CN109887544A - RNA sequence parallel sorting method based on Non-negative Matrix Factorization - Google Patents

RNA sequence parallel sorting method based on Non-negative Matrix Factorization Download PDF

Info

Publication number
CN109887544A
CN109887544A CN201910060301.6A CN201910060301A CN109887544A CN 109887544 A CN109887544 A CN 109887544A CN 201910060301 A CN201910060301 A CN 201910060301A CN 109887544 A CN109887544 A CN 109887544A
Authority
CN
China
Prior art keywords
bayes
coefficient
matrix
rna sequence
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910060301.6A
Other languages
Chinese (zh)
Other versions
CN109887544B (en
Inventor
杨晓凯
钟诚
黄毅然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN201910060301.6A priority Critical patent/CN109887544B/en
Publication of CN109887544A publication Critical patent/CN109887544A/en
Application granted granted Critical
Publication of CN109887544B publication Critical patent/CN109887544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses the RNA sequence parallel sorting methods based on Non-negative Matrix Factorization.By will RNA data matrixization processing after, its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, it is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, uses the classification work of the parallel progress RNA sequence of non-negative matrix factorization method.Method of the invention effectively raises the classification accuracy of RNA sequence, using concurrent technique means, effectively raises the operational efficiency of RNA sequence classification work.

Description

RNA sequence parallel sorting method based on Non-negative Matrix Factorization
Technical field
The invention belongs to bioinformatics technique field, in particular to a kind of RNA sequence based on Non-negative Matrix Factorization is simultaneously Row classification method.
Background technique
For experimental technique, the bioinformatics tools for analyzing unicellular RNA sequence data are still lagged. In recent years, be developed a variety of methods come it is (or sub- using one group of intracellular subgroup of unicellular RNA sequence Data Detection Class).These new calculating instruments show to be very important on understanding unicellular RNA sequence heterogeneity.In addition, once sub- Group is determined, and finds very crucial in order to disclose secondary biological mechanism, and each subgroup (subclass) has characteristic base Because of expression characteristic.
Non-negative Matrix Factorization (Non-negative Matrix Factorization, NMF) is used as a kind of effective data Dimension-reduction algorithm, because its thought is concise, it is convenient to realize, changes and various obtains extensive concern.It is by dividing the matrix of a higher-dimension Solution is the product of two or more low-dimensional matrixes, realizes dimension specification, is convenient to study high dimensional data in a lower dimensional space Property.NMF and other methods the difference is that, original given nonnegative matrix is approximately decomposed into two nonnegative matrixes by it Product, i.e. NMF guarantee decompose gained matrix each element be all positive value, thus based on local feature obtain to former data Expression can be added.NMF decomposes the peculiar property that there is locality, part expression etc. to be better than other analogous algorithms, has intuitive object Connotation is managed, i.e., the whole nonlinear combination added that can be divided into multiple portions.Therefore, NMF algorithm obtains pole in recent years Big concern has all obtained further improving and developing from concept system to implementation method, it is fast and efficiently real to produce a batch Use algorithm.
Summary of the invention
The RNA sequence parallel sorting method based on Non-negative Matrix Factorization that the purpose of the present invention is to provide a kind of, to improve Classification accuracy and arithmetic speed.
The present invention realizes that technical solution used by foregoing invention purpose is: the RNA sequence based on Non-negative Matrix Factorization is simultaneously Row classification method seeks its corresponding pattra leaves for different K values according to raw data matrix after the processing of RNA data matrixization This coefficient is chosen according to the K value during Bayes's restricted coefficients of equation Non-negative Matrix Factorization, parallel using non-negative matrix factorization method Progress RNA sequence classification work.
The above-mentioned RNA sequence parallel sorting method based on Non-negative Matrix Factorization, includes the following steps:
1) matrixing RNA data:
It is three cores that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A, The alphabetical A of nucleotide mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G square of E Battle array, then RNA data are calculated as M=P × E;
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's coefficient K value during constrained non-negative matrix decomposition is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η It is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs A series of sample (Z of grey iterative generation(k)(k)(k)), k >=1 forms complete condition distributed collection, uses Gibbs standard packet Seek the corresponding Bayes's coefficient of different K values;
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W;
3) classified using Non-negative Matrix Factorization:
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |;With E(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than regulation The signature of horizontal DES is considered having differential activities in group;
4) foundation of parallel algorithm realizes that steps are as follows at R language platform R Studio:
Input:RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output:RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as the Bayes The corresponding K value of coefficient;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for;
14: calculating RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
The invention has the advantages that: existing method is compared, method of the invention has obviously on classification accuracy Raising, while parallelization modification after, method operational efficiency of the invention significantly improves, when classifying to identical data, It is obviously shortened the time required to it.
Detailed description of the invention
Fig. 1 is the result figure after the RNA data matrix of 21 breast cancer cells;
When Fig. 2 is reference length for series 12, the result figure of 21 breast cancer RNA data classifications.
Specific embodiment
RNA sequence parallel sorting method based on Non-negative Matrix Factorization of the invention, the specific steps are as follows:
1) matrixing RNA data:
Most of somatic mutations include that single base replaces, and insertion and missing reset and copy number variation (CNV).Single alkali Base substitution belongs to one of six kinds of possible bases variations, i.e. C:G > A:T, C:G > G:C, C:G > T:A, T:A > A:T, T:A > C:G and T:A > G:C.By including each 5' and 3' for replacing site adjacent to base, it can obtain that there are 96 with the further expansion group The alphabetical A of trinucleotide mutation type.Once A is correctly defined, the counting of the mutation found in G different genes group is by group Dress up the K with K=A × G matrix M.One crucial hypothesis is for the counting in M to be considered as the additivity effect of N number of mutation process It answers, each mutation process is defined as the carrier of K × 1 of mutation rate.The latter defines so-called mutation signature.More precisely, institute There is the mutation in genome to lead to the linear combination of N number of basis vector of dimension K × 1, mixed coefficint is by having a size of the N number of of 1≤G Expose vector definition.If base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then RNA Data can be easily calculated as M=P × E.
Fig. 1 is the concrete outcome figure after the RNA data matrix of 21 breast cancer cells.
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's coefficient K value during constrained non-negative matrix decomposition is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, hyper parameter Ψ and superlinearity parameter η All it is to estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs A series of sample (Z of grey iterative generation(k)(k)(k)), k >=1 forms complete condition distributed collection, these samples are for passing through Random data undated parameter θ, hyper parameter Ψ and superlinearity parameter η.Different K are sought in the above manner using Gibbs standard packet It is worth corresponding Bayes's coefficient.
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that can The waving interval W of receiving seeks the maximum corresponding k of Bayes's coefficient, chooses the smallest k value in waving interval with reference to W.
With 21 breast cancer RNA data instances, k=3 is found out in the above manner.
3) classified using Non-negative Matrix Factorization
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |.With E(r)As exposure matrix, this can be associated with the independent knowledge of such as clinical data, to check each mutation process Activity it is how associated with the latter.In particular, being used when prior information excites the sample of two or more classifications to divide Kruskal-allis examines to check the actual value between of all categories with the presence or absence of significant difference.Subtract the logarithm of these values Median defines difference exposure fraction (DES).Retain the important information about the contribution of each signature in genome sample.Tool There is the signature of the DES higher than prescribed level to be considered that there are differential activities in group.
Fig. 2 is reference length when being sequence 12, the result figure of 21 breast cancer RNA data classifications.
4) parallel carry out RNA sequence classification realizes at R language platform R Studio:
Specific steps are illustrated in the form of pseudocode, as follows:
Input: breast cancer RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output: breast cancer RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as the Bayes The corresponding K value of coefficient;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for
14: calculating breast cancer RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
Experiment is classified to multiple data sets such as 21 breast cancer data using above method, Tables 1 and 2 is right respectively The present invention is compared with existing method (referring to Rosales R A, Drummond R D, Valieris R, et al.signeR:An empirical Bayesian approach to mutational signature discovery[J] .Bioinformatics, 2016,33 (1): the 8.) comparison result classified to 21 breast cancer data, side of the invention Method obtains higher classification accuracy and faster arithmetic speed.
1 present invention of table is compared with the accuracy rate of existing method
2 present invention of table is compared with the runing time of existing method

Claims (2)

1. the RNA sequence parallel sorting method based on Non-negative Matrix Factorization, which is characterized in that after the processing of RNA data matrixization, Its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, according to Bayes's restricted coefficients of equation nonnegative matrix K value in decomposable process is chosen, and the classification work of the parallel progress RNA sequence of non-negative matrix factorization method is used.
2. the RNA sequence parallel sorting method according to claim 1 based on Non-negative Matrix Factorization, which is characterized in that packet Include following steps:
1) matrixing RNA data:
It is trinucleotide that the counting of the mutation found in G different genes group, which is assembled into the K with K=A × G matrix M, A, The alphabetical A of mutation type, if base vector merges into the signature matrix P of K × N, and coefficient vector is N × G matrix of E, then RNA data are calculated as M=P × E;
2) its corresponding Bayes's coefficient is sought for different K values according to raw data matrix, and according to Bayes's restricted coefficients of equation K value during Non-negative Matrix Factorization is chosen:
The matrix that step 1) is obtained follows empirical Bayes method, wherein parameter θ, and hyper parameter Ψ and superlinearity parameter η are Estimate from initial data, for the η of selection, generates sampler π (θ | M, η), indicate random tensor using Z, this needs iteration Generate a series of sample (Z(k)(k)(k)), k >=1 forms complete condition distributed collection, is sought using Gibbs standard packet The corresponding Bayes's coefficient of different K values;
The value range of k is determined according to sample size n sample range, 2≤k≤n/2 finds out its corresponding Bayes's coefficient, it is specified that acceptable Waving interval W, seek the maximum corresponding k of Bayes's coefficient, choose the smallest k value in waving interval with reference to W;
3) classified using Non-negative Matrix Factorization:
Using k required by step 2) as the dimension of Non-negative Matrix Factorization, P and E are sought, meets original data set min | | M-PE | |;With E(r)As exposure matrix, retain the important information about the contribution of each signature in genome sample, has and be higher than prescribed level DES signature be considered in group have differential activities;
4) foundation of parallel algorithm realizes that steps are as follows at R language platform R Studio:
Input:RNA sequence M, reference sequences L, acceptable waving interval W, core number p;
Output:RNA sequence classification data;
Begin
1:n ← RNA sequence item number | M |;
2:n ' ← actual needs calculates the K value number n/2-1 of Bayes's coefficient;
3:x ← single core needs to calculate K value number n '/p of Bayes's coefficient;
4: distributing the corresponding Bayes's coefficient of different K values that each core need to calculate according to x;
5: each core calculates separately the corresponding Bayes's coefficient of K value and is stored in array k, under be designated as Bayes's coefficient Corresponding K value;
6: traversal finds out the maximum value k [m] in k array;
The subscript m -1 of maximum value in 7:j ← k array;
8:while ((k [m]-k [j])/k [m]) < W do;
9:j=j-1;
10:end while;
11:for i=1to j do;
12: using j as matrix decomposition dimension, it is P and E that M, which is carried out Non-negative Matrix Factorization,;
13:end for;
14: calculating RNA sequence classification data using P and E matrix result combination reference sequences L;
End。
CN201910060301.6A 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization Active CN109887544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910060301.6A CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910060301.6A CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Publications (2)

Publication Number Publication Date
CN109887544A true CN109887544A (en) 2019-06-14
CN109887544B CN109887544B (en) 2022-07-05

Family

ID=66926595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910060301.6A Active CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Country Status (1)

Country Link
CN (1) CN109887544B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491443A (en) * 2019-07-23 2019-11-22 华中师范大学 A kind of lncRNA protein interaction prediction method based on projection neighborhood Non-negative Matrix Factorization
CN111370060A (en) * 2020-03-21 2020-07-03 广西大学 Protein interaction network co-location co-expression complex recognition system and method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102186987A (en) * 2008-04-24 2011-09-14 阿肯色大学托管委员会 Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof
CN102696034A (en) * 2008-10-31 2012-09-26 雅培制药有限公司 Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN107016261A (en) * 2017-04-11 2017-08-04 曲阜师范大学 Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition
US20180203974A1 (en) * 2016-11-07 2018-07-19 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102186987A (en) * 2008-04-24 2011-09-14 阿肯色大学托管委员会 Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof
CN102696034A (en) * 2008-10-31 2012-09-26 雅培制药有限公司 Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
US20180203974A1 (en) * 2016-11-07 2018-07-19 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection
CN107016261A (en) * 2017-04-11 2017-08-04 曲阜师范大学 Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
CHUNXUAN SHAO 等: "Robust classification of single-cell transcriptome data by nonnegative matrix factorization", 《BIOINFORMATICS》 *
孙华 等: "miRNA与疾病关系中分类预测方法研究", 《电脑知识与技术》 *
尤燕玲: "改进的非负矩阵分解算法在miRNA与基因互作关系的研究", 《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491443A (en) * 2019-07-23 2019-11-22 华中师范大学 A kind of lncRNA protein interaction prediction method based on projection neighborhood Non-negative Matrix Factorization
CN110491443B (en) * 2019-07-23 2022-04-01 华中师范大学 lncRNA protein correlation prediction method based on projection neighborhood non-negative matrix decomposition
CN111370060A (en) * 2020-03-21 2020-07-03 广西大学 Protein interaction network co-location co-expression complex recognition system and method

Also Published As

Publication number Publication date
CN109887544B (en) 2022-07-05

Similar Documents

Publication Publication Date Title
Kubatko et al. An invariants-based method for efficient identification of hybrid species from large-scale genomic data
Flagel et al. The unreasonable effectiveness of convolutional neural networks in population genetic inference
Rosenberg Discordance of species trees with their most likely gene trees: a unifying principle
Lanier et al. Is recombination a problem for species-tree analyses?
Blackburne et al. Class of multiple sequence alignment algorithm affects genomic analysis
Rasmussen et al. A Bayesian approach for fast and accurate gene tree reconstruction
Page et al. BamBam: genome sequence analysis tools for biologists
Yu et al. Coalescent-based delimitation outperforms distance-based methods for delineating less divergent species: the case of Kurixalus odontotarsus species group
Xie et al. Poly (A) motif prediction using spectral latent features from human DNA sequences
Bhargava et al. DNA barcoding in plants: evolution and applications of in silico approaches and resources
Senerchia et al. Evolutionary dynamics of retrotransposons assessed by high-throughput sequencing in wild relatives of wheat
Danko et al. Minerva: an alignment-and reference-free approach to deconvolve linked-reads for metagenomics
Herath et al. CoMet: a workflow using contig coverage and composition for binning a metagenomic sample with high precision
Fernandes et al. CSA: an efficient algorithm to improve circular DNA multiple alignment
Lei et al. Tumor copy number deconvolution integrating bulk and single-cell sequencing data
Hosseinpoor et al. Proposing a novel community detection approach to identify cointeracting genomic regions
CN109887544A (en) RNA sequence parallel sorting method based on Non-negative Matrix Factorization
Du et al. Species tree inference under the multispecies coalescent on data with paralogs is accurate
Hong et al. To rarefy or not to rarefy: robustness and efficiency trade-offs of rarefying microbiome data
Chowdhury et al. A bi-objective function optimization approach for multiple sequence alignment using genetic algorithm
Xie et al. A combination of boosting and bagging for kdd cup 2009-fast scoring on a large database
Bouckaert bModelTest: Bayesian site model selection for nucleotide data
Vavoulis et al. A statistical approach for tracking clonal dynamics in cancer using longitudinal next-generation sequencing data
Qiao et al. Poisson hurdle model-based method for clustering microbiome features
Prabhakara et al. Mutant-bin: unsupervised haplotype estimation of viral population diversity without reference genome

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant