CN109887544B - RNA sequence parallel classification method based on non-negative matrix factorization - Google Patents

RNA sequence parallel classification method based on non-negative matrix factorization Download PDF

Info

Publication number
CN109887544B
CN109887544B CN201910060301.6A CN201910060301A CN109887544B CN 109887544 B CN109887544 B CN 109887544B CN 201910060301 A CN201910060301 A CN 201910060301A CN 109887544 B CN109887544 B CN 109887544B
Authority
CN
China
Prior art keywords
bayesian
matrix
rna sequence
coefficient
coefficients
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910060301.6A
Other languages
Chinese (zh)
Other versions
CN109887544A (en
Inventor
杨晓凯
钟诚
黄毅然
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi University
Original Assignee
Guangxi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi University filed Critical Guangxi University
Priority to CN201910060301.6A priority Critical patent/CN109887544B/en
Publication of CN109887544A publication Critical patent/CN109887544A/en
Application granted granted Critical
Publication of CN109887544B publication Critical patent/CN109887544B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a parallel RNA sequence classification method based on non-negative matrix factorization. After RNA data is matrixed, the corresponding Bayes coefficients are obtained according to different K values of an original data matrix, K value selection in the nonnegative matrix decomposition process is restrained according to the Bayes coefficients, and the nonnegative matrix decomposition method is used for parallel RNA sequence classification. The method effectively improves the classification accuracy of the RNA sequence, and effectively improves the operation efficiency of RNA sequence classification work by utilizing a parallel technical means.

Description

RNA sequence parallel classification method based on non-negative matrix factorization
Technical Field
The invention belongs to the technical field of bioinformatics, and particularly relates to a parallel RNA sequence classification method based on nonnegative matrix factorization.
Background
Bioinformatics tools for analyzing single cell RNA sequence data still lag behind relative to experimental techniques. In recent years, various methods have been developed to detect subpopulations (or subsets) within a group of cells using single-cell RNA sequence data. These new computational tools indicate that it is very important to understand the heterogeneity of single-cell RNA sequences. Furthermore, once the subpopulations are identified, each subpopulation (subclass) has characteristic gene expression characteristics, finding a critical feature to reveal secondary biological mechanisms.
Non-Negative Matrix Factorization (NMF) is an effective data dimension reduction algorithm, and has attracted attention due to its simple concept, convenient implementation, and various changes. The method realizes the dimension specification by decomposing a high-dimensional matrix into the product of two or more low-dimensional matrices, and is convenient for researching the properties of high-dimensional data in a low-dimensional space. NMF differs from other methods in that it approximately decomposes an originally given non-negative matrix into the product of two non-negative matrices, i.e., NMF guarantees that each element of the decomposed matrix is a positive value, thereby obtaining an additive representation of the original data based on local features. NMF decomposition has the unique properties of locality, partial expression and the like superior to other algorithms of the same kind, and has an intuitive physical meaning, namely, the whole can be divided into an additive nonlinear combination of a plurality of parts. Therefore, in recent years, the NMF algorithm has gained great attention, and the concept system to the implementation method are further improved and developed, so that a batch of fast and efficient practical algorithms are generated.
Disclosure of Invention
The invention aims to provide a RNA sequence parallel classification method based on non-negative matrix factorization so as to improve the classification accuracy and the operation speed.
The technical scheme adopted by the invention for realizing the aim is as follows: the RNA sequence parallel classification method based on nonnegative matrix decomposition is characterized in that RNA data are subjected to matrixing processing, corresponding Bayesian coefficients are obtained according to different K values of an original data matrix, K value selection in the nonnegative matrix decomposition process is constrained according to the Bayesian coefficients, and the nonnegative matrix decomposition method is used for parallel RNA sequence classification.
The RNA sequence parallel classification method based on non-negative matrix factorization comprises the following steps:
1) matrixing RNA data:
counts of mutations found in G different genomes were assembled into a K × G matrix M with K ═ a, a being the letter a of the trinucleotide mutation type, and if the basis vectors were merged into a K × N signature matrix P and the coefficient vectors were an N × G matrix of E, the RNA data was calculated as M ═ P × E;
2) and solving the corresponding Bayes coefficients according to the original data matrix aiming at different K values, and selecting according to the K values in the Bayes coefficient constraint non-negative matrix decomposition process:
the matrix obtained in the step 1) follows an empirical Bayes method, wherein a parameter theta, a hyper-parameter psi and a hyper-linearity parameter eta are estimated from original data, a sampler pi (theta | M, eta) is generated aiming at the selected eta, Z is used for representing a random tensor, and a series of samples (Z | M, eta) are required to be generated iteratively(k)(k)(k)) K is more than or equal to 1, forming a complete condition distribution set, and solving Bayesian coefficients corresponding to different K values by using a Gibbs standard packet;
determining the value range of k according to the sample capacity n, wherein k is more than or equal to 2 and less than or equal to n/2, solving the corresponding Bayesian coefficient, defining an acceptable fluctuation interval W, solving the k corresponding to the largest Bayesian coefficient, and selecting the smallest k value in the fluctuation interval by referring to W;
3) classification was done using non-negative matrix factorization:
taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-PE; with E(r)As an exposure matrix, important information about the contribution of each signature in the genomic sample is retained, signatures with DES above a specified level are considered to have differential activity in the population;
4) the establishment of the parallel algorithm is realized under R language platform R Studio, and the steps are as follows:
input: an RNA sequence M, a reference sequence L, an acceptable fluctuation interval W and a core number p;
output: RNA sequence classification data;
Begin
1: n ← number of RNA sequences | M |;
2: n' ← actually calculating the number n/2-1 of K values of Bayesian coefficients;
3: x ← single kernel requires calculation of the K-valued number n'/p of bayesian coefficients;
4: distributing Bayesian coefficients corresponding to different K values to be calculated for each core according to x;
5: each core respectively calculates a Bayesian coefficient corresponding to the K value and stores the Bayesian coefficient in an array K, and the subscript of the Bayesian coefficient is the K value corresponding to the Bayesian coefficient;
6: traversing and finding out the maximum value k [ m ] in the k array;
7: j ← k index m-1 of the maximum value in the array;
8:while((k[m]-k[j])/k[m])<W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: taking j as a matrix decomposition dimension, and carrying out non-negative matrix decomposition on M into P and E;
13:end for;
14: calculating RNA sequence classification data by combining the P and E matrix results with a reference sequence L;
End。
the invention has the beneficial effects that: compared with the prior art, the method provided by the invention has the advantages that the classification accuracy is obviously improved, the running efficiency is obviously improved after the parallelization modification, and the required time is obviously shortened when the same data is classified.
Drawings
FIG. 1 is a graph showing the results of the matrixing of RNA data of 21 breast cancer cells;
FIG. 2 is a graph showing the results of classification of 21 pieces of breast cancer RNA data with reference to the length of series 12.
Detailed Description
The invention relates to a RNA sequence parallel classification method based on non-negative matrix factorization, which comprises the following specific steps of:
1) matrixing RNA data:
most somatic mutations include single base substitutions, insertions and deletions, rearrangements and Copy Number Variations (CNVs). Single base substitutions belong to one of six possible base changes, namely C: g > A: t, C: g > G: c, C: g > T: a, T: a > A: t, T: a > C: g and T: a > G: C. the set can be further expanded by including 5 'and 3' adjacent bases for each substitution site, resulting in the letter a having 96 trinucleotide mutation types. Once a is correctly defined, counts of mutations found in G different genomes are assembled into a K × G matrix M with K ═ a. One key hypothesis is to consider the counts in M as an additive effect of N mutation processes, each defined as K × 1 vector for mutation rate. The latter defines the so-called mutation signature. More precisely, mutations in all genomes result in a linear combination of N basis vectors of dimension K × 1, the mixing coefficients being defined by N exposure vectors of size 1 ≦ G. If the basis vectors are combined into a K × N signature matrix P and the coefficient vectors are an N × G matrix of E, the RNA data can be simply calculated as M ═ P × E.
FIG. 1 is a detailed result chart of the RNA data matrixing of 21 breast cancer cells.
2) Solving Bayes coefficients corresponding to different K values according to the original data matrix, and selecting according to the K values in the Bayes coefficient constraint nonnegative matrix decomposition process:
the matrix obtained in the step 1) follows an empirical Bayes method, wherein a parameter theta, a hyper-parameter psi and a hyper-linearity parameter eta are estimated from original data, a sampler pi (theta | M, eta) is generated aiming at the selected eta, Z is used for representing a random tensor, and a series of samples (Z | M, eta) are required to be generated iteratively(k)(k)(k)) And k is more than or equal to 1, forming a complete condition distribution set, and the samples are used for updating the parameter theta, the hyperparameter psi and the hyperparallel parameter eta through random data. And solving Bayesian coefficients corresponding to different K values by using a Gibbs standard packet according to the above mode.
Determining the value range of k according to the sample capacity n, wherein k is more than or equal to 2 and less than or equal to n/2, solving the corresponding Bayesian coefficient, defining an acceptable fluctuation interval W, solving the k corresponding to the largest Bayesian coefficient, and selecting the smallest k value in the fluctuation interval by referring to W.
Taking 21 pieces of breast cancer RNA data as an example, k is 3 as described above.
3) Classification using non-negative matrix factorization
Taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-M-PE. With E(r)As an exposure matrix, this can be associated with independent knowledge, such as clinical data, in order to check how the activity of each mutation process is associated with the latter. In particular, when the prior test information excites a sample partition of two or more classes, the Kruskal-alis test is used to check whether there is a significant difference in the actual values between the classes. The median of the logarithm of these values subtracted defines the Differential Exposure Score (DES). Important information about the contribution of each signature in the genomic sample is retained. Signatures with DES above a specified level are considered to have differential activity in the population.
FIG. 2 is a graph showing the results of 21 data classification of breast cancer RNA with reference to the length of SEQ ID NO. 12.
4) Parallel RNA sequence classification was performed under the R language platform R Studio:
the specific steps are explained in a pseudo code form, and are as follows:
input: breast cancer RNA sequence M, reference sequence L, acceptable fluctuation interval W and core number p;
output: breast cancer RNA sequence classification data;
Begin
1: n ← number of RNA sequences | M |;
2: n' ← actually calculating the number n/2-1 of K values of Bayesian coefficients;
3: x ← single kernel requires calculation of the K-valued number n'/p of bayesian coefficients;
4: distributing Bayesian coefficients corresponding to different K values to be calculated for each core according to x;
5: each core respectively calculates a Bayesian coefficient corresponding to the K value and stores the Bayesian coefficient in an array K, and the subscript of the Bayesian coefficient is the K value corresponding to the Bayesian coefficient;
6: traversing and finding out the maximum value k [ m ] in the k array;
7: j ← k index m-1 of the maximum value in the array;
8:while((k[m]-k[j])/k[m])<W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: taking j as a matrix decomposition dimension, and carrying out non-negative matrix decomposition on M into P and E;
13:end for
14: calculating breast cancer RNA sequence classification data by combining the P and E matrix results with a reference sequence L;
End。
the classification experiment was performed on a plurality of data sets such as 21 breast cancer data by the above method, and tables 1 and 2 respectively compare the results of classifying 21 breast cancer data according to the present invention with the conventional method (see Rosales R A, Drummond R D, Valieris R, et al. signal R: An empirical Bayesian associated with a statistical signal discovery [ J ]. Bioinformatics,2016,33(1):8.), and the method of the present invention achieves higher classification accuracy and faster computation speed.
TABLE 1 comparison of the accuracy of the present invention with the existing method
Figure BDA0001953928750000051
TABLE 2 run time comparison of the present invention to existing methods
Figure BDA0001953928750000052

Claims (1)

1. The parallel RNA sequence classification method based on nonnegative matrix decomposition comprises the steps of matrixing RNA data, solving Bayesian coefficients corresponding to different K values according to an original data matrix, restricting K value selection in the nonnegative matrix decomposition process according to the Bayesian coefficients, and performing parallel RNA sequence classification work by using the nonnegative matrix decomposition method; the method is characterized in that the RNA sequence parallel classification method based on non-negative matrix factorization comprises the following steps:
1) matrixed RNA data:
counts of mutations found in G different genomes are assembled into a K × G matrix M, and if the basis vectors are merged into a K × N signature matrix P and the coefficient vector is an N × G matrix of E, the RNA data is calculated as M ═ P × E;
2) and solving the corresponding Bayes coefficients according to the original data matrix aiming at different K values, and selecting according to the K values in the Bayes coefficient constraint non-negative matrix decomposition process:
the matrix obtained in the step 1) follows an empirical Bayes method, wherein a parameter theta, a hyper-parameter psi and a hyper-linearity parameter eta are estimated from original data, a sampler pi (theta | M, eta) is generated aiming at the selected eta, Z is used for representing a random tensor, and a series of samples (Z | M, eta) are required to be generated iteratively(k)(k)(k)) K is more than or equal to 1, forming a complete condition distribution set, and solving Bayesian coefficients corresponding to different K values by using a Gibbs standard packet;
determining the value range of k according to the sample capacity n, wherein k is more than or equal to 2 and less than or equal to n/2, solving the corresponding Bayesian coefficient, defining an acceptable fluctuation interval W, solving the k corresponding to the largest Bayesian coefficient, and selecting the smallest k value in the fluctuation interval by referring to W;
3) classification was done using non-negative matrix factorization:
taking k obtained in the step 2) as the dimension of nonnegative matrix decomposition, obtaining P and E, and meeting the requirement of the original data set min-PE; with E(r)As an exposure matrix, important information about the contribution of each signature in the genomic sample is retained;
4) the establishment of the parallel algorithm is realized under R language platform R Studio, and the steps are as follows:
input: an RNA sequence M, a reference sequence L, an acceptable fluctuation interval W and a core number p;
output: RNA sequence classification data;
Begin
1: n ← number of RNA sequences | M |;
2: n' ← actually calculating the number n/2-1 of K values of Bayesian coefficients;
3: x ← single kernel requires calculation of the K-valued number n'/p of bayesian coefficients;
4: distributing Bayesian coefficients corresponding to different K values to be calculated for each core according to x;
5: each core respectively calculates a Bayesian coefficient corresponding to the K value and stores the Bayesian coefficient in an array K, and the subscript of the Bayesian coefficient is the K value corresponding to the Bayesian coefficient;
6: traversing and finding out the maximum value k [ m ] in the k array;
7: j ← k index m-1 of the maximum value in the array;
8:while((k[m]-k[j])/k[m])<W do;
9:j=j-1;
10:end while;
11:for i=1 to j do;
12: taking j as a matrix decomposition dimension, and carrying out non-negative matrix decomposition on M into P and E;
13:end for;
14: calculating RNA sequence classification data by combining the P and E matrix results with a reference sequence L;
End。
CN201910060301.6A 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization Active CN109887544B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910060301.6A CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910060301.6A CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Publications (2)

Publication Number Publication Date
CN109887544A CN109887544A (en) 2019-06-14
CN109887544B true CN109887544B (en) 2022-07-05

Family

ID=66926595

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910060301.6A Active CN109887544B (en) 2019-01-22 2019-01-22 RNA sequence parallel classification method based on non-negative matrix factorization

Country Status (1)

Country Link
CN (1) CN109887544B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110491443B (en) * 2019-07-23 2022-04-01 华中师范大学 lncRNA protein correlation prediction method based on projection neighborhood non-negative matrix decomposition
CN111370060A (en) * 2020-03-21 2020-07-03 广西大学 Protein interaction network co-location co-expression complex recognition system and method

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102186987A (en) * 2008-04-24 2011-09-14 阿肯色大学托管委员会 Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof
CN102696034A (en) * 2008-10-31 2012-09-26 雅培制药有限公司 Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN107016261A (en) * 2017-04-11 2017-08-04 曲阜师范大学 Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3040930A1 (en) * 2016-11-07 2018-05-11 Grail, Inc. Methods of identifying somatic mutational signatures for early cancer detection

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102186987A (en) * 2008-04-24 2011-09-14 阿肯色大学托管委员会 Gene expression profiling based identification of genomic signature of high-risk multiple myeloma and uses thereof
CN102696034A (en) * 2008-10-31 2012-09-26 雅培制药有限公司 Genomic classification of non-small cell lung carcinoma based on patterns of gene copy number alterations
CN103235900A (en) * 2013-03-28 2013-08-07 中山大学 Weight assembly clustering method for excavating protein complex
CN107016261A (en) * 2017-04-11 2017-08-04 曲阜师范大学 Difference expression gene discrimination method based on joint constrained non-negative matrix decomposition

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
miRNA与疾病关系中分类预测方法研究;孙华 等;《电脑知识与技术》;20170425;第13卷(第12期);198、201 *
Robust classification of single-cell transcriptome data by nonnegative matrix factorization;Chunxuan Shao 等;《Bioinformatics》;20160923;第32卷(第2期);235-242 *
改进的非负矩阵分解算法在miRNA与基因互作关系的研究;尤燕玲;《中国优秀博硕士学位论文全文数据库(硕士)基础科学辑》;20131215(第S2期);A006-88 *

Also Published As

Publication number Publication date
CN109887544A (en) 2019-06-14

Similar Documents

Publication Publication Date Title
Flagel et al. The unreasonable effectiveness of convolutional neural networks in population genetic inference
Fischer et al. EMu: probabilistic inference of mutational processes and their localization in the cancer genome
Matuszewski et al. Coalescent processes with skewed offspring distributions and nonequilibrium demography
Danko et al. Minerva: an alignment-and reference-free approach to deconvolve linked-reads for metagenomics
US20230287487A1 (en) Systems and methods for genetic identification and analysis
US20140052383A1 (en) Systems and methods for identifying a contributor&#39;s str genotype based on a dna sample having multiple contributors
CN109887544B (en) RNA sequence parallel classification method based on non-negative matrix factorization
US20190177719A1 (en) Method and System for Generating and Comparing Reduced Genome Data Sets
Arisdakessian et al. CoCoNet: an efficient deep learning tool for viral metagenome binning
Hong et al. To rarefy or not to rarefy: robustness and efficiency trade-offs of rarefying microbiome data
Yoosefzadeh-Najafabadi et al. Genome-wide association study statistical models: A review
Malekpour et al. MSeq-CNV: accurate detection of Copy Number Variation from Sequencing of Multiple samples
Briand et al. A rapid and simple method for assessing and representing genome sequence relatedness
Kamath et al. Adaptive learning of rank-one models for efficient pairwise sequence alignment
Papastamoulis et al. A Bayesian model selection approach for identifying differentially expressed transcripts from RNA sequencing data
Cope et al. Intragenomic variation in non-adaptive nucleotide biases causes underestimation of selection on synonymous codon usage
Huang et al. Reveel: large-scale population genotyping using low-coverage sequencing data
Qiao et al. Poisson hurdle model-based method for clustering microbiome features
Halperin et al. HAPLOFREQ—estimating haplotype frequencies efficiently
Cao et al. A systematic evaluation of methods for cell phenotype classification using single-cell RNA sequencing data
Kimmel et al. Modeling neutral evolution of Alu elements using a branching process
Cowell A sub-critical branching process model for application to analysing Y haplotype DNA mixtures
Spence et al. Scaling the discrete-time Wright–Fisher model to biobank-scale datasets
Sheikh et al. Base-calling for bioinformaticians
Konopiński Average weighted nucleotide diversity is more precise than pixy in estimating the true value of π from sequence sets containing missing data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant