WO2022061974A1 - 用于快速转录组表达定量的数据处理方法、装置及存储介质 - Google Patents

用于快速转录组表达定量的数据处理方法、装置及存储介质 Download PDF

Info

Publication number
WO2022061974A1
WO2022061974A1 PCT/CN2020/120407 CN2020120407W WO2022061974A1 WO 2022061974 A1 WO2022061974 A1 WO 2022061974A1 CN 2020120407 W CN2020120407 W CN 2020120407W WO 2022061974 A1 WO2022061974 A1 WO 2022061974A1
Authority
WO
WIPO (PCT)
Prior art keywords
transcriptome
subsequence
sequence
reference matrix
target vector
Prior art date
Application number
PCT/CN2020/120407
Other languages
English (en)
French (fr)
Inventor
易会广
Original Assignee
易会广
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 易会广 filed Critical 易会广
Publication of WO2022061974A1 publication Critical patent/WO2022061974A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • the invention belongs to the technical field of data processing, and in particular relates to a data processing method, device and storage medium for rapid transcriptome expression quantification.
  • transcriptome sample sequencing and transcriptome expression quantification are widely used in biomedical research.
  • the existing quantitative methods of transcriptome expression mainly include two categories: the quantitative method of transcriptome expression based on sequence alignment and the method of quantitative transcriptome expression not based on sequence alignment.
  • the main representative of the transcriptome expression quantification method not based on sequence alignment is Kallisto, which has the advantage of extremely fast speed, and for the sequencing data (such as: next-generation sequencing data) generated by the current mainstream sequencing platforms, it transcribes
  • the accuracy of quantitative calculation of group expression is also one of the highest among several methods.
  • the use of kallisto to quantify transcriptome expression has the problem of poor accuracy.
  • the purpose of this application is to provide a data processing method, electronic device and computer-readable storage medium for rapid transcriptome expression quantification, which can be applied not only to mainstream sequencing data characterized by short read length and high throughput, but also to For third-generation sequencing data characterized by long read lengths, it can improve the speed and accuracy of transcriptome expression quantification.
  • the embodiments of the present application provide a data processing method for rapid transcriptome expression quantification, including:
  • the expression level of the transcriptome is obtained and output.
  • An embodiment of the present application further provides an electronic device, the electronic device includes: a memory, a processor, and a computer program stored on the memory and executable on the processor, the computer program including:
  • an index building module for obtaining a reference transcriptome sequence, decomposing the reference transcriptome sequence into a first subsequence of preset length, and constructing a reference matrix according to the first subsequence;
  • a sequencing data processing module configured to obtain sequencing data of the transcriptome, decompose the sequencing data into a second subsequence of the preset length, and construct a target vector according to the second subsequence;
  • An expression quantification module configured to obtain and output the expression amount of the transcriptome by using the reference matrix, the target vector and a preset linear equation system.
  • Embodiments of the present application further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, realizes the data processing for rapid transcriptome expression quantification as shown in the foregoing embodiments method.
  • the expression level of the transcriptome is obtained by using the reference matrix constructed based on the reference transcriptome sequence, the target vector constructed based on the sequencing data of the transcriptome, and the preset linear equation system.
  • the accuracy of the obtained expression level is higher than that of kallisto, and at the same time, because the time of sequence alignment is saved, the speed of data processing is further improved.
  • FIG. 1 is a schematic flowchart of a data processing method for rapid transcriptome expression quantification provided by an embodiment of the present application
  • step S101 is a schematic flowchart of the implementation of step S101 in the data processing method for rapid transcriptome expression quantification provided by an embodiment of the present application;
  • step S102 is a schematic flowchart of the implementation of step S102 in the data processing method for rapid transcriptome expression quantification provided by an embodiment of the present application;
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a data processing method for rapid transcriptome expression quantification provided in an embodiment of the application, and the method can be applied to, for example, desktop computers, laptop computers, tablet computers, personal computers, servers, and other Computer equipment that performs data processing in a mobile or non-mobile environment.
  • the method mainly includes the following steps S101-S103.
  • Step S101 obtaining a reference transcriptome sequence, decomposing the reference transcriptome sequence into a first subsequence of a preset length, and constructing a reference matrix according to the first subsequence.
  • the obtained reference transcriptome sequence includes, for example, a set of transcript sequences.
  • the first subsequence of preset length namely the first k-mer.
  • k-mer refers to a subsequence of length k in a biological sequence, where the value of k is any predetermined positive integer smaller than the length of the biological sequence.
  • step S101 specifically includes the following steps:
  • the reference transcriptome is a collection of transcripts, that is, the reference transcriptome sequence includes a plurality of transcript sequences, and the data format thereof can be, for example, the following fasta format:
  • Step S1012 decomposes each of the above-mentioned transcript sequences into a first subsequence of k length (ie, the first k-mer), where the value of k is preferably an integer in the range of 5-31.
  • the preset k value will be uniformly used in all subsequent data processing.
  • the value of k is assumed to be 11 in this example and the following examples.
  • the set of 11-mers corresponding to each transcript sequence obtained by the decomposition (hereinafter referred to as the 11-mer set) is recorded.
  • the decomposition results are as follows:
  • the reference matrix A is constructed through step S1013, as shown below.
  • the construction of the reference matrix A may have various data structures, and it is preferably a sparse matrix.
  • the storage mode of the reference matrix A is shown in Table 1 below.
  • the hash index is the k-mer (or k-mer number), and the hash value is an array of numbers containing all transcript sequences for this k-mer.
  • each row of the reference matrix A corresponds to a k-mer (that is, a first subsequence of preset length)
  • each column corresponds to a transcript sequence
  • the value in the reference matrix A indicates the row to which the value belongs. Whether the k-mer of the value exists in the transcript sequence of the column to which the value belongs, or, the number of times the k-mer of the row to which the value belongs appears in the transcript sequence of the column to which the value belongs.
  • Step S102 acquiring sequencing data of the transcriptome, decomposing the sequencing data into a second subsequence of the preset length, and constructing a target vector according to the second subsequence.
  • the sequencing data of the transcriptome can be obtained from a database, or can also be generated by sequencing.
  • the sequencing data is decomposed according to the same preset length (ie, the k-length) as the first subsequence described above, so as to obtain the second subsequence (ie, the second k-mer) of the k-length.
  • step S102 may specifically include the following steps:
  • the transcriptome sample can be sequenced by a sequencing tool to obtain sequencing data.
  • the sequencing tools may include, but are not limited to, second-generation (eg, illumina) or third-generation (eg, nanopore) sequencers, and single-cell sequencing (eg, 10Xgenomics) platforms.
  • the sequencing data obtained by a sequencer is generally a set of reads, and the data format can be, for example, the following fasta format.
  • the sequencing data usually includes read sequences and quality sequences.
  • the target vector b when constructing the target vector b according to the sequencing data, only the read sequences in the sequencing data do not need the quality sequences.
  • step S1021 the transcriptome sample is sequenced by a second-generation or third-generation sequencer to obtain the above-mentioned fasta-format sequencing data of the transcriptome, and the sequencing data can be specifically as follows:
  • each read sequence in the sequencing data is decomposed into second subsequences of k length (ie, second k-mers), and a set of second k-mers (hereinafter referred to as second k-mers) is obtained mer set).
  • the above-mentioned second k-mer set is counted through step S1023, and the target vector b shown below is constructed according to the counting result, and it is not necessary to distinguish which read sequence is from when counting.
  • Each row in the target vector b corresponds to a second k-mer, and the value in this row represents the number of occurrences of the second k-mer corresponding to that row in the sequencing data.
  • the specific method of counting is, for example, using A ij to represent the number of occurrences of the i-th 11-mer in the j-th transcript sequence or whether it has appeared before.
  • another statistical method can also be used, that is, b only records the number of read sequences in which the k-mer has appeared.
  • b only records the number of read sequences in which the k-mer has appeared.
  • the algorithm for constructing the target vector b can be implemented with an array inside.
  • the target vector b may be stored in a hash table, and the specific storage method may refer to the aforementioned reference matrix A, which will not be repeated here.
  • the aforementioned reference matrix A and target vector b in order to further improve the accuracy of the obtained transcriptome expression, when constructing the aforementioned reference matrix A and target vector b, skip the first subsequence of preset length. In the set and the set of second subsequences of preset length, the subsequences that meet the preset conditions.
  • the above-mentioned preset conditions include at least one of the following conditions: subsequences that have appeared in more than a preset transcript sequence; subsequences that appear more than a preset number of times in the same transcript sequence; sequencing technology sequences; and preset custom sequences.
  • the preset user-defined sequence refers to a sequence that needs to be skipped and set according to a user-defined operation.
  • the more k-mers appear in the transcript sequences the lower the information content, and therefore the more necessary to skip.
  • a k-mer appears multiple times (eg, >3 times) in the same transcript sequence, it may be a short repetitive fragment in the genome, which may affect quantification (eg: AAAAAAAA, TTTTTTTTT), so Can be skipped.
  • the sequencing technology sequence that is, the short sequence that is not derived from the sequencing sample itself, which is specially added for the convenience of sequencing, such as: molecular marker sequence, linker sequence and other technical sequences (non-biological sequences) in the sequencing data. This short sequence has a certain small size. The probability is the same as the k-mer of the transcript sequence, so it can be skipped.
  • the above skip operation can be implemented by setting the values in the reference matrix A and the target vector b to 0. For example, if a k-mer corresponds to the ith row in the reference matrix A and the target vector b, and the k-mer is repeated many times in the jth transcript sequence, then the reference matrix A ij The value is set to 0, and the value of bi in the target vector b is set to 0. With the above settings, in the internal implementation of the algorithm when A ij and bi are constructed, this k-mer will be skipped directly.
  • the reference matrix A is very sparse, and the data structure of the reference matrix A can be realized by hashing in the actual algorithm, that is, only k-mers that have appeared in the reference transcriptome are stored, and those in the k-mers that appear in the sequencing data but not in the reference transcriptome, and k-mers that appear in neither the sequencing data nor the reference transcriptome will be skipped, that is, the reference matrix A and the target vector
  • the read length of the predetermined sequencing data is L (generally, the read length is 100-150). Then, in each transcript sequence (assuming that the length of the transcript sequence is s, here is assumed to be greater than 2*L), the distance between the two ends of the transcript sequence is less than or equal to L-k+1 k-mer is at the original value Multiply by the weight w proportional to this distance, and the remaining distances from the two ends of the transcript sequence are greater than L-k+1 k-mer (that is, the L-k+1 k-mer to the sk-1 k-mer) ) is a fixed value (both equal to the weight of the Lkth k-mer).
  • the weights of the 1st to 90th k-mers of the transcript sequence are set to be 1 to 90, the weights of the 91st to 210th k-mers are 90, and the weights of the 211th to 300th k-mers are 90, respectively.
  • the weighted reference matrix A is obtained by multiplying the above weight by the original value in the reference matrix A.
  • the following calculation steps are the same. Multiply the finally calculated transcriptome expression level x by 90 to obtain the new transcriptome expression level x. Among them, for transcript sequences less than 2L, it can be skipped.
  • Step S103 using the reference matrix, the target vector and the preset linear equation system to obtain and output the expression level of the transcriptome.
  • the rows of the reference matrix A correspond one-to-one with the rows of the target vector b.
  • the two are in a one-to-one correspondence based on the k-mer space.
  • the two are in a strict one-to-one correspondence in the same 11-mer space, that is, each row in the 11-mer space corresponds to a 11-mer space.
  • the first line corresponds to AAAAAAAAAA
  • the second line corresponds to AAAAAAAAAC
  • the reference matrix A and the target vector b each have 411 rows, corresponding to AAAAAAAAAAA, AAAAAAAAAAC, AAAAAAAAAAG, . . . TTTTTTTTTTT.
  • the least squares method is used to estimate x, that is, the following formula (1) can be used to obtain the estimated value of the expression level of the transcriptome as the expression level x of the transcriptome.
  • A is the reference matrix
  • b is the target vector.
  • a T A and A T b can be calculated by using a known matrix operation package, and the calculation method is not specifically limited in this embodiment.
  • LU decomposition (LU Decomposition, triangular decomposition) may be performed on the A T A matrix first, and other matrix decomposition methods may also be selected in practical applications. Different linear equation solving packages may use different matrix factorization methods. But the form is the solution
  • the inverse matrix (A T A) -1 can also be obtained first, and then the following formula (2) is used to obtain the estimated value of the expression level of the transcriptome. as the expression level x of the transcriptome.
  • A is the reference matrix
  • b is the target vector.
  • the above (A T A) -1 A T may be calculated by using a known matrix operation package, and the calculation method is not specifically limited in this embodiment.
  • the representation of the rows and columns of the reference matrix A can be interchanged, that is, the columns of the reference matrix A and the columns of the target vector b are in a one-to-one correspondence based on the k-mer space.
  • the columns of the reference matrix A that is, the above formula A T
  • correspond to the columns of the target vector b that is, the above formula b T
  • the following formula (3) can be used to solve the estimated value of the expression level of the transcriptome as the expression level x of the transcriptome.
  • the following formula (4) can also be used to obtain the estimated expression level of the transcriptome as the expression level x of the transcriptome.
  • A is the reference matrix
  • b is the target vector
  • each column of the reference matrix A corresponds to a first k-mer
  • each row corresponds to a transcript sequence
  • the value in the reference matrix A indicates whether the first k-mer of the column to which the value belongs exists in the row to which the value belongs.
  • each column in the target vector b corresponds to a second k-mer, and its value represents the number of occurrences of the second k-mer in the sequencing data.
  • the gene expression amount (x) cannot be negative, that is to say, a negative solution is meaningless.
  • the solution result is limited to a non-negative solution, that is, taking the above formula (1) as an example, the value of the solution needs to satisfy and this condition.
  • steps S101 and S102 are independent of each other, and can be executed synchronously or asynchronously.
  • steps S102 and S103 are independent of each other, and the execution order can be interchanged.
  • the data processing method for rapid transcriptome expression quantification provided in this embodiment has a wide range of applications, not only can be applied to the current mainstream sequencing data characterized by short read length and high throughput, but also can be applied to the emerging long-read sequencing data. It has the advantages of fast and accurate transcriptome expression quantification.
  • the expression level of the transcriptome is obtained by using the reference matrix constructed based on the reference transcriptome sequence, the target vector constructed based on the transcriptome sequencing data, and the preset linear equation system.
  • the accuracy of the obtained expression level is higher than that of kallisto.
  • the speed of data processing is further improved.
  • FIG. 4 is a schematic structural diagram of an electronic device provided by an embodiment of the present application.
  • the electronic device 30 includes: a memory 50, a processor 60, and a computer program 40 stored on the memory 50 and running on the processor 60.
  • the processor 60 executes the computer program 40, The data processing method for rapid transcriptome expression quantification as in the previous example is implemented.
  • the computer program 40 includes:
  • an index building module 401 configured to obtain a reference transcriptome sequence, decompose the reference transcriptome sequence into a first subsequence of preset length, and construct a reference matrix according to the first subsequence;
  • a sequencing data processing module 402 configured to obtain sequencing data of the transcriptome, decompose the sequencing data into a second subsequence of the preset length, and construct a target vector according to the second subsequence;
  • the expression quantification module 403 is configured to obtain and output the expression amount of the transcriptome by using the reference matrix, the target vector and the preset linear equation system.
  • the index building module 401 includes:
  • a download module for obtaining the reference transcriptome sequence from the database, where the reference transcriptome sequence includes a plurality of transcript sequences;
  • a transcript decomposition module configured to decompose each of the transcript sequences, respectively, to obtain a set of the first subsequences corresponding to each of the transcript sequences;
  • a construction module for constructing the reference matrix according to the set of the first subsequences corresponding to each of the transcript sequences.
  • each row of the reference matrix corresponds to one of the first subsequences
  • each column corresponds to one of the transcript sequences
  • the value in the reference matrix indicates whether the first subsequence of the row to which the value belongs exists in the value.
  • the transcript sequence of the column to which the value belongs or, the number of times that the first subsequence of the row to which the value belongs appears in the transcript sequence of the column to which the value belongs.
  • each column of the reference matrix corresponds to one of the first subsequences
  • each row corresponds to one of the transcript sequences
  • the value in the reference matrix indicates whether the first subsequence of the column to which the value belongs exists in the value.
  • the transcript sequence of the row to which the value belongs or, the number of times that the first subsequence of the column to which the value belongs appears in the transcript sequence of the row to which the value belongs.
  • the sequencing data processing module 402 includes:
  • a sequencing module for sequencing the transcriptome sample to obtain the sequencing data
  • a read segment decomposition module configured to decompose the read segment sequence in the sequencing data to obtain the set of the second subsequence
  • the counting module is used for counting the set of the second subsequences, and constructing the target vector according to the counting result.
  • each row in the target vector corresponds to one type of the second subsequence, and the value in each row represents the number of occurrences of the corresponding second subsequence in the sequencing data.
  • each column in the target vector corresponds to one type of the second subsequence
  • the value in each column represents the number of occurrences of the corresponding second subsequence in the sequencing data.
  • the rows of the reference matrix are in one-to-one correspondence with the rows of the target vector, and the expression quantification module 403 is further configured to use the following formula (1) or formula (2) to solve the estimated value of the expression level of the transcriptome as the transcript.
  • A is the reference matrix
  • b is the target vector
  • the columns of the reference matrix are in one-to-one correspondence with the columns of the target vector, and the expression quantification module 403 is further configured to use the following formula (3) or formula (4) to solve the estimated value of the expression level of the transcriptome as the transcript.
  • A is the reference matrix
  • b is the target vector
  • the computer program 40 further includes:
  • the skipping module is used for skipping the subsequences that meet the preset conditions in the first subsequence and the second subsequence when constructing the reference matrix and the target vector.
  • the preset condition includes at least one of the following conditions:
  • Subsequences that appear in the transcript sequence exceeding a preset number are included in the transcript sequence exceeding a preset number; subsequences that appear more than a preset number of times in the same transcript sequence; sequencing technology sequences; and preset custom sequences.
  • the computer program 40 further includes:
  • the weight setting module is configured to set different weights for the values in the reference matrix according to the positions of the first subsequences in the corresponding transcript sequences when the reference matrix is constructed.
  • the solution result is limited to a non-negative solution, that is, taking the above formula (1) as an example, the solution value to be satisfied and this condition.
  • the electronic device further includes: at least one input device and at least one output device.
  • the above-mentioned memory 50, processor 60, input device and output device are connected through a bus.
  • the input device may specifically be a camera, a touch panel, a physical button, or the like.
  • the output device may specifically be a display, a printer, a radio frequency module, etc., wherein the display may include, but is not limited to, a touch or non-touch picture tube (CRT) display, a liquid crystal display (LCD), an LED display, and the like.
  • CTR touch or non-touch picture tube
  • LCD liquid crystal display
  • LED display and the like.
  • the memory 50 may be a high-speed random access memory (RAM, Random Access Memory) memory, or may be a non-volatile memory (non-volatile memory), such as a disk memory.
  • RAM Random Access Memory
  • non-volatile memory such as a disk memory.
  • the memory 50 is used to store a set of executable program codes, and the processor 60 is coupled to the memory 50 .
  • the expression level of the transcriptome is obtained by using the reference matrix constructed based on the reference transcriptome sequence, the target vector constructed based on the transcriptome sequencing data, and the preset linear equation system.
  • the accuracy of the obtained expression level is higher than that of kallisto.
  • the speed of data processing is further improved.
  • an embodiment of the present application further provides a computer-readable storage medium, where the computer-readable storage medium may be provided in the foregoing electronic device, and the computer-readable storage medium may be a memory of the foregoing electronic device.
  • a computer program is stored on the computer-readable storage medium, and when the program is executed by the processor, implements the data processing method for rapid transcriptome expression quantification described in the foregoing embodiments.
  • the computer-storable medium may also be a USB flash drive, a removable hard disk, a read-only memory (ROM, Read-Only Memory), a RAM, a magnetic disk, or an optical disk and other media that can store program codes.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种用于快速转录组表达定量的数据处理方法、装置及存储介质,应用于数据处理领域。该方法包括:获取参考转录组序列,将参考转录组序列分解为预设长度的第一子序列,并根据该第一子序列构建参考矩阵;获取转录组的测序数据,将测序数据分解为该预设长度的第二子序列,并根据该第二子序列构建目标向量;利用该参考矩阵、该目标向量以及预设的线性方程组,得到该转录组的表达量并输出。该方法不仅可应用于以短读长、高通量为特点的测序数据,还可应用于以长读长为特点的测序数据,并具有转录组表达定量速度快、准确度高的优点。

Description

用于快速转录组表达定量的数据处理方法、装置及存储介质 技术领域
本发明属于数据处理技术领域,尤其涉及一种用于快速转录组表达定量的数据处理方法、装置及存储介质。
背景技术
转录组样本测序以及转录组表达定量在生物医学研究中有十分广泛的应用,目前用转录组测序数据进行转录组表达定量的计算方法已经有很多。现有的转录组表达定量方法主要包括基于序列比对的转录组表达定量方法以及不基于序列比对的转录组表达定量方法两大类。其中,不基于序列比对的转录组表达定量方法的主要代表是Kallisto,其具有速度极快的优点,并且对于目前主流测序平台所产生的测序数据(如:二代测序数据),其进行转录组表达定量计算的准确度也是目前最高的几种方法之一。但是,对于近年来新出现的以长读长为特征的三代测序数据,用kallisto进行转录组表达定量则存在准确度较差的问题。
发明内容
本申请旨在提供一种用于快速转录组表达定量的数据处理方法、电子装置及计算机可读存储介质,不仅可应用于以短读长、高通量为特点的主流测序数据,还可应用于以长读长为特点的三代测序数据,并可提高转录组表达定量的速度和准确度。
本申请实施例提供了一种用于快速转录组表达定量的数据处理方法,包括:
获取参考转录组序列,将所述参考转录组序列分解为预设长度的第一子序列,并根据所述第一子序列构建参考矩阵;
获取转录组的测序数据,将所述测序数据分解为所述预设长度的第二子序列,并根据所述第二子序列构建目标向量;
利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出。
本申请实施例还提供了一种电子装置,所述电子装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述计算机程序包括:
索引构建模块,用于获取参考转录组序列,将所述参考转录组序列分解为预设长度的第一子序列,并根据所述第一子序列构建参考矩阵;
测序数据处理模块,用于获取转录组的测序数据,将所述测序数据分解为所述预设长度的第二子序列,并根据所述第二子序列构建目标向量;
表达定量模块,用于利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,所述计算机程序被处理器执行时,实现如上述实施例所示的用于快速转录组表达定量的数据处理方法。
上述本申请各实施例,通过利用基于参考转录组序列构建的参考矩阵、基于转录组的测序数据构建的目标向量以及预设的线性方程组,得到该转录组的表达量,经过测试表明,在对以长读长为特征的测序数据进行转录组表达定量时,得到的该表达量的准确性比kallisto更高,同时,由于省却了序列比对的时间,还进一步提高了数据处理的速度。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例。
图1是本申请实施例提供的用于快速转录组表达定量的数据处理方法的流程示意图;
图2是本申请实施例提供的用于快速转录组表达定量的数据处理方法中步骤S101的实现流程示意图;
图3是本申请实施例提供的用于快速转录组表达定量的数据处理方法中步骤S102的实现流程示意图;
图4是本申请实施例提供的电子装置的结构示意图。
具体实施方式
为使得本发明的发明目的、特征、优点能够更加的明显和易懂,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而非全部实施例。基于本发明中的实施例,本领域技术人员在 没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。
参见图1,图1为本申请实施例提供的用于快速转录组表达定量的数据处理方法的流程示意图,该方法可以应用如:台式电脑、手提电脑、平板电脑、个人计算机、服务器以及其他可在移动或非可移动的环境中进行数据处理的计算机设备。如图1所示,该方法主要包括如下步骤S101~S103。
步骤S101,获取参考转录组序列,将该参考转录组序列分解为预设长度的第一子序列,并根据该第一子序列构建参考矩阵。
具体的,获取的参考转录组序列包括例如一组转录本序列。预设长度的第一子序列,即第一k-mer。可以理解的,在生物信息中,k-mer指生物序列中长度为k的子序列,其中k的值为预设的小于该生物序列的长度的任意正整数。
进一步的,于本申请其他一实施方式中,如图2所示,步骤S101具体包括以下步骤:
S1011、从数据库获取该参考转录组序列,该参考转录组序列包括多条转录本序列;
S1012、分别将各该转录本序列进行分解,得到各该转录本序列各自对应的该第一子序列的集合;
S1013、根据各该转录本序列各自对应的该第一子序列的集合,构建该参考矩阵。
可以理解的,参考转录组是转录本的集合,即,该参考转录组序列包括多条转录本序列,其数据格式例如可以为以下fasta格式:
>转录本1ID
转录本1序列
>转录本2ID
转录本2序列
……。
举例来说,假设通过步骤S1011获取的上述fasta格式的参考转录组序列如下所示:
>ENSMUST00000000001
GTTGGAAATGTCGAGATCCGAATTGTGTGAAGTGG...
>ENSMUST00000000002
TTTAGGCTCTCCAGCGTGCAATTACTCATTCGATA...
>ENSMUST00000000003
ATTGGGTCGACCGCCTTGTAACACAGTATTCATCAAA...
>ENSMUST00000000004
CCAGTGGAGGGGTCTTTCGCATCTTCCAGA...
>ENSMUST00000000005
ATCGAGCACCTATCTATCGGTTAGAAGTGGCGTATAAG...
……。
通过步骤S1012将上述每一条转录本序列分解成k长度的第一子序列(即,第一k-mer),其中k的值优选为5~31范围内的整数。k的值被预设后,后续的所有数据处理中将统一使用该预设的k值。为便于理解,本示例及以下示例中均假设k的值为11。在执行分解操作后,记录分解得到的每一条转录本序列各自对应的11-mer的集合(以下简称11-mer集)。分解结果具体可如下所示:
转录本ENSMUST00000000001的11-mer集
GTTGGAAATGT
TTGGAAATGTC
TGGAAATGTCG
...
转录本ENSMUST00000000002的11-mer集
TTTAGGCTCTC
TTAGGCTCTCC
TAGGCTCTCCA
...。
之后,通过步骤S1013构建参考矩阵A,具体如下所示。
Figure PCTCN2020120407-appb-000001
可以理解的,构建参考矩阵A在算法实现上,可有多种数据结构,优选为稀疏矩阵。可选的,为了节约内存,参考矩阵A的存储方式如下表1所示。
表1
Figure PCTCN2020120407-appb-000002
如表1所示,对于每种k-mer只记录存在此k-mer的转录本序列的编号,这可用一个数组散列表示。散列索引为k-mer(或者k-mer编号),散列值为包含此k-mer的所有转录本序列的编号数组。
可以理解的,上述表1所示的参考矩阵A的存储方式仅为一种示例,实际应用中不限于此。
于本实施例中,参考矩阵A的每一行对应一种k-mer(即,预设长度的第一子序列),每一列对应一条转录本序列,参考矩阵A中的值表示该值所属行的k-mer是否存在于该值所 属列的转录本序列中,或者,该值所属行的k-mer在该值所属列的转录本序列中出现的次数。
步骤S102,获取转录组的测序数据,将该测序数据分解为该预设长度的第二子序列,并根据该第二子序列构建目标向量。
具体的,转录组的测序数据可以从数据库获取,或者也可以通过测序产生。该测序数据按照与上述第一子序列相同的预设长度(即,k长度)被分解,以得到k长度的第二子序列(即,第二k-mer)。
进一步的,于本申请其他一实施方式中,如图3所示,步骤S102具体可以包括以下步骤:
S1021、对转录组样本进行测序,得到该测序数据;
S1022、将该测序数据中的读段序列进行分解,得到该第二子序列的集合;
S1023、对该第二子序列的集合进行计数,并根据计数结果构建该目标向量。
具体的,可通过测序工具对转录组样本进行测序,以得到测序数据。该测序工具可以但不限于包括二代(如:illumina)或三代(如:nanopore)测序仪,以及单细胞测序(如:10Xgenomics)平台。其中,通过测序仪得到的测序数据一般是读段(read)的集合,数据格式例如可以为以下fasta格式。
@读段1ID
读段1序列
+读段1ID
读段1质量序列
@读段2ID
读段2序列
+读段2ID
读段2质量序列
……。
如上所示,测序数据中通常包括读段序列和质量序列,于本实施例中,在根据测序数据构建目标向量b时,只需该测序数据中的读段序列而不需要质量序列。
举例来说,假设首先在步骤S1021,通过二代或三代测序仪对转录组样本进行测序,得到转录组的上述fasta格式的测序数据,该测序数据具体可如下所示:
@ERR3261257.1 1 length=151
CTCACAAGGTTATCCACTATGTTTTTCGATAA...
+ERR3261257.1 1 length=151
AAAFFFFJJJJJJJJJJJFJJJJJJJJ...
@ERR3261257.2 2 length=151
CCATAGATAGCAAGAATTTTCCACAAGC...
+ERR3261257.2 2 length=151
AAAFFJJJJFJJJJJJJJJJJJJJFJJJ<JJJ...
......。
然后,通过步骤S1022将该测序数据中的各读段序列分解为k长度的第二子序列(即,第二k-mer),并得到第二k-mer的集合(以下简称第二k-mer集)。
之后,通过步骤S1023对上述第二k-mer集进行计数,并根据计数结果构建如下所示目标向量b,在计数时不必区分来自哪一条读段序列。目标向量b中的每一行对应一种第二k-mer,该行中的值表示该行对应的第二k-mer在测序数据中的出现次数。
Figure PCTCN2020120407-appb-000003
于本实施例中,计数的具体方法例如:利用A ij表示第i种11-mer在第j条转录本序列中的出现次数或是否出现过。利用b i表示第i种11-mer在测序数据(即,所有读段序列)中累计出现的次数,如:假设b 2对应的11-mer(AAAAAAAAAC)在读段1中出现过1次, 读段3中出现过2次,在其他读段中都没有出现过,那么b 2=1+2=3。
可选的,于本申请其他一实施方式中,还可采用另一种统计方法,即,b只记录出现过该k-mer的读段序列的数目,如上例中,假设AAAAAAAAAC只在读段1与读段3中出现过,那么b 2=2。
进一步的,构建目标向量b的算法内部可用数组实现。可选的,可如同参考矩阵A一样,将目标向量b用散列表存储,具体存储方式可参考前述参考矩阵A,此处不再赘述。
可选的,于本申请其他一实施例中,为进一步提高得到的转录组的表达量的准确性,在构建前述参考矩阵A和目标向量b时,跳过预设长度的第一子序列的集合和预设长度的第二子序列的集合中,符合预设条件的子序列。
其中,上述预设条件包括以下条件中的至少一种:在超过预设条转录本序列中出现过的子序列;在同一条转录本序列中出现的次数大于预设次数的子序列;测序技术序列;以及预设的自定义序列。其中,预设的自定义序列指根据用户自定义操作设置的需要跳过的序列。
可以理解的,k-mer在越多转录本序列中出现,说明其信息量越低,因此越需要跳过。此外,若一个k-mer在同一条转录本序列中出现多次(如,>3次),则说明其可能是基因组中的短重复片段,可能会影响定量(比如:AAAAAAAA,TTTTTTTTT),因此可跳过。另外,测序技术序列,即为了测序方便特别添加的非来自测序样本本身的短序列,如:测序数据中的分子标记序列、接头序列等技术序列(非生物序列),这种短序列有一定小概率与转录本序列的k-mer相同,因此可跳过。
具体的,可通过将参考矩阵A和目标向量b中的值设置为0实现上述跳过操作。举例来说,如果一个k-mer在参考矩阵A和目标向量b中对应第i行,该k-mer在第j条转录本序列中出现过多次重复,那么将参考矩阵A中A ij的值设置为0,将目标向量b中b i的值设置为0。通过上述设置,A ij和b i构建时的算法内部实现中,这个k-mer就会被直接跳过。
在一实际应用例中,参考矩阵A是十分稀疏的,实际算法中可用散列(hash)实现参考矩阵A的数据结构,即,只存储在参考转录组中出现过的k-mer,那些在测序数据中出现过而参考转录组中没有出现过的k-mer,以及,在测序数据中和参考转录组中均没有出现过的k-mer将被跳过,即,参考矩阵A与目标向量b中与上述未出现过的k-mer对应的行全为0 值。可以理解的,在数学上,参考矩阵A的行表示的是k-mer全空间,比如k=11时,参考矩阵A有4 11行,表示所有可能的11-mer组合,这些所有可能的11-mer中的很大一部分既不会在参考转录组中出现,也不会在测序数据中出现。
可选的,于本申请其他一实施例中,为进一步提高得到的转录组的表达量的准确性,在构建该参考矩阵时,根据各第一子序列(第一k-mer)在各自对应的转录本序列中的位置,为该参考矩阵中的值设置不同的权重。
可以理解的,由于测序时,转录本序列的各k-mer中靠近该转录本序列两端的k-mer较不容易被读段序列覆盖到,通过为参考矩阵A里的值设置不同的权重,可以提高最终得到的转录组的表达量的准确度。
举例来说,假设在构建参考矩阵A时,如果预设测序数据的读长是L(一般为读长100~150)。那么,每条转录本序列(假设该转录本序列的长度为s,此处假定大于2*L)中的距离该条转录本序列两端小于等于L-k+1的k-mer在原值乘以与此距离成正比的权重w,其余距离该转录本序列两端大于L-k+1的k-mer(即第L-k+1个k-mer至第s-k-1个k-mer)的权重为固定值(都等于第L-k个k-mer的权重)。
进一步的,假设预设测序数据的读长L=100,k=11,一条转录本序列的长度s为300。那么,设该转录本序列的第1~90个k-mer的权重分别为1~90,第91~210个k-mr的权重为90,第211~300个k-mer的权重分别为90~1。用上述权重乘以参考矩阵A中原先的值得到加权的参考矩阵A,接下来计算步骤相同,将最终计算得的转录组的表达量x乘以90即得新的转录组的表达量x。其中,对于小于2L的转录本序列,可以跳过。
步骤S103,利用该参考矩阵、该目标向量以及预设的线性方程组,得到该转录组的表达量并输出。
于本实施例中,参考矩阵A的行与目标向量b的行一一对应。具体的,二者基于k-mer空间一一对应,例如,二者在同一个11-mer空间严格一一对应,也就是说,二者在11-mer空间的每一行均对应一种11-mer,比如第一行均对应AAAAAAAAAA,第二行均对应AAAAAAAAAC,以此类推。举例来说,当k=11时,参考矩阵A与目标向量b各有4 11行,分别对应AAAAAAAAAAA,AAAAAAAAAAC,AAAAAAAAAAG,...TTTTTTTTTTT。
若假设转录组的表达量为x,x=(x 1,x 2,x 3...,x n),其中,x n是第n条转录本序列的表达量,则可利用线性关系:Ax=b,求解x。然而,通常理论值x无法求得,因此利用最小二乘法估计x,即,可利用以下公式(1),求得该转录组的表达量估计值
Figure PCTCN2020120407-appb-000004
作为该转录组的表达量x。
Figure PCTCN2020120407-appb-000005
其中,A为参考矩阵,b为目标向量。上述A TA和A Tb可利用已知的矩阵运算包计算得到,本实施例对计算方法不做具体限定。
具体的,可先对A TA矩阵进行LU分解(LU Decomposition,三角分解),在实际应用中或者也可选其他的矩阵分解方法。具体不同的线性方程组求解包可能使用不同的矩阵分解法。但形式上都是求解
Figure PCTCN2020120407-appb-000006
可选的,于本申请其他一实施例中,还可先求逆矩阵(A TA) -1,然后再利用以下公式(2),求解出该转录组的表达量估计值
Figure PCTCN2020120407-appb-000007
作为该转录组的表达量x。
Figure PCTCN2020120407-appb-000008
其中,A为参考矩阵,b为目标向量。上述(A TA) -1A T可利用已知的矩阵运算包计算得到,本实施例对计算方法不做具体限定。
可选的,可直接先求出(A TA) -1A T,然后再求解
Figure PCTCN2020120407-appb-000009
这样同一物种的所有测序数据均可通过使用
Figure PCTCN2020120407-appb-000010
快速求得
Figure PCTCN2020120407-appb-000011
从而提高数据处理的速度。
于本申请其他一实施例中,参考矩阵A的行列表现方式可以互换,即,参考矩阵A的列与目标向量b的列基于k-mer空间一一对应。此时,将上述求解方程式:Ax=b的两边进行转置,得到x TA T=b T作为行列互换(转置)后应用的求解方程。此时的参考矩阵A(即上式A T)的列与目标向量b(即上式b T)的列一一对应,可利用以下公式(3)求解出该转录组的表达量估计值
Figure PCTCN2020120407-appb-000012
作为该转录组的表达量x。
Figure PCTCN2020120407-appb-000013
或者,也可利用以下公式(4)求解出该转录组的表达量估计值
Figure PCTCN2020120407-appb-000014
作为该转录组的表达量x。
Figure PCTCN2020120407-appb-000015
上述公式(3)和(4)中,A为参考矩阵,b为目标向量。
可以理解的,当参考矩阵A的列和目标向量b的列一一对应时,转置后的参考矩阵A T的行与转置后的目标向量b T的行一一对应,目标向量b的行的含义即转置后的目标向量b T列的含义(目标向量b只有一行,转置后的目标向量b T只有一列)。此时,参考矩阵A的每一列对应一种第一k-mer,每一行对应一条转录本序列,参考矩阵A中的值表示该值所属列的第一k-mer是否存在于该值所属行的转录本序列中,或者,该值所属列的第一k-mer在该值所属行的转录本序列中出现的次数。目标向量b中的每一列对应一种第二k-mer,其值表示该第二k-mer在测序数据中的出现次数。
进一步的,由于基因表达量(x)不可能是负数,也就是说,负解没有意义。为进一步提高数据处理的速度,提高数据处理结果的准确性,在求解转录组的表达量时,限定求解结果为非负解,即,以上述公式(1)为例,求解的值需满足
Figure PCTCN2020120407-appb-000016
Figure PCTCN2020120407-appb-000017
这一条件。
需要说明的是,上述步骤S101和步骤S102互相独立,可同步执行,也可以不同步执行,不同步执行时,步骤S101和步骤S102的执行顺序可互换。上述步骤S102和步骤S103互相独立,执行顺序可互换。
本实施例提供的用于快速转录组表达定量的数据处理方法应用广泛,不仅能够应用于目前主流的以短读长、高通量为特点的测序数据,而且能应用于新出现的以长读长为特点的三代测序数据,并具有既快又准确地进行转录组表达定量的优点。
本申请实施例中,通过利用基于参考转录组序列构建的参考矩阵、基于转录组的测序数据构建的目标向量以及预设的线性方程组,得到该转录组的表达量,经过测试表明,在对以长读长为特征的测序数据进行转录组表达定量时,得到的该表达量的准确性比kallisto更高,同时,由于省却了序列比对的时间,还进一步提高了数据处理的速度。
参见图4,图4为本申请实施例提供的电子装置的结构示意图。如图4所示,该电子装置30包括:存储器50、处理器60及存储在该存储器50上并可在该处理器60上运行的计算机程序40,该处理器60执行该计算机程序40时,实现如前述实施例中的用于快速转录组表达定量的数据处理方法。
其中,计算机程序40包括:
索引构建模块401,用于获取参考转录组序列,将该参考转录组序列分解为预设长度的第一子序列,并根据该第一子序列构建参考矩阵;
测序数据处理模块402,用于获取转录组的测序数据,将该测序数据分解为该预设长度的第二子序列,并根据该第二子序列构建目标向量;
表达定量模块403,用于利用该参考矩阵、该目标向量以及预设的线性方程组,得到该转录组的表达量并输出。
可选的,索引构建模块401包括:
下载模块,用于从数据库获取该参考转录组序列,该参考转录组序列包括多条转录本序列;
转录本分解模块,用于分别将各该转录本序列进行分解,得到各该转录本序列各自对应的该第一子序列的集合;
构建模块,用于根据各该转录本序列各自对应的该第一子序列的集合,构建该参考矩阵。
可选的,该参考矩阵的每一行对应一种该第一子序列,每一列对应一条该转录本序列,该参考矩阵中的值表示该值所属行的该第一子序列是否存在于该值所属列的该转录本序列中,或者,该值所属行的该第一子序列在该值所属列的该转录本序列中出现的次数。
可选的,该参考矩阵的每一列对应一种该第一子序列,每一行对应一条该转录本序列,该参考矩阵中的值表示该值所属列的该第一子序列是否存在于该值所属行的该转录本序列中,或者,该值所属列的该第一子序列在该值所属行的该转录本序列中出现的次数。
可选的,测序数据处理模块402包括:
测序模块,用于对转录组样本进行测序,得到该测序数据;
读段分解模块,用于将该测序数据中的读段序列进行分解,得到该第二子序列的集合;
计数模块,用于对该第二子序列的集合进行计数,并根据计数结果构建该目标向量。
可选的,该目标向量中的每一行对应一种该第二子序列,每一行中的值表示对应的该第二子序列在该测序数据中的出现次数。
可选的,该目标向量中的每一列对应一种该第二子序列,每一列中的值表示对应的该第 二子序列在该测序数据中的出现次数。
可选的,该参考矩阵的行与该目标向量的行一一对应,表达定量模块403还用于利用以下公式(1)或公式(2),求解该转录组的表达量估计值作为该转录组的表达量并输出;
Figure PCTCN2020120407-appb-000018
Figure PCTCN2020120407-appb-000019
其中,A为该参考矩阵,b为该目标向量,
Figure PCTCN2020120407-appb-000020
为该转录组的表达量估计值。
可选的,该参考矩阵的列与该目标向量的列一一对应,表达定量模块403还用于利用以下公式(3)或公式(4),求解该转录组的表达量估计值作为该转录组的表达量并输出;
Figure PCTCN2020120407-appb-000021
Figure PCTCN2020120407-appb-000022
其中,A为该参考矩阵,b为该目标向量,
Figure PCTCN2020120407-appb-000023
为该转录组的表达量估计值。
可选的,计算机程序40还包括:
跳过模块,用于在构建该参考矩阵和该目标向量时,跳过该第一子序列和该第二子序列中,符合预设条件的子序列。
可选的,该预设条件包括以下条件中的至少一种:
在超过预设条该转录本序列中出现过的子序列;在同一条该转录本序列中出现的次数大于预设次数的子序列;测序技术序列;以及预设的自定义序列。
可选的,计算机程序40还包括:
权重设置模块,用于在构建该参考矩阵时,根据各该第一子序列在各自对应的该转录本序列中的位置,为该参考矩阵中的值设置不同的权重。
可选的,为进一步提高数据处理的速度,提高数据处理结果的准确性,在求解转录组的表达量时,限定求解结果为非负解,即,以上述公式(1)为例,求解的值需满足
Figure PCTCN2020120407-appb-000024
Figure PCTCN2020120407-appb-000025
这一条件。
进一步的,该电子装置还包括:至少一个输入设备以及至少一个输出设备。
上述存储器50、处理器60、输入设备以及输出设备,通过总线连接。
其中,输入设备具体可为摄像头、触控面板、物理按键等等。输出设备具体可为显示器、打印机、射频模块等等,其中显示器例如可以但不限于包括触控或非触控的映象管(CRT)显示器、液晶显示器(LCD)、LED显示器等等。
存储器50可以是高速随机存取记忆体(RAM,Random Access Memory)存储器,也可为非不稳定的存储器(non-volatile memory),例如磁盘存储器。存储器50用于存储一组可执行程序代码,处理器60与存储器50耦合。
本申请实施例中,通过利用基于参考转录组序列构建的参考矩阵、基于转录组的测序数据构建的目标向量以及预设的线性方程组,得到该转录组的表达量,经过测试表明,在对以长读长为特征的测序数据进行转录组表达定量时,得到的该表达量的准确性比kallisto更高,同时,由于省却了序列比对的时间,还进一步提高了数据处理的速度。
进一步的,本申请实施例还提供了一种计算机可读存储介质,该计算机可读存储介质可以是设置于上述电子装置中,该计算机可读存储介质可以是前述电子装置的存储器。该计算机可读存储介质上存储有计算机程序,该程序被处理器执行时实现前述实施例中描述的用于快速转录组表达定量的数据处理方法。进一步的,该计算机可存储介质还可以是U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、RAM、磁碟或者光盘等各种可以存储程序代码的介质。
需要说明的是,对于前述的各方法实施例,为了简便描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其它顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定都是本发明所必须的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其它实施例的相关描述。
以上为对本发明所提供的用于快速转录组表达定量的数据处理方法、电子装置及计算机可读存储介质的描述,对于本领域的技术人员,依据本申请实施例的思想,在具体实施方式及应用范围上均会有改变之处,综上,本说明书内容不应理解为对本发明的限制。

Claims (10)

  1. 一种用于快速转录组表达定量的数据处理方法,其特征在于,包括:
    获取参考转录组序列,将所述参考转录组序列分解为预设长度的第一子序列,并根据所述第一子序列构建参考矩阵;
    获取转录组的测序数据,将所述测序数据分解为所述预设长度的第二子序列,并根据所述第二子序列构建目标向量;
    利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出。
  2. 根据权利要求1所述的方法,其特征在于,所述获取参考转录组序列,将所述参考转录组序列分解为预设长度的第一子序列,并根据所述第一子序列构建参考矩阵,包括:
    从数据库获取所述参考转录组序列,所述参考转录组序列包括多条转录本序列;
    分别将各所述转录本序列进行分解,得到各所述转录本序列各自对应的所述第一子序列的集合;
    根据各所述转录本序列各自对应的所述第一子序列的集合,构建所述参考矩阵;
    其中,所述参考矩阵的每一行对应一种所述第一子序列,每一列对应一条所述转录本序列,所述参考矩阵中的值表示所述值所属行的所述第一子序列是否存在于所述值所属列的所述转录本序列中,或者,所述值所属行的所述第一子序列在所述值所属列的所述转录本序列中出现的次数;或者,
    所述参考矩阵的每一列对应一种所述第一子序列,每一行对应一条所述转录本序列,所述参考矩阵中的值表示所述值所属列的所述第一子序列是否存在于所述值所属行的所述转录本序列中,或者,所述值所属列的所述第一子序列在所述值所属行的所述转录本序列中出现的次数。
  3. 根据权利要求1所述的方法,其特征在于,所述获取转录组的测序数据,将所述测序数据分解为所述预设长度的第二子序列,并根据所述第二子序列构建目标向量,包括:
    对转录组样本进行测序,得到所述测序数据;
    将所述测序数据中的读段序列进行分解,得到所述第二子序列的集合;
    对所述第二子序列的集合进行计数,并根据计数结果构建所述目标向量;
    其中,所述目标向量中的每一行对应一种所述第二子序列,每一行中的值表示对应的所述第二子序列在所述测序数据中的出现次数;或者,
    所述目标向量中的每一列对应一种所述第二子序列,每一列中的值表示对应的所述第二子序列在所述测序数据中的出现次数。
  4. 根据权利要求1至3中的任一项所述的方法,其特征在于,所述参考矩阵的行与所述目标向量的行一一对应,所述利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出,包括:
    利用以下公式(1)或公式(2),求解所述转录组的表达量估计值作为所述转录组的表达量并输出;
    Figure PCTCN2020120407-appb-100001
    Figure PCTCN2020120407-appb-100002
    其中,A为所述参考矩阵,b为所述目标向量,
    Figure PCTCN2020120407-appb-100003
    为所述转录组的表达量估计值。
  5. 根据权利要求1至3中的任一项所述的方法,其特征在于,所述参考矩阵的列与所述目标向量的列一一对应,所述利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出,包括:
    利用以下公式(3)或公式(4),求解所述转录组的表达量估计值作为所述转录组的表达量并输出;
    Figure PCTCN2020120407-appb-100004
    Figure PCTCN2020120407-appb-100005
    其中,A为所述参考矩阵,b为所述目标向量,
    Figure PCTCN2020120407-appb-100006
    为所述转录组的表达量估计值。
  6. 根据权利要求1所述的方法,其特征在于,在构建所述参考矩阵和所述目标向量时,跳过所述第一子序列和所述第二子序列中,符合预设条件的子序列。
  7. 根据权利要求6所述的方法,其特征在于,所述预设条件包括以下条件中的至少一种:
    在超过预设条所述转录本序列中出现过的子序列;
    在同一条所述转录本序列中出现的次数大于预设次数的子序列;
    测序技术序列;以及
    预设的自定义序列。
  8. 根据权利要求1所述的方法,其特征在于,在构建所述参考矩阵时,根据各所述第一子序列在各自对应的所述转录本序列中的位置,为所述参考矩阵中的值设置不同的权重。
  9. 一种电子装置,所述电子装置包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述计算机程序包括:
    索引构建模块,用于获取参考转录组序列,将所述参考转录组序列分解为预设长度的第一子序列,并根据所述第一子序列构建参考矩阵;
    测序数据处理模块,用于获取转录组的测序数据,将所述测序数据分解为所述预设长度的第二子序列,并根据所述第二子序列构建目标向量;
    表达定量模块,用于利用所述参考矩阵、所述目标向量以及预设的线性方程组,得到所述转录组的表达量并输出。
  10. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,所述计算机程序被处理器执行时,实现如权利要求1至8中的任意一项所述的用于快速转录组表达定量的数据处理方法。
PCT/CN2020/120407 2020-09-23 2020-10-12 用于快速转录组表达定量的数据处理方法、装置及存储介质 WO2022061974A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011009255.6 2020-09-23
CN202011009255.6A CN112151120B (zh) 2020-09-23 2020-09-23 用于快速转录组表达定量的数据处理方法、装置及存储介质

Publications (1)

Publication Number Publication Date
WO2022061974A1 true WO2022061974A1 (zh) 2022-03-31

Family

ID=73897787

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/120407 WO2022061974A1 (zh) 2020-09-23 2020-10-12 用于快速转录组表达定量的数据处理方法、装置及存储介质

Country Status (2)

Country Link
CN (1) CN112151120B (zh)
WO (1) WO2022061974A1 (zh)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203703A (zh) * 2017-05-22 2017-09-26 人和未来生物科技(长沙)有限公司 一种转录组测序数据计算解读方法
CN111128303A (zh) * 2018-10-31 2020-05-08 深圳华大生命科学研究院 基于已知序列确定目标物种中对应序列的方法和系统

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3248110A4 (en) * 2015-01-22 2018-07-04 The Board of Trustees of The Leland Stanford Junior University Methods and systems for determining proportions of distinct cell subsets
WO2017210102A1 (en) * 2016-06-01 2017-12-07 Institute For Systems Biology Methods and system for generating and comparing reduced genome data sets
CN109997193B (zh) * 2016-11-10 2023-03-14 深圳华大生命科学研究院 一种对特定群中的亚群进行定量分析的方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107203703A (zh) * 2017-05-22 2017-09-26 人和未来生物科技(长沙)有限公司 一种转录组测序数据计算解读方法
CN111128303A (zh) * 2018-10-31 2020-05-08 深圳华大生命科学研究院 基于已知序列确定目标物种中对应序列的方法和系统

Also Published As

Publication number Publication date
CN112151120A (zh) 2020-12-29
CN112151120B (zh) 2024-03-12

Similar Documents

Publication Publication Date Title
Ono et al. PBSIM2: a simulator for long-read sequencers with a novel generative model of quality scores
US11649495B2 (en) Systems and methods for mitochondrial analysis
Knijnenburg et al. Fewer permutations, more accurate P-values
Mohamadi et al. ntCard: a streaming algorithm for cardinality estimation in genomics data
Lin et al. iPro54-PseKNC: a sequence-based predictor for identifying sigma-54 promoters in prokaryote with pseudo k-tuple nucleotide composition
Buchfink et al. Fast and sensitive protein alignment using DIAMOND
Song et al. Alignment-free sequence comparison based on next-generation sequencing reads
US8798936B2 (en) Methods and systems for data analysis using the Burrows Wheeler transform
Berger et al. Visualization and analysis of DNA sequences using DNA walks
Alser et al. From molecules to genomic variations: Accelerating genome analysis via intelligent algorithms and architectures
Pérez-Wohlfeil et al. Ultra-fast genome comparison for large-scale genomic experiments
Sheetlin et al. Frameshift alignment: statistics and post-genomic applications
CN112885412B (zh) 基因组注释方法、装置、可视化平台和存储介质
Xu et al. iss-pc: Identifying splicing sites via physical-chemical properties using deep sparse auto-encoder
Huo et al. Optimizing genetic algorithm for motif discovery
Lee et al. DeepFinder: An integration of feature-based and deep learning approach for DNA motif discovery
Bickel et al. An overview of recent developments in genomics and associated statistical methods
Ye et al. BlindCall: ultra-fast base-calling of high-throughput sequencing data by blind deconvolution
De Pierri et al. Sweep: representing large biological sequences datasets in compact vectors
WO2022061974A1 (zh) 用于快速转录组表达定量的数据处理方法、装置及存储介质
Liu et al. Are dropout imputation methods for scRNA-seq effective for scATAC-seq data?
Joudaki et al. Aligning distant sequences to graphs using long seed sketches
Asnicar et al. NES2RA: Network expansion by stratified variable subsetting and ranking aggregation
CN108897990B (zh) 面向大规模高维序列数据的交互特征并行选择方法
Kamal et al. Big DNA datasets analysis under push down automata

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20954820

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20954820

Country of ref document: EP

Kind code of ref document: A1