WO2020125448A1 - 差异表达基因筛选方法及装置 - Google Patents

差异表达基因筛选方法及装置 Download PDF

Info

Publication number
WO2020125448A1
WO2020125448A1 PCT/CN2019/123574 CN2019123574W WO2020125448A1 WO 2020125448 A1 WO2020125448 A1 WO 2020125448A1 CN 2019123574 W CN2019123574 W CN 2019123574W WO 2020125448 A1 WO2020125448 A1 WO 2020125448A1
Authority
WO
WIPO (PCT)
Prior art keywords
differentially expressed
screening
expressed genes
gene
screened
Prior art date
Application number
PCT/CN2019/123574
Other languages
English (en)
French (fr)
Inventor
罗依雯
殷鹏
朱木春
王伟任
张建业
Original Assignee
深圳先进技术研究院
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳先进技术研究院 filed Critical 深圳先进技术研究院
Publication of WO2020125448A1 publication Critical patent/WO2020125448A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
    • G16B25/10Gene or protein expression profiling; Expression-ratio estimation or normalisation

Definitions

  • the present application relates to the field of differentially expressed gene screening technology, and in particular to a differentially expressed gene screening method and device.
  • differential expression of genes is caused by a variety of factors and is closely related to the occurrence and development of many diseases. Bioinformatics and biostatistical analysis of differentially expressed genes are important for the study of cell regulation mechanisms and disease mechanisms. significance.
  • the existing technology has a large number of tools based on the traditional method, but the traditional method does not have a rescreening process, and the screening accuracy is low.
  • the purpose of the present application is to provide a method and device for screening differentially expressed genes to solve the technical problem of low screening accuracy without the rescreening process in the conventional method.
  • an embodiment of the present application provides a method for screening differentially expressed genes, including:
  • the differentially expressed genes are screened from the primary screening differentially expressed genes.
  • the embodiments of the present application provide a first possible implementation manner of the first aspect, wherein the step of preliminary screening the expression amount of the gene to be screened to obtain the preliminary screening differentially expressed genes includes:
  • the embodiments of the present application provide a second possible implementation manner of the first aspect, wherein the step of generating the association rule corresponding to the primary screening differentially expressed gene includes:
  • an association rule corresponding to the initially screened differentially expressed genes is determined.
  • the embodiments of the present application provide a third possible implementation manner of the first aspect, wherein the step of generating a differentially expressed gene transaction set according to the preliminary screening of differentially expressed genes includes:
  • a differentially expressed gene transaction set is generated.
  • the embodiments of the present application provide a fourth possible implementation manner of the first aspect, wherein, based on the association rule, the step of screening differentially expressed genes from the preliminary screening of differentially expressed genes includes:
  • the differentially expressed genes are screened from the primary screening differentially expressed genes.
  • the embodiments of the present application provide a fifth possible implementation manner of the first aspect, wherein, according to the support degree, the confidence degree, and the corresponding improvement degree, differences are screened from the differentially expressed genes in the preliminary screening
  • the steps of gene expression include:
  • an embodiment of the present application further provides a differentially expressed gene screening device, including:
  • An obtaining module the obtaining module is used to obtain the gene expression amount to be screened;
  • a preliminary screening module the preliminary screening module is used for preliminary screening of the expression amount of the gene to be screened to obtain the differentially expressed genes of the preliminary screening;
  • a generating module the generating module is used to generate an association rule corresponding to the differentially expressed genes at the initial screening
  • a screening module which screens differentially expressed genes from the preliminary screening differentially expressed genes based on the association rules.
  • the generation module includes:
  • a generating unit the generating unit generates a differentially expressed gene transaction set according to the preliminary screening differentially expressed genes
  • a first determining unit determines a frequent item set according to the differentially expressed gene transaction set
  • a second determination unit determines an association rule corresponding to the initially screened differentially expressed genes according to the frequent item set and the gene transaction set.
  • an embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor, when the processor executes the computer program Implement the above steps.
  • an embodiment of the present application further provides a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to perform the above method.
  • FIG. 1 is a flowchart of a method for screening differentially expressed genes according to an embodiment of the present application
  • FIG. 2 is a flowchart of the method of step S102 in FIG. 1;
  • FIG. 3 is a flowchart of the method of step S103 in FIG. 1;
  • step S301 in FIG. 3 is a flowchart of the method of step S301 in FIG. 3;
  • FIG. 5 is a flowchart of the method of step S104 in FIG. 1;
  • FIG. 6 is a flowchart of the method of step S502 in FIG. 5.
  • FIG. 7 is a schematic diagram of a differentially expressed gene screening device module provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of internal units of a generation module provided by an embodiment of the present application.
  • 10-acquisition module 20-initial screening module; 30-generation module; 40-screening module; 31-generation unit; 32-first determination unit; 33-second determination unit.
  • an embodiment of a method for screening differentially expressed genes is provided. It should be noted that the steps shown in the flowchart of the accompanying drawings may be executed in a computer system such as a set of computer-executable instructions, and, Although a logical sequence is shown in the flowchart, in some cases, the steps shown or described may be performed in a sequence different from that shown here.
  • FIG. 1 is a method for screening differentially expressed genes according to an embodiment of the present application. As shown in FIG. 1, the method includes the following steps:
  • Step S101 Obtain the gene expression level to be screened.
  • the gene expression level with screening includes the experimental group data and the control group data of the experimental subjects, for example: four experimental drugs out of eight mice, and the other four are injected with normal saline to record the eight small Gene expression levels of mice in a preset period of time to form initial data of gene expression levels.
  • Step S102 Perform preliminary screening on the expression amount of the gene to be screened to obtain the differentially expressed genes in the preliminary screening.
  • step S102 the initial screening of the gene expression to be screened is carried out.
  • this application also provides an implementation, as shown in FIG. 2, including:
  • Step S201 Obtain an expression matrix, grouping matrix and difference comparison matrix of the gene expression amount to be screened.
  • the gene expression data is generally expressed in a matrix form, which is called a gene expression matrix.
  • the row of the gene expression matrix represents the expression of a gene under different environmental conditions or different time points, and the column represents the expression of all genes under different conditions or samples (such as tissue, experimental conditions, processing factors, etc.), and the data of each grid represents The expression level of a specific gene in a specific sample, and then establish a grouping matrix and a difference comparison matrix, assuming that there are data from the experimental group (y) and the control group (p), after the data is grouped, the elements in the difference comparison matrix are the difference multiples FC:
  • Step S202 Perform differential expression gene analysis according to the expression matrix, the grouping matrix, and the difference comparison matrix, to obtain preliminary screening differential expression genes.
  • the expression matrix, grouping matrix and difference comparison matrix are used to analyze the differentially expressed genes to obtain the differentially expressed genes.
  • the differential expression matrix is as follows:
  • a, b, c, and d are the difference multiple FC, which is listed as the time period, behavior experiment group and control group.
  • the grouping matrix is as follows:
  • Case is the experimental group, that is, the group added with drugs
  • Control is the control group, that is, the group added with normal saline
  • 0 represents no
  • 1 represents yes, to see whether the genes show differential expression
  • the matrix can be calculated by the program preset by the computer, and the specific selection method can be determined according to the actual situation.
  • the differential expression index is obtained according to the expression matrix, the grouping matrix and the difference comparison matrix, and the gene is analyzed by the differential expression index Screening to get the initial screening genes
  • the specific implementation process is as follows: first use the limfit function to fit the model: limfit has two main parameters, namely the above expression matrix and the grouping matrix, the grouping matrix is essentially an indicator matrix. Then use contrasts experimental comparison: Once the linear model is fitted with the appropriate grouping matrix, the fitted model and the difference control matrix can use contrasts.fit to calculate the multiple change and t statistic of the perceptual interest comparison, which can calculate all possible pairs in the experiment Compare. Finally, use bayes to evaluate the differential expression: after fitting experiments to compare linear models, use bayes or use a simple empirical Bayes model to adjust the standard error. The results of the first screening of the expressed genes are as follows:
  • Step S103 Generate an association rule corresponding to the first screening differentially expressed genes.
  • Step S301 generating a differentially expressed gene transaction set according to the preliminary screening of differentially expressed genes.
  • an expression gene transaction set is generated based on the differentially expressed genes at the initial screening, and each column of the expression gene transaction set is a transaction, which can be set as a column of expression genes within a period of time, and the expression genes are expressed in a specific form It is indicated that the specific expression form is based on step S301, and a differentially expressed gene transaction set is generated based on the differentially expressed genes at the initial screening.
  • This application also provides an implementation manner, as shown in FIG. 4, including:
  • Step S401 Obtain the initial differential screening gene corresponding to the differential expression factor FC and divide the intensity category.
  • the FC value is divided into 6 intensities (up-regulation: A/AA/AAA; down-regulation: B/BB/BBB), and each initial screening differentially expressed gene corresponds to one intensity.
  • Step S402 Based on the initial screening of differentially expressed genes, generate a differentially expressed gene transaction set according to the intensity category.
  • each differentially-expressed gene for initial screening corresponds to an intensity
  • an expression gene transaction set is generated according to the differentially-expressed genes for initial screening and their corresponding intensities
  • each column of the expression gene transaction set is a practice, behavior time
  • the generated transaction set is shown in the following table:
  • Step S302 Determine frequent item sets according to the differentially expressed gene transaction set.
  • the frequent item set can be determined according to the differentially expressed gene transaction set by adopting the idea of apriori algorithm.
  • a 1-frequent item set is generated, such as 4 ⁇ P04919AA ⁇ , 3 items ⁇ O35218 BB ⁇ .
  • the 2-frequent itemsets are generated through connection and pruning, such as 2 ⁇ P04919AA, O35218BB ⁇ .
  • the 5-candidate set ⁇ P04919A, Q9WUK2B, Q9CWW6BB, P01654A, Q9ES52BB ⁇ is a combination of two 4-candidate sets ⁇ P04919AA, Q9WUK2B, Q9CWW6BB, P01654A ⁇ and ⁇ P04919AA, Q9WUK2B, Q9CWW6BB, Q9ES52BB ⁇ .
  • Step S303 Determine an association rule corresponding to the initially screened differentially expressed genes according to the frequent item set and the gene transaction set.
  • frequent itemset mining is an important research foundation in the research topic of data mining. It can tell us the variables that often appear together in the data set and provide some support for possible decision-making. Frequent itemset mining is The basis of the association rule mining task.
  • the embodiments of the present application determine the association rules corresponding to the primary screening differentially expressed genes based on frequent item sets and transaction sets, so that subsequent association rules can be used to rescreen the primary screening differentially expressed genes to obtain more accurate differentially expressed genes .
  • step S104 based on the association rules, the differentially expressed genes are screened from the primary screening differentially expressed genes.
  • the association rule needs to be determined. Based on the step S104, based on the association rule, the differentially expressed genes are screened from the primary screening differentially expressed genes.
  • the present application also provides an implementation manner, as shown in FIG. 5, including:
  • Step S501 Determine the support degree of the association rule according to the transaction set.
  • Step S502 Determine the confidence of the association rule according to the frequent item set.
  • the association rules can be measured according to two criteria, namely support and confidence. Assuming there is an association rule R, the support degree of the association rule R is the number of transactions and X that include X and Y in the transaction set.
  • Ratio. which is:
  • the support degree reflects the probability that X and Y appear at the same time, and D represents the number of items in the transaction set.
  • confidence refers to the ratio of the number of transactions containing X and Y to the number of transactions containing X. which is:
  • the confidence level reflects the probability that the transaction contains Y if the transaction contains X.
  • step S503 the differentially expressed genes are screened from the primary screening differentially expressed genes according to the support degree, confidence level and corresponding promotion degree.
  • association rules of support and confidence are established, and then the association rules need to be screened using the lifting degree to obtain effective association rules, which can be called strong association rules.
  • the present application also provides an embodiment, as shown in FIG. 6, including:
  • Step S601 Determine the promotion degree according to the confidence ratio support degree.
  • Step S602 Acquire a preset lift threshold.
  • step S603 the differentially expressed genes are screened among the differentially expressed genes in the initial screening by using the degree of support and confidence that the lifting degree is greater than the threshold.
  • the support degree and the confidence degree are used to determine the promotion degree.
  • the promotion degree can be used to judge whether the association rule between the support degree and the confidence degree is valid.
  • the specific calculation formula is as follows:
  • the lift threshold is set to 1, when the lift is greater than 1, it is determined that the confidence and support are related, and the association rule is determined to be required for the experiment, and then the re-screening process can be performed according to this association rule; if the promotion degree is less than If it is equal to 1, it means that the confidence and support are not related and can not form an association rule.
  • the effective association rule is obtained, that is, the strong association rule
  • the differentially expressed genes with precision screening can be screened from the frequent item set, which is more than the traditional single use
  • the preliminary screening step results in further screening of more accurately differentially expressed genes.
  • the differentially expressed genes need to be analyzed experimentally one by one after being obtained to determine whether they are the samples required for the experiment, which not only increases the workload, but also wastes resources and has low efficiency.
  • determine the support and confidence to determine the association rules and then use the promotion degrees corresponding to the support and confidence as the screening conditions of the association rules, and finally use the effective association rules to re-frequent item sets Identify more precisely differentially expressed genes.
  • the embodiment of the present application also provides a differentially expressed gene screening device.
  • the differentially expressed gene screening device is mainly used to perform the differentially expressed gene screening method provided in the above contents of the embodiments of the present application.
  • the specific introduction of the screening device is shown in Figure 7, including:
  • the obtaining module 10 is used to obtain the gene expression amount to be screened
  • the preliminary screening module 20 is used for preliminary screening of the expression amount of the gene to be screened to obtain the differentially expressed genes in the preliminary screening;
  • a generation module 30 which is used to generate an association rule corresponding to the differentially expressed genes at the initial screening
  • the screening module 40 screens the differentially expressed genes from the primary screening differentially expressed genes based on the association rules.
  • the generation module 30 includes:
  • Generating unit 31, generating unit 31 generates a differentially expressed gene transaction set according to the preliminary screening differentially expressed genes
  • the first determining unit 32 determines the frequent item set according to the differentially expressed gene transaction set
  • the second determining unit 33 determines the association rule corresponding to the first screening differentially expressed genes according to the frequent item set.
  • An embodiment of the present application further provides an electronic device, including a memory, a processor, and a computer program stored on the memory and executable on the processor.
  • the processor implements the computer program to implement the above steps.
  • Embodiments of the present application also provide a computer-readable medium having a non-volatile program code executable by a processor, where the program code causes the processor to perform the above method.
  • connection should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , Or integrally connected; it can be a mechanical connection or an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be the connection between two components.
  • connection should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , Or integrally connected; it can be a mechanical connection or an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be the connection between two components.
  • connection should be understood in a broad sense, for example, it may be a fixed connection or a detachable connection , Or integrally connected; it can be a mechanical connection or an electrical connection; it can be directly connected, or it can be indirectly connected through an intermediary, or it can be the connection between two components.
  • the disclosed system, device, and method may be implemented in other ways.
  • the device embodiments described above are only schematic.
  • the division of the unit is only a division of logical functions.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some communication interfaces, devices or units, and may be in electrical, mechanical, or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the functions are implemented in the form of software functional units and sold or used as independent products, they may be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present application essentially or part of the contribution to the existing technology or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to enable a computer device (which may be a personal computer, server, or network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program code .

Landscapes

  • Health & Medical Sciences (AREA)
  • Genetics & Genomics (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Biophysics (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种差异表达基因筛选方法及装置,涉及差异表达基因筛选的技术领域,包括:获取待筛选基因表达量(S101);对待筛选基因表达量进行初步筛选,得到初筛差异表达基因(S102);生成初筛差异表达基因对应的关联规则(S103);基于关联规则,从初筛差异表达基因中筛选差异表达基因(S104)。该方法建立了关联规则,实现对初筛差异表达基因进行再筛选,在得到精度更高的差异表达基因的同时,还节省了对初筛差异表达基因逐一实验所需大量的人力物力。

Description

差异表达基因筛选方法及装置 技术领域
本申请涉及差异表达基因筛选技术领域,尤其是涉及一种差异表达基因筛选方法及装置。
背景技术
基因的差异化表达由多种因素共同导致,并且与许多疾病的发生和发展有密切联系,对差异化表达的基因进行生物信息学以及生物统计学的分析对于研究细胞调节机制和疾病机理有着重要意义。
目前,针对差异基因表达的筛选,现有技术基于传统方法已有大量的工具,但是传统方法没有设置再筛选过程,筛选精度低。
发明内容
有鉴于此,本申请的目的在于提供一种差异表达基因筛选方法及装置,以解决传统方法中没有设置再筛选过程,筛选精度低的技术问题。
第一方面,本申请实施例提供了一种差异表达基因筛选方法,包括:
获取待筛选基因表达量;
对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因;
生成所述初筛差异表达基因对应的关联规则;
基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因。
结合第一方面,本申请实施例提供了第一方面的第一种可能的实施方式,其中,对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因的步骤,包括:
获取所述待筛选基因表达量的表达矩阵、分组矩阵和差异比较矩阵;
根据所述表达矩阵、分组矩阵和差异比较矩阵进行差异表达基因分析,得到初筛差异表达基因。
结合第一方面,本申请实施例提供了第一方面的第二种可能的实施方式,其中,生成所述初筛差异表达基因对应的关联规则的步骤包括:
根据所述初筛差异表达基因生成差异表达基因事务集;
根据所述差异表达基因事务集确定频繁项集;
根据所述频繁项集和所述基因事务集确定所述初筛差异表达基因对应的关联规则。
结合第一方面,本申请实施例提供了第一方面的第三种可能的实施方式,其中,根据所述初筛差异表达基因生成差异表达基因事务集的步骤,包括:
获取所述初筛差异表达基因对应差异表达倍数FC划分强度类别;
基于所述初筛差异表达基因,根据所述强度类别,生成差异表达基因事务集。
结合第一方面,本申请实施例提供了第一方面的第四种可能的实施方式,其中,基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因的步骤,包括:
根据所述事务集确定所述关联规则的支持度;
根据所述频繁项集确定所述关联规则的置信度;
根据所述支持度、置信度及相应的提升度,从所述初筛差异表达基因中筛选差异表达基因。
结合第一方面,本申请实施例提供了第一方面的第五种可能的实施方式,其中,根据所述支持度、置信度及相应的提升度,从所述初筛差异表达基因中筛选差异表达基因的步骤,包括:
根据所述置信度比所述支持度确定所述提升度;
获取预设的提升度阈值;
利用所述提升度大于阈值的支持度和置信度在所述初筛差异表达基因中筛选差异表达基因。
第二方面,本申请实施例还提供一种差异表达基因筛选装置,包括:
获取模块,所述获取模块用于获取待筛选基因表达量;
初筛模块,所述初筛模块用于对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因;
生成模块,所述生成模块用于生成所述初筛差异表达基因对应的关联规则;
筛选模块,所述筛选模块基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因。
结合第二方面,本申请实施例提供了第二方面的第一种可能的实施方式,其中,所述生成模块包括:
生成单元,所述生成单元根据所述初筛差异表达基因生成差异表达基因事务集;
第一确定单元,所述第一确定单元根据所述差异表达基因事务集确定频繁项集;
第二确定单元,所述第二确定单元根据所述频繁项集和所述基因事务集确定所述初筛差异表达基因对应的关联规则。
第三方面,本申请实施例还提供一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述步骤。
第四方面,本申请实施例还提供一种具有处理器可执行的非易失的程序代码的计算机可读介质,所述程序代码使所述处理器执行上述方法。
在本申请实施例中,通过获取待筛选基因表达量;对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因;生成所述初筛差异表达基因对应的关联规则;基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因的方式,建立了关联规则,实现对初筛差异表达基因进行再筛选,得到精度更高的差异表达基因的同时,还节省了对初筛差异表达基因逐一实验所需大量的人力物力。
本申请的其他特征和优点将在随后的说明书中阐述,并且,部分地从说明书中变得显而易见,或者通过实施本申请而了解。本申请的目的和其他优点在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。
为使本申请的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本申请具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图是本申请的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1为本申请实施例提供的一种差异表达基因筛选方法流程图;
图2为图1中步骤S102的方法流程图;
图3为图1中步骤S103的方法流程图;
图4为图3中步骤S301的方法流程图;
图5为图1中步骤S104的方法流程图;
图6为图5中骤S502的方法流程图。
图7为本申请实施例提供的一种差异表达基因筛选装置模块示意图;
图8为本申请实施例提供的生成模块内部单元示意图。
图标:
10-获取模块;20-初筛模块;30-生成模块;40-筛选模块;31-生成单元;32-第一确定单元;33-第二确定单元。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
根据本申请实施例,提供了一种差异表达基因筛选方法实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
图1是根据本申请实施例的差异表达基因筛选方法,如图1所示,该方法包括如下步骤:
步骤S101,获取待筛选基因表达量。
在本申请实施例中,带筛选基因表达量包括实验对象的实验组数据及对照组数据,例如:对八只小鼠中的四只实验药物,另外四只注射生理盐水,记录这八只小鼠在预设时间段的基因表达量,以此形成基因表达量的初始数据。
步骤S102,对待筛选基因表达量进行初步筛选,得到初筛差异表达基因。
在本申请实施例中,首先需要对待筛选基因表达量进行初步筛选,传统方 法中已经存在很多方式对待筛选基因表达量进行筛选,例如R语言中基于主流研究方法封装好的affy、limma、pheatmap、ggplot2等软件包,为进行差异基因筛选提供了不少的便利,本申请对此不做限定,具体使用方法可以依据实际情况而定,基于步骤S102,对待筛选基因表达量进行初步筛选,得到初筛差异表达基因,本申请还提供了一种实施方式,如图2所示,包括:
步骤S201,获取待筛选基因表达量的表达矩阵、分组矩阵和差异比较矩阵。
在本申请实施例中,基因表达数据通常利用矩阵形式表示,称为基因表达矩阵。基因表达矩阵的行代表一个基因在不同环境条件下或不同时间点的表达,列代表不同条件或样本下(如组织、实验条件、处理因素等)所有基因的表达情况,每个格子的数据表示特定的基因在特定的样本中的表达水平,之后建立分组矩阵与差异比较矩阵,假设有实验组(y)与对照组(p)的数据,将数据分组后,差异比较矩阵中元素为差异倍数FC:
Figure PCTCN2019123574-appb-000001
步骤S202,根据表达矩阵、分组矩阵和差异比较矩阵进行差异表达基因分析,得到初筛差异表达基因。
利用表达矩阵、分组矩阵和差异比较矩阵对差异表达基因进行分析得到初筛差异表达基因,其中差异表达矩阵形式如下:
  X6h.P1 X6h.P2 ……
A2A432 a b ……
A2A863 c d ……
…… …… …… ……
其中,a、b、c、d均为差异倍数FC,列为时间段,行为实验组和对照组。
分组矩阵如下:
   Case Control
X6h.P1 0 1
X6h.P2 0 1
其中,Case为实验组,即加入药物的组,Control为对照组,即加入生理盐水的组别,0代表否,1代表是,看基因是否显现出差异性表达,最后再通过差异比较矩阵可以计算得到实验则与对照组的比值。然后可以借助计算机预设的程序对矩阵进行运算,具体选用的方法可以依据实际情况而定,本申请中通过根据表达矩阵、分组矩阵和差异比较矩阵得到差异表达指标,通过差异表达指标对基因进行筛选得到初筛基因,具体实现流程如下所示:先使用limfit函数拟合模型:limfit有两个主要参数,即为上述表达矩阵和分组矩阵,分组矩阵本质上是指示矩阵。然后采用contrasts实验对比:一旦使用适当的分组矩阵拟合线性模型,拟合模型和差异对照矩阵可用contrasts.fit计算感性兴趣对比的倍数变化和t统计量,这可以计算实验中所有可能的成对比较。最后使用bayes评估差异表达:在拟合实验对比线性模型之后,使用bayes或使用简单的经验贝叶斯模型来调节标准误差。第一步经过初筛的表达基因结果如下:
  logFC AveExpr t P.Value adj.p.Val B change
P04919 2.85 18.63 8.66 1.934979e-.05 0.0468 2.833 up
Q9EPK2 1.28 16.32 6.81 1.139044e-04 0.1339 1.494 Not
步骤S103,生成初筛差异表达基因对应的关联规则。
在本申请实施例中,传统方法中已经存在很多方式对待筛选基因表达量进行筛选,例如R语言中基于主流研究方法封装好的affy、limma、pheatmap、ggplot2等软件包,为进行差异基因筛选提供了不少的便利,但是由于适用范 围的差异使得筛选结果参差不齐,所以需要对初筛差异表达基因进行进一步筛选,基于步骤S103,生成初筛差异表达基因对应的关联规则,本申请还提供了一种实施方式,如图3所示,包括:
步骤S301,根据初筛差异表达基因生成差异表达基因事务集。
在本申请实施例中,根据初筛差异表达基因生成一个表达基因事务集,表达基因事务集的每一列即一个事务,可以设置成一列为一个时间段内的表达基因,将表达基因以特定形式表示,具体的表现形式基于步骤S301,根据初筛差异表达基因生成差异表达基因事务集,本申请还提供了一种实施方式,如图4所示,包括:
步骤S401,获取初筛差异表达基因对应差异表达倍数FC划分强度类别。
在本申请实施例中,按FC的值划分为6个强度(上调:A/AA/AAA;下调:B/BB/BBB),每一个初筛差异表达基因对应一个强度。
步骤S402,基于初筛差异表达基因,根据强度类别,生成差异表达基因事务集。
在本申请实施例中,每一个初筛差异表达基因对应一个强度,根据的初筛差异表达基因及其对应的强度生成一个表达基因事务集,表达基因事务集的每一列即一个实务,行为时间段,例如:在预设的七个时间段内存在153个初筛差异表达基因,生成的事务集如下表所示:
Figure PCTCN2019123574-appb-000002
Figure PCTCN2019123574-appb-000003
步骤S302,根据差异表达基因事务集确定频繁项集。
在本申请实施例中,可以通过采用apriori算法思路根据差异表达基因事务集确定频繁项集,首先第一次扫描事务集时,产生1-频繁项集,比如4个{P04919 AA}、3个{O35218 BB}。在此基础上经过连接、修剪产生2-频繁项集,比如2个{P04919 AA、O35218 BB}。以此类推,直到无法产生更高阶的频繁项集为止,然后在第k次循环中,也就是产生k-频繁项集的时候,首先产生k-候选集,k-候选集中每一个项集都是对两个只有一个项不同的属于k-1频繁项集的项集连接产生的。比如5-候选集{P04919 A、Q9WUK2B、Q9CWW6BB、P01654A、Q9ES52BB}是由两个4-候选集{P04919AA、Q9WUK2B、Q9CWW6BB、P01654A}和{P04919AA、Q9WUK2B、Q9CWW6BB、Q9ES52BB}连接产生的,筛选后产生k-频繁项集。
步骤S303,根据频繁项集和基因事务集确定初筛差异表达基因对应的关联规则。
在本申请实施例中,频繁项集挖掘是数据挖掘研究课题中一个很重要的研究基础,它可以告诉我们在数据集中经常一起出现的变量,为可能的决策提供一些支持,频繁项集挖掘是关联规则挖掘任务的基础,本申请实施例根据频繁项集和事务集确定初筛差异表达基因对应的关联规则,以便后续使用关联规则对初筛差异表达基因进行再筛选得到更加精确的差异表达基因。
步骤S104,基于关联规则,从初筛差异表达基因中筛选差异表达基因。
在本申请实施例中,需要确定关联规则,基于步骤S104,基于关联规则, 从初筛差异表达基因中筛选差异表达基因,本申请还提供了一种实施方式,如图5所示,包括:
步骤S501,根据事务集确定所述关联规则的支持度。
步骤S502,根据频繁项集确定所述关联规则的置信度。
在本申请实施例中,关联规则可以根据两个标准来衡量,即支持度和置信度,假设有关联规则R,关联规则R的支持度是交易集同时包含X和Y的交易数与|D|之比。即:
Figure PCTCN2019123574-appb-000004
其中支持度反映了X、Y同时出现的概率,D表示事务集中的项数。对于关联规则R,置信度是指包含X和Y的交易数与包含X的交易数之比。即:
Figure PCTCN2019123574-appb-000005
其中置信度反映了如果交易中包含X,则交易包含Y的概率。
步骤S503,根据支持度、置信度及相应的提升度,从初筛差异表达基因中筛选差异表达基因。
在本申请实施例中,建立支持度与置信度的关联规则,然后需要利用提升度对关联规则进行筛选,筛选得到有效的关联规则,可以称之为强关联规则,基于步骤S502,根据支持度、置信度及相应的提升度,从初筛差异表达基因中筛选差异表达基因,本申请还提供了一种实施方式,如图6所示,包括:
步骤S601,根据置信度比支持度确定提升度。
步骤S602,获取预设的提升度阈值。
步骤S603,利用提升度大于阈值的支持度和置信度在初筛差异表达基因中筛选差异表达基因。
在本申请实施例中,利用支持度和置信度确定提升度,所述提升度可以用 来判断支持度与置信度的关联规则是否有效,具体计算公式如下:
Figure PCTCN2019123574-appb-000006
一般设置提升度阈值为1,当提升度大于1时,即判定置信度与支持度是相关的,确定关联规则为实验所需,即可根据此条关联规则进行再筛选过程;若提程度小于等于1,则说明置信度与支持度没有关联,不能形成关联规则,再得到有效的关联规则之后,即强关联规则,从频繁项集中可以再筛选出精度筛选的差异表达基因,比传统单一使用初步筛选的步骤得到进一步筛选更加准确的差异表达基因。基于上述方法代替传统方法在得到初筛差异表达基因后需要逐一对其进行实验分析从而确定是否为实验所需样本,不仅任务量增加,且浪费资源,效率低。通过事务集与频繁项集的建立,确定支持度与置信度,从而确定关联规则,再通过支持度与置信度对应的提升度作为关联规则的筛选条件,最终利用有效的关联规则再频繁项集中确定更加精确的差异表达基因。
本申请实施例还提供了一种差异表达基因筛选装置,该差异表达基因筛选装置主要用于执行本申请实施例上述内容所提供的差异表达基因筛选方法,以下对本申请实施例提供的差异表达基因筛选装置做具体介绍如图7所示,包括:
获取模块10,获取模块10用于获取待筛选基因表达量;
初筛模块20,初筛模块20用于对待筛选基因表达量进行初步筛选,得到初筛差异表达基因;
生成模块30,生成模块30用于生成初筛差异表达基因对应的关联规则;
筛选模块40,筛选模块40基于关联规则,从初筛差异表达基因中筛选差异表达基因。
本申请实施例所提供的装置,其实现原理及产生的技术效果和前述方法实 施例相同,为简要描述,装置实施例部分未提及之处,可参考前述方法实施例中相应内容。
在本申请实施例的又一实施例中,如图8所示,生成模块30包括:
生成单元31,生成单元31根据初筛差异表达基因生成差异表达基因事务集;
第一确定单元32,第一确定单元32根据差异表达基因事务集确定频繁项集;
第二确定单元33,第二确定单元33根据频繁项集确定初筛差异表达基因对应的关联规则。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述步骤。
本申请实施例还提供一种具有处理器可执行的非易失的程序代码的计算机可读介质,所述程序代码使所述处理器执行上述方法。
另外,在本申请实施例的描述中,除非另有明确的规定和限定,术语“安装”、“相连”、“连接”应做广义理解,例如,可以是固定连接,也可以是可拆卸连接,或一体地连接;可以是机械连接,也可以是电连接;可以是直接相连,也可以通过中间媒介间接相连,可以是两个元件内部的连通。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本申请中的具体含义。
在本申请的描述中,需要说明的是,术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制。此外,术语“第一”、“第二”、“第三”仅用于描述目的,而不能理解为指示或暗示相对重要性。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者 该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本申请的具体实施方式,用以说明本申请的技术方案,而非对其限制,本申请的保护范围并不局限于此,尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本申请实施例技术方案的精神和范围,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应所述以权利要求的保护范围为准。

Claims (13)

  1. 一种差异表达基因筛选方法,其特征在于,包括:
    获取待筛选基因表达量;
    对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因;
    生成所述初筛差异表达基因对应的关联规则;
    基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因。
  2. 根据权利要求1所述的差异表达基因筛选方法,其特征在于,对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因的步骤,包括:
    获取所述待筛选基因表达量的表达矩阵、分组矩阵和差异比较矩阵;
    根据所述表达矩阵、分组矩阵和差异比较矩阵进行差异表达基因分析,得到初筛差异表达基因。
  3. 根据权利要求1所述的差异表达基因筛选方法,其特征在于,生成所述初筛差异表达基因对应的关联规则的步骤包括:
    根据所述初筛差异表达基因生成差异表达基因事务集;
    根据所述差异表达基因事务集确定频繁项集;
    根据所述频繁项集和所述基因事务集确定所述初筛差异表达基因对应的关联规则。
  4. 根据权利要求3所述的差异表达基因筛选方法,其特征在于,根据所述初筛差异表达基因生成差异表达基因事务集的步骤,包括:
    获取所述初筛差异表达基因对应差异表达倍数FC划分强度类别;
    基于所述初筛差异表达基因,根据所述强度类别,生成差异表达基因事务集。
  5. 根据权利要求3所述的差异表达基因筛选方法,其特征在于,基于所述 关联规则,从所述初筛差异表达基因中筛选差异表达基因的步骤,包括:
    根据所述事务集确定所述关联规则的支持度;
    根据所述频繁项集确定所述关联规则的置信度;
    根据所述支持度、置信度及相应的提升度,从所述初筛差异表达基因中筛选差异表达基因。
  6. 根据权利要求5所述的差异表达基因筛选方法,其特征在于,根据所述支持度、置信度及相应的提升度,从所述初筛差异表达基因中筛选差异表达基因的步骤,包括:
    根据所述置信度比所述支持度确定所述提升度;
    获取预设的提升度阈值;
    利用所述提升度大于阈值的支持度和置信度在所述初筛差异表达基因中筛选差异表达基因。
  7. 一种差异表达基因筛选装置,其特征在于,包括:
    获取模块,所述获取模块用于获取待筛选基因表达量;
    初筛模块,所述初筛模块用于对所述待筛选基因表达量进行初步筛选,得到初筛差异表达基因;
    生成模块,所述生成模块用于生成所述初筛差异表达基因对应的关联规则;
    筛选模块,所述筛选模块基于所述关联规则,从所述初筛差异表达基因中筛选差异表达基因。
  8. 根据权利要求7所述的差异表达基因筛选装置,其特征在于,所述生成模块包括:
    生成单元,所述生成单元根据所述初筛差异表达基因生成差异表达基因事务集;
    第一确定单元,所述第一确定单元根据所述差异表达基因事务集确定频繁项集;
    第二确定单元,所述第二确定单元根据所述频繁项集和所述基因事务集确定所述初筛差异表达基因对应的关联规则。
  9. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现上述权利要求1所述的方法的步骤。
  10. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现上述权利要求2所述的方法的步骤。
  11. 一种电子设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现上述权利要求3所述的方法的步骤。
  12. 一种具有处理器可执行的非易失的程序代码的计算机可读介质,其特征在于,所述程序代码使所述处理器执行所述权利要求1所述方法。
  13. 一种具有处理器可执行的非易失的程序代码的计算机可读介质,其特征在于,所述程序代码使所述处理器执行所述权利要求2所述方法。
PCT/CN2019/123574 2018-12-18 2019-12-06 差异表达基因筛选方法及装置 WO2020125448A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811551609.2A CN111341385A (zh) 2018-12-18 2018-12-18 差异表达基因筛选方法及装置
CN201811551609.2 2018-12-18

Publications (1)

Publication Number Publication Date
WO2020125448A1 true WO2020125448A1 (zh) 2020-06-25

Family

ID=71102513

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/123574 WO2020125448A1 (zh) 2018-12-18 2019-12-06 差异表达基因筛选方法及装置

Country Status (2)

Country Link
CN (1) CN111341385A (zh)
WO (1) WO2020125448A1 (zh)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160253453A1 (en) * 2015-01-16 2016-09-01 University Of Virginia Patent Foundation Parameterizing Cell-to-Cell Regulatory Heterogeneities via Stochastic Transcriptional Profiles
CN108038352A (zh) * 2017-12-15 2018-05-15 西安电子科技大学 结合差异化分析和关联规则挖掘全基因组关键基因的方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060141489A1 (en) * 2004-07-13 2006-06-29 Allison David B Method of statistical genomic analysis
US20060190190A1 (en) * 2005-02-02 2006-08-24 Zohar Yakhini Method and system for analysis of gene-expression data
CN108830045B (zh) * 2018-06-29 2021-04-20 深圳先进技术研究院 一种基于多组学的生物标记物系统筛选方法

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160253453A1 (en) * 2015-01-16 2016-09-01 University Of Virginia Patent Foundation Parameterizing Cell-to-Cell Regulatory Heterogeneities via Stochastic Transcriptional Profiles
CN108038352A (zh) * 2017-12-15 2018-05-15 西安电子科技大学 结合差异化分析和关联规则挖掘全基因组关键基因的方法

Also Published As

Publication number Publication date
CN111341385A (zh) 2020-06-26

Similar Documents

Publication Publication Date Title
EP3008617B1 (en) Automatic customization of a software application
Lindbom et al. PsN-Toolkit—a collection of computer intensive statistical methods for non-linear mixed effect modeling using NONMEM
US7743071B2 (en) Efficient data handling representations
US8983936B2 (en) Incremental visualization for structured data in an enterprise-level data store
US8161048B2 (en) Database analysis using clusters
US8239778B2 (en) Graphical database interaction system and method
US7797356B2 (en) Dynamically detecting exceptions based on data changes
US20070050379A1 (en) Highlighting entities in a display representation of a database query, results of a database query, and debug message of a database query to indicate associations
JP2019519027A (ja) 履歴ログからの学習と、etlツール内のデータアセットに関するデータベースオペレーションの推奨
US10152512B2 (en) Metadata-driven program code generation for clinical data analysis
US7720831B2 (en) Handling multi-dimensional data including writeback data
EP3866042B1 (en) Adaptive differentially private count
US8751543B2 (en) Database view modeling using existing data model
US20140279972A1 (en) Cleansing and standardizing data
US20120059861A1 (en) Method and system for creating a relationship structure
US20190392069A1 (en) Advanced formulas planning script conversion platform
CN109376153A (zh) 一种基于NiFi的数据写入图数据库的系统及方法
Spill et al. Binless normalization of Hi-C data provides significant interaction and difference detection independent of resolution
US20190197043A1 (en) System and method for analysis and represenation of data
Charmpi et al. Optimizing network propagation for multi-omics data integration
JP2020098585A (ja) 二部ネットワーク内のミッシングリンクを理解するための視覚分析フレームワーク、方法、プログラム、装置、およびシステム
US20120084250A1 (en) Data write-back to data provider from individual grid cells
WO2020125448A1 (zh) 差异表达基因筛选方法及装置
Arioli et al. OptiMissP: A dashboard to assess missingness in proteomic data-independent acquisition mass spectrometry
US20150363711A1 (en) Device for rapid operational visibility and analytics automation

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19899880

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 12/11/2021)

122 Ep: pct application non-entry in european phase

Ref document number: 19899880

Country of ref document: EP

Kind code of ref document: A1