WO2020135500A1 - Procédé et système de construction d'un ensemble de données de référence de l'analyse d'informations biologiques - Google Patents

Procédé et système de construction d'un ensemble de données de référence de l'analyse d'informations biologiques Download PDF

Info

Publication number
WO2020135500A1
WO2020135500A1 PCT/CN2019/128283 CN2019128283W WO2020135500A1 WO 2020135500 A1 WO2020135500 A1 WO 2020135500A1 CN 2019128283 W CN2019128283 W CN 2019128283W WO 2020135500 A1 WO2020135500 A1 WO 2020135500A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
reference data
sample
biological information
information analysis
Prior art date
Application number
PCT/CN2019/128283
Other languages
English (en)
Chinese (zh)
Inventor
王云峰
杜洋
李大为
玄兆伶
王海良
王娟
肖飞
Original Assignee
安诺优达基因科技(北京)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 安诺优达基因科技(北京)有限公司 filed Critical 安诺优达基因科技(北京)有限公司
Publication of WO2020135500A1 publication Critical patent/WO2020135500A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics
    • G16B50/30Data warehousing; Computing architectures
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding

Definitions

  • Primary databases include, for example, genome databases, nucleic acid and protein primary structure sequence databases, and biological macromolecular three-dimensional spatial structure databases.
  • the secondary database is constructed based on the primary database and literature. According to the actual needs of different research fields in life sciences, data such as genome maps, nucleic acid and protein sequences, protein structures, and literature are analyzed, organized, summarized, and annotated to construct secondary databases with special biological significance and special purposes. An effective way to develop information databases.
  • the present invention proposes a new method for constructing a background database, that is, a method for constructing a reference data set for bioinformatics analysis, which divides different features based on a certain feature of the existing data bioinformatics analysis reference dataset Refer to the subset of reference data to obtain a subset of reference data with relatively consistent internal features.
  • the above-mentioned features show significant differences between the subsets of reference data.
  • the object of the present invention is achieved by the following technical solutions.
  • a method for constructing a reference data set for biological information analysis including:
  • the sequencing data is preferably sequencing data from a high-throughput sequencing platform, specifically, those generally used by those skilled in the art, such as whole genome sequencing or target sequence capture sequencing.
  • the sequencing data includes all sequencing data or part of sequencing data, and all of the sequencing data and part of the sequencing data may be sequencing data obtained after filtering by a quality control method commonly used by those skilled in the art.
  • the sequencing feature data of each reference sample is extracted.
  • the reads comparison quality is preferably the Unique reads comparison quality.
  • the step of acquiring the partial sequencing data of the reference sample includes:
  • the geometric center can be obtained by a method generally used in the art, for example, a method for obtaining the geometric center of a cluster.
  • the data range of the sequencing data, the data range of the reference data set for biological information analysis, and the data range of the number of reference data subsets are the data positions corresponding to the reference genome obtained after the data is compared with the reference genome .
  • the characteristic factors of the reference data subset or the biological information analysis reference data set are based on the characteristic factors extracted from the initial reference data set constructing it or the influencing factors of the sequencing data of the reference sample constructing the initial reference data set.
  • one of the influencing factors of the sequencing data of the reference sample that constructs the initial reference data set is the storage temperature of the sample, and based on this influencing factor, the GC content is selected as the characteristic factor of the sequencing data of the reference sample of the initial reference data set, thereby obtaining the biological
  • the characteristic factor of information analysis reference data set or reference data subset is GC content.
  • An analysis method of chromosomal abnormality or copy number variation characterized in that the biological information analysis reference data set described in item 8 is used to analyze the chromosomal abnormality or copy number variation of the sample to be detected.
  • a system for constructing a reference data set for biological information analysis comprising:
  • Data acquisition module for acquiring sequencing data of multiple reference samples
  • the initial reference data set construction module is used to form the sequencing data of all reference samples into an initial reference data set
  • the reference data set construction module is used to use any reference data subset as a biological information analysis reference data set.
  • classification module includes,
  • a feature data extraction sub-module for extracting feature data of each reference sample according to the feature factor, wherein, based on all sequencing data of the reference sample, sequencing feature data of each reference sample is extracted,
  • the data segmentation sub-module is used to divide reference samples with similar feature data into one reference data subset based on the feature data of each reference sample, thereby obtaining more than two reference data subsets.
  • classification module includes,
  • Feature factor selection sub-module used to select feature factors based on sequencing data influencing factors
  • the data segmentation sub-module is used to divide reference samples with similar feature data into one reference data subset based on the feature data of each reference sample, thereby obtaining more than two reference data subsets.
  • the data filtering sub-module includes:
  • a feature coefficient extraction element used to obtain the feature coefficient of the segmented data according to the feature factor
  • the segmented data filtering component is used to judge whether the feature coefficient of the segmented data exceeds the set range, delete the segmented data whose feature coefficient exceeds the set range, and retain the segmented data with the feature coefficient within the set range.
  • the data segmentation sub-module includes a direct segmentation element and/or a quantitative segmentation element
  • Data acquisition module for acquiring sequencing data of newly added reference samples
  • a classification module used to classify the newly added reference sample and a reference data subset that wants to be expanded
  • An expansion data set construction module used to merge the newly added reference sample with the reference data subset to be expanded into a class to form an expanded reference data subset
  • the reference data set construction module is used to use any expanded reference data subset as a biological information analysis reference data set.
  • classification module includes a feature data extraction sub-module and a classification judgment sub-module.
  • the data extraction sub-module is used to extract the characteristic data of the data range corresponding to the newly added reference sample according to the data range and the characteristic factor of the reference data subset to be expanded.
  • the classification and determination submodule is used to compare the spatial distance between the geometric data of the newly added reference sample and the geometric center of the reference data subset to be expanded.
  • the newly added reference sample and the reference data subset with the smallest spatial distance desired to be expanded are classified into one category.
  • a biological information analysis system which uses the biological information analysis reference data set described in item 8 to analyze the test sample, the system includes,
  • the data acquisition module is used to acquire the sequencing data of the sample to be tested,
  • the data set preparation module is used for storing or extracting the biological information analysis reference data set, and extracting the parameters of the biological information analysis reference data set of the item, the parameters including geometric center, data range and characteristic factor,
  • a classification module used to classify the sample to be tested and a reference data set for biological information analysis
  • the analysis module is used for biological information analysis by using a reference data set of biological information analysis that classifies the sample to be tested,
  • the data set preparation module includes a parameter extraction sub-module.
  • the parameter extraction sub-module is used to extract parameters of the biological information analysis reference data set.
  • the parameters preferably include geometric center, data range, and feature factor.
  • classification module includes a feature data extraction sub-module and a classification judgment sub-module.
  • the sample to be detected and the biological information analysis reference data set having the minimum spatial distance therefrom are classified into one category.
  • An electronic device including:
  • the detection result caused by the overall reference data set as the background is abnormal, such as the problem of false positives.
  • the sequencing data of some samples may show certain fluctuations due to certain influencing factors. Under the overall reference data set, such fluctuations will be judged as abnormal fluctuations, that is, the detection results are displayed as abnormal.
  • the present invention overcomes this problem by performing statistics, evaluation, and screening on the influencing factors of the reference samples that constitute the overall reference data set, and classifying the overall reference data set based on this.
  • FIG. 1 shows a schematic diagram of an embodiment of a method for constructing a reference data set for biological information analysis according to the present invention
  • FIG. 3 shows a schematic diagram of an example of a preferred embodiment of the classification module of the system for constructing a biological information analysis reference data set of the present invention
  • FIG. 6 shows a schematic diagram of a specific implementation manner of an example of a biological information analysis system of the present invention
  • Example 8 is a graph showing the result of clustering reference samples in Example 2.
  • Example 10 is a graph showing the average GC probability density distribution of the reference data subset in Example 2.
  • Reads The plural of read. Read is a sequence of short sequencing fragments generated by a high-throughput sequencing platform.
  • Unique reads refers to the only reads that are compared to the reference genome. During the sequencing process, some reads can be compared to multiple locations in the reference genome at the same time. Unique reads filter out these multiple reads from all non-dup reads, and the rest are unique reads.
  • FIG. 1 shows a schematic diagram of an embodiment of a method for constructing a reference data set for biological information analysis according to the present invention.
  • step S300 classifies the initial reference data set to obtain more than two reference data subsets.
  • the method includes: selecting feature factors based on sequencing data influencing factors; and according to the features The factor extracts the feature data of each reference sample; based on the feature data of each reference sample, the reference samples with similar feature data are divided into one reference data subset, thereby obtaining more than two reference data subsets.
  • the sequencing feature data of each reference sample is extracted.
  • the characteristic factor is selected from one or more of the following: reads alignment quality, GC content, sample base sequence complexity and sample genome local complexity.
  • the reads comparison quality is preferably the Unique reads comparison quality.
  • the selection of the characteristic factor according to the influencing factors of the sequencing data may be, for example, in the peripheral blood sample sequencing process, the influencing factors include the packaging method, the type of blood collection tube, the centrifugal temperature, the sequencing platform, etc. All will affect the GC content distribution of the final library. At this time, the GC content can be selected as the influencing factor for sequencing data of peripheral blood samples according to the above influencing factors.
  • the conventional overall reference data set consisting of normal diploid samples has a high confidence GC content distribution interval, however, due to various influencing factors, the library GC content deviates.
  • the reference data set belongs to different GC distributions, which leads to the accuracy of the final detection result of the unclassified overall reference data set.
  • the step of acquiring partial sequencing data of the reference sample includes: segmenting all the sequencing data of the reference sample to obtain segmented data of each reference sample; according to the characteristic factor Obtain the feature coefficients of the segmented data; determine whether the feature coefficients of the segmented data exceed the set range, delete the segmented data with the feature coefficients outside the set range, and keep the segmented feature coefficients within the set range Data to obtain partial sequencing data of the reference sample.
  • segmenting all the sequencing data of the reference sample may be windowing the sequencing data of the reference sample. If the feature factor uses, for example, the GC content to obtain the feature coefficient of the segmented data, the coefficient of variation CV obtained from all windows may be used. The coefficient of variation may be calculated based on the ratio between the average value and the standard deviation of the GC content of each window. The preferred GC content is unique reads GC content.
  • the setting interval can choose to exclude those areas where the characteristic fluctuation is not obvious, and the setting interval can also choose to exclude those low mapability areas of Uniq reads.
  • the set interval may be, for example, 75% or more and 95% or less. Among them, the coefficient of variation less than 75% indicates that the characteristic fluctuation is not obvious, and the coefficient of variation greater than 95% indicates that it is related to the low mapability of Uniq reads.
  • the step of dividing the reference samples with similar characteristic data into one reference data subset based on the characteristic data of each reference sample adopts a direct division method or based on the set number of reference data subsets
  • the division of division can preferably be selected using, for example, a cumulative distribution function.
  • the specific implementation scheme of the direct division method may be an unsupervised clustering algorithm.
  • the method for constructing a reference data set for biological information analysis of the present invention includes: randomly selecting N reference samples and sequencing the N reference samples (obtaining N gene libraries) ), so that the sequencing results of N reference samples constitute the initial reference data set. For each of the selected N gene libraries, data preprocessing can also be performed to exclude redundant parts in each library.
  • Use unsupervised clustering algorithm to perform cluster analysis and classification on the N pre-processed data libraries to determine L reference data subsets, that is, divide the N libraries into L reference data subsets; take any reference data subset as Search clusters (reference data subsets that you want to expand) to obtain the geometric centers of the search clusters, that is, determine the geometric centers of each reference data subset in the L reference data subsets based on the unsupervised clustering algorithm.
  • M new samples that are different from the N reference samples are further randomly selected, and the M new samples are also sequenced to obtain M gene libraries. It is also possible to perform data preprocessing on each of the M gene libraries. Then, based on the distance between the geometric centers of the M gene libraries and each of the L reference data subsets, the M gene libraries are allocated to the L reference data subsets respectively to expand the L reference data subsets ; And obtaining L reference data subsets after expansion.
  • each of the additional M gene libraries is calculated And the spatial distance between the geometric centers of each of the L reference data subsets, that is, L spatial distances can be calculated, and each of the M gene libraries is assigned to the calculated L spatial distances.
  • the smallest reference data subset thereby completing the expansion process of a gene library (reference data subset that you want to expand), and when there is no minimum value in the calculated L spatial distances, the gene library is not assigned to L sub-backgrounds In any of the libraries.
  • the capacity expansion step of the next one of the M gene libraries is directly performed, and the above steps are performed on all the M gene libraries to complete the capacity expansion step.
  • the reference sample may be any kind of population, such as humans, mammals, etc., and the multiple reference samples of the same type or the same type are sequenced to obtain sequencing data of these reference samples.
  • First-generation sequencing, second-generation sequencing, and third-generation sequencing can be used.
  • N reference samples may be selected at random, so that the N reference samples are sequenced to obtain sequencing data of the N reference samples, and the sequencing results of the N reference samples constitute an initial reference data set.
  • FIG. 2 shows a schematic diagram of an embodiment of a system for constructing a reference data set for biological information analysis according to the present invention.
  • this embodiment provides a system for constructing a reference data set for biological information analysis.
  • the system includes a data acquisition module 100 for acquiring sequencing data of multiple reference samples.
  • the initial reference data set construction module 200 is used to compose the sequencing data of all reference samples into an initial reference data set.
  • the classification module 300 is configured to classify the initial reference data set to obtain more than two reference data subsets.
  • the reference data set construction module 500 is configured to use any reference data subset as a biological information analysis reference data set.
  • FIG. 3 shows a schematic diagram of an example of a preferred implementation of the classification module of the specific implementation of FIG. 2.
  • FIG. 4 shows a schematic diagram of an example of a preferred implementation of the classification module of the specific implementation of FIG. 2.
  • the data filtering sub-module 320c includes a data segmentation component 321 for segmenting all the sequencing data of the reference samples to obtain segmented data for each reference sample; a feature coefficient extraction component 322 for segmenting data based on the features Factor to obtain the characteristic coefficient of the segmented data; the segmented data filtering element, 323 is used to determine whether the feature coefficient of the segmented data exceeds the set range, and delete the segmented data whose feature coefficient exceeds the set range, Keep the segmented data with feature coefficients within the set range.
  • the feature data extraction sub-module 330c is configured to extract feature data of each reference sample according to the feature factor, wherein the sequencing feature data of each reference sample is extracted based on the partial sequencing data of the reference sample.
  • the data segmentation sub-module 340c includes a direct segmentation element and/or a quantitative segmentation element that divides reference samples with similar feature data into one reference data subset based on the set number of reference data subsets.
  • FIG. 6 shows a schematic diagram of a specific implementation manner of an example of the biological information analysis system of the present invention.
  • FIG. 6 a schematic diagram of an example of a preferred embodiment of the system for constructing a biological information analysis reference data set of the present invention
  • the classification module 300b includes a feature data extraction submodule 301b and a classification judgment submodule 302b.
  • the feature data extraction sub-module 301b is configured to analyze the data range and feature factors of the reference data set according to the biological information, and extract feature data of the data range corresponding to the sample to be detected.
  • the input device 13 may include, for example, a keyboard, a mouse, and the like.
  • the output device 14 can output various information to the outside, such as the fitting curve of the target sequence.
  • the output device 14 may include, for example, a display, a speaker, a printer, and a communication network and a remote output device connected thereto, and so on.
  • the electronic device 10 may also include any other suitable components.
  • Extract the geometric center and radius of the 6 reference data subsets merge the libraries in each reference data subset to calculate the average GC content of each feature window, and use the uniq reads GC content distribution vector obtained from each reference data subset as a reference The geometric center of the data subset.
  • the average uniq reads GC distribution between different reference data subsets is compared, as shown in Figure 10.
  • Figure 10 shows that the 6 reference data subsets show significantly different distributions. For any newly added samples or samples to be tested, only the distance between the corresponding feature window and the geometric center of the 6 reference data subsets can be calculated to determine which reference data subset or sample to be tested the new sample belongs to. Which reference data subset serves as the reference data set for its biological information analysis.
  • Principal component analysis is performed based on the GC distribution matrix of the 6 reference data subsets shown in Figure 8 to observe the distribution of samples under different principal components.
  • the analysis results are shown in Figure 9 and Figure 9
  • Six different reference data subsets are labeled with different gray levels and symbols.
  • the results in FIG. 9 show that there is a significant difference between the different reference data subsets, especially the difference under the second and third principal components.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Theoretical Computer Science (AREA)
  • Biophysics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Bioethics (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

L'invention également un procédé de construction d'un ensemble de données de référence de l'analyse d'informations biologiques. Ledit procédé consiste : à acquérir des données de séquençage d'une pluralité d'échantillons de référence (S100) ; à former un ensemble de données de référence initial par séquençage de données de tous les échantillons de référence (S200) ; à effectuer un traitement de classification sur l'ensemble de données de référence initial (S300), pour obtenir au moins deux sous-ensembles de données de référence, et à utiliser n'importe lequel des sous-ensembles de données de référence en tant qu'un ensemble de données de référence de l'analyse d'informations biologiques.
PCT/CN2019/128283 2018-12-29 2019-12-25 Procédé et système de construction d'un ensemble de données de référence de l'analyse d'informations biologiques WO2020135500A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201811638706 2018-12-29
CN201811638706.5 2018-12-29

Publications (1)

Publication Number Publication Date
WO2020135500A1 true WO2020135500A1 (fr) 2020-07-02

Family

ID=71126324

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/128283 WO2020135500A1 (fr) 2018-12-29 2019-12-25 Procédé et système de construction d'un ensemble de données de référence de l'analyse d'informations biologiques

Country Status (2)

Country Link
CN (1) CN111383717A (fr)
WO (1) WO2020135500A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113609204B (zh) * 2021-09-30 2021-12-24 深圳前海环融联易信息科技服务有限公司 数据关联特征分析方法、装置、设备及介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103824001A (zh) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 染色体的检测方法和装置
WO2015053480A1 (fr) * 2013-10-11 2015-04-16 삼성에스디에스 주식회사 Système et procédé d'analyse d'échantillons biologiques
CN105631464A (zh) * 2015-12-18 2016-06-01 深圳先进技术研究院 对染色体序列和质粒序列进行分类的方法及装置
CN106610977A (zh) * 2015-10-22 2017-05-03 阿里巴巴集团控股有限公司 一种数据聚类方法和装置
CN109063959A (zh) * 2018-06-22 2018-12-21 深圳弘睿康生物科技有限公司 一种样本质量控制分析方法和系统

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102517392A (zh) * 2011-12-26 2012-06-27 深圳华大基因研究院 基于宏基因组16s高可变区v3的分类方法和装置
JP5938484B2 (ja) * 2012-01-20 2016-06-22 深▲せん▼華大基因医学有限公司Bgi Diagnosis Co., Ltd. ゲノムのコピー数変異の有無を判断する方法、システム及びコンピューター読み取り可能な記憶媒体
WO2016090583A1 (fr) * 2014-12-10 2016-06-16 深圳华大基因研究院 Dispositif et procédé de traitement de données de séquençage
KR101828052B1 (ko) * 2015-06-24 2018-02-09 사회복지법인 삼성생명공익재단 유전자의 복제수 변이(cnv)를 분석하는 방법 및 장치
EP3878974A1 (fr) * 2015-07-06 2021-09-15 Illumina Cambridge Limited Préparation d'échantillons pour l'amplification d'acide nucléique
CN105063208B (zh) * 2015-08-10 2018-03-06 北京吉因加科技有限公司 一种血浆中游离的目标dna低频突变富集测序方法
CN105132407B (zh) * 2015-08-10 2017-12-12 北京吉因加科技有限公司 一种脱落细胞dna低频突变富集测序方法
CN106815491B (zh) * 2016-12-29 2021-11-16 浙江安诺优达生物科技有限公司 一种用于检测ffpe样本基因融合的装置
CN108256292B (zh) * 2016-12-29 2021-11-02 浙江安诺优达生物科技有限公司 一种拷贝数变异检测装置
CN108763859B (zh) * 2018-05-17 2020-11-24 北京博奥医学检验所有限公司 一种基于未知cnv样本建立提供cnv检测所需的模拟数据集的方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2015053480A1 (fr) * 2013-10-11 2015-04-16 삼성에스디에스 주식회사 Système et procédé d'analyse d'échantillons biologiques
CN103824001A (zh) * 2014-02-27 2014-05-28 北京诺禾致源生物信息科技有限公司 染色体的检测方法和装置
CN106610977A (zh) * 2015-10-22 2017-05-03 阿里巴巴集团控股有限公司 一种数据聚类方法和装置
CN105631464A (zh) * 2015-12-18 2016-06-01 深圳先进技术研究院 对染色体序列和质粒序列进行分类的方法及装置
CN109063959A (zh) * 2018-06-22 2018-12-21 深圳弘睿康生物科技有限公司 一种样本质量控制分析方法和系统

Also Published As

Publication number Publication date
CN111383717A (zh) 2020-07-07

Similar Documents

Publication Publication Date Title
Binder et al. Big data in medical science—a biostatistical view: Part 21 of a series on evaluation of scientific publications
US20060259246A1 (en) Methods for efficiently mining broad data sets for biological markers
US20030224344A1 (en) Method and system for clustering data
KR101950395B1 (ko) 개체군 유전체 염기서열 및 변이의 변환데이터에 대한 인공지능 딥러닝 모델을 이용한 바이오마커 검출 방법
KR101542529B1 (ko) 대립유전자의 바이오마커 발굴방법
JP2003536179A (ja) ヒューリスティック分類方法
CN108038352B (zh) 结合差异化分析和关联规则挖掘全基因组关键基因的方法
WO2022170909A1 (fr) Procédé de prédiction de sensibilité à un médicament, dispositif électronique et support de stockage lisible par ordinateur
CN111667885A (zh) 使用基于树的空间数据结构对基因数据集的群体分类
CN111913999B (zh) 基于多组学与临床数据的统计分析方法、系统和存储介质
Yang et al. Applying the Fisher score to identify Alzheimer’s disease-related genes
WO2023005196A1 (fr) Procédé de classification de gènes du cancer du sein à granularité multiple basé sur un rayon de voisinage adaptatif double
Ressom et al. Adaptive double self-organizing maps for clustering gene expression profiles
JP2018530815A (ja) 生体データにおけるパターン認識のマルチレベルアーキテクチャ
WO2020135500A1 (fr) Procédé et système de construction d'un ensemble de données de référence de l'analyse d'informations biologiques
Sharmila et al. An artificial immune system-based algorithm for abnormal pattern in medical domain
Ragunthar et al. Classification of gene expression data with optimized feature selection
Vilo et al. Regulatory sequence analysis: application to the interpretation of gene expression
JP3936851B2 (ja) クラスタリング結果評価方法及びクラスタリング結果表示方法
TWI399661B (zh) 從微陣列資料中分析及篩選疾病相關基因的系統
Serra et al. Data integration in genomics and systems biology
Thenmozhi et al. Distribution based fuzzy estimate spectral clustering for Cancer detection with protein sequence and structural motifs
CN107710206B (zh) 用于根据生物学数据的亚群检测的方法、系统和装置
CN115881218B (zh) 用于全基因组关联分析的基因自动选择方法
Lin Study on the influence of adolescent smoking on physical training vital capacity in eastern coastal areas

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19901968

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19901968

Country of ref document: EP

Kind code of ref document: A1