WO2021164270A1 - Procédé, appareil et dispositif d'analyse de données, et support de stockage - Google Patents

Procédé, appareil et dispositif d'analyse de données, et support de stockage Download PDF

Info

Publication number
WO2021164270A1
WO2021164270A1 PCT/CN2020/119441 CN2020119441W WO2021164270A1 WO 2021164270 A1 WO2021164270 A1 WO 2021164270A1 CN 2020119441 W CN2020119441 W CN 2020119441W WO 2021164270 A1 WO2021164270 A1 WO 2021164270A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
representative
representative sequence
sample
depth information
Prior art date
Application number
PCT/CN2020/119441
Other languages
English (en)
Chinese (zh)
Inventor
贾瑞凯
叶桦
方其
张艳
吴昕
廖国娟
Original Assignee
苏州金唯智生物科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州金唯智生物科技有限公司 filed Critical 苏州金唯智生物科技有限公司
Publication of WO2021164270A1 publication Critical patent/WO2021164270A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/30Detection of binding sites or motifs

Definitions

  • the present disclosure relates to the field of biology, for example, to a data analysis method, device, equipment, and storage medium.
  • Metagenomics analysis has become the main method of microbial population research.
  • the huge sequence information obtained through high-throughput sequencing can use bioinformatics analysis to study the composition and diversity of microorganisms, and the relationship between microorganisms and the environment. Research has entered a new era.
  • 16SrDNA sequencing technology is one of the main technologies for studying metagenomics.
  • the main purpose of the bioinformatics analysis method corresponding to 16SrDNA sequencing was to increase the speed of data processing as much as possible while ensuring the accuracy of the data.
  • OTU Table OTU cluster analysis
  • ASV table characteristic sequence analysis
  • the present disclosure provides a data analysis method, device, equipment, and storage medium, so as to achieve a more complete and accurate detection of components, and expand the sensitivity and completeness of detection.
  • a data analysis method including:
  • a data analysis device which includes:
  • a determining module configured to determine a representative sequence according to the sequencing data
  • a statistics module configured to count the depth information of the representative sequence in at least one sample
  • the analysis module is configured to analyze the representative sequence and the depth information of the representative sequence in at least one sample.
  • a computer device including a memory, a processor, and a computer program that is stored in the memory and can run on the processor. Data analysis methods.
  • FIG. 1 is a flowchart of a data analysis method in Embodiment 1 of the present invention.
  • Figure 2 is a schematic structural diagram of a data analysis device in the second embodiment of the present invention.
  • Fig. 3 is a schematic structural diagram of a computer device in the third embodiment of the present invention.
  • Fig. 1 is a flowchart of a data analysis method provided in the first embodiment of the present invention. This embodiment is applicable to the situation of data analysis.
  • the method can be executed by the data analysis device in the embodiment of the present invention, and the device can use software.
  • And/or hardware implementation, as shown in Figure 1, the method includes the following steps:
  • S110 Read sequencing data of at least one sample.
  • the sequencing data includes at least one read, and each read corresponds to one base sequence fragment.
  • the sequencing data of at least one sample may be read.
  • three samples M01, M02, and M03 need to be tested, and the three samples have known bacterial populations and proportions (see Beiresources Catalog: HM- 782D).
  • the data composition of the three samples M01, M02, and M03 contained 38,386, 68,436, 31,791 base sequences (reads), and on average, each read contained 457 bases.
  • S120 Determine a representative sequence according to the sequencing data.
  • the sequencing data includes at least one base sequence; correspondingly, determining a representative sequence according to the sequencing data includes:
  • the same base sequence fragments in the sequencing data of at least one sample are combined into a group, and the corresponding base sequence of each group is used as the representative sequence of the group.
  • the base sequence includes at least one base.
  • 138,613 reads are divided into 127,299 groups according to whether their base sequences are completely identical, and the corresponding base sequence information of each group is used as the representative sequence of the group.
  • the base sequence includes at least one base; correspondingly, the same base sequence in the sequencing data of at least one sample is combined into a group, and the corresponding base sequence of each group is used as the representative sequence of the group include:
  • a base sequence does not have a base sequence that is exactly the same as the base sequence, the base sequence is divided into a group separately. Divide at least one base sequence with different bases into one group.
  • the representative sequence includes at least one base, for example, if the base sequence fragment is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA, then each A is one base.
  • the way to determine whether the bases in the at least one base sequence are the same is to align all the bases in the base sequence in sequence.
  • base sequence 1 is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
  • the way to judge whether the base sequence 1 and the base sequence 2 are the same is to sequentially align the bases in the base sequence 1 and the base sequence 2, and the base sequence 1 and the base sequence 2 are If one base, third base, and fourth base are different, base sequence 1 and base sequence 2 are different, base sequence 1 is a group, base sequence 2 is a group, if the base sequence 1 is AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA. Align the bases in base sequence 1 and base sequence 3 in sequence
  • S140 Analyze the representative sequence and the depth information of the representative sequence in at least one sample.
  • the way of analyzing the representative sequence and the depth information of the representative sequence in at least one sample may be to first perform a difference analysis on a table containing the representative sequence and the representative sequence in the depth of at least one sample; also It can be used to annotate representative sequences, determine the species and/or genus corresponding to the representative sequence, merge the representative sequences of the same species and/or genus, and add the corresponding depth information to obtain at least one sample bacteria
  • the composition at the species and/or genus level is not limited in the embodiment of the present invention.
  • analyzing the representative sequence and the depth information of the representative sequence in at least one sample includes:
  • the annotating the representative sequences obtained after grouping includes:
  • the method further includes:
  • a table is established according to the number, the representative sequence, and the depth information of the representative sequence, wherein the depth information of the representative sequence includes depth information for at least one sample.
  • the technical solution of this embodiment reads the sequencing data of at least one sample; merges the identical base sequence fragments in the sequencing data into one group, and each group contains the base sequence information as the representative sequence of each group; statistical representation The depth information of the sex sequence in at least one sample; the analysis of the representative sequence and the depth information of the representative sequence in at least one sample can detect the components more completely and accurately, and expand the sensitivity and completeness of the detection.
  • FIG. 2 is a schematic structural diagram of a data analysis device provided in Embodiment 2 of the present invention. This embodiment is suitable for data analysis.
  • the device can be implemented in software and/or hardware.
  • the device can be integrated in any equipment that provides data analysis functions.
  • the data analysis device It includes: a reading module 210, a determination module 220, a statistics module 230, and an analysis module 240.
  • the reading module 210 is configured to read sequencing data of at least one sample
  • the determining module 220 is configured to determine a representative sequence according to the sequencing data
  • the statistics module 230 is configured to count the depth information of the representative sequence in at least one sample
  • the analysis module 240 is configured to analyze the representative sequence and the depth information of the representative sequence in at least one sample.
  • the above-mentioned product can execute the method provided by any embodiment of the present invention, and has the functional modules and effects corresponding to the execution method.
  • the technical solution of this embodiment reads the sequencing data of at least one sample; determines the representative sequence according to the sequencing data; counts the depth information of the representative sequence in at least one sample; compares the representative sequence and the representative sequence Analyzing the depth information of the sequence in at least one sample can detect the components more completely and accurately, and expand the sensitivity and completeness of the detection.
  • FIG. 3 is a schematic structural diagram of a computer device in the third embodiment of the present invention.
  • Figure 3 shows a block diagram of an exemplary computer device 12 suitable for implementing embodiments of the present disclosure.
  • the computer device 12 shown in FIG. 3 is only an example, and should not bring any limitation to the function and application scope of the embodiment of the present invention.
  • the computer device 12 is represented in the form of a general-purpose computing device.
  • the components of the computer device 12 may include: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • the bus 18 represents one or more of several types of bus structures, including a memory bus or a memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any bus structure among multiple bus structures.
  • these architectures include Industry Standard Architecture (Subversive Alliance, ISA) bus, Micro Channel Architecture (MAC) bus, enhanced ISA bus, Video Electronics Standards Association (VESA) ) Local bus and Peripheral Component Interconnect (PCI) bus.
  • the computer device 12 includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and nonvolatile media, removable and non-removable media.
  • the system memory 28 may include a computer system readable medium in the form of a volatile memory, such as a random access memory (RAM) 30 and/or a cache memory 32.
  • the computer device 12 may include other removable/non-removable, volatile/nonvolatile computer system storage media.
  • the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (not shown in FIG. 3, usually referred to as a "hard drive").
  • each drive can be connected to the bus 18 through one or more data media interfaces.
  • the system memory 28 may include at least one program product, the program product having a set (for example, at least one) of program modules, and these program modules are configured to perform the functions of multiple embodiments of the present invention.
  • a program/utility tool 40 having a set of (at least one) program module 42 may be stored in, for example, the system memory 28.
  • Such program module 42 includes an operating system, one or more application programs, other program modules, and program data. Each or a combination of the examples may include the realization of a network environment.
  • the program module 42 generally executes the functions and/or methods in the embodiments described in the present disclosure.
  • the computer device 12 may also communicate with one or more external devices 14 (such as keyboards, pointing devices, displays 24, etc.), and may also communicate with one or more devices that enable users to interact with the computer device 12, and/or communicate with Any device (such as a network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices.
  • This communication can be performed through an input/output (Input/Output, I/O) interface 22.
  • the display 24 does not exist as an independent entity, but is embedded in a mirror surface. When the display surface of the display 24 is not displayed, the display surface of the display 24 and the mirror surface are visually integrated.
  • the computer device 12 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 20.
  • the network adapter 20 communicates with other modules of the computer device 12 through the bus 18.
  • other hardware and/or software modules can be used in conjunction with the computer device 12, including: microcode, device drivers, redundant processing units, external disk drive arrays, and disk arrays (Redundant Arrays of Independent Drives, RAID) Systems, tape drives, and data backup storage systems.
  • the processing unit 16 executes a variety of functional applications and data processing by running programs stored in the system memory 28, for example, to implement the data analysis method provided by the embodiment of the present invention: read sequencing data of at least one sample; The data determines a representative sequence; counts the depth information of the representative sequence in at least one sample; analyzes the representative sequence and the depth information of the representative sequence in at least one sample.
  • the fourth embodiment of the present invention provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, realizes the data analysis method as provided in the embodiment of the present invention: read sequencing data of at least one sample; read Sequencing data of at least one sample; determining a representative sequence according to the sequencing data; counting the depth information of the representative sequence in at least one sample; determining the representative sequence and the depth information of the representative sequence in at least one sample Perform analysis.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • the computer-readable storage medium may be, for example, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the above.
  • Computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Erasable Programmable Read-Only Memory (EPROM) or flash memory, optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, apparatus, or device.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and computer-readable program code is carried therein. This propagated data signal can take many forms, including electromagnetic signals, optical signals, or any suitable combination of the above.
  • the computer-readable signal medium may also be any computer-readable medium other than the computer-readable storage medium.
  • the computer-readable medium may send, propagate, or transmit the program for use by or in combination with the instruction execution system, apparatus, or device .
  • the program code contained on the computer-readable medium can be transmitted by any suitable medium, including wireless, wire, optical cable, radio frequency (RF), etc., or any suitable combination of the foregoing.
  • RF radio frequency
  • the computer program code for performing the operations of the present disclosure can be written in one or more programming languages or a combination thereof.
  • the programming languages include object-oriented programming languages—such as Java, Smalltalk, C++, and also conventional Procedural programming language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computer, partly on the user's computer, executed as an independent software package, partly on the user's computer and partly executed on a remote computer, or entirely executed on the remote computer or server.
  • the remote computer may be connected to the user computer through any kind of network including LAN or WAN, or may be connected to an external computer (for example, using an Internet service provider to connect through the Internet).

Landscapes

  • Bioinformatics & Cheminformatics (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Biotechnology (AREA)
  • Biophysics (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Analytical Chemistry (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

La présente invention concerne un procédé, un appareil et un dispositif d'analyse de données, ainsi qu'un support de stockage. Ledit procédé consiste à : lire des données de séquençage d'au moins un échantillon (S110) ; déterminer une séquence représentative selon les données de séquençage (S120) ; compter des informations de profondeur de la séquence représentative dans l'au moins un échantillon (S130) ; et analyser la séquence représentative et les informations de profondeur de la séquence représentative dans l'au moins un échantillon (S140).
PCT/CN2020/119441 2020-02-20 2020-09-30 Procédé, appareil et dispositif d'analyse de données, et support de stockage WO2021164270A1 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010103766.8 2020-02-20
CN202010103766.8A CN111326213B (zh) 2020-02-20 2020-02-20 一种数据分析方法、装置、设备及存储介质

Publications (1)

Publication Number Publication Date
WO2021164270A1 true WO2021164270A1 (fr) 2021-08-26

Family

ID=71165312

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/119441 WO2021164270A1 (fr) 2020-02-20 2020-09-30 Procédé, appareil et dispositif d'analyse de données, et support de stockage

Country Status (2)

Country Link
CN (1) CN111326213B (fr)
WO (1) WO2021164270A1 (fr)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111326213B (zh) * 2020-02-20 2023-10-03 苏州金唯智生物科技有限公司 一种数据分析方法、装置、设备及存储介质
CN112631562B (zh) * 2020-12-01 2022-08-23 上海欧易生物医学科技有限公司 基于python的二代测序样本混样方法、应用、设备、计算机可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085910A1 (en) * 2014-09-18 2016-03-24 Illumina, Inc. Methods and systems for analyzing nucleic acid sequencing data
CN106916891A (zh) * 2017-02-24 2017-07-04 昆明藻能生物科技有限公司 采用De novo转录组分析鉴定隐甲藻中与DHA生物合成相关的脂肪酸去饱和酶基因的方法
CN109378044A (zh) * 2018-10-18 2019-02-22 东莞博奥木华基因科技有限公司 一种个体化用药报告生成方法和系统
CN109920484A (zh) * 2019-02-14 2019-06-21 北京安智因生物技术有限公司 一种测序仪用的基因检测数据的分析方法及系统
CN110246581A (zh) * 2019-07-02 2019-09-17 广东瑞昊生物技术有限公司 基于基因检测的评估系统
CN111326213A (zh) * 2020-02-20 2020-06-23 苏州金唯智生物科技有限公司 一种数据分析方法、装置、设备及存储介质

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106021979A (zh) * 2016-05-12 2016-10-12 北京百迈客云科技有限公司 人基因组重测序数据分析系统及方法
CN110349629B (zh) * 2019-06-20 2021-08-06 湖南赛哲医学检验所有限公司 一种利用宏基因组或宏转录组检测微生物的分析方法

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160085910A1 (en) * 2014-09-18 2016-03-24 Illumina, Inc. Methods and systems for analyzing nucleic acid sequencing data
CN106916891A (zh) * 2017-02-24 2017-07-04 昆明藻能生物科技有限公司 采用De novo转录组分析鉴定隐甲藻中与DHA生物合成相关的脂肪酸去饱和酶基因的方法
CN109378044A (zh) * 2018-10-18 2019-02-22 东莞博奥木华基因科技有限公司 一种个体化用药报告生成方法和系统
CN109920484A (zh) * 2019-02-14 2019-06-21 北京安智因生物技术有限公司 一种测序仪用的基因检测数据的分析方法及系统
CN110246581A (zh) * 2019-07-02 2019-09-17 广东瑞昊生物技术有限公司 基于基因检测的评估系统
CN111326213A (zh) * 2020-02-20 2020-06-23 苏州金唯智生物科技有限公司 一种数据分析方法、装置、设备及存储介质

Also Published As

Publication number Publication date
CN111326213B (zh) 2023-10-03
CN111326213A (zh) 2020-06-23

Similar Documents

Publication Publication Date Title
Flygare et al. Taxonomer: an interactive metagenomics analysis portal for universal pathogen detection and host mRNA expression profiling
Kuczynski et al. Using QIIME to analyze 16S rRNA gene sequences from microbial communities
Zhu et al. HGTector: an automated method facilitating genome-wide discovery of putative horizontal gene transfers
Jombart et al. Discriminant analysis of principal components: a new method for the analysis of genetically structured populations
WO2021164270A1 (fr) Procédé, appareil et dispositif d'analyse de données, et support de stockage
Hiseni et al. HumGut: a comprehensive human gut prokaryotic genomes collection filtered by metagenome data
Brinkmann et al. Proficiency testing of virus diagnostics based on bioinformatics analysis of simulated in silico high-throughput sequencing data sets
Makałowski et al. Bioinformatics of nanopore sequencing
Fischer et al. Abundance estimation and differential testing on strain level in metagenomics data
Gliddon et al. Identification of reduced host transcriptomic signatures for tuberculosis disease and digital PCR-based validation and quantification
CN111710364B (zh) 一种菌群标记物的获取方法、装置、终端及存储介质
WO2022121337A1 (fr) Procédé et appareil d'exploration de données, dispositif électronique et support de stockage
WO2020047453A1 (fr) Systèmes et procédés d'analyse de données de séquençage d'arn unicellulaire
CN111814033A (zh) 用于投放的媒介信息确定方法、装置、设备和存储介质
Palatnick et al. iGenomics: Comprehensive DNA sequence analysis on your Smartphone
CN110444254B (zh) 一种菌群标记物的检测方法、检测系统及终端
Phelan et al. Genome-wide host-pathogen analyses reveal genetic interaction points in tuberculosis disease
Hadgu et al. Evaluation of screening tests for detecting Chlamydia trachomatis: bias associated with the patient-infected-status algorithm
US20200381084A1 (en) Identifying salient features for instances of data
WO2020211399A1 (fr) Procédé et appareil d'envoi de données, dispositif informatique, et support d'informations
JP2012063885A (ja) 情報分析装置、情報分析方法、情報分析システムおよびプログラム
Pan et al. Microbial diversity biased estimation caused by intragenomic heterogeneity and interspecific conservation of 16s rrna genes
Zhang et al. Inferring transmission heterogeneity using virus genealogies: Estimation and targeted prevention
Mallawaarachchi et al. Genome-wide association, prediction and heritability in bacteria with application to Streptococcus pneumoniae
Wyllie et al. M. tuberculosis microvariation is common and is associated with transmission: analysis of three years prospective universal sequencing in England

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20920580

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20920580

Country of ref document: EP

Kind code of ref document: A1