WO2024101492A1 - Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération - Google Patents

Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération Download PDF

Info

Publication number
WO2024101492A1
WO2024101492A1 PCT/KR2022/017799 KR2022017799W WO2024101492A1 WO 2024101492 A1 WO2024101492 A1 WO 2024101492A1 KR 2022017799 W KR2022017799 W KR 2022017799W WO 2024101492 A1 WO2024101492 A1 WO 2024101492A1
Authority
WO
WIPO (PCT)
Prior art keywords
read
base point
quality score
file
files
Prior art date
Application number
PCT/KR2022/017799
Other languages
English (en)
Korean (ko)
Inventor
김필수
오재현
최용진
어해석
김동원
김태환
이상용
Original Assignee
엘지전자 주식회사
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 엘지전자 주식회사 filed Critical 엘지전자 주식회사
Priority to PCT/KR2022/017799 priority Critical patent/WO2024101492A1/fr
Publication of WO2024101492A1 publication Critical patent/WO2024101492A1/fr

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/20Sequence assembly
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Definitions

  • This example relates to a method for recombining the base sequence of a microorganism, and more specifically, to a method and system for analyzing the base sequence of a microorganism using a next-generation sequencing method.
  • eDNA is the DNA of microorganisms collected from various environmental samples and is defined as a bio-indicator that can diagnose the degree of contamination in the surrounding environment. Since the DNA sequence is different for each microbial species, samples from the environment of interest are collected and the contents of the sample are collected. Various technologies have been proposed to measure the type of microorganism.
  • a database that distinguishes various species of microorganisms has been developed and shared. By extracting microbial DNA sequence information from environmental samples of interest and comparing it with the information in the database, it is possible to define microorganisms in the environment. .
  • sequence reads of the corresponding microorganism samples that are read and delivered through Illumina's NGS equipment are generated in a standardized format.
  • each sample data is cut into read units in the batch, the sequence is read in an expanded state, and each sequence read data is recorded as a FASTQ file.
  • Bioinformatics technology for generating desired DNA sequence information by processing the FASTQ file of each sample and related microbial classification technology have also been proposed in various ways.
  • US patent US 11335436B2 proposes a classification technology that allows simultaneous comparison of multiple targets for multiple microorganisms.
  • This embodiment provides a microbial analysis method that processes a FASTQ file using NGS technology, specifies only valid data through noise filtering, and applies an analysis algorithm only to the valid data.
  • the second task of this embodiment is to provide a microbial analysis method that can select a truncation location for a FASTQ file using NGS technology by considering the total read count.
  • the third task of this embodiment is to provide a microbial analysis method that can analyze only positions judged to be valid in a plurality of reads while quality is corrected by selecting the truncated position considering both quantitative and qualitative factors. .
  • This embodiment includes the steps of reading a first group including a set of a plurality of forward read files for one sample and a first group including a plurality of reverse read files; For each of the first group and the second group, performing truncation based on the lead count and quality score of the lead files constituting each group; performing pairing by detecting an overlap section between a forward read file and a reverse read file for the one truncated ID; Merging the forward read file and reverse read file for the paired ID to create an integrated read file; and matching the integrated read file with a standard DNA base sequence to specify the microorganism.
  • the forward read file and reverse read file for the one ID may be NGS-analyzed FASTQ files.
  • the forward read file and reverse read file for the one ID may be generated to correspond to each of a plurality of reads divided through NGS analysis for a microbial sample extracted from the target environment.
  • Each base point of the forward read file and the reverse read file may include each of the quality scores.
  • the truncating step may include defining a first range based on read counts at each base point of a plurality of read files for either the first group or the second group; reading the quality score for each base point of the plurality of read files in which the first range is defined and determining the quality score as valid data if it is greater than or equal to a predetermined range; And when the valid data ends, it may include truncating after the base point that is not the valid data.
  • a region having a read count that satisfies a predetermined percentage or more with respect to the read count of the first base point may be defined as the first range.
  • a region having a read count that satisfies 0.2 to 0.3% or more of the read count of the first base point may be defined as the first range.
  • the step of determining valid data can be determined only at base points within the first range.
  • the step of determining the valid data includes, when the quality score at the current base point is greater than or equal to a first value, continuously determining valid data by reading the quality score of the next base point; and if the quality score at the current base point satisfies a first range lower than the first value, if any of the quality scores in the first window after the current base point is greater than or equal to the first value, the current base point may include determining that the data is valid.
  • the current base point If the quality score at the current base point satisfies a first range lower than the first value, and if none of the quality scores in the first window after the current base point is greater than the first value, the current base point The interval in which the quality score within the subsequent second window maintains the first range can be determined as valid data.
  • valid data may be determined using different methods depending on the location of the current base point.
  • the quality score at the current base point is lower than the first range, if the position of the current base point is less than the reference position, up to the current base point is determined as the valid data, and if it is above the reference position, the current base point When the quality score of is less than the reference value, up to the current base point can be determined as the valid data.
  • the reference position may be below the first range, and the reference value may be greater than the first range.
  • the integrated read file may be matched from the standard DNA sequence database corresponding to the target environment among the standard DNA sequence databases for each individual environment.
  • the step of matching the integrated read file with the standard DNA sequence database can specify both the type and amount of microorganisms matched with the integrated read file.
  • the history may be updated and transmitted to the user terminal.
  • an alarm can be sent to the user terminal.
  • this embodiment includes an input unit that receives a set of NGS analysis files for the target environment from an external NGS server; Among the set of NGS analysis files, a first group including a plurality of forward read files and a first group including a plurality of reverse read files are read for each of the first group and the second group, forming each group.
  • a preprocessor that performs truncation based on the read count and quality score of the read file and merges the forward read file and reverse read file for one ID to generate an integrated read file;
  • a classifier that receives the integrated read file from the preprocessor and matches it with a standard DNA sequence to specify microorganisms; and an information processing unit that processes the specified microorganism information and transmits it to a user terminal.
  • the forward read file and reverse read file for the one ID are NGS-analyzed FASTQ files, and each base point of each forward read file and reverse read file may include each quality score.
  • the preprocessor defines a first range based on the read count at each base point of a plurality of read files for either the first group or the second group, and defines a first range for the plurality of read files within the first range.
  • the quality score is read for each base point, and if the quality score is above a predetermined range, it is determined to be valid data, and the base point after the base point that is not valid data can be truncated.
  • FIG. 1 is a diagram showing a microorganism analysis system according to this embodiment.
  • Figure 2 is a configuration diagram of the microorganism analysis device of Figure 1.
  • Figure 3 is a flowchart showing the microorganism analysis method of Figure 2.
  • Figures 4a and 4b are conceptual diagrams showing the characteristics of the DNA sequence of each microorganism for NGS analysis.
  • Figures 4c and 4d are graphs showing the quality degradation of sequence read files.
  • Figure 5 shows a FASTQ file by NGS analysis.
  • Figure 6 is a flowchart showing the pretreatment steps of the microbial analysis device.
  • Figure 7 is a graph showing the sequence read count distribution in Table 1.
  • Figure 8 is a detailed flowchart of the quality score analysis step of Figure 6.
  • Figure 9 is a graph showing the standard value of the quality score.
  • FIG. 10 is a table showing a method for specifying a common location in FIGS. 6 to 8.
  • Figures 11A and 11B are diagrams showing before and after a merge operation of a FASTQ file of paired sequence reads.
  • NGS Next Generation Sequencing
  • target sample refers to a sample that is the target of NGS analysis.
  • the target sample may be a biological sample collected from microorganisms obtained from a specific environment, that is, the target environment 300.
  • NGS data for each sample can be provided as a pair of files.
  • the microbial analysis system collects microbial samples from the target environment 300, performs NGS analysis on them, and then processes and matches the NGS-analyzed FASTQ file to improve matching accuracy. It's about an analysis system.
  • the FASTQ file is a standardized file format, and is a file in which the nucleotide sequence, or sequence, is read forward and reverse for each read of each sample. Forward (reading from 5' to 3') and reverse (3') It contains a pair of FASTQ files corresponding to the leading from 'to 5'.
  • the microorganism analysis system receives a FASTQ file from the NGS server 400, an NGS server 400 that obtains a microbial sample collected from the target environment 300 and performs NGS analysis, processes it, matches it with a specific microorganism, and provides information about the microbial sample. It includes a microbial analysis device 100 and a user terminal 200 that provide a response method.
  • the target environment 300 may be various surrounding environments where the user is located, or various surrounding environments that the user is interested in, and may be a home, especially an environment vulnerable to microorganisms within the home, and may be a specific area such as a kitchen, refrigerator, or sink. Alternatively, it may be a specific environment within a business location, such as a display stand or countertop in a retail area such as a convenience store or restaurant.
  • the user terminal 200 is a device capable of wired or wireless communication capable of receiving data from the microorganism analysis device 100, and includes a tablet PC, PDA (Personal Digital Assistant), laptop, cellular phone, PCS (Personal Communication Service) phone, and handheld PC ( Hand-Held PC), GSM (Global System for Mobile) phones, W-CDMA (Wideband CDMA) phones, CDMA-2000 phones, and smartphones.
  • a tablet PC PDA (Personal Digital Assistant), laptop, cellular phone, PCS (Personal Communication Service) phone, and handheld PC ( Hand-Held PC), GSM (Global System for Mobile) phones, W-CDMA (Wideband CDMA) phones, CDMA-2000 phones, and smartphones.
  • the user terminal 200 includes a display device capable of displaying the final microbial information from the microbial analysis device 100, and an application that can receive the final microbial information from the microbial analysis device 100 in various forms is installed. It may be.
  • the user terminal 200 may be able to respond to the microorganism, analyze the results, and determine the expected recovery time.
  • the NGS server 400 that performs NGS analysis collects DNA from the collected microbial sample, cultivates reads that cut the DNA to a predetermined length, reads bases from both ends of each read, and generates a FASTQ file. to provide.
  • NGS servers 400 can be applied, and as an example, Illumina's NGS equipment can be applied.
  • the microbial analysis device 100 of this embodiment receives a FASTQ file from the NGS server 400, processes the FASTQ file, and matches it with the reference base sequence of the database of the plurality of classifiers 140 that are classified. Define microorganisms in each FASTQ file.
  • the microorganism analysis device 100 of this embodiment stores the base sequences of each defined microorganism in the database of each category, and strengthens the matching model of each classifier 140 to enable gradually optimized modeling.
  • the microorganism analysis device 100 of this embodiment can detect without omission by processing a set of FASTQ files that are co-cultured and read for DNA receptors of various microorganisms that can be found in a specific environment around the user.
  • the microbial analysis device 100 does not perform matching on all of the various read files, but rather reflects all the read counts and quality scores at each location in one set of the FASTQ files, filters only the valid files, and then analyzes them. By performing , you can increase the matching probability and reduce computation.
  • FIG. 2 is a configuration diagram of the microorganism analysis device 100 of FIG. 1, and FIG. 3 is a flowchart showing the microorganism analysis method of FIG. 2.
  • the microbial analysis device 100 includes a communication unit 110 including an input unit 111 and an output unit 113, a preprocessing unit 120, a normalization module 130, a classifier 140, and a processing unit. Includes (150).
  • the communication unit 110 is a communication module that communicates with the NGS server 400 and the user terminal 200 using wired and wireless communication, and can be varied depending on the designated network.
  • the network can apply wireless communication technologies such as IEEE 802.11 WLAN, IEEE 802.15 WPAN, UWB, Wi-Fi, Zigbee, Z-wave, Blue-Tooth, etc., and at least one communication technology can be applied.
  • wireless communication technologies such as IEEE 802.11 WLAN, IEEE 802.15 WPAN, UWB, Wi-Fi, Zigbee, Z-wave, Blue-Tooth, etc.
  • the pre-processing unit 120 processes the FASTQ file input through the input unit 111 and provides it in a state that can be normalized and matched.
  • the preprocessor 120 may remove meaningless data while leaving only valid data through truncation of the received FASTQ file, and generate one merged read by pairing each FASTQ file of one ID.
  • the merged reads of each set are provided to the normalization module 130, undergo normalization, and then model the classifier 140. Matching progresses.
  • the preprocessor 120 divides a set of forward read files into a group and a reverse read file, sets a common location where truncation can be performed on the read files of each group, and sets the corresponding read file at the common location. Performs truncation on the group's lead file.
  • the common location can be set by considering both the read count and quality score for the read files of each group.
  • the normalization module 130 When analyzing the diversity of a plurality of sets of read files received, the normalization module 130 generates excessive or under-diversity analysis results based on different sequencing depths (amount of information of microbial community) for each sample, thereby equalizing the amount of information. Proceed with normalization for .
  • the normalization module 130 can perform diversity analysis using the optimal amount of information by setting the read level for each sample to the maximum within the limit that can save as much information as possible.
  • a specific diversity table can be loaded according to the target environment 300 of each sample, and the normalization module 130 can be activated according to the loaded diversity table.
  • the data in the diversity table analyzes previously held data, draws the degree of diversity for each lead, and saves the lead-diversity relationship only for the diversity that is above a certain range compared to the diversity saturation value and creates a database.
  • each diversity table is learned to further strengthen the state specialized for the corresponding environment.
  • the filtered final reads are loaded, the corresponding diversity table data is read, and the number of filtered reads among the final read values in the diversity table data is less than the first critical range among the total number of reads, Normalization is performed by selecting the final read value with the largest value.
  • the classifier 140 analyzes which species and genus of microorganisms are present in the sample by comparing and matching the selected final reads with a reference sequence database by applying the classifier 140 for each field.
  • the classifier 140 separately constructs a reference sequence database for each environment and performs matching of the final read from the reference sequence database for each environment.
  • the reference sequence database for each environment can be classified into household, hospital, retail, and food production facilities, similar to diversity analysis, but can be classified separately.
  • a classifier 140 for a specific environment that is, a reference sequence database
  • running a classification algorithm that matches it only microorganisms specialized for a specific environment can be compared in a limited way.
  • a reference sequence database specialized for each field can be initially created by selecting and processing the genetic information of individual 16S rRNA genes registered in the publicly available National Center for Biotechnology Information (NCBI) and classifying it for each environment.
  • NCBI National Center for Biotechnology Information
  • the processing unit 150 secures information on the type and amount of microorganisms matched to the final read by the classifier 140, it processes the information and provides it to the user terminal 200.
  • the processing unit 150 can perform both alpha diversity analysis and beta diversity analysis on the information about the received final read.
  • Alpha diversity analysis analyzes and displays the level of each final read
  • Beta diversity analyzes and provides the degree of dissimilarity between each final read, and can be provided in tables and graphs.
  • the processing unit 150 provides a trend of microbial changes in the target environment 300 and a response method through history analysis. possible.
  • the output unit 113 transmits the result data provided from the processing unit 150 to the designated user terminal 200, and can perform a user alarm if any of the detected microorganisms contain dangerous microorganisms above a predetermined level. there is.
  • the microbial analysis device 100 may be composed of an embedded system board equipped with a memory card as a data storage unit (not shown), a library file for microbial analysis, and a signal processing device.
  • a memory card capable of storing output signal data is inserted into the embedded system board, and the memory card stores the system OS, driving program, and library files for analysis.
  • signal processing for analysis of the plurality of final reads is calculated through comparative analysis with the library file in the CPU of the embedded system board, and the analysis results are stored back in the memory card.
  • the communication unit 110 can be mounted together on such an embedded system board, but it is not limited to this.
  • microorganism analysis method of the microorganism analysis device 100 will be described with reference to FIGS. 3 to 11.
  • Figure 3 is a flowchart showing the microorganism analysis method of Figure 2
  • Figures 4a and 4b are conceptual diagrams showing the characteristics of the DNA sequence of each microorganism for NGS analysis
  • Figures 4c and 4d show the deterioration of the quality score of the sequence read file.
  • This is a graph
  • Figure 5 shows a FASTQ file based on NGS analysis.
  • microorganisms that can be found in a specific environment largely include bacteria (bacteria) and fungi (mold), and these number up to hundreds of thousands of species.
  • the DNA base sequences (sequences) of each bacteria and fungus are stored in the database of the classifier 140, and such DNA base sequences are stored separately according to each environment.
  • the DNA sequence information of the V4 or V3 to V4 region of the 16S RNA gene is generally used to distinguish each species, as shown in the DNA sequence of Figure 4b.
  • the length of the DNA sequence meets approximately 300 bp (base point) or less.
  • the length of the V3 to V4 region DNA sequence is approximately 500 bp or less.
  • the bp of a read that can read a base sequence in NGS equipment is up to 600 bp, and reads longer than that are not read, use expensive equipment, or have significantly lower accuracy, making them less useful.
  • the forward read and reverse read of each ID are truncated to remove some bp of the FASTQ file, then paired and merged, so that the length of the final merged read is 468 bp or less.
  • Truncation selects a common location for each group for a first group including a plurality of forward reads and a second group including a plurality of reverse reads within a set, and selects a common location for each group and connects a plurality of leads within each group at the common location. This is done by deleting invalid bp data.
  • truncation is defined as a filtering operation that deletes all data in bp after a common position for a plurality of reads in one group.
  • the common position shows a characteristic of decreasing as the bp moves backward from the first position.
  • truncation is performed by reflecting quantitative factors by comparing the read counts in which each bp exists in a plurality of reads in one group. can be performed.
  • the microbial analysis method of this embodiment obtains a set of FASTQ files for one sample of the target environment 300 from the NGS server 400, as shown in FIG. 3 (S10).
  • a set of FASTQ files is assigned a unique ID for each read, and FASTQ files for each ID are transmitted as a pair, as shown in Figures 5A and 5B.
  • the left side of Figure 5 is a forward read file, and the right side is a reverse read file with the same ID.
  • the forward read file When comparing the forward read file and reverse read file of one ID, the forward read file reads the nucleotide sequence from number 5 to number 3 for one sample read, and the reverse read file reads the sequence from number 3 to number 5 in the opposite direction. You can see that it is implemented as a lead file.
  • each FASTQ file is created in the same format, and the first row is the ID of the corresponding sample read, and the forward read and the corresponding reverse read have the same ID.
  • the second row shows the sequential base sequence for each bp (base point), and the forward read and reverse read are written complementary.
  • the third row is a separator, and the fourth row consists of an encoder indicating the quality score at each bp of each base sequence.
  • the preprocessor 120 searches for a common position for each group, that is, a forward read group and a reverse read group, for a set of FASTQ files input from the preprocessor 120, and searches each forward read and reverse read group at the common position.
  • the noise portion of each sequence read file is filtered by truncating (S20).
  • pairing is performed to determine whether there is an overlap section between the forward read file and the reverse read file, which are filtered FASTQ files of one ID (S30).
  • the preprocessor 120 merges the FASTQ file of one ID for which pairing has been completed and generates one merged read file (S40).
  • the normalization module 130 normalizes the data for each sample.
  • Figure 6 is a flowchart showing the preprocessing steps of the microbial analysis device
  • Figure 7 is a graph showing the sequence read count distribution in Table 1
  • Figure 8 is a detailed flowchart of the quality score analysis step in Figure 6
  • Figure 9 is a diagram of the quality score. It is a graph showing reference values
  • FIG. 10 is an example showing a method of specifying a common position in FIGS. 6 to 8.
  • the preprocessor 120 opens a plurality of forward read files and reverse read files read for one set, that is, one sample, as shown in FIGS. 5A and 5B.
  • the read count for each bp is performed individually for the first group including the forward read file and the second group including the reverse read file.
  • this is done by counting the number of forward read files in which a base sequence exists in the first bp.
  • the read count of each bp is counted to almost match the total number, and the number of read counts may decrease toward the back end.
  • the total bp length of each read may rapidly decrease as it passes a certain level, and the distribution may be as shown in Table 1 and FIG. 7.
  • Table 1 shows the distribution of the read count and the quality score of the corresponding bp for whether the nucleotide sequence of each bp exists for a plurality of forward reads belonging to the first group. Quality of the 50% distribution that accounts for the largest number of reads.
  • the score can be defined as the average quality score of the corresponding bp, and the average quality score can be used later in the step of specifying a common location based on the quality score.
  • Figure 7 graphically shows the read count distribution at each bp in Table 1.
  • the preprocessor 120 calculates the threshold bp(th) as shown in FIG. 7 as the first reference value and determines the primary range to perform truncation only in bp in front of the first reference value.
  • the first reference value is, when the read count in the first bp is a, a bp having a read count that satisfies 0.3 to 0.2 times less than a, specifically less than 0.25 times, can be selected as the first reference value. there is.
  • Figure 10 shows the first reference value (n1). If the value of the first bp is 30000, 249 bp falling to 5000 can be set as the first reference value (n1).
  • the preprocessor 120 performs an operation to specify a common location for truncation of a plurality of sequence reads in each group based on the average quality score of each group for which the primary range is set (S23). .
  • a first value which is the highest reference value, is set for the average quality score of one group, and a first range having a predetermined size is defined in a range lower than the first value.
  • the first range has a lower limit and an upper limit
  • the lower limit of the first range is a second value, which is a minimum value for specifying a common location
  • the upper limit of the first range is defined as a third value.
  • a range greater than the third value and less than the first value may be defined as a second range.
  • the preprocessor 120 receives the data in Table 1 for the first group of sequence reads for which the primary range is determined, for example, the data shown in FIG. 10 (S231).
  • the preprocessor 120 searches for the average quality score starting from the first bp for the nucleotide sequence of each bp in Figure 10 and determines which part of the quality score level in Figure 9 the average quality score of each digit corresponds to ( S232).
  • the average quality score of the first bp is greater than or equal to the first value, it is determined to be valid data, and it is determined whether the average quality score of the next bp, that is, the second bp, is greater than or equal to the first value.
  • the operation is repeatedly performed until a section in which bp that remains above the first value appears continuously.
  • the case of meeting the first range is defined as the case of being less than the third value or exceeding the second value.
  • the first value is 38
  • the second value is 34
  • the third value is 36.
  • the quality score in the first range is defined as two values, 35 and 36.
  • the first range it is determined whether there is a section in which at least one of the average quality scores of the bp within the first window B rises above the first value from the bp next to the corresponding bp (S235).
  • the first window may be 3bp to 5bp, but is not limited thereto.
  • the 241st bp is defined as the second point (n2) . It is determined whether the average quality score increases to 38 or more at each bp in the first window (B) after the second point (n2).
  • the 241st bp is considered a valid value and the quality score of the next bp is determined.
  • the second window (C) may be the same or different in size from the first window (B) and, for example, may satisfy 4 to 6 bp.
  • the second window (C) when the second window (C) is 4, only the area in which bp with a value of 35 to 36 appears continuously among the average quality scores up to the 248th bp is set as a valid value.
  • the average quality score in the corresponding bp is not within the first range, it is divided into three cases.
  • the second range can be viewed as satisfying the quality score of 37. Therefore, if the quality score satisfies 37, the second range is maintained until a continuous section and the next bp is specified as a common position (S238).
  • the second window (C) is filtered from the bp next to the corresponding bp to within the first range within the second window. Only the area in which values satisfying ? appear consecutively is set as a valid value, and the next bp of the valid value is specified as a common position (S236).
  • the second window (C) may be the same as or different from the first window (B) and, for example, may satisfy 4 to 6 bp.
  • the common position is determined differently depending on the location of the corresponding bp.
  • the corresponding bp is less than the ⁇ value (S242), the corresponding bp is specified as a common position (S243).
  • the ⁇ value may have a value smaller than the first point n1, which is a primary range, that is, a range limited by the read count, and may be between 200 and 220, for example.
  • the common position is specified according to the average quality score of the corresponding bp (S244).
  • the corresponding bp is specified as a common location and truncation is set to be performed after the corresponding bp.
  • the fourth value may be within the second range, that is, it may be a value greater than the third value.
  • truncation is performed at the 246th bp, which is the common position (n4), and subsequent bps are removed.
  • the common location of the second group may be different from the common location of the first group. Accordingly, the forward read and reverse read of the same truncated ID may have different lengths, but the difference may be very small.
  • the preprocessor 120 determines whether an overlap section (OS) exists in the two read files.
  • the merged read overlaps the overlap section (OS) from the end of the forward read to the end of the reverse read based on the overlap section (OS). creates .
  • merged reads can be generated by converting the base sequence of the reverse read into a complementary base sequence and arranging it after the overlap section (OS) (S40).
  • the merged read file created in this way is created by also merging the quality score in the fourth row, as shown in Figure 11b.
  • each quality score is read corresponding to the position of each base sequence, so the value itself does not change.
  • the merged read file includes an augmented base sequence having a length of bp excluding the overlap section (OS) of the two reads, and the merged read file with the augmented length is output as the final read.
  • the diversity analysis results are either excessive or under-generated due to different sequencing depths (amount of information of the microbial community) for each sample, and normalization is performed to equalize the amount of information.
  • a specific diversity table is loaded with reference to environmental information, and the normalization module 130 is performed according to the loaded diversity table.
  • the final read selected by the classifier 140 is compared with the reference sequence database by applying the field-specific classifier 140 to analyze what species and genus of microorganisms are present in the sample ( S60).
  • the reference database is read from the home classifier 140, each modeling is performed, and matching is performed between the nucleotide sequence of the final read and each DNA nucleotide sequence in the home reference database.
  • the microorganism analysis device 100 secures information on the type and amount of microorganisms matched to the final lead, it processes the information and provides it to the user terminal 200 (S70).
  • the processing unit 150 can provide microbial change trends and response methods in the target environment 300 through history analysis. .
  • the output unit 113 transmits the result data provided from the processing unit 150 to the designated user terminal 200, and can perform a user alarm if any of the detected microorganisms contain dangerous microorganisms above a predetermined level. there is.
  • the matching probability can be improved by selecting and matching valid single reads with high quality scores.
  • the type and amount of matched microorganisms are detected together, visualized, and provided to the user, enabling immediate response, and modeling of each classifier 140 by applying the data again as a reference to the database for each classifier 140. This has a more adaptable effect.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Biophysics (AREA)
  • Theoretical Computer Science (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Databases & Information Systems (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioethics (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Artificial Intelligence (AREA)
  • Software Systems (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

Le présent mode de réalisation concerne un dispositif d'analyse de micro-organismes comprenant : une unité d'entrée pour recevoir un ensemble de fichiers d'analyse NGS pour un environnement cible à partir d'un serveur NGS externe ; une unité de prétraitement qui lit un premier groupe incluant une pluralité de fichiers de lecture directe et un second groupe comprenant une pluralité de fichiers de lecture inverse parmi l'ensemble de fichiers d'analyse NGS de façon à tronquer chacun du premier groupe et du second groupe sur la base d'un nombre de lecture et d'un score de qualité de fichiers de lecture constituant chaque groupe, et qui fusionne un fichier de lecture directe et un fichier de lecture inverse pour un ID de façon à générer un fichier de lecture final ; un classificateur qui reçoit le fichier de lecture final de l'unité de prétraitement, et qui associe le fichier de lecture final reçu à une séquence d'ADN standard de façon à spécifier un micro-organisme ; et une unité de traitement d'informations, qui traite des informations concernant le micro-organisme spécifié de façon à transmettre les informations traitées à un terminal utilisateur. Par conséquent, un fichier FASTQ obtenu au moyen d'une technique NGS est traité de sorte que seules des données valides sont spécifiées par filtrage de bruit, un algorithme d'analyse peut être appliqué uniquement avec les données valides, et un nombre de lecture total est considéré pour sélectionner une position de troncature pour le fichier FASTQ obtenu au moyen d'une technique NGS, et ainsi seule une position déterminée comme étant valide dans une pluralité de lectures tout en corrigeant la qualité peut être analysée en considérant tous les facteurs quantitatifs et tous les facteurs qualitatifs.
PCT/KR2022/017799 2022-11-11 2022-11-11 Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération WO2024101492A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/KR2022/017799 WO2024101492A1 (fr) 2022-11-11 2022-11-11 Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/KR2022/017799 WO2024101492A1 (fr) 2022-11-11 2022-11-11 Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération

Publications (1)

Publication Number Publication Date
WO2024101492A1 true WO2024101492A1 (fr) 2024-05-16

Family

ID=91033009

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/KR2022/017799 WO2024101492A1 (fr) 2022-11-11 2022-11-11 Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération

Country Status (1)

Country Link
WO (1) WO2024101492A1 (fr)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR101798229B1 (ko) * 2016-12-27 2017-12-12 주식회사 천랩 전장 리보솜 rna 서열정보를 얻는 방법 및 상기 리보솜 rna 서열정보를 이용하여 미생물을 동정하는 방법
US20180365375A1 (en) * 2015-04-24 2018-12-20 University Of Utah Research Foundation Methods and systems for multiple taxonomic classification
KR20200027900A (ko) * 2018-09-05 2020-03-13 주식회사 천랩 시료 미생물의 동정 및 분류 방법
WO2022028624A1 (fr) * 2020-08-07 2022-02-10 西安中科茵康莱医学检验有限公司 Procédé et appareil pour déterminer des espèces microbiennes et acquérir des informations associées au moyen d'un séquençage, support de stockage lisible par ordinateur et dispositif électronique
CN114242173A (zh) * 2021-12-22 2022-03-25 深圳吉因加医学检验实验室 一种mNGS鉴定微生物的数据处理方法、装置及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180365375A1 (en) * 2015-04-24 2018-12-20 University Of Utah Research Foundation Methods and systems for multiple taxonomic classification
KR101798229B1 (ko) * 2016-12-27 2017-12-12 주식회사 천랩 전장 리보솜 rna 서열정보를 얻는 방법 및 상기 리보솜 rna 서열정보를 이용하여 미생물을 동정하는 방법
KR20200027900A (ko) * 2018-09-05 2020-03-13 주식회사 천랩 시료 미생물의 동정 및 분류 방법
WO2022028624A1 (fr) * 2020-08-07 2022-02-10 西安中科茵康莱医学检验有限公司 Procédé et appareil pour déterminer des espèces microbiennes et acquérir des informations associées au moyen d'un séquençage, support de stockage lisible par ordinateur et dispositif électronique
CN114242173A (zh) * 2021-12-22 2022-03-25 深圳吉因加医学检验实验室 一种mNGS鉴定微生物的数据处理方法、装置及存储介质

Similar Documents

Publication Publication Date Title
WO2020231193A1 (fr) Procédé de gestion de faisceau, appareil, dispositif électronique et support de stockage lisible par ordinateur
WO2012134180A2 (fr) Procédé de classification des émotions pour analyser des émotions inhérentes dans une phrase et procédé de classement des émotions pour des phrases multiples à l'aide des informations de contexte
CN104302781A (zh) 一种检测染色体结构异常的方法及装置
WO2017116123A1 (fr) Système d'identification de cause d'une maladie au moyen d'informations de variation génétique concernant le génome d'un individu
Yaakov et al. Coupling phenotypic persistence to DNA damage increases genetic diversity in severe stress
WO2020022733A1 (fr) Procédé de détection d'anomalie chromosomique basé sur le séquençage du génome entier et utilisation associée
WO2020168606A1 (fr) Procédé, appareil et dispositif d'optimisation de vidéo publicitaire, et support d'informations lisible par ordinateur
CN103377245B (zh) 一种自动问答方法及装置
WO2017014469A1 (fr) Procédé de prédiction du risque de maladie, et dispositif pour l'exécuter
WO2018129978A1 (fr) Procédé de traitement d'informations, dispositif, support d'informations et dispositif informatique
WO2024101492A1 (fr) Système d'analyse de micro-organismes et procédé d'analyse de micro-organismes utilisant un séquençage de nouvelle génération
WO2017146338A1 (fr) Procédé et appareil permettant d'archiver une base de données générant des informations d'index, et procédé et appareil permettant de consulter une base de données archivée comprenant des informations d'index
DeLeo et al. RNA profile diversity across arthropoda: guidelines, methodological artifacts, and expected outcomes
Pornputtapong et al. KITSUNE: A tool for identifying empirically optimal K-mer length for alignment-free phylogenomic analysis
ATE299278T1 (de) Methode und computersystem für die optimierung eines boolschen ausdrucks für anfragebearbeitung
WO2013032198A1 (fr) Moteur de recommandation basé sur des articles pour recommander un article fortement associé
WO2018236120A1 (fr) Procédé et dispositif d'identification de quasi-espèces au moyen d'un marqueur négatif
WO2024205022A1 (fr) Dispositif électronique et son procédé de gestion d'informations d'article
WO2024096149A1 (fr) Système et procédé d'analyse microbienne utilisant une technologie de séquençage de nouvelle génération
WO2011068315A2 (fr) Appareil permettant de sélectionner une base de données optimale en utilisant une technique de reconnaissance de force conceptuelle maximale et procédé associé
Fisher et al. A highly discriminatory multilocus microsatellite typing (MLMT) system for Penicillium marneffei
WO2020050627A1 (fr) Procédé d'identification et de classification de micro-organismes d'échantillon
WO2021010670A1 (fr) Procédé et système de traitement de données utilisant le seuillage automatique
WO2021172780A1 (fr) Procédé et dispositif de sélection de gène
WO2024136630A1 (fr) Procédé de recherche d'homologie de séquence d'une base de données de nucléotides

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22965264

Country of ref document: EP

Kind code of ref document: A1