WO2021196356A1 - 微生物目标片段中多拷贝区域的识别方法、装置及应用 - Google Patents
微生物目标片段中多拷贝区域的识别方法、装置及应用 Download PDFInfo
- Publication number
- WO2021196356A1 WO2021196356A1 PCT/CN2020/090175 CN2020090175W WO2021196356A1 WO 2021196356 A1 WO2021196356 A1 WO 2021196356A1 CN 2020090175 W CN2020090175 W CN 2020090175W WO 2021196356 A1 WO2021196356 A1 WO 2021196356A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- copy
- region
- candidate
- regions
- microbial target
- Prior art date
Links
- 230000000813 microbial effect Effects 0.000 title claims abstract description 69
- 239000012634 fragment Substances 0.000 title claims abstract description 59
- 238000000034 method Methods 0.000 title claims abstract description 56
- 238000012795 verification Methods 0.000 claims abstract description 10
- 238000012545 processing Methods 0.000 claims description 21
- 238000004590 computer program Methods 0.000 claims description 17
- 244000005700 microbiome Species 0.000 claims description 6
- 238000012360 testing method Methods 0.000 claims description 6
- 241000224489 Amoeba Species 0.000 claims description 3
- 241000894006 Bacteria Species 0.000 claims description 3
- 241000223935 Cryptosporidium Species 0.000 claims description 3
- 241000233866 Fungi Species 0.000 claims description 3
- 241000224016 Plasmodium Species 0.000 claims description 3
- 241000223996 Toxoplasma Species 0.000 claims description 3
- 241000224526 Trichomonas Species 0.000 claims description 3
- 241000700605 Viruses Species 0.000 claims description 3
- 241000222712 Kinetoplastida Species 0.000 claims description 2
- 241000243190 Microsporidia Species 0.000 claims description 2
- 210000003495 flagella Anatomy 0.000 claims description 2
- 238000012216 screening Methods 0.000 claims description 2
- 241000238631 Hexapoda Species 0.000 claims 1
- 230000035945 sensitivity Effects 0.000 abstract description 7
- 108020004465 16S ribosomal RNA Proteins 0.000 abstract description 2
- 230000003252 repetitive effect Effects 0.000 abstract description 2
- 239000013612 plasmid Substances 0.000 description 13
- 238000001514 detection method Methods 0.000 description 11
- 241000894007 species Species 0.000 description 9
- 238000004891 communication Methods 0.000 description 6
- 108700022487 rRNA Genes Proteins 0.000 description 6
- 239000000523 sample Substances 0.000 description 6
- 238000004364 calculation method Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 4
- 230000008859 change Effects 0.000 description 3
- 238000013461 design Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 3
- 230000003287 optical effect Effects 0.000 description 3
- 241000196324 Embryophyta Species 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 244000000010 microbial pathogen Species 0.000 description 2
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 2
- 241001295810 Microsporidium Species 0.000 description 1
- 241001646725 Mycobacterium tuberculosis H37Rv Species 0.000 description 1
- 108700035964 Mycobacterium tuberculosis HsaD Proteins 0.000 description 1
- 241000588652 Neisseria gonorrhoeae Species 0.000 description 1
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 239000012472 biological sample Substances 0.000 description 1
- 238000000802 evaporation-induced self-assembly Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000003703 image analysis method Methods 0.000 description 1
- 238000005259 measurement Methods 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000007857 nested PCR Methods 0.000 description 1
- 230000002093 peripheral effect Effects 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000003753 real-time PCR Methods 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000007671 third-generation sequencing Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
- G16B20/10—Ploidy or copy number detection
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
Definitions
- the present invention relates to the field of bioinformatics, in particular to a method, device and application for identifying multi-copy regions in microbial target fragments.
- plasmids are not universal.
- plasmids Some species do not have plasmids themselves, so plasmids cannot be used to detect the species, let alone the design of primers on the plasmids to improve detection sensitivity. For example, studies have reported that about 5% of Neisseria gonorrhoeae strains cannot be detected because they do not have a plasmid.
- rRNA genes are present in the genomes of all microbial species, and there are often multiple copies that can improve detection sensitivity. But in fact, not all rRNA genes are multiple copies. For example, there is only one copy of the rRNA gene in Mycobacterium tuberculosis H37Rv.
- rRNA genes there are some changes in rRNA gene sequence that are not suitable for detection. For example, between closely related species or even between strains of the same species with different subtypes, rRNA genes cannot meet the requirements of species specificity or even subspecies specificity because the sequence is too conservative.
- the purpose of the present invention is to provide a method, device and application for identifying multi-copy regions in microbial target fragments.
- the first aspect of the present invention provides a method for identifying multi-copy regions in microbial target fragments, the method at least including the following steps:
- S100 Find candidate multi-copy regions: perform internal comparisons on the target microbial fragments, and find the regions corresponding to the sequence to be tested whose similarity meets the preset value as candidate multi-copy regions.
- the similarity refers to the coverage ratio of the sequence to be tested and The product of the match rate;
- S200 Verify that the multi-copy area is obtained: Obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
- the second aspect of the present invention provides a device for identifying multiple copy regions in a microbial target segment, the device at least comprising:
- Candidate multi-copy region searching module used for internal comparison of microbial target fragments, searching for the region corresponding to the test sequence whose similarity meets the preset value as the candidate multi-copy region, the similarity refers to the coverage rate of the test sequence The product of the match rate;
- the multi-copy area verification and acquisition module is used to obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
- a third aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for identifying a multi-copy region in the aforementioned microbial target segment is realized.
- a fourth aspect of the present invention provides a computer processing device, including a processor and the aforementioned computer-readable storage medium.
- the processor executes a computer program on the computer-readable storage medium to realize the multi-copy area in the aforementioned microbial target segment. The steps of the identification method.
- the fifth aspect of the present invention provides an electronic terminal, including: a processor, a memory, and a communicator; the memory is used to store a computer program, the communicator is used to communicate with an external device, and the processor is used to execute all The computer program stored in the memory, so that the terminal executes the aforementioned method for identifying the multi-copy area in the microbial target segment.
- the sixth aspect of the present invention provides the aforementioned method for identifying multiple-copy areas in a microbial target segment, the aforementioned device for identifying multiple-copy areas in a microbial target segment, the aforementioned computer-readable storage medium, the aforementioned computer processing equipment, or the aforementioned electronic
- the terminal is used to detect the multi-copy region in the microbial target fragment.
- the method, device and application for identifying the multi-copy region in the microbial target segment of the present invention have the following beneficial effects:
- the method for identifying multi-copy regions in microbial target fragments of the present invention has high accuracy and high sensitivity, and identifies undiscovered multi-copy regions; it can search for repetitive sequences in incompletely assembled motifs; it is comparable to 16srRNA More comprehensive than 16srRNA, not all 16srRNA are multiple copies; this system is not limited to whether there is a whole genome sequence, you can submit calculation tasks by providing the names of target strains and comparison strains or uploading sequence files locally.
- Fig. 1 is a flowchart of a method according to an embodiment of the present invention.
- Figure 1-1 is a graph showing the calculation results of the coverage ratio and sequence matching ratio of the aligned sequence.
- Figure 1-2 is a schematic diagram of the multi-copy area verification obtaining module of the present invention.
- Fig. 2 is a schematic diagram of an apparatus according to an embodiment of the present invention.
- Fig. 3 is a schematic diagram of an electronic terminal in an embodiment of the present invention.
- the method for identifying multi-copy regions in microbial target fragments of the present invention includes at least the following steps:
- S100 Find candidate multi-copy regions: perform internal comparisons on the target microbial fragments, and find the regions corresponding to the sequence to be tested whose similarity meets the preset value as candidate multi-copy regions.
- the similarity refers to the coverage ratio of the sequence to be tested and The product of the match rate;
- S200 Verify that the multi-copy area is obtained: Obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
- the preset value of similarity can be determined as required.
- the recommended similarity preset value should exceed 80%, such as 85%, 90%, 95%, 96%, 97%, 98%, 99%, or 100%.
- Coverage rate (length of similar sequence/(end value of sequence to be tested-starting value of sequence to be tested + 1))%
- the matching rate is the identity value when the sequence to be tested is compared with another sequence.
- the identity value of the alignment of the two sequences can be obtained by software such as needle, water or blat.
- the length of a similar sequence refers to the number of bases that the matched fragment occupies in the sequence to be tested when the sequence to be tested is compared with another sequence, that is, the length of the matched fragment.
- Sequence A is the sequence to be tested. Compare sequences A and B. The length of the matched fragment is 187. The start value (that is, the starting position) of sequence A is 1, and the end value (that is, the end position) is 187. but:
- sequence A and sequence B corresponds to an identity of 98.4%.
- the similarity preset value is set to 80%, and the similarity of A and B meets the preset value, so they serve as a candidate multi-copy area.
- the positions of the respective bases between the two sequences to be compared do not cross (that is, the two compared sequences are completely separated in the microbial target fragment, and there is no overlapping part).
- the alignment sequence pair with regional overlap can be removed before the alignment, or the similarity value obtained by the alignment sequence pair with the regional overlap can be removed after the alignment.
- the position of the sequence A is 1-187
- the position of each base in the sequence B will not appear between 1-187. You can use the uniq function to perform deduplication after calculating the coverage rate and matching rate.
- step S200 the method for obtaining the median copy number of the candidate multi-copy region is: determining the position of each candidate multi-copy region on the microbial target fragment, and obtaining other candidates covered by each base position of the candidate multi-copy region to be verified The number of multi-copy areas is calculated, and the median value of the copy number of the candidate multi-copy area to be verified is calculated.
- the other candidate multi-copy areas refer to candidate multi-copy areas other than the candidate multi-copy area to be verified.
- the first row represents the microbial target fragment sequence.
- the fragment within the frame is the candidate multi-copy region to be verified
- the number in the second row is the target fragment sequence to be verified.
- the gray segment in the figure represents the candidate multi-copy region other than the candidate multi-copy region to be verified (hereinafter referred to as the repeated fragment).
- the first base A in the first row of the frame line. Since this base corresponds to appear in 5 repeats (that is, it is covered by 5 repeats), it is considered that the position of the corresponding repeats is If the number is 5, the number of multiple copies at this position is 5.
- the number of repeats corresponding to the last base G in the frame is 4, and the number of multiple copies at this position is 4.
- the number of repeated fragments covered by each base position of the candidate multi-copy region to be verified is counted.
- the median value refers to the variable value in the middle of the variable sequence by arranging the values of the variables in the statistical population in order of size to form a sequence.
- the microbial target fragment may be a chain or multiple incomplete motifs.
- the order of motif connection is not particularly limited, and it can be connected in any order. For example, connect each motif in a random order to form a chain. If the region where the similarity meets the preset value contains different motifs, the region is cut according to the original motif connection point, divided into two regions, and it is judged whether the two regions are candidate multi-copy regions respectively.
- Microbial target fragments are incomplete multiple motifs, which means that part of the microbial target fragment sequence is not a continuous single sequence, but is composed of multiple motifs of different sizes.
- the motif is caused by incomplete splicing of short read lengths under the existing second-generation sequencing conditions. This method is also suitable for whole genome sequence data generated by new technologies such as third-generation sequencing.
- the microbial target fragments in step S1 are all derived from public databases, and the public databases are mainly selected from ncbi ( https://www.ncbi.nlm.nih.gov ).
- the method further includes the following steps: S101, comparing the selected adjacent microbial target fragments in pairs, and if there is a comparison result with a similarity lower than a preset value, an alarm is issued and the target strain is displayed Corresponding filter conditions.
- the method of the present invention is not limited to whether there is a whole genome sequence, and the calculation task can be submitted by providing the names of the target strain and the comparison strain or uploading a sequence file locally. From the comparison of the detection range, this method can cover all pathogenic microorganisms, including but not limited to bacteria, viruses, fungi, amoeba, cryptosporidium, flagellates, microsporidia, piriformis, plasmodium, toxoplasma, Trichomonas, kinetoplasm, etc.
- a 95% confidence interval of the copy number of the candidate multi-copy region can also be calculated.
- the confidence interval refers to the estimated interval of the overall parameter constructed by the sample statistics, that is, the interval estimation of the overall copy number of this target area. It reflects the degree to which the true value of the copy number of the target area has a certain probability to fall around the measurement result, and it gives the credibility of the measured value of the measured parameter.
- the base number of the candidate multi-copy region is used as the sample number, and the copy value corresponding to each base in the candidate multi-copy region is the sample value.
- each base corresponds to 1 copy value, so this is a set of 500 copy values in total.
- the present invention uses the 95% confidence interval of these 500 copy values to measure the overall multi-copy target area when the significance level is 0.05 and the confidence level is 95%. Interval estimation of copy number.
- the confidence level is the same, the larger the sample size, the narrower the confidence interval and the closer to the mean.
- the microbial target segment can be the whole genome of the microbe or the gene segment of the microbe.
- the mechanism of the present invention is: under normal circumstances, the median value and 95% confidence interval representing these 500 copy values can reflect the true situation of the candidate multi-copy region.
- the design of this module can also exclude some special cases. For example, in this 500 bp candidate multi-copy region, only 5 bases have a copy number of 1000, and the remaining 495 bases have a copy number of 1. Then the median copy number in this case is 1, but the mean is 10.99, and the 95% confidence interval is (2.25-19.73). Obviously, although the mean value is shown as multiple copies, the median value is no longer within the 95% confidence interval, and the candidate multiple copy area cannot be judged as multiple copies.
- the device for identifying multiple-copy regions in a microbial target segment of the present invention at least includes:
- Candidate multi-copy region searching module used for internal comparison of microbial target fragments, searching for the region corresponding to the test sequence whose similarity meets the preset value as the candidate multi-copy region, the similarity refers to the coverage rate of the test sequence The product of the match rate;
- the multi-copy area verification and acquisition module is used to obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
- Coverage rate (length of similar sequence/(end value of sequence to be tested-start value of sequence to be tested+1))%.
- the matching rate is the identity value when the sequence to be tested is compared with another sequence.
- the identity value of the alignment of the two sequences can be obtained by software such as needle, water or blat.
- the preset value of similarity can be determined as required.
- the recommended similarity preset value should exceed 80%, such as 85%, 90%, 95%, 96%, 97%, 98%, 99%, or 100%.
- the candidate multi-copy region searching module further includes the following sub-modules: a raw data similarity comparison module, which is used to compare the selected adjacent microbial target fragments in pairs, if the similarity is lower than the preset The result of the comparison of values, an alarm is issued and the filter conditions corresponding to the target plant species are displayed. The user can re-select the target plant species to enter the background calculation based on the feedback report.
- a raw data similarity comparison module which is used to compare the selected adjacent microbial target fragments in pairs, if the similarity is lower than the preset The result of the comparison of values, an alarm is issued and the filter conditions corresponding to the target plant species are displayed. The user can re-select the target plant species to enter the background calculation based on the feedback report.
- the candidate multi-copy region searching module when the microbial target segment is multiple incomplete motifs, connect the motifs and then search for the candidate multi-copy region.
- the region where the similarity meets the preset value contains different motifs, the region is cut according to the original motif connection point, divided into two regions, and it is judged whether the two regions are candidate multi-copy regions respectively.
- the motifs are connected in any order.
- the multi-copy region verification and acquisition module also includes a sub-module for obtaining the median value of the copy number of the candidate multi-copy region, which is used to determine the position of each candidate multi-copy region on the microbial target segment, and obtain each candidate multi-copy region to be verified.
- the number of other candidate multi-copy regions covered by a base position is calculated, and the median value of the copy number of the candidate multi-copy region to be verified is calculated.
- the multi-copy area verification and acquisition module is also used to calculate the 95% confidence interval of the copy number of the candidate multi-copy area.
- the base number of the candidate multi-copy region is used as the sample number, and the copy value corresponding to each base in the candidate multi-copy region is the sample value.
- the device in this embodiment has basically the same principles as the foregoing method embodiments, in the foregoing method and device embodiments, the definitions of the same features, the calculation methods, the enumeration of implementation manners, and the enumeration of preferred implementation manners can be used interchangeably. Do not repeat it again.
- the division of the various modules of the above device is only a division of logical functions, and may be fully or partially integrated into a physical entity during actual implementation, or may be physically separated.
- These modules can all be implemented in the form of software called by processing elements; they can also be implemented in the form of hardware; some modules can be implemented in the form of calling software by processing elements, and some of the modules can be implemented in the form of hardware.
- the acquisition module may be a separate processing element, or it may be integrated in a certain chip for implementation.
- it may also be stored in the memory in the form of program code, and a certain processing element may call and execute the functions of the above acquisition module.
- the implementation of other modules is similar.
- each step of the above method or each of the above modules can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
- the above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (ASICs for short), or one or more microprocessors ( Digital Singnal Processor, DSP for short), or, one or more Field Programmable Gate Array (FPGA for short) or Graphics Processing Unit (GPU for short), etc.
- ASICs application specific integrated circuits
- DSP Digital Singnal Processor
- FPGA Field Programmable Gate Array
- GPU Graphics Processing Unit
- the processing element may be a general-purpose processor, such as a central processing unit (CPU for short) or other processors that can call program codes.
- these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC for short).
- a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for identifying the multi-copy region in the aforementioned microbial target segment is realized.
- a computer processing device including a processor and the aforementioned computer-readable storage medium, and the processor executes the computer program on the computer-readable storage medium to realize the aforementioned microorganism Steps of the method for identifying multiple copy regions in the target segment.
- an electronic terminal including: a processor, a memory, and a communicator; the memory is used to store a computer program, and the communicator is used to communicate with an external device, so The processor is used to execute the computer program stored in the memory, so that the terminal executes the method for realizing the identification of the multiple-copy area in the aforementioned microbial target segment.
- FIG. 3 a schematic diagram of an electronic terminal provided by the present invention is shown.
- the electronic terminal includes a processor 31, a memory 32, a communicator 33, a communication interface 34, and a system bus 35; the memory 32 and the communication interface 34 are connected to the processor 31 and the communicator 33 through the system bus 35 to complete mutual communication,
- the memory 32 is used to store computer programs, the communicator 33 and the communication interface 34 are used to communicate with other devices, and the processor 31 and the communicator 33 are used to run computer programs to make the electronic terminal execute the steps of the above image analysis method.
- the aforementioned system bus may be a Peripheral Pomponent Interconnect (PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus.
- PCI Peripheral Pomponent Interconnect
- EISA Extended Industry Standard Architecture
- the system bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
- the communication interface is used to realize the communication between the database access device and other devices (such as the client, the read-write library and the read-only library).
- the memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
- the above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short), an application specific integrated circuit ( ApplicationSpecificIntegratedCircuit, ASIC for short), Field-ProgrammableGateArray (FPGA for short), Graphics Processing Unit (GPU) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
- CPU Central Processing Unit
- NP Network Processor
- DSP Digital Signal Processing
- ASIC ApplicationSpecificIntegratedCircuit
- FPGA Field-ProgrammableGateArray
- GPU Graphics Processing Unit
- the aforementioned computer program can be stored in a computer-readable storage medium.
- the computer-readable storage medium may include, but is not limited to, a floppy disk, an optical disk, a CD-ROM (read-only optical disk memory), a magneto-optical disk, and a ROM (only Read memory), RAM (random access memory), EPROM (erasable programmable read-only memory), EEPROM (electrically erasable programmable read-only memory), magnetic or optical card, flash memory, or suitable for storage Other types of media/machine-readable media that execute instructions.
- the computer-readable storage medium may be a product that has not been connected to a computer device, or a component that has been connected to a computer device for use.
- the computer program is a routine, program, object, component, data structure, etc., that performs a specific task or implements a specific abstract data type.
- the aforementioned method for identifying multi-copy regions in microbial target fragments, devices for identifying multi-copy regions in microbial target fragments, computer-readable storage media, computer processing equipment, or electronic terminals can be used in microbial PCR detection.
- the aforementioned device for identifying multi-copy areas in microbial target fragments can be used to detect multi-copy areas in microbial target fragments.
- the microorganism is selected from one or more of bacteria, viruses, fungi, amoeba, cryptosporidium, flagellum, microsporidium, piriformis, plasmodium, toxoplasma, trichomonas, or kinetoplast.
Landscapes
- Life Sciences & Earth Sciences (AREA)
- Physics & Mathematics (AREA)
- Engineering & Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Biophysics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Biotechnology (AREA)
- Evolutionary Biology (AREA)
- General Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Analytical Chemistry (AREA)
- Chemical & Material Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioethics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Evolutionary Computation (AREA)
- Public Health (AREA)
- Software Systems (AREA)
- Genetics & Genomics (AREA)
- Molecular Biology (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
Abstract
Description
Claims (17)
- 一种微生物目标片段中多拷贝区域的识别方法,其特征在于,所述方法至少包括以下步骤:S100:寻找候选多拷贝区域:对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;S200:验证获得多拷贝区域:获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
- 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,还包括以下特征中的一项或多项:a.覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%;b.相似性预设值超过80%;c.两个进行对比的序列之间的各个碱基的位置不发生交叉;d.所述方法还包括以下步骤:S101,将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件;e.在步骤S200中,还可计算候选多拷贝区域拷贝数的95%置信区间。
- 如权利要求2所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
- 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。
- 根据权利要求3所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,还包括以下特征中的一项或多项:a.相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域;b.所述基序按照任意的顺序连接。
- 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,候选多拷贝区域拷贝数的中值的获得方法为:确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。
- 一种微生物目标片段中多拷贝区域的识别装置,其特征在于,所述装置至少包括:候选多拷贝区域寻找模块,用于对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;多拷贝区域验证获得模块,用于获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
- 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,还包括以下特征中的一项或多项:a.覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%;b.相似性预设值超过80%;c.两个进行对比的序列之间的各个碱基的位置不发生交叉;d.所述候选多拷贝区域寻找模块还包括以下子模块:原始数据相似性比较子模块,用于将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件;e.多拷贝区域验证获得模块还用于计算候选多拷贝区域拷贝数的95%置信区间。
- 如权利要求8所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
- 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,候选多拷贝区域寻找模块中,当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。
- 如权利要求10所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,还包括以下特征中的一项或多项:a.相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域;b.所述基序按照任意的顺序连接。
- 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,所述多拷贝区域验证获得模块中还包括候选多拷贝区域拷贝数的中值的获得子模块,用于确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。
- 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法。
- 一种计算机处理设备,包括处理器及权利要求13所述的计算机可读存储介质,其特征在于,所述处理器执行所述计算机可读存储介质上的计算机程序,实现权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法的步骤。
- 一种电子终端,其特征在于,包括:处理器、存储器、及通信器;所述存储器用于存储计算机程序,所述通信器用于与外部设备进行通信连接,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法。
- 如权利要求1-6所述的微生物目标片段中多拷贝区域的识别方法,权利要求7-12所述的微生物目标片段中多拷贝区域的识别装置,权利要求13所述的计算机可读存储介质,权利要求14所述的计算机处理设备,或权利要求15所述的电子终端,用于检测微生物目标片段中多拷贝区域的用途。
- 如权利要求16所述的用途,其特征在于,所述微生物选自细菌、病毒、真菌、变形虫、隐孢子虫、鞭毛虫、微孢子虫、梨形虫、疟原虫、弓形虫、毛滴虫或动质体中的一种或多种。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU2020439391A AU2020439391B2 (en) | 2020-04-02 | 2020-05-14 | Method and apparatus for identifying multi-copy region in microbial target fragment, and use |
JP2022560044A JP7367234B2 (ja) | 2020-04-02 | 2020-05-14 | 微生物の標的断片における多コピー領域の識別方法、装置及び応用 |
US17/916,189 US20230154568A1 (en) | 2020-04-02 | 2020-05-14 | Method and device for identifying multi-copy region in microorganism target fragment and use thereof |
EP20928847.1A EP4120279A4 (en) | 2020-04-02 | 2020-05-14 | METHOD AND DEVICE FOR IDENTIFYING A MULTIPLE COPY REGION IN A MICROBIAL TARGET FRAGMENT AND USE |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010254690.9 | 2020-04-02 | ||
CN202010254690.9A CN111477275B (zh) | 2020-04-02 | 2020-04-02 | 微生物目标片段中多拷贝区域的识别方法、装置及应用 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021196356A1 true WO2021196356A1 (zh) | 2021-10-07 |
Family
ID=71749593
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/090175 WO2021196356A1 (zh) | 2020-04-02 | 2020-05-14 | 微生物目标片段中多拷贝区域的识别方法、装置及应用 |
Country Status (6)
Country | Link |
---|---|
US (1) | US20230154568A1 (zh) |
EP (1) | EP4120279A4 (zh) |
JP (1) | JP7367234B2 (zh) |
CN (1) | CN111477275B (zh) |
AU (1) | AU2020439391B2 (zh) |
WO (1) | WO2021196356A1 (zh) |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6694335B1 (en) * | 1999-10-04 | 2004-02-17 | Microsoft Corporation | Method, computer readable medium, and system for monitoring the state of a collection of resources |
CN101930502A (zh) * | 2010-09-03 | 2010-12-29 | 深圳华大基因科技有限公司 | 表型基因的检测及生物信息分析的方法及系统 |
CN103810402A (zh) * | 2014-02-25 | 2014-05-21 | 北京诺禾致源生物信息科技有限公司 | 用于基因组的数据处理方法和装置 |
CN104870056A (zh) * | 2012-10-05 | 2015-08-26 | 弗·哈夫曼-拉罗切有限公司 | 用于诊断和治疗炎性肠病的方法 |
Family Cites Families (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2003104487A2 (en) * | 2002-06-06 | 2003-12-18 | Centre For Addiction And Mental Health | Detection of epigenetic abnormalities and diagnostic method based thereon |
CN105574361B (zh) * | 2015-11-05 | 2018-11-02 | 上海序康医疗科技有限公司 | 一种检测基因组拷贝数变异的方法 |
JP2019501641A (ja) * | 2015-11-12 | 2019-01-24 | サミュエル ウィリアムスSamuel WILLIAMS | ナノポア技術を用いた短いdna断片の迅速な配列決定 |
US10095831B2 (en) * | 2016-02-03 | 2018-10-09 | Verinata Health, Inc. | Using cell-free DNA fragment size to determine copy number variations |
CN106845154B (zh) * | 2016-12-29 | 2022-04-08 | 浙江安诺优达生物科技有限公司 | 一种用于ffpe样本拷贝数变异检测的装置 |
US11993811B2 (en) * | 2017-01-31 | 2024-05-28 | Myriad Women's Health, Inc. | Systems and methods for identifying and quantifying gene copy number variations |
US20180235352A1 (en) * | 2017-02-23 | 2018-08-23 | Janay Jones | Multi purpose personal transport gear that converts from backpack to comfort pad to poncho to hammock |
CN108048530B (zh) * | 2018-01-23 | 2021-07-27 | 广州大学 | 基于EST序列开发EPICs引物的方法 |
CN109234267B (zh) * | 2018-09-12 | 2021-07-30 | 中国科学院遗传与发育生物学研究所 | 一种基因组组装方法 |
-
2020
- 2020-04-02 CN CN202010254690.9A patent/CN111477275B/zh active Active
- 2020-05-14 WO PCT/CN2020/090175 patent/WO2021196356A1/zh unknown
- 2020-05-14 EP EP20928847.1A patent/EP4120279A4/en active Pending
- 2020-05-14 US US17/916,189 patent/US20230154568A1/en active Pending
- 2020-05-14 JP JP2022560044A patent/JP7367234B2/ja active Active
- 2020-05-14 AU AU2020439391A patent/AU2020439391B2/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6694335B1 (en) * | 1999-10-04 | 2004-02-17 | Microsoft Corporation | Method, computer readable medium, and system for monitoring the state of a collection of resources |
CN101930502A (zh) * | 2010-09-03 | 2010-12-29 | 深圳华大基因科技有限公司 | 表型基因的检测及生物信息分析的方法及系统 |
CN104870056A (zh) * | 2012-10-05 | 2015-08-26 | 弗·哈夫曼-拉罗切有限公司 | 用于诊断和治疗炎性肠病的方法 |
CN103810402A (zh) * | 2014-02-25 | 2014-05-21 | 北京诺禾致源生物信息科技有限公司 | 用于基因组的数据处理方法和装置 |
Non-Patent Citations (1)
Title |
---|
See also references of EP4120279A4 * |
Also Published As
Publication number | Publication date |
---|---|
CN111477275A (zh) | 2020-07-31 |
CN111477275B (zh) | 2020-12-25 |
US20230154568A1 (en) | 2023-05-18 |
JP7367234B2 (ja) | 2023-10-23 |
EP4120279A4 (en) | 2023-11-22 |
JP2023516504A (ja) | 2023-04-19 |
AU2020439391B2 (en) | 2024-02-29 |
AU2020439391A1 (en) | 2022-11-10 |
EP4120279A1 (en) | 2023-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021196357A1 (zh) | 微生物的种特异共有序列的获得方法、装置及应用 | |
Lun et al. | A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor | |
US10354747B1 (en) | Deep learning analysis pipeline for next generation sequencing | |
Agius et al. | High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions | |
JP2005531853A (ja) | Snp遺伝子型クラスタリングのためのシステムおよび方法 | |
CN106460045B (zh) | 人类基因组常见拷贝数变异用于癌症易感风险评估 | |
US20220277811A1 (en) | Detecting False Positive Variant Calls In Next-Generation Sequencing | |
US20140288844A1 (en) | Characterization of biological material in a sample or isolate using unassembled sequence information, probabilistic methods and trait-specific database catalogs | |
Ochoa et al. | Beyond the E-value: stratified statistics for protein domain prediction | |
WO2021196356A1 (zh) | 微生物目标片段中多拷贝区域的识别方法、装置及应用 | |
Li et al. | SENIES: DNA shape enhanced two-layer deep learning predictor for the identification of enhancers and their strength | |
Ziegler et al. | MiMSI-a deep multiple instance learning framework improves microsatellite instability detection from tumor next-generation sequencing | |
Jiang et al. | DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data | |
AU2022218581B2 (en) | Sequencing data-based itd mutation ratio detecting apparatus and method | |
WO2021196358A1 (zh) | 微生物目标片段中特异性区域的识别方法、装置及应用 | |
Miglietta et al. | Smart-Plexer: a breakthrough workflow for hybrid development of multiplex PCR assays | |
Bang et al. | Deep-learning optimized DEOCSU suite provides an iterable pipeline for accurate ChIP-exo peak calling | |
CN116646010B (zh) | 人源性病毒检测方法及装置、设备、存储介质 | |
US20230005569A1 (en) | Chromosomal and Sub-Chromosomal Copy Number Variation Detection | |
WO2024140881A1 (zh) | 胎儿dna浓度的确定方法及装置 | |
Morris et al. | Two algorithms for biospecimen comparison and differentiation using SNP genotypes | |
Tahara et al. | MOCCS profile analysis clarifies the cell type dependency of transcription factor-binding sequences and cis-regulatory SNPs in humans | |
Hedges | Bioinformatics of Human Genetic Disease Studies | |
Michel et al. | Large-scale structure prediction enabled by reliable model quality assessment and improved contact predictions for small families. | |
Ji et al. | Shine: A novel strategy to extract specific, sensitive and well-conserved biomarkers from massive microbial genomic datasets |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20928847 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022560044 Country of ref document: JP Kind code of ref document: A |
|
ENP | Entry into the national phase |
Ref document number: 2020928847 Country of ref document: EP Effective date: 20221010 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2020439391 Country of ref document: AU Date of ref document: 20200514 Kind code of ref document: A |