WO2021196356A1 - 微生物目标片段中多拷贝区域的识别方法、装置及应用 - Google Patents

微生物目标片段中多拷贝区域的识别方法、装置及应用 Download PDF

Info

Publication number
WO2021196356A1
WO2021196356A1 PCT/CN2020/090175 CN2020090175W WO2021196356A1 WO 2021196356 A1 WO2021196356 A1 WO 2021196356A1 CN 2020090175 W CN2020090175 W CN 2020090175W WO 2021196356 A1 WO2021196356 A1 WO 2021196356A1
Authority
WO
WIPO (PCT)
Prior art keywords
copy
region
candidate
regions
microbial target
Prior art date
Application number
PCT/CN2020/090175
Other languages
English (en)
French (fr)
Inventor
嵇匆
邵俊斌
刘燕
齐霞
金宇丹
李启腾
Original Assignee
上海之江生物科技股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海之江生物科技股份有限公司 filed Critical 上海之江生物科技股份有限公司
Priority to AU2020439391A priority Critical patent/AU2020439391B2/en
Priority to JP2022560044A priority patent/JP7367234B2/ja
Priority to US17/916,189 priority patent/US20230154568A1/en
Priority to EP20928847.1A priority patent/EP4120279A4/en
Publication of WO2021196356A1 publication Critical patent/WO2021196356A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B20/00ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
    • G16B20/10Ploidy or copy number detection
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis

Definitions

  • the present invention relates to the field of bioinformatics, in particular to a method, device and application for identifying multi-copy regions in microbial target fragments.
  • plasmids are not universal.
  • plasmids Some species do not have plasmids themselves, so plasmids cannot be used to detect the species, let alone the design of primers on the plasmids to improve detection sensitivity. For example, studies have reported that about 5% of Neisseria gonorrhoeae strains cannot be detected because they do not have a plasmid.
  • rRNA genes are present in the genomes of all microbial species, and there are often multiple copies that can improve detection sensitivity. But in fact, not all rRNA genes are multiple copies. For example, there is only one copy of the rRNA gene in Mycobacterium tuberculosis H37Rv.
  • rRNA genes there are some changes in rRNA gene sequence that are not suitable for detection. For example, between closely related species or even between strains of the same species with different subtypes, rRNA genes cannot meet the requirements of species specificity or even subspecies specificity because the sequence is too conservative.
  • the purpose of the present invention is to provide a method, device and application for identifying multi-copy regions in microbial target fragments.
  • the first aspect of the present invention provides a method for identifying multi-copy regions in microbial target fragments, the method at least including the following steps:
  • S100 Find candidate multi-copy regions: perform internal comparisons on the target microbial fragments, and find the regions corresponding to the sequence to be tested whose similarity meets the preset value as candidate multi-copy regions.
  • the similarity refers to the coverage ratio of the sequence to be tested and The product of the match rate;
  • S200 Verify that the multi-copy area is obtained: Obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
  • the second aspect of the present invention provides a device for identifying multiple copy regions in a microbial target segment, the device at least comprising:
  • Candidate multi-copy region searching module used for internal comparison of microbial target fragments, searching for the region corresponding to the test sequence whose similarity meets the preset value as the candidate multi-copy region, the similarity refers to the coverage rate of the test sequence The product of the match rate;
  • the multi-copy area verification and acquisition module is used to obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
  • a third aspect of the present invention provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for identifying a multi-copy region in the aforementioned microbial target segment is realized.
  • a fourth aspect of the present invention provides a computer processing device, including a processor and the aforementioned computer-readable storage medium.
  • the processor executes a computer program on the computer-readable storage medium to realize the multi-copy area in the aforementioned microbial target segment. The steps of the identification method.
  • the fifth aspect of the present invention provides an electronic terminal, including: a processor, a memory, and a communicator; the memory is used to store a computer program, the communicator is used to communicate with an external device, and the processor is used to execute all The computer program stored in the memory, so that the terminal executes the aforementioned method for identifying the multi-copy area in the microbial target segment.
  • the sixth aspect of the present invention provides the aforementioned method for identifying multiple-copy areas in a microbial target segment, the aforementioned device for identifying multiple-copy areas in a microbial target segment, the aforementioned computer-readable storage medium, the aforementioned computer processing equipment, or the aforementioned electronic
  • the terminal is used to detect the multi-copy region in the microbial target fragment.
  • the method, device and application for identifying the multi-copy region in the microbial target segment of the present invention have the following beneficial effects:
  • the method for identifying multi-copy regions in microbial target fragments of the present invention has high accuracy and high sensitivity, and identifies undiscovered multi-copy regions; it can search for repetitive sequences in incompletely assembled motifs; it is comparable to 16srRNA More comprehensive than 16srRNA, not all 16srRNA are multiple copies; this system is not limited to whether there is a whole genome sequence, you can submit calculation tasks by providing the names of target strains and comparison strains or uploading sequence files locally.
  • Fig. 1 is a flowchart of a method according to an embodiment of the present invention.
  • Figure 1-1 is a graph showing the calculation results of the coverage ratio and sequence matching ratio of the aligned sequence.
  • Figure 1-2 is a schematic diagram of the multi-copy area verification obtaining module of the present invention.
  • Fig. 2 is a schematic diagram of an apparatus according to an embodiment of the present invention.
  • Fig. 3 is a schematic diagram of an electronic terminal in an embodiment of the present invention.
  • the method for identifying multi-copy regions in microbial target fragments of the present invention includes at least the following steps:
  • S100 Find candidate multi-copy regions: perform internal comparisons on the target microbial fragments, and find the regions corresponding to the sequence to be tested whose similarity meets the preset value as candidate multi-copy regions.
  • the similarity refers to the coverage ratio of the sequence to be tested and The product of the match rate;
  • S200 Verify that the multi-copy area is obtained: Obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
  • the preset value of similarity can be determined as required.
  • the recommended similarity preset value should exceed 80%, such as 85%, 90%, 95%, 96%, 97%, 98%, 99%, or 100%.
  • Coverage rate (length of similar sequence/(end value of sequence to be tested-starting value of sequence to be tested + 1))%
  • the matching rate is the identity value when the sequence to be tested is compared with another sequence.
  • the identity value of the alignment of the two sequences can be obtained by software such as needle, water or blat.
  • the length of a similar sequence refers to the number of bases that the matched fragment occupies in the sequence to be tested when the sequence to be tested is compared with another sequence, that is, the length of the matched fragment.
  • Sequence A is the sequence to be tested. Compare sequences A and B. The length of the matched fragment is 187. The start value (that is, the starting position) of sequence A is 1, and the end value (that is, the end position) is 187. but:
  • sequence A and sequence B corresponds to an identity of 98.4%.
  • the similarity preset value is set to 80%, and the similarity of A and B meets the preset value, so they serve as a candidate multi-copy area.
  • the positions of the respective bases between the two sequences to be compared do not cross (that is, the two compared sequences are completely separated in the microbial target fragment, and there is no overlapping part).
  • the alignment sequence pair with regional overlap can be removed before the alignment, or the similarity value obtained by the alignment sequence pair with the regional overlap can be removed after the alignment.
  • the position of the sequence A is 1-187
  • the position of each base in the sequence B will not appear between 1-187. You can use the uniq function to perform deduplication after calculating the coverage rate and matching rate.
  • step S200 the method for obtaining the median copy number of the candidate multi-copy region is: determining the position of each candidate multi-copy region on the microbial target fragment, and obtaining other candidates covered by each base position of the candidate multi-copy region to be verified The number of multi-copy areas is calculated, and the median value of the copy number of the candidate multi-copy area to be verified is calculated.
  • the other candidate multi-copy areas refer to candidate multi-copy areas other than the candidate multi-copy area to be verified.
  • the first row represents the microbial target fragment sequence.
  • the fragment within the frame is the candidate multi-copy region to be verified
  • the number in the second row is the target fragment sequence to be verified.
  • the gray segment in the figure represents the candidate multi-copy region other than the candidate multi-copy region to be verified (hereinafter referred to as the repeated fragment).
  • the first base A in the first row of the frame line. Since this base corresponds to appear in 5 repeats (that is, it is covered by 5 repeats), it is considered that the position of the corresponding repeats is If the number is 5, the number of multiple copies at this position is 5.
  • the number of repeats corresponding to the last base G in the frame is 4, and the number of multiple copies at this position is 4.
  • the number of repeated fragments covered by each base position of the candidate multi-copy region to be verified is counted.
  • the median value refers to the variable value in the middle of the variable sequence by arranging the values of the variables in the statistical population in order of size to form a sequence.
  • the microbial target fragment may be a chain or multiple incomplete motifs.
  • the order of motif connection is not particularly limited, and it can be connected in any order. For example, connect each motif in a random order to form a chain. If the region where the similarity meets the preset value contains different motifs, the region is cut according to the original motif connection point, divided into two regions, and it is judged whether the two regions are candidate multi-copy regions respectively.
  • Microbial target fragments are incomplete multiple motifs, which means that part of the microbial target fragment sequence is not a continuous single sequence, but is composed of multiple motifs of different sizes.
  • the motif is caused by incomplete splicing of short read lengths under the existing second-generation sequencing conditions. This method is also suitable for whole genome sequence data generated by new technologies such as third-generation sequencing.
  • the microbial target fragments in step S1 are all derived from public databases, and the public databases are mainly selected from ncbi ( https://www.ncbi.nlm.nih.gov ).
  • the method further includes the following steps: S101, comparing the selected adjacent microbial target fragments in pairs, and if there is a comparison result with a similarity lower than a preset value, an alarm is issued and the target strain is displayed Corresponding filter conditions.
  • the method of the present invention is not limited to whether there is a whole genome sequence, and the calculation task can be submitted by providing the names of the target strain and the comparison strain or uploading a sequence file locally. From the comparison of the detection range, this method can cover all pathogenic microorganisms, including but not limited to bacteria, viruses, fungi, amoeba, cryptosporidium, flagellates, microsporidia, piriformis, plasmodium, toxoplasma, Trichomonas, kinetoplasm, etc.
  • a 95% confidence interval of the copy number of the candidate multi-copy region can also be calculated.
  • the confidence interval refers to the estimated interval of the overall parameter constructed by the sample statistics, that is, the interval estimation of the overall copy number of this target area. It reflects the degree to which the true value of the copy number of the target area has a certain probability to fall around the measurement result, and it gives the credibility of the measured value of the measured parameter.
  • the base number of the candidate multi-copy region is used as the sample number, and the copy value corresponding to each base in the candidate multi-copy region is the sample value.
  • each base corresponds to 1 copy value, so this is a set of 500 copy values in total.
  • the present invention uses the 95% confidence interval of these 500 copy values to measure the overall multi-copy target area when the significance level is 0.05 and the confidence level is 95%. Interval estimation of copy number.
  • the confidence level is the same, the larger the sample size, the narrower the confidence interval and the closer to the mean.
  • the microbial target segment can be the whole genome of the microbe or the gene segment of the microbe.
  • the mechanism of the present invention is: under normal circumstances, the median value and 95% confidence interval representing these 500 copy values can reflect the true situation of the candidate multi-copy region.
  • the design of this module can also exclude some special cases. For example, in this 500 bp candidate multi-copy region, only 5 bases have a copy number of 1000, and the remaining 495 bases have a copy number of 1. Then the median copy number in this case is 1, but the mean is 10.99, and the 95% confidence interval is (2.25-19.73). Obviously, although the mean value is shown as multiple copies, the median value is no longer within the 95% confidence interval, and the candidate multiple copy area cannot be judged as multiple copies.
  • the device for identifying multiple-copy regions in a microbial target segment of the present invention at least includes:
  • Candidate multi-copy region searching module used for internal comparison of microbial target fragments, searching for the region corresponding to the test sequence whose similarity meets the preset value as the candidate multi-copy region, the similarity refers to the coverage rate of the test sequence The product of the match rate;
  • the multi-copy area verification and acquisition module is used to obtain the median value of the copy number of the candidate multi-copy area; if the median value of the copy number of the candidate multi-copy area is greater than 1, it is recorded as the multi-copy area.
  • Coverage rate (length of similar sequence/(end value of sequence to be tested-start value of sequence to be tested+1))%.
  • the matching rate is the identity value when the sequence to be tested is compared with another sequence.
  • the identity value of the alignment of the two sequences can be obtained by software such as needle, water or blat.
  • the preset value of similarity can be determined as required.
  • the recommended similarity preset value should exceed 80%, such as 85%, 90%, 95%, 96%, 97%, 98%, 99%, or 100%.
  • the candidate multi-copy region searching module further includes the following sub-modules: a raw data similarity comparison module, which is used to compare the selected adjacent microbial target fragments in pairs, if the similarity is lower than the preset The result of the comparison of values, an alarm is issued and the filter conditions corresponding to the target plant species are displayed. The user can re-select the target plant species to enter the background calculation based on the feedback report.
  • a raw data similarity comparison module which is used to compare the selected adjacent microbial target fragments in pairs, if the similarity is lower than the preset The result of the comparison of values, an alarm is issued and the filter conditions corresponding to the target plant species are displayed. The user can re-select the target plant species to enter the background calculation based on the feedback report.
  • the candidate multi-copy region searching module when the microbial target segment is multiple incomplete motifs, connect the motifs and then search for the candidate multi-copy region.
  • the region where the similarity meets the preset value contains different motifs, the region is cut according to the original motif connection point, divided into two regions, and it is judged whether the two regions are candidate multi-copy regions respectively.
  • the motifs are connected in any order.
  • the multi-copy region verification and acquisition module also includes a sub-module for obtaining the median value of the copy number of the candidate multi-copy region, which is used to determine the position of each candidate multi-copy region on the microbial target segment, and obtain each candidate multi-copy region to be verified.
  • the number of other candidate multi-copy regions covered by a base position is calculated, and the median value of the copy number of the candidate multi-copy region to be verified is calculated.
  • the multi-copy area verification and acquisition module is also used to calculate the 95% confidence interval of the copy number of the candidate multi-copy area.
  • the base number of the candidate multi-copy region is used as the sample number, and the copy value corresponding to each base in the candidate multi-copy region is the sample value.
  • the device in this embodiment has basically the same principles as the foregoing method embodiments, in the foregoing method and device embodiments, the definitions of the same features, the calculation methods, the enumeration of implementation manners, and the enumeration of preferred implementation manners can be used interchangeably. Do not repeat it again.
  • the division of the various modules of the above device is only a division of logical functions, and may be fully or partially integrated into a physical entity during actual implementation, or may be physically separated.
  • These modules can all be implemented in the form of software called by processing elements; they can also be implemented in the form of hardware; some modules can be implemented in the form of calling software by processing elements, and some of the modules can be implemented in the form of hardware.
  • the acquisition module may be a separate processing element, or it may be integrated in a certain chip for implementation.
  • it may also be stored in the memory in the form of program code, and a certain processing element may call and execute the functions of the above acquisition module.
  • the implementation of other modules is similar.
  • each step of the above method or each of the above modules can be completed by an integrated logic circuit of hardware in the processor element or instructions in the form of software.
  • the above modules may be one or more integrated circuits configured to implement the above methods, for example: one or more application specific integrated circuits (ASICs for short), or one or more microprocessors ( Digital Singnal Processor, DSP for short), or, one or more Field Programmable Gate Array (FPGA for short) or Graphics Processing Unit (GPU for short), etc.
  • ASICs application specific integrated circuits
  • DSP Digital Singnal Processor
  • FPGA Field Programmable Gate Array
  • GPU Graphics Processing Unit
  • the processing element may be a general-purpose processor, such as a central processing unit (CPU for short) or other processors that can call program codes.
  • these modules can be integrated together and implemented in the form of a system-on-a-chip (SOC for short).
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the method for identifying the multi-copy region in the aforementioned microbial target segment is realized.
  • a computer processing device including a processor and the aforementioned computer-readable storage medium, and the processor executes the computer program on the computer-readable storage medium to realize the aforementioned microorganism Steps of the method for identifying multiple copy regions in the target segment.
  • an electronic terminal including: a processor, a memory, and a communicator; the memory is used to store a computer program, and the communicator is used to communicate with an external device, so The processor is used to execute the computer program stored in the memory, so that the terminal executes the method for realizing the identification of the multiple-copy area in the aforementioned microbial target segment.
  • FIG. 3 a schematic diagram of an electronic terminal provided by the present invention is shown.
  • the electronic terminal includes a processor 31, a memory 32, a communicator 33, a communication interface 34, and a system bus 35; the memory 32 and the communication interface 34 are connected to the processor 31 and the communicator 33 through the system bus 35 to complete mutual communication,
  • the memory 32 is used to store computer programs, the communicator 33 and the communication interface 34 are used to communicate with other devices, and the processor 31 and the communicator 33 are used to run computer programs to make the electronic terminal execute the steps of the above image analysis method.
  • the aforementioned system bus may be a Peripheral Pomponent Interconnect (PCI) bus or an Extended Industry Standard Architecture (Extended Industry Standard Architecture, EISA) bus.
  • PCI Peripheral Pomponent Interconnect
  • EISA Extended Industry Standard Architecture
  • the system bus can be divided into address bus, data bus, control bus and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the communication interface is used to realize the communication between the database access device and other devices (such as the client, the read-write library and the read-only library).
  • the memory may include random access memory (Random Access Memory, RAM for short), and may also include non-volatile memory (non-volatile memory), such as at least one disk memory.
  • the above-mentioned processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, CPU for short), a network processor (Network Processor, NP for short), etc.; it may also be a digital signal processor (Digital Signal Processing, DSP for short), an application specific integrated circuit ( ApplicationSpecificIntegratedCircuit, ASIC for short), Field-ProgrammableGateArray (FPGA for short), Graphics Processing Unit (GPU) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components.
  • CPU Central Processing Unit
  • NP Network Processor
  • DSP Digital Signal Processing
  • ASIC ApplicationSpecificIntegratedCircuit
  • FPGA Field-ProgrammableGateArray
  • GPU Graphics Processing Unit
  • the aforementioned computer program can be stored in a computer-readable storage medium.
  • the computer-readable storage medium may include, but is not limited to, a floppy disk, an optical disk, a CD-ROM (read-only optical disk memory), a magneto-optical disk, and a ROM (only Read memory), RAM (random access memory), EPROM (erasable programmable read-only memory), EEPROM (electrically erasable programmable read-only memory), magnetic or optical card, flash memory, or suitable for storage Other types of media/machine-readable media that execute instructions.
  • the computer-readable storage medium may be a product that has not been connected to a computer device, or a component that has been connected to a computer device for use.
  • the computer program is a routine, program, object, component, data structure, etc., that performs a specific task or implements a specific abstract data type.
  • the aforementioned method for identifying multi-copy regions in microbial target fragments, devices for identifying multi-copy regions in microbial target fragments, computer-readable storage media, computer processing equipment, or electronic terminals can be used in microbial PCR detection.
  • the aforementioned device for identifying multi-copy areas in microbial target fragments can be used to detect multi-copy areas in microbial target fragments.
  • the microorganism is selected from one or more of bacteria, viruses, fungi, amoeba, cryptosporidium, flagellum, microsporidium, piriformis, plasmodium, toxoplasma, trichomonas, or kinetoplast.

Landscapes

  • Life Sciences & Earth Sciences (AREA)
  • Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biophysics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Analytical Chemistry (AREA)
  • Chemical & Material Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Bioethics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Epidemiology (AREA)
  • Evolutionary Computation (AREA)
  • Public Health (AREA)
  • Software Systems (AREA)
  • Genetics & Genomics (AREA)
  • Molecular Biology (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)

Abstract

一种微生物目标片段中多拷贝区域的识别方法,至少包括以下步骤:寻找候选多拷贝区域:对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积(S1);验证获得多拷贝区域:获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域(S2)。该识别方法的微生物目标片段中多拷贝区域的识别方法准确性高,灵敏度高,识别出未发现的多拷贝区域;可在组装不完整的基序中寻找重复序列;与16srRNA相比更加全面。

Description

微生物目标片段中多拷贝区域的识别方法、装置及应用 技术领域
本发明涉及生物信息学领域,特别是涉及一种微生物目标片段中多拷贝区域的识别方法、装置及应用。
背景技术
由于生物样本中病原微生物的DNA浓度大多非常低,接近检测极限。而使用传统PCR或者实时PCR检测时常常缺乏检测灵敏度。其他方法如两步巢式PCR可以被用来提高灵敏度,但是方法耗时、成本高、准确性不好。因此,提高检测灵敏度至关重要。其中一种方式是在设计引物时寻找合适的模板区域,通常情况下会选用质粒和16S rRNA。
但是,选用质粒做引物设计会产生一些问题:不是所有微生物都含有物种特异性质粒,并且有些微生物没有质粒。首先,质粒DNA的物种特异性不确定,有些物种质粒上的序列和其他物种质粒上的序列高度相似,那么基于质粒的PCR检测会产生假阳性或假阴性结果的高风险,许多临床实验室仍然需要使用其他PCR引物对来进行验证性实验。其次,质粒不具有普遍性,有些物种本身并没有质粒,那么就不能使用质粒来检测该物种,更不能通过在质粒上设计引物来提高检测灵敏度。例如,有研究报道,大约有5%的淋病奈瑟氏菌菌株因为没有质粒而无法检测到。
同样地,选用rRNA基因区域作为检测PCR的模板也存在一些问题:rRNA基因虽然存在于所有微生物物种基因组中,并且常常有多个拷贝能够提高检测灵敏度。但事实上,并不是所有rRNA基因都是多拷贝。例如,在结核分支杆菌H37Rv中rRNA基因只有1个拷贝。另外,有一些rRNA基因序列的变化并不适合做检测。例如,在亲缘性很近的物种之间甚至是相同物种不同亚型的菌株之间,rRNA基因由于序列太保守而无法满足物种特异性甚至是亚种特异性的要求。
发明内容
鉴于以上所述现有技术的缺点,本发明的目的在于提供一种微生物目标片段中多拷贝区域的识别方法、装置及应用。
本发明第一方面提供一种微生物目标片段中多拷贝区域的识别方法,所述方法至少包括以下步骤:
S100:寻找候选多拷贝区域:对微生物目标片段进行内部比对,寻找相似性满足预设值 的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
S200:验证获得多拷贝区域:获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
本发明第二方面提供一种微生物目标片段中多拷贝区域的识别装置,所述装置至少包括:
候选多拷贝区域寻找模块,用于对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
多拷贝区域验证获得模块,用于获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
本发明第三方面提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述微生物目标片段中多拷贝区域的识别方法。
本发明第四方面提供一种计算机处理设备,包括处理器及前述的计算机可读存储介质,所述处理器执行所述计算机可读存储介质上的计算机程序,实现前述微生物目标片段中多拷贝区域的识别方法的步骤。
本发明第五方面提供一种电子终端,包括:处理器、存储器、及通信器;所述存储器用于存储计算机程序,所述通信器用于与外部设备进行通信连接,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行前述的微生物目标片段中多拷贝区域的识别方法。
本发明第六方面提供前述的微生物目标片段中多拷贝区域的识别方法,前述的微生物目标片段中多拷贝区域的识别装置,前述的计算机可读存储介质,前述的计算机处理设备,或前述的电子终端,用于检测微生物目标片段中多拷贝区域的用途。
如上所述,本发明的微生物目标片段中多拷贝区域的识别方法、装置及应用,具有以下有益效果:
与文献数据库对比,本发明的微生物目标片段中多拷贝区域的识别方法准确性高,灵敏度高,识别出未发现的多拷贝区域;可在组装不完整的基序中寻找重复序列;与16srRNA相比更加全面,16srRNA不都是多拷贝;本系统不受限于是否存在全基因组序列,可以通过提供目标菌株和对比菌株的名称或者本地上传序列文件来提交运算任务。
附图说明
图1是本发明实施例的方法的流程图。
图1-1是比对序列的覆盖率与序列匹配率的计算结果图。
图1-2是本发明的多拷贝区域验证获得模块示意图。
图2是本发明实施例的装置示意图。
图3是本发明实施例中电子终端示意图。
具体实施方式
以下通过特定的具体实例说明本发明的实施方式,本领域技术人员可由本说明书所揭露的内容轻易地了解本发明的其他优点与功效。本发明还可以通过另外不同的具体实施方式加以实施或应用,本说明书中的各项细节也可以基于不同观点与应用,在没有背离本发明的精神下进行各种修饰或改变。
此外应理解,本发明中提到的一个或多个方法步骤并不排斥在所述组合步骤前后还可以存在其他方法步骤或在这些明确提到的步骤之间还可以插入其他方法步骤,除非另有说明;还应理解,本发明中提到的一个或多个步骤之间的组合连接关系并不排斥在所述组合步骤前后还可以存在其他步骤或在这些明确提到的两个步骤之间还可以插入其他步骤,除非另有说明。而且,除非另有说明,各方法步骤的编号仅为鉴别各方法步骤的便利工具,而非为限制各方法步骤的排列次序或限定本发明可实施的范围,其相对关系的改变或调整,在无实质变更技术内容的情况下,当亦视为本发明可实施的范畴。
请参阅图1至图3。需要说明的是,本实施例中所提供的图示仅以示意方式说明本发明的基本构想,虽图式中仅显示与本发明中有关的组件而非按照实际实施时的组件数目、形状及尺寸绘制,其实际实施时各组件的型态、数量及比例可为一种随意的改变,且其组件布局型态也可能更为复杂。
如图1所示,本发明的微生物目标片段中多拷贝区域的识别方法,至少包括以下步骤:
S100:寻找候选多拷贝区域:对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
S200:验证获得多拷贝区域:获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
相似性预设值可以根据需要确定。相似性预设值推荐应超过80%,例如85%,90%,95%,96%,97%,98%,99%,或100%。
覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%
匹配率即为待测序列与另一序列比对时的identity值。两序列比对的identity值可以利用needle、water或blat等软件获得。
相似序列的长度是指:待测序列与另一序列比对时,相匹配的片段在待测序列中所占的碱基数,即相匹配的片段长度。
例如,一候选多拷贝区域对应的待测序列的数据情况如图1-1所示,其中,
序列A为待测序列,将序列A和B进行对比,两者相匹配的片段长度为187,A序列的起始值(即起始位置)为1,末端值(即结束位置)为187,则:
A序列的覆盖率=(187/(187-1+1))*100%=100%
序列A和序列B的匹配率对应identity为98.4%。
则A和B的相似性=100%*98.4%=98.4%。相似性预设值设置为80%,A和B的相似性满足预设值,因此作为候选多拷贝区域。
两个进行对比的序列之间的各个碱基的位置不发生交叉(即两比对序列在微生物目标片段中是完全分离的,没有重合部分)。可以在比对前去除有区域重叠的比对序列对,也可以在比对后,去除有区域重叠的比对序列对获得的相似性值。例如,如图1所示,所述序列A位置1-187,则序列B中各个碱基的位置不会出现在1-187之间。可以在计算完覆盖率和匹配率后,利用uniq函数进行去重。
步骤S200中,候选多拷贝区域拷贝数的中值的获得方法为:确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。所述其他候选多拷贝区域是指除了待验证候选多拷贝区域以外的候选多拷贝区域。
具体的,例如,如图1-2所示,第一行代表微生物目标片段序列,在微生物目标片段序列中,框线内的片段为待验证的候选多拷贝区域,第二行的数字为待验证的候选多拷贝区域中,各碱基对应的多拷贝数,图中灰色片段代表待验证的候选多拷贝区域以外的候选多拷贝区域(之后简称为重复片段)。从左边开始,框线中第一行的第一个碱基A,由于该碱基对应在5个重复片段中出现(即被5个重复片段覆盖),因此认为,与其位置对应的重复片段的数量为5,则该位置的多拷贝数为5;如图中框线中最后一个碱基G,与其位置对应的重复片段的数量为4,则该位置的多拷贝数为4。以此类推,统计待验证的候选多拷贝区域的每个碱基位置上覆盖的重复片段的个数。统计结果图中参见第二行的多拷贝数,结合各个位置的拷贝数的数值,即可计算获得候选多拷贝区域拷贝数的中值。中值是指:是指将统计总体当中的各个变量值按大小顺序排列起来,形成一个数列,处于变量数列中间位置的变量值。
进一步的,步骤S100中,微生物目标片段可以为一条链,也可以为不完整的多条基序。
当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。 基序连接的顺序没有特别限定,可以按照任意的顺序连接。例如,将各基序按随机顺序连接成一条链。相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域。
微生物目标片段为不完整的多条基序指的部分微生物目标片段序列不是连续的单条序列,而是由多个不同大小的基序组成。基序是现有二代测序条件下短读长拼接不完整导致的。该方法也同样适用于三代测序等新技术产生的全基因组序列数据。
步骤S1中的微生物目标片段均来源于公共数据库,所述公共数据库选自主要是ncbi( https://www.ncbi.nlm.nih.gov)。
进一步的,所述方法还包括以下步骤:S101,将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件。
可以过滤人为错误或其他原因导致的异常数据。
本发明所述方法不受限于是否存在全基因组序列,可以通过提供目标菌株和对比菌株的名称或者本地上传序列文件来提交运算任务。从检测范围上比较,本方法可以涵盖所有的致病微生物,包括但不限于细菌、病毒、真菌、变形虫、隐孢子虫、鞭毛虫、微孢子虫、梨形虫、疟原虫、弓形虫、毛滴虫、动质体等。
在优选的实施方式中,在步骤S200中,还可计算候选多拷贝区域拷贝数的95%置信区间。置信区间是指由样本统计量所构造的总体参数的估计区间,即对这个目标区域的整体拷贝数的区间估计。它体现了该目标区域拷贝数的真实值有一定概率落在测量结果周围的程度,其给出的是被测量参数的测量值的可信程度。
在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
如图1-2所示,在这个长度为500bp的多拷贝目标区域中,每个碱基对应1个拷贝数值,那么这是一组共计500个的拷贝数值。
除了上文提到的拷贝数中值外,本发明使用这500个拷贝数值的95%置信区间来衡量在显著性水平为0.05时,置信度为95%的情况下,该多拷贝目标区域整体拷贝数的区间估计。在置信水平相同时,样本量越多,置信区间越窄,越接近均值。
所述微生物目标片段可以是微生物的全基因组,也可以是微生物的基因片段。
本发明的机理为:正常情况下,代表这500个拷贝数值的中值和95%置信区间可以反映出该候选多拷贝区域的真实情况。本模块的设计除了进一步验证多拷贝以外,也可以排除一些特殊情况。例如,这500bp的候选多拷贝区域中仅有5个碱基的拷贝数为1000,而剩余495 个碱基的拷贝数为1。那么这种情况下的拷贝数中值为1,均值却为10.99,95%置信区间为(2.25-19.73)。很显然,虽然均值显示为多拷贝,但是中值已经不在这95%置信区间范围内,候选多拷贝区域不可判为多拷贝。
如图2所示,本发明的微生物目标片段中多拷贝区域的识别装置,至少包括:
候选多拷贝区域寻找模块,用于对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
多拷贝区域验证获得模块,用于获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%。
匹配率即为待测序列与另一序列比对时的identity值。两序列比对的identity值可以利用needle、water或blat等软件获得。
相似性预设值可以根据需要确定。相似性预设值推荐应超过80%,例如85%,90%,95%,96%,97%,98%,99%,或100%。
进一步的,两个进行对比的序列之间的各个碱基的位置不发生交叉。
可选的,所述候选多拷贝区域寻找模块还包括以下子模块:原始数据相似性比较模块,用于将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件。用户可以根据反馈报告重新选择目标株种进入后台运算。
候选多拷贝区域寻找模块中,当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。
相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域。
所述基序按照任意的顺序连接。
所述多拷贝区域验证获得模块中还包括候选多拷贝区域拷贝数的中值的获得子模块,用于确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。
多拷贝区域验证获得模块还用于计算候选多拷贝区域拷贝数的95%置信区间。
在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
由于本实施例中的装置与前述方法实施例的原理基本相同,在上述方法和装置实施例中,对相同特征的定义、计算方法、实施方式的列举及优选实施方式的列举阐述可以互用,不再重复赘述。
需要说明的是,应理解以上装置的各个模块的划分仅仅是一种逻辑功能的划分,实际实现时可以全部或部分集成到一个物理实体上,也可以物理上分开。这些模块可以全部以软件通过处理元件调用的形式实现;也可以全部以硬件的形式实现;还可以部分模块通过处理元件调用软件的形式实现,部分模块通过硬件的形式实现。例如,获取模块可以为单独设立的处理元件,也可以集成在某一个芯片中实现,此外,也可以以程序代码的形式存储于存储器中,由某一个处理元件调用并执行以上获取模块的功能。其它模块的实现与之类似。此外这些模块全部或部分可以集成在一起,也可以独立实现。这里所述的处理元件可以是一种集成电路,具有信号的处理能力。在实现过程中,上述方法的各步骤或以上各个模块可以通过处理器元件中的硬件的集成逻辑电路或者软件形式的指令完成。
例如,以上这些模块可以是被配置成实施以上方法的一个或多个集成电路,例如:一个或多个特定集成电路(Application Specific Integrated Circuit,简称ASIC),或,一个或多个微处理器(digital singnal processor,简称DSP),或,一个或者多个现场可编程门阵列(Field Programmable Gate Array,简称FPGA)或图形处理器(Graphics Processing Unit,简称:GPU)等。再如,当以上某个模块通过处理元件调度程序代码的形式实现时,该处理元件可以是通用处理器,例如中央处理器(Central Processing Unit,简称CPU)或其它可以调用程序代码的处理器。再如,这些模块可以集成在一起,以片上系统(system-on-a-chip,简称SOC)的形式实现。
在本发明的一些实施例中,还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述微生物目标片段中多拷贝区域的识别方法。
在本发明的一些实施例中,还提供了一种计算机处理设备,包括处理器及前述的计算机可读存储介质,所述处理器执行所述计算机可读存储介质上的计算机程序,实现前述微生物目标片段中多拷贝区域的识别方法的步骤。
在本发明的一些实施例中,还提供了一种电子终端,包括:处理器、存储器、及通信器;所述存储器用于存储计算机程序,所述通信器用于与外部设备进行通信连接,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行实现前述微生物目标片段中多拷贝区域的识别方法。
如图3所示,展示本发明提供的一种电子终端的示意图。所述电子终端包括处理器31、 存储器32、通信器33、通信接口34和系统总线35;存储器32和通信接口34通过系统总线35与处理器31和通信器33连接并完成相互间的通信,存储器32用于存储计算机程序,通信器33、通信接口34用于和其他设备进行通信,处理器31和通信器33用于运行计算机程序,使电子终端执行如上图像分析方法的各个步骤。
上述提到的系统总线可以是外设部件互连标准(PeripheralPomponentInterconnect,简称PCI)总线或扩展工业标准结构(ExtendedIndustryStandardArchitecture,简称EISA)总线等。该系统总线可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。通信接口用于实现数据库访问装置与其他设备(例如客户端、读写库和只读库)之间的通信。存储器可能包含随机存取存储器(RandomAccessMemory,简称RAM),也可能还包括非易失性存储器(non-volatilememory),例如至少一个磁盘存储器。
上述的处理器可以是通用处理器,包括中央处理器(CentralProcessingUnit,简称CPU)、网络处理器(NetworkProcessor,简称NP)等;还可以是数字信号处理器(DigitalSignalProcessing,简称DSP)、专用集成电路(ApplicationSpecificIntegratedCircuit,简称ASIC)、现场可编程门阵列(Field-ProgrammableGateArray,简称FPGA)、图形处理器(Graphics Processing Unit,简称:GPU)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。
本领域普通技术人员可以理解:实现上述各方法实施例的全部或部分步骤可以通过计算机程序相关的硬件来完成。前述的计算机程序可以存储于一计算机可读存储介质中。该程序在执行时,执行包括上述各方法实施例的步骤;所述计算机可读存储介质可包括,但不限于,软盘、光盘、CD-ROM(只读光盘存储器)、磁光盘、ROM(只读存储器)、RAM(随机存取存储器)、EPROM(可擦除可编程只读存储器)、EEPROM(电可擦除可编程只读存储器)、磁卡或光卡、闪存、或适于存储机器可执行指令的其他类型的介质/机器可读介质。所述计算机可读存储介质可以是未接入计算机设备的产品,也可以是已接入计算机设备使用的部件。
在具体实现上,所述计算机程序为执行特定任务或实现特定抽象数据类型的例程、程序、对象、组件、数据结构等等。
前述的微生物目标片段中多拷贝区域的识别方法、微生物目标片段中多拷贝区域的识别装置、计算机可读存储介质、计算机处理设备或电子终端可应用在微生物PCR检测中。
具体的,用于模板序列的筛选。
前述的微生物目标片段中多拷贝区域的识别装置,前述的计算机可读存储介质,前述的计算机处理设备,或前述的电子终端,可以用于检测微生物目标片段中多拷贝区域的用途。
所述微生物选自细菌、病毒、真菌、变形虫、隐孢子虫、鞭毛虫、微孢子虫、梨形虫、疟原虫、弓形虫、毛滴虫或动质体中的一种或多种。
上述实施例仅例示性说明本发明的原理及其功效,而非用于限制本发明。任何熟悉此技术的人士皆可在不违背本发明的精神及范畴下,对上述实施例进行修饰或改变。因此,举凡所属技术领域中具有通常知识者在未脱离本发明所揭示的精神与技术思想下所完成的一切等效修饰或改变,仍应由本发明的权利要求所涵盖。

Claims (17)

  1. 一种微生物目标片段中多拷贝区域的识别方法,其特征在于,所述方法至少包括以下步骤:
    S100:寻找候选多拷贝区域:对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
    S200:验证获得多拷贝区域:获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
  2. 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,还包括以下特征中的一项或多项:
    a.覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%;
    b.相似性预设值超过80%;
    c.两个进行对比的序列之间的各个碱基的位置不发生交叉;
    d.所述方法还包括以下步骤:S101,将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件;
    e.在步骤S200中,还可计算候选多拷贝区域拷贝数的95%置信区间。
  3. 如权利要求2所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
  4. 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。
  5. 根据权利要求3所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,还包括以下特征中的一项或多项:
    a.相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域;
    b.所述基序按照任意的顺序连接。
  6. 根据权利要求1所述的微生物目标片段中多拷贝区域的识别方法,其特征在于,候选多拷贝区域拷贝数的中值的获得方法为:确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。
  7. 一种微生物目标片段中多拷贝区域的识别装置,其特征在于,所述装置至少包括:
    候选多拷贝区域寻找模块,用于对微生物目标片段进行内部比对,寻找相似性满足预设值的待测序列对应的区域作为候选多拷贝区域,所述相似性是指待测序列的覆盖率与匹配率的乘积;
    多拷贝区域验证获得模块,用于获得候选多拷贝区域拷贝数的中值;若候选多拷贝区域拷贝数的中值大于1,则记为多拷贝区域。
  8. 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,还包括以下特征中的一项或多项:
    a.覆盖率=(相似序列的长度/(待测序列末端值-待测序列起始值+1))%;
    b.相似性预设值超过80%;
    c.两个进行对比的序列之间的各个碱基的位置不发生交叉;
    d.所述候选多拷贝区域寻找模块还包括以下子模块:原始数据相似性比较子模块,用于将选定的相邻微生物目标片段进行两两比对,若出现相似性低于预设值的比对结果,则发出警报并显示目标株种对应的筛选条件;
    e.多拷贝区域验证获得模块还用于计算候选多拷贝区域拷贝数的95%置信区间。
  9. 如权利要求8所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,在计算候选多拷贝区域拷贝数的95%置信区间时,以候选多拷贝区域的碱基数为样本数,候选多拷贝区域中各碱基对应的拷贝数值为样本值计算。
  10. 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,候选多拷贝区域寻找模块中,当微生物目标片段为不完整的多条基序时,将各基序连接起来再寻找候选多拷贝区域。
  11. 如权利要求10所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,还包括以下特征中的一项或多项:
    a.相似性满足预设值的区域中若包含不同基序,则将该区域按照原来的基序连接点进行切割,分成两个区域,分别判断两个区域是否是候选多拷贝区域;
    b.所述基序按照任意的顺序连接。
  12. 如权利要求7所述的微生物目标片段中多拷贝区域的识别装置,其特征在于,所述多拷贝区域验证获得模块中还包括候选多拷贝区域拷贝数的中值的获得子模块,用于确定各候选多拷贝区域在微生物目标片段上的位置,获得待验证候选多拷贝区域的每个碱基位置上覆盖的其他候选多拷贝区域的个数,计算该待验证候选多拷贝区域的拷贝数的中值。
  13. 一种计算机可读存储介质,其上存储有计算机程序,其特征在于,该程序被处理器执行时实现权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法。
  14. 一种计算机处理设备,包括处理器及权利要求13所述的计算机可读存储介质,其特征在于,所述处理器执行所述计算机可读存储介质上的计算机程序,实现权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法的步骤。
  15. 一种电子终端,其特征在于,包括:处理器、存储器、及通信器;所述存储器用于存储计算机程序,所述通信器用于与外部设备进行通信连接,所述处理器用于执行所述存储器存储的计算机程序,以使所述终端执行权利要求1-6任一所述的微生物目标片段中多拷贝区域的识别方法。
  16. 如权利要求1-6所述的微生物目标片段中多拷贝区域的识别方法,权利要求7-12所述的微生物目标片段中多拷贝区域的识别装置,权利要求13所述的计算机可读存储介质,权利要求14所述的计算机处理设备,或权利要求15所述的电子终端,用于检测微生物目标片段中多拷贝区域的用途。
  17. 如权利要求16所述的用途,其特征在于,所述微生物选自细菌、病毒、真菌、变形虫、隐孢子虫、鞭毛虫、微孢子虫、梨形虫、疟原虫、弓形虫、毛滴虫或动质体中的一种或多种。
PCT/CN2020/090175 2020-04-02 2020-05-14 微生物目标片段中多拷贝区域的识别方法、装置及应用 WO2021196356A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
AU2020439391A AU2020439391B2 (en) 2020-04-02 2020-05-14 Method and apparatus for identifying multi-copy region in microbial target fragment, and use
JP2022560044A JP7367234B2 (ja) 2020-04-02 2020-05-14 微生物の標的断片における多コピー領域の識別方法、装置及び応用
US17/916,189 US20230154568A1 (en) 2020-04-02 2020-05-14 Method and device for identifying multi-copy region in microorganism target fragment and use thereof
EP20928847.1A EP4120279A4 (en) 2020-04-02 2020-05-14 METHOD AND DEVICE FOR IDENTIFYING A MULTIPLE COPY REGION IN A MICROBIAL TARGET FRAGMENT AND USE

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010254690.9 2020-04-02
CN202010254690.9A CN111477275B (zh) 2020-04-02 2020-04-02 微生物目标片段中多拷贝区域的识别方法、装置及应用

Publications (1)

Publication Number Publication Date
WO2021196356A1 true WO2021196356A1 (zh) 2021-10-07

Family

ID=71749593

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/090175 WO2021196356A1 (zh) 2020-04-02 2020-05-14 微生物目标片段中多拷贝区域的识别方法、装置及应用

Country Status (6)

Country Link
US (1) US20230154568A1 (zh)
EP (1) EP4120279A4 (zh)
JP (1) JP7367234B2 (zh)
CN (1) CN111477275B (zh)
AU (1) AU2020439391B2 (zh)
WO (1) WO2021196356A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6694335B1 (en) * 1999-10-04 2004-02-17 Microsoft Corporation Method, computer readable medium, and system for monitoring the state of a collection of resources
CN101930502A (zh) * 2010-09-03 2010-12-29 深圳华大基因科技有限公司 表型基因的检测及生物信息分析的方法及系统
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置
CN104870056A (zh) * 2012-10-05 2015-08-26 弗·哈夫曼-拉罗切有限公司 用于诊断和治疗炎性肠病的方法

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2003104487A2 (en) * 2002-06-06 2003-12-18 Centre For Addiction And Mental Health Detection of epigenetic abnormalities and diagnostic method based thereon
CN105574361B (zh) * 2015-11-05 2018-11-02 上海序康医疗科技有限公司 一种检测基因组拷贝数变异的方法
JP2019501641A (ja) * 2015-11-12 2019-01-24 サミュエル ウィリアムスSamuel WILLIAMS ナノポア技術を用いた短いdna断片の迅速な配列決定
US10095831B2 (en) * 2016-02-03 2018-10-09 Verinata Health, Inc. Using cell-free DNA fragment size to determine copy number variations
CN106845154B (zh) * 2016-12-29 2022-04-08 浙江安诺优达生物科技有限公司 一种用于ffpe样本拷贝数变异检测的装置
US11993811B2 (en) * 2017-01-31 2024-05-28 Myriad Women's Health, Inc. Systems and methods for identifying and quantifying gene copy number variations
US20180235352A1 (en) * 2017-02-23 2018-08-23 Janay Jones Multi purpose personal transport gear that converts from backpack to comfort pad to poncho to hammock
CN108048530B (zh) * 2018-01-23 2021-07-27 广州大学 基于EST序列开发EPICs引物的方法
CN109234267B (zh) * 2018-09-12 2021-07-30 中国科学院遗传与发育生物学研究所 一种基因组组装方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6694335B1 (en) * 1999-10-04 2004-02-17 Microsoft Corporation Method, computer readable medium, and system for monitoring the state of a collection of resources
CN101930502A (zh) * 2010-09-03 2010-12-29 深圳华大基因科技有限公司 表型基因的检测及生物信息分析的方法及系统
CN104870056A (zh) * 2012-10-05 2015-08-26 弗·哈夫曼-拉罗切有限公司 用于诊断和治疗炎性肠病的方法
CN103810402A (zh) * 2014-02-25 2014-05-21 北京诺禾致源生物信息科技有限公司 用于基因组的数据处理方法和装置

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4120279A4 *

Also Published As

Publication number Publication date
CN111477275A (zh) 2020-07-31
CN111477275B (zh) 2020-12-25
US20230154568A1 (en) 2023-05-18
JP7367234B2 (ja) 2023-10-23
EP4120279A4 (en) 2023-11-22
JP2023516504A (ja) 2023-04-19
AU2020439391B2 (en) 2024-02-29
AU2020439391A1 (en) 2022-11-10
EP4120279A1 (en) 2023-01-18

Similar Documents

Publication Publication Date Title
WO2021196357A1 (zh) 微生物的种特异共有序列的获得方法、装置及应用
Lun et al. A step-by-step workflow for low-level analysis of single-cell RNA-seq data with Bioconductor
US10354747B1 (en) Deep learning analysis pipeline for next generation sequencing
Agius et al. High resolution models of transcription factor-DNA affinities improve in vitro and in vivo binding predictions
JP2005531853A (ja) Snp遺伝子型クラスタリングのためのシステムおよび方法
CN106460045B (zh) 人类基因组常见拷贝数变异用于癌症易感风险评估
US20220277811A1 (en) Detecting False Positive Variant Calls In Next-Generation Sequencing
US20140288844A1 (en) Characterization of biological material in a sample or isolate using unassembled sequence information, probabilistic methods and trait-specific database catalogs
Ochoa et al. Beyond the E-value: stratified statistics for protein domain prediction
WO2021196356A1 (zh) 微生物目标片段中多拷贝区域的识别方法、装置及应用
Li et al. SENIES: DNA shape enhanced two-layer deep learning predictor for the identification of enhancers and their strength
Ziegler et al. MiMSI-a deep multiple instance learning framework improves microsatellite instability detection from tumor next-generation sequencing
Jiang et al. DRAMS: A tool to detect and re-align mixed-up samples for integrative studies of multi-omics data
AU2022218581B2 (en) Sequencing data-based itd mutation ratio detecting apparatus and method
WO2021196358A1 (zh) 微生物目标片段中特异性区域的识别方法、装置及应用
Miglietta et al. Smart-Plexer: a breakthrough workflow for hybrid development of multiplex PCR assays
Bang et al. Deep-learning optimized DEOCSU suite provides an iterable pipeline for accurate ChIP-exo peak calling
CN116646010B (zh) 人源性病毒检测方法及装置、设备、存储介质
US20230005569A1 (en) Chromosomal and Sub-Chromosomal Copy Number Variation Detection
WO2024140881A1 (zh) 胎儿dna浓度的确定方法及装置
Morris et al. Two algorithms for biospecimen comparison and differentiation using SNP genotypes
Tahara et al. MOCCS profile analysis clarifies the cell type dependency of transcription factor-binding sequences and cis-regulatory SNPs in humans
Hedges Bioinformatics of Human Genetic Disease Studies
Michel et al. Large-scale structure prediction enabled by reliable model quality assessment and improved contact predictions for small families.
Ji et al. Shine: A novel strategy to extract specific, sensitive and well-conserved biomarkers from massive microbial genomic datasets

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20928847

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2022560044

Country of ref document: JP

Kind code of ref document: A

ENP Entry into the national phase

Ref document number: 2020928847

Country of ref document: EP

Effective date: 20221010

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020439391

Country of ref document: AU

Date of ref document: 20200514

Kind code of ref document: A