WO2012071685A1 - Method and system for bioinformatics analysis of hpv precise typing - Google Patents

Method and system for bioinformatics analysis of hpv precise typing Download PDF

Info

Publication number
WO2012071685A1
WO2012071685A1 PCT/CN2010/001943 CN2010001943W WO2012071685A1 WO 2012071685 A1 WO2012071685 A1 WO 2012071685A1 CN 2010001943 W CN2010001943 W CN 2010001943W WO 2012071685 A1 WO2012071685 A1 WO 2012071685A1
Authority
WO
WIPO (PCT)
Prior art keywords
sequence
sample
sequencing
hpv
fragments
Prior art date
Application number
PCT/CN2010/001943
Other languages
French (fr)
Chinese (zh)
Inventor
刘智盛
田仕林
潘荣
Original Assignee
深圳华大基因科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳华大基因科技有限公司 filed Critical 深圳华大基因科技有限公司
Priority to CN201080070484.7A priority Critical patent/CN103261442B/en
Priority to PCT/CN2010/001943 priority patent/WO2012071685A1/en
Publication of WO2012071685A1 publication Critical patent/WO2012071685A1/en
Priority to HK13112598.6A priority patent/HK1185113A1/en

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search

Definitions

  • the invention relates to the field of biological genetic engineering technology, and in particular relates to a method and system for bioinformatics analysis of HPV accurate typing. Background technique
  • HPV Human Papillomavirus
  • HPV16 high-risk types
  • HPV6 low-risk types
  • HPV6 high-risk types
  • infection rates range from less than 1% to as high as 50%.
  • More than 100 types of HPV can infect the skin (skin type) or the mucous membranes of the respiratory and anal genital tract (mucosal type), and more than 40 types of HPV can infect the cervix.
  • HPV plays an important role in the initiation, development, progression, and even malignancy of many tumors, and is considered to be the tumor virus most closely related to human tumors.
  • HPV typing is important for the development of HPV treatment options, the risk of HPV infection, and the regional specificity of HPV infection. Therefore, the current research suggests that it is necessary to perform the typing detection of HPV present in each sample, which will help to analyze the pathogenicity of various HPV types in more detail to achieve the best clinical prevention and treatment effects.
  • the detection methods for HPV genotyping in the prior art mainly include the following:
  • ELISA method The specific reaction between the antigen and the antibody is used to connect the analyte to the enzyme, and the color reaction is generated by the enzyme and the substrate for quantitative determination. This method only The identification that can be used for individual subtypes has gradually been replaced by other assays.
  • PCR Polymerase Chain Reaction
  • Hybrid capture detection method The molecular hybridization chemiluminescence is used to amplify the signal, and the HPV type is determined by interpreting the intensity of the light. This method has the disadvantage of being unable to detect HPV specific types and multiple infections and high costs.
  • PCR combined hybridization detection method is a method of sharing PCR and hybridization. This method also has the disadvantages of time consuming and complicated means.
  • gene chip technology Gene chip technology, there are many classifications, commonly used in the in situ synthesis of oligonucleotides. The method has the disadvantages of inaccurate detection results, high experimental conditions, and high cost.
  • a technical problem to be solved by the present invention is to provide a method and system for bioinformatics analysis of HPV accurate classification, which can realize HPV type with high sensitivity and specificity and rapid recognition of gene sequences.
  • One aspect of the present invention provides a method for bioinformatics analysis of HPV exact typing, the method comprising: receiving a sequencing fragment obtained by high-throughput sequencing technology; performing a sample linker sequence and a sample linker sequence library in the sequenced fragment Alignment, realizing the sub-sample operation; comparing the sequenced fragments with the reference genomic sequence, screening the compared sequences to determine the HPV type or negative of the sequenced fragments; determining the sequence fragments of the determined type by sample Merge, and filter according to the number and proportion of sequence fragments supporting the corresponding type after the combination; finally confirm the HPV type of each sample Don't either be determined to be negative.
  • the method further comprises: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
  • the step of "filtering the sequencing sequence to remove the unqualified sequence” further comprises: presetting the sequencing quality threshold and ratio of the unqualified base Threshold; when the sequencing quality value of the base in the sequencing sequence is lower than the sequencing quality threshold, and the number of bases below the sequencing quality threshold accounts for the ratio of the number of bases of the entire sequence exceeds the ratio threshold; Qualified sequence and filtered; when the number of undetermined bases in the sequencing result of the sequencing sequence exceeds 10% of the number of bases in the entire sequence, the sequencing sequence is considered to be an unqualified sequence and filtered; For alignment, if a sequencing linker sequence is present in the sequencing sequence, the sequencing sequence is a failed sequence and filtered.
  • the method further comprises: removing the sample linker sequence from the sequence fragment after performing the sample-sequencing operation.
  • the step of "removing the sample linker sequence from the sequence fragment” further comprises: presetting the sequencing quality threshold and the number of bases of the sample linker sequence A sequence in which the sequencing quality of the base in the linker sequence is lower than the sequencing quality threshold, and the number of bases exceeds the base number threshold is removed.
  • the method further comprises: step a: performing a complete matching operation between the sample linker sequence and the sequence in the sample linker sequence library; step b, taking the sample The linker sequence degrades the 1- 2 bp base and performs a perfect match with the corresponding portion of the sequence in the sample linker sequence library; Step c, allows the sample linker sequence to have only one base insertion, ie, in the sample The start of the head sequence performs a perfect match operation. When a base cannot match, the base is regarded as an insert base. After skipping the base, the exact match operation is continued. Step d: Allow the sample linker sequence to have only one base.
  • Step a> Step b>"Step c or Step d" The sequence determines the alignment of the final sample linker sequence; the sequence of the same sample linker sequence is considered to be from the same sample, thereby distinguishing the sample; and the sample linker sequence in the sequence of the sample is removed.
  • the method further comprises: if there is no comparison result in the four steps of steps a - d, or one step simultaneously compares two results Or only step c and step d are compared at the same time; the comparison result is considered to be invalid information due to indistinguishable, and the corresponding entire sequence is removed.
  • the step of "screening the compared sequence” further comprises: aligning the sequenced fragments obtained by the high throughput sequencing technique to the reference genome sequence After the alignment, screen and remove the alignment results in the alignment length less than 70%, or the consistency is less than 85% of the sequence; retain the best results in each sequence alignment result; retain the suboptimal results;
  • the suboptimal result satisfies: the consistency of the sequence * the alignment length and the alignment score are higher than or equal to 0.9 times and 0.85 times of the best result, respectively, and the probability that the sequence is not correlated with the reference sequence is lower than the best result.
  • the method further comprises: normalizing the number of sequence fragments after combining the samples by combining the samples of the determined type.
  • standardizing the number of sequence fragments after sample combination further comprises: proportionally the number of sequences owned by each sample in each library The amount of sequencing scaled to the library is the average amount of sequencing in the ideal case.
  • the step "screening according to the number and proportion of sequence fragments supporting the corresponding types after the combination” further includes: after standardization, according to the following conditions Screening in sequence: If the number of available sequences is less than the average number of valid sequence fragments of the negative control sample plus the sum of four standard deviations, the actual or sequencing operation is considered to be unsuccessful; otherwise, if the alignment results support HPV type If the number of sequence fragments is less than a predetermined threshold, it is considered to be negative; if the ratio of the number of sequence fragments supporting the HPV type to the total number of sequence fragments reaches a predetermined threshold or more, it is considered that the type is infected.
  • Another aspect of the present invention provides a system for bioinformatics analysis of HPV accurate typing, the system comprising: a receiving module for receiving a sequencing fragment obtained by high-throughput sequencing technology; a sample module for sequencing The sample linker sequence in the fragment is compared with the sample linker sequence library to implement a sample-sequencing operation; a sequence type determination module is used to compare the sequenced fragment with the reference genome sequence, and the compared sequence is compared to determine the selected sequence.
  • the HPV type or negative of the sequence fragment; the sample type determining module is configured to combine the sample fragments of the determined type by sample, and according to the number and proportion of the sequence fragments supporting the corresponding type after the combination; The HPV type of the sample was either negative.
  • the receiving module is further configured to: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
  • the sub-sample module is further configured to: after the sub-sample operation is performed, remove the sample linker sequence from the sequence fragment.
  • the combined screening module is further configured to: after combining the determined sequence segments by samples, performing the combined number of sequence fragments of the samples standardization.
  • standardizing the number of sequence fragments after combining the samples further comprises: proportionally the number of sequences owned by each sample in each library
  • the amount of sequencing scaled to the library is the average amount of sequencing in the ideal case.
  • the method and system for bioinformatics analysis of HPV precise typing provided by the invention realize high sensitivity and specificity by using sequencing technology and analysis means, and quickly identify and confirm the purpose of HPV type.
  • FIG. 1 is a flow chart showing a method for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention
  • FIG. 2 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention
  • FIG. 3 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV precise typing provided by the present invention.
  • FIG. 4 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention.
  • Figure 5 is a flow chart showing one embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention
  • FIG. 6 is a schematic structural diagram of a system for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention
  • FIG. 7 is a schematic diagram showing the variation of the effective sequence of each stage in the original sequence in the method and system of the bioinformatics analysis of the HPV precise classification provided by the embodiment of the present invention
  • FIG. 8 is a schematic diagram showing the distribution of the number of effective sequence segments of a real sample and a negative control sample according to an embodiment of the present invention
  • FIG. 9 is a schematic diagram showing the results of repeatability after 10 times of sequencing and analysis of each sample provided by an embodiment of the present invention.
  • Figure 10 is a schematic diagram showing the comparison of the negative positive results and the blood negative samples and the clinical test results measured by all the real samples provided by the embodiments of the present invention.
  • Figure 11 is a diagram showing the results of detection of plasmid samples in a second type of library provided by an embodiment of the present invention. detailed description
  • the samples specifically used in the examples of the present invention include: 328 patient real tissue samples, blood negative samples, pure water negative samples, and positive samples of plasmids loaded with specific HPV types.
  • the strategies that can be employed in various embodiments include: 96 samples per sequencing library, two types of libraries are designed: The first category is 82 patient true tissue samples, 6 pure water negative samples, 6 blood negative Sample, 2 plasmid positive samples; The second category was 90 plasmid positive samples, 6 pure water negative samples. Each library was sequenced 10 times to facilitate verification of the repeatability of the information analysis. Therefore, 50 libraries were sequenced on the machine.
  • FIG. 1 is a flow chart showing a method for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention.
  • the method 100 for bioinformatics analysis of HPV precise typing comprises: Step 102, receiving a sequencing fragment obtained by high-throughput sequencing technology.
  • the high-throughput sequencing technology employed in the present invention may be Illumina GA sequencing technology or other existing high-throughput sequencing technologies.
  • Step 104 Align the sample linker sequence in the sequenced segment with the sample linker sequence library to implement a sample-sequencing operation.
  • the sample connector sequence library used in the embodiment of the present invention is 96 pairs of primer-index designed experimentally. (The sample connector sequence library used in the present invention can be designed according to the experimental requirements and the number of samples, and the sample linker sequence base during the design process. Distribution and length should pay attention to comprehensively consider the number of samples tested and the non-homology of different sample linker sequences. Ensure that different samples are sampled by sample linker alignment.
  • Step 106 Align the sequenced fragment with the reference genome sequence, and compare the sequence after the screening to determine the HPV type or negative of the sequenced fragment.
  • the sequencing fragments obtained by the high-throughput sequencing technology are aligned to the reference genome sequence by any short sequence mapping program (such as a mapping program such as blast), wherein the reference genome sequence can be taken from the public database NCBI, the public database.
  • Any short sequence mapping program such as a mapping program such as blast
  • the reference genome sequence can be taken from the public database NCBI, the public database.
  • "screening the aligned sequences” further comprises: after comparing the sequencing fragments obtained by the high-throughput sequencing technology to the reference genome sequence, screening and removing the alignment results in the alignment result is lower than 70%, or a sequence with less than 85% identity (100% means that the two sequences are identical); retain the best results for each sequence ratio; retain suboptimal results; where suboptimal results satisfy: sequence The consistency* alignment length and alignment score are higher than or equal to 0.9 times and 0.85 times of the optimal result, respectively, and the probability that the sequence is not correlated with the reference sequence is 10 3 times lower than the optimal result; Whether the best result of the sequence and the suboptimal result are aligned to the same type or its subtype, and if so, the comparison result is only compared to the sequence of a certain type as the effective sequence, and the effective sequence alignment is determined. HPV type or negative.
  • Step 108 Combine the determined sequence segments into samples, and select according to the number and proportion of the sequence segments supporting the corresponding types after the combination; finally confirm that the HPV type of each sample is negative.
  • An embodiment of the method for bioinformatics analysis of HPV accurate typing provided by the present invention utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete detection of infected HPV types. High sensitivity and specificity.
  • FIG. 2 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV precise typing provided by the present invention.
  • the method 200 for bioinformatics analysis of HPV precise typing includes: steps 202, 203, 204, 206, and 208, wherein steps 202, 204, 206, and 208 can perform the steps shown in FIG. 1, respectively.
  • 102, 104, 106, and 108 are the same or similar technical contents, and the technical contents thereof will not be described herein for the sake of brevity.
  • step 203 is performed to filter the sequencing sequence to remove the unqualified sequence.
  • the step of "sequencing the sequencing sequence to remove the unqualified sequence” further includes: presetting the sequencing quality threshold and the proportional threshold of the unqualified base (the low quality threshold in the present invention is determined by the specific sequencing technology and the sequencing environment) For example, if the number of bases whose sequencing quality value is less than 5 exceeds 50% of the number of bases of the entire sequence, it is considered to be an unqualified sequence).
  • sequencing quality value of the base in the sequencing sequence is lower than the sequencing quality threshold (eg, 5), and the number of bases below the sequencing quality threshold accounts for more than a proportional threshold (eg, 50%)
  • a proportional threshold eg, 50%
  • the sequence is considered to be an unqualified sequence and filtered.
  • the method for bioinformatics analysis of HPV accurate typing removes the unqualified sequence by filtering the sequencing sequence, thereby further reducing the influence of the unqualified sequence, thereby improving the accuracy of the detection analysis.
  • Figure 3 is a flow chart showing another embodiment of the method of bioinformatics analysis of HPV exact typing provided by the present invention.
  • the method 300 for bioinformatics analysis of HPV exact typing includes: steps 302, 304, 305, 306, and 308, wherein steps 302, 304, 306, and 308 can perform the steps shown in FIG. 1, respectively.
  • 102, 104, 106, and 108 are the same or similar technical contents, and the technical contents thereof will not be described herein for the sake of brevity.
  • step 305 is performed to remove the sample connector sequence M column segment.
  • the step of "removing the sample linker sequence from the sequence fragment” further comprises: presetting the sequencing quality threshold (eg, 5) and the number of bases threshold (eg, 3) of the sample linker sequence; sequencing the bases in the linker sequence A sequence whose mass value is lower than the sequencing quality and the number of bases exceeds the base number threshold is removed.
  • the sequencing quality threshold eg, 5
  • the number of bases threshold eg, 3
  • a sequence of 10 bp (base pair) of the linker sequence in the present embodiment in which the sequence quality value is less than 5 and the number is greater than 3 is removed.
  • Step a completely matching the sample linker sequence with the sequence in the sample linker sequence library
  • Step b Degrading the sample linker sequence by l-2 bp base, and performing complete matching operation with the corresponding part of the sequence in the sample linker sequence library;
  • Step c allowing the sample linker sequence to insert only one base, that is, performing a perfect match operation at the beginning of the sample linker sequence, and treating the base as an insert base when a base cannot match, skipping the base Continue to perform the exact match operation;
  • Step d allowing the sample linker sequence to have only one base deletion, ie in the sample
  • the final sample connector is determined according to the order of priority: step a> step b>"step c or step d".
  • Sequence alignment results in the case of processing linker alignments, sometimes the same sequence will get different alignment results. Setting the priority of the screening comparison results can be understood as: the highest of step a, b times, c and d has the same priority).
  • step a - d if there is no comparison result in the four steps of steps a - d, or one step simultaneously compares the two results, or only step c and step d simultaneously compare the results; then the comparison result is considered to be due to Cannot distinguish and determine invalid information, and remove the corresponding entire sequence.
  • An embodiment of the method for bioinformatics analysis of HPV precise typing compares the sample linker sequence in the sequenced fragment with the sample linker sequence library, and after performing the sample-sequencing operation, the sample linker sequence is sequenced from the sequence The fragment is removed to ensure the authenticity and reliability of the HPV typing analysis, providing further protection for further HPV classification.
  • Figure 4 is a flow chart showing another embodiment of the method of bioinformatics analysis of HPV exact typing provided by the present invention.
  • the method 400 for bioinformatics analysis of HPV precise typing includes: steps 402, 404, 406, 408, 409, and 410, wherein steps 402, 404, and 406 can respectively perform the steps shown in FIG. 102, 104, 106 the same or similar technical content, for the sake of brevity, the technical content will not be repeated here.
  • step 408 is performed to merge the sequence segments of the determined type by sample. Specifically, in step 404, the relationship between which samples the respective sequences are from is found, and according to this relationship, will belong to the same The sequences of the samples are grouped together and their alignment with the HPV reference genome is counted.
  • step 409 the number of sequence fragments after the sample is combined is standardized.
  • the sequencing amount of each library sample is different due to the heterogeneity of the concentration on the respective libraries.
  • the number of sequences owned by each sample is scaled to the average amount of sequencing in which the sequencing amount of the library is ideal. That is, the number of combined sequences for each sample is normalized.
  • step 410 screening is performed according to the number and proportion of sequence fragments supporting the corresponding type after standardization, and finally confirming that the HPV type of each sample is negative.
  • the existing information of the sample is filtered and filtered.
  • the screening conditions used are as follows: If the number of available sequence fragments is less than a certain value (such as 137), the experiment or sequencing operation is considered to be unsuccessful; the comparison result supports HPV type. The number of sequence fragments is less than a certain threshold (such as 350), and the test result is considered negative.
  • the comparison results support that the ratio of the number of sequence fragments of a certain type of HPV to the total number of sequence fragments reaches a predetermined threshold (the threshold is set in the specific experimental background, and the authenticity and repeatability of the detection should be considered comprehensively, such as 12%). Above, the sample is considered to be infected with this type. Among them, the specific value of each part depends on the specific experimental conditions.
  • Figure 5 is a flow chart showing one embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention.
  • the method for bioinformatics analysis of HPV precise typing comprises: Step 502: Receiving a sequencing fragment obtained by high-throughput sequencing technology.
  • Step 502 Receiving a sequencing fragment obtained by high-throughput sequencing technology.
  • Illumina GA high throughput sequencing technology is employed.
  • Step 504 After receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
  • the unqualified sequence includes: The number of minus bases with a sequencing quality value below 5 is more than 50% of the number of bases in the entire sequence, which is considered to be an unqualified sequence; the number of N in the sequencing result exceeds the entire sequence of bases. A 10% of the number is considered to be an unqualified sequence; it is aligned with the sequence of the sequenced strander sequence, and if the sequence of the sequenced linker is present in the sequence, it is considered to be an unqualified sequence.
  • Step 506 Comparing the sample connector sequence in each sequence with the sample sequence library to implement the sample-sequencing operation.
  • Step 508 the sample sequence is removed from the sequence segment. Specifically, a sequence in the linker sequence in which the number of bases having a sequencing quality value of less than 5 is greater than three is removed. Then, 1) the sample linker sequence is completely matched with the sequence in the sample linker sequence library; 2) the sample linker sequence is degraded by l-2bp and the sequence corresponding to the sequence in the sample linker library is completely matched; 3) the sample sequence is allowed only There is a base insertion. Perform a perfect match at the beginning of the sample linker sequence. When a base cannot match, consider the base to be an insert base. After skipping this base, continue the strict exact match operation. 4) Allow the sample sequence to have only one base. The absence of the base.
  • step 510 the sequenced fragments are aligned with the reference genome sequence, and the sequence after the comparison is screened.
  • the HPV type or negative of the sequenced fragment after screening is determined.
  • the blast mapping program is used to compare the sequencing fragments obtained by the high-throughput sequencing technology to the reference genome sequence. After the alignment, the alignment in the alignment result was less than 70%, or the sequence was less than 85%.
  • each sequence alignment that is, the first comparison result of the blast software comparison output, and also retain the suboptimal result; wherein, the suboptimal result satisfies: sequence consistency * alignment length, ratio
  • the scores corresponding to the scores are respectively 0.9 times or 0.85 times higher than or equal to the best result, and the probability that the sequence is uncorrelated with the reference sequence match is 10 3 times lower than the best result.
  • it is judged whether the sequence of the sequence is the same type (or a subtype thereof), and finally only the selected alignment result is compared with the sequence of a certain type as a valid sequence, and each sequence is determined.
  • the HPV type was compared or confirmed to be negative.
  • step 512 the alignment results of the determined types of sequences are combined by sample. Specifically, in step 506, the relationship from which sample each sequence is derived has been found, and according to this relationship, the sequences belonging to the same sample are grouped together, and their alignment results with the HPV reference genome are counted.
  • the number of merged sequences for each sample is normalized.
  • sample one read one num one STD Sample_read_num_ori * (150000/read num ori) ; where sample_read_num_STD represents the number of sample sequences after normalization; sample_read _num_ori represents the actual sequence number of the sample ⁇ 1 J; read num ori represents the number of sequences of the sample corresponding library sequencing.
  • Step 516 screening according to the number and proportion of sequence fragments supporting the corresponding type after standardization, and finally confirming that the HPV type of each sample is determined to be negative.
  • the screening is performed according to the following conditions: The number of available sequences is less than 137, and the experiment or the sequencing operation is considered to be unsuccessful; otherwise, the comparison result supports the HPV type sequence fragment number less than 350, which is considered to be negative.
  • the alignment results support that the number of HPV types of sequence fragments accounts for more than 12% of the total number of sequence fragments, and it is considered that the type is infected, and the HPV type of each sample infection is finally determined or determined to be negative.
  • FIG. 6 is a schematic structural diagram of a system for bioinformatics analysis of HPV accurate classification according to an embodiment of the present invention.
  • a system 600 for bioinformatics analysis of HPV accurate typing includes: a receiving module 602, a sub-sample module 604, a sequence type determining module 606, and a sample type determining module 608. among them
  • the receiving module 602 is configured to receive the sequenced segment obtained by the high-throughput sequencing technology.
  • the sample module 604 is configured to compare the sample connector sequence in the sequenced segment with the sample connector sequence library to implement a sample-sequencing operation.
  • the sequence type determination module 606 is configured to compare the sequenced fragment with the reference genome sequence, and compare the sequence of the sequence to determine the HPV type or negative of the sequenced fragment.
  • the sample type determining module 608 is configured to combine the determined sequence segments by samples, and perform screening according to the number and proportion of the sequence fragments supporting the corresponding types after the combination; finally confirming the HPV type of each sample or determining negative.
  • the receiving module is further configured to: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
  • filtering the sequencing sequence For details of the specific process, refer to the description in the method embodiment, and details are not described herein again.
  • the sub-sample module is further configured to: after the sub-sample operation is performed, remove the sample linker sequence from the sequence fragment.
  • the combined screening module is further configured to: after combining the determined sequence segments by samples, performing the combined number of sequence fragments of the samples standardization.
  • standardizing the number of sequence fragments after combining the samples further comprises: proportionally the number of sequences owned by each sample in each library
  • the amount of sequencing scaled to the library is the average amount of sequencing in the ideal case.
  • the embodiment of the system for bioinformatics analysis of HPV accurate classification utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete detection of infected HPV types. High sensitivity and specificity.
  • FIG. 7 is a schematic diagram showing the variation of the effective sequence of each stage in the original sequence during the analysis process of the method and system for bioinformatics analysis of the HPV precise classification provided by the embodiment of the present invention.
  • the abscissa represents the sequencing library code and the ordinate represents the ratio of the effective sequence to the original sequence.
  • the Filter curve indicates the change of the ratio of the effective sequence to the original sequence of the different sequencing libraries after filtering the sequencing sequence;
  • the Lib-match curve indicates the proportion of the effective sequence to the original sequence of the different sequencing libraries after the sample differentiation is completed;
  • the Final curve indicates that the different sequenced libraries account for the effective sequence after the sequence HPV type is determined.
  • the proportion of the original sequence changes.
  • the sequence utilization rate of all 50 sequencing libraries in this example reached more than 80%.
  • Figure 8 is a diagram showing the distribution of the number of valid sequence fragments of the real sample and the negative control sample provided by the embodiment of the present invention.
  • the average number of valid sequence fragments of the negative control sample was 19.82.
  • the standard deviation of the number of valid sequence fragments plus four times the mean is 136.98.
  • the use of 137 valid sequence fragments as experimental or sequencing success or not defined values can effectively distinguish between real and negative control samples.
  • Fig. 9 is a view showing the results of repeatability after sequencing and analysis of each sample 10 times in the sample provided by the embodiment of the present invention.
  • Figure 9 shows the results of repeatability after 10 replicates of each sample and analysis.
  • the abscissa represents the defined value that determines the positive result of the test
  • the ordinate represents the average of the repetition rates of all samples. It can be clearly seen by those skilled in the art according to FIG. 9 that all samples are sequenced in Hong Kong or Shenzhen.
  • the number of sequence fragments supporting the HPV type is determined to be a defined value of the positive result of the detection result, the sample is repeatedly analyzed.
  • the repeatability is as high as 99%, which fully reflects the stability of the present invention for HPV detection.
  • FIG. 10 is a schematic diagram showing the comparison between the negative positive results measured by the real samples and the blood negative samples and clinical test results provided by the embodiments of the present invention.
  • blood is a confirmed negative sample without HPV infection. Patients with a test result greater than 1 were clinically confirmed to be positive for HPV infection.
  • the result of confirming the positive result of HPV infection in this embodiment is mostly the same as the clinical test result.
  • the value of 350 can distinguish between blood-negative and positive samples, avoiding false positives. Because the clinical test results are not completely positive Therefore, the detection results of this embodiment are sufficient to demonstrate the accuracy of the present invention.
  • Figure 11 is a schematic diagram showing the results of detection of plasmid samples in a second type of library provided by an embodiment of the present invention.
  • the abscissa indicates the type in which the HPV virus was loaded into the plasmid, and the ordinate indicates the proportion of the sequence fragment supporting the corresponding HPV virus type during the analysis of the example. It can be clearly seen by those skilled in the art according to FIG. 11 that a sample supporting a ratio of the number of sequence fragments of a certain type of HPV is determined to be a type of HPV infection, and the sample can be effectively and specifically detected. Specific type.
  • HBB sample 43 HPV6 sample 75 HBB sample 12 - sample 44 HBB sample 76 HBB sample 13 HBB sample 45 - sample 77 HBB sample 14 HPV59 sample 46 HBB sample 78 HBB sample 15 HPV16 sample 47 - sample 79 HBB sample 16 HBB sample 48 HBB sample 80 HBB sample 17 HBB sample 49 HBB sample 81 HBB sample 18 HBB sample 50 HBB sample 82 HBB sample 19 HBB sample 51 HBB plasmid (type 33) HPV33 sample 20 HPV16 sample 52 HBB plasmid (type 33) HPV33 sample 21 HBB sample 53 HBB blood negative sample HBB sample 22 HBB sample 54 HBB blood negative sample HBB sample 23 HPV11 sample 55 HBB blood negative sample HBB sample 24 HBB sample 56 HBB blood negative sample HBB sample 25 HBB sample 57 HBB blood negative sample HBB sample 26 HBB sample 58 - Blood negative sample HBB sample 27 HBB sample 59 HBB pure water negative sample - sample 28 HBB sample 60 HBB pure water negative sample - sample
  • Table 1 shows the results of detection of a sample library provided by the experimental example of the present invention. As shown in Table 1, this table is a sample library test result for the first class library. Where "HBB" indicates that the test result is negative, "-" indicates that the number of detected sequences is lower than 137 due to a sample problem or an experimental problem, and the sample test is considered to have failed.
  • An embodiment of a method and system for bioinformatics analysis of HPV accurate classification provided by the present invention, which utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete the infection of HPV type. Detection, with high sensitivity and specificity.
  • An embodiment of the method and system for bioinformatics analysis of HPV accurate typing provided by the present invention, by filtering the sequencing sequence, removing unqualified sequences, further reducing the influence of the unqualified sequence, thereby improving detection The accuracy of the analysis.
  • An embodiment of a method and system for bioinformatics analysis of HPV accurate typing provided by the present invention, comparing a sample linker sequence in a sequenced segment with a sample linker sequence library, and implementing a sample-sequencing operation, and then taking the sample The linker sequence is removed from the sequence fragment to ensure the authenticity and reliability of the HPV typing analysis, further impeding further HPV typing.

Landscapes

  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Biophysics (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Health & Medical Sciences (AREA)
  • Engineering & Computer Science (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biotechnology (AREA)
  • Evolutionary Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Theoretical Computer Science (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

A method and system for bioinformatics analysis of HPV precise typing is disclosed in the invention, said method comprises: receiving sequencing fragments obtained through high-throughput sequencing technique; comparing a sample linker sequence in sequencing fragments with a sample linker sequence library to implement separation of samples; comparing the sequencing fragments with reference genomic sequence, filtering the sequence that has been compared; determining HPV type of the filtered sequence fragments or determining them to be negative; combining the sequence fragments whose types have been determined sample by sample; performing filtering according to quantity and ratio of the combined sequence fragments which support corresponding type; identifying HPV type of every sample or identifying them to be negative finally. The method and system for bioinformatics analysis of HPV precise typing provided in the invention utilizes bioinformatics analysis method and technical solution to implement rapid detection of large samples and rapid detection of infected HPV type with relative high sensitivity and specificity.

Description

HPV精确分型的生物信息学分析的方法及系统 技术领域  Method and system for bioinformatics analysis of HPV accurate typing
本发明涉及生物基因工程技术领域, 尤其涉及一种 HPV精确 分型的生物信息学分析的方法及系统。 背景技术  The invention relates to the field of biological genetic engineering technology, and in particular relates to a method and system for bioinformatics analysis of HPV accurate typing. Background technique
人类乳突病毒 ( Human Papillomavirus, HPV )是一种嗜上 皮性病毒, 根据其致病强弱能力被分为高危型 (例如 HPV16, 18, 31, 33和 45 )和低危型 (例如 HPV6, 11, 42, 43和 44 ) 两 大类。 在自然人群中, 感染率从低于 1 %到高达 50%。 超过 100 种的 HPV能够感染皮肤(皮肤类型)或呼吸道和肛门生殖道的粘 膜(粘膜类型), 超过 40种的 HPV能够感染子宫颈。 HPV在许 多肿瘤的启动、 发生、 发展甚至恶性的过程中都起到了重要作 用, 从而被认为是与人类肿瘤关系最为密切的肿瘤病毒。  Human Papillomavirus (HPV) is an epithelial virus that is classified into high-risk types (such as HPV16, 18, 31, 33, and 45) and low-risk types (such as HPV6) based on its ability to cause disease. 11, 42, 43 and 44) two major categories. In the natural population, infection rates range from less than 1% to as high as 50%. More than 100 types of HPV can infect the skin (skin type) or the mucous membranes of the respiratory and anal genital tract (mucosal type), and more than 40 types of HPV can infect the cervix. HPV plays an important role in the initiation, development, progression, and even malignancy of many tumors, and is considered to be the tumor virus most closely related to human tumors.
对 HPV病毒感染进行准确的检测可提高 HPV相关肿瘤, 特 别是妇女宫颈癌的病变筛查敏感性, 改善其防治手段。 分型与临 床的结合研究已经证实不同的 HPV亚型在致癌性方面存在比较大 的差别。 HPV分型检测对于 HPV治疗方案的制定、 感染 HPV的 危险程度、 HPV感染的区域特异性等均具有重要的意义。 因此目 前的研究认为有必要对各个样本中存在的 HPV进行分型检测, 将 有助于更详细的分析各种 HPV型别的致病性, 以达到最佳的临床 预防、 治疗的效果。  Accurate detection of HPV infection can improve the sensitivity of HPV-related tumors, especially in women with cervical cancer, and improve their prevention and treatment. The combination of typing and clinical studies has confirmed that there are large differences in carcinogenicity between different HPV subtypes. HPV typing is important for the development of HPV treatment options, the risk of HPV infection, and the regional specificity of HPV infection. Therefore, the current research suggests that it is necessary to perform the typing detection of HPV present in each sample, which will help to analyze the pathogenicity of various HPV types in more detail to achieve the best clinical prevention and treatment effects.
目前, 现有技术中用于 HPV基因分型的检测方法主要包括 以下几种:  At present, the detection methods for HPV genotyping in the prior art mainly include the following:
1、 ELISA法: 是采用抗原与抗体的特异反应将待测物与酶 连接, 通过酶与底物产生颜色反应, 用于定量测定。 这种方法只 能用于个别亚型的鉴定, 现已逐渐被其它检测方法取代。 1. ELISA method: The specific reaction between the antigen and the antibody is used to connect the analyte to the enzyme, and the color reaction is generated by the enzyme and the substrate for quantitative determination. This method only The identification that can be used for individual subtypes has gradually been replaced by other assays.
2、 PCR (聚合 ½反应, Polymerase Chain Reaction )检测 法: 是将提取的 DNA进行扩增, 实现 HPV感染检测。 目前常用 通用引物 PCR和实时荧光定量 PCR。 该方法具有假阳性高, 手 段繁复、 费时, 不能准确诊断多重感染的缺点。  2. PCR (Polymerase Chain Reaction) detection method: The extracted DNA is amplified to achieve HPV infection detection. Currently, universal primer PCR and real-time fluorescent quantitative PCR are commonly used. This method has the advantages of high false positive, complicated and time-consuming, and cannot accurately diagnose the defects of multiple infections.
3、 杂交捕获检测法: 是利用分子杂交化学发光来放大信号, 通过判读光的强弱来确定 HPV型。 该方法具有无法检测 HPV特 定型别和多重感染且费用高等缺点。  3. Hybrid capture detection method: The molecular hybridization chemiluminescence is used to amplify the signal, and the HPV type is determined by interpreting the intensity of the light. This method has the disadvantage of being unable to detect HPV specific types and multiple infections and high costs.
4、 PCR结合杂交检测法: 是 PCR与杂交共享的方法。 该方 法同样具有费时、 手段繁复等缺点。  4, PCR combined hybridization detection method: is a method of sharing PCR and hybridization. This method also has the disadvantages of time consuming and complicated means.
5、 基因芯片技术: 基因芯片技术, 有多种分类, 常用的是寡 核苷酸原位合成法。 该方法具有检测结果不准确, 实验条件要求 高, 费用高等缺点。  5, gene chip technology: Gene chip technology, there are many classifications, commonly used in the in situ synthesis of oligonucleotides. The method has the disadvantages of inaccurate detection results, high experimental conditions, and high cost.
综上所述, 提供一种具有较高灵敏度和特异度, 且准确性高 的 HVP分型检测技术成为本领域亟待解决的技术问题。 发明内容  In summary, providing a HVP type detection technology with high sensitivity and specificity and high accuracy has become a technical problem to be solved in the field. Summary of the invention
本发明要解决的一个技术问题是提供一种 HPV精确分型的 生物信息学分析的方法及系统, 能够实现高灵敏度和特异性、 快 速的识别基因序列的 HPV型。  A technical problem to be solved by the present invention is to provide a method and system for bioinformatics analysis of HPV accurate classification, which can realize HPV type with high sensitivity and specificity and rapid recognition of gene sequences.
本发明的一个方面提供了一种 HPV精确分型的生物信息学分 析的方法, 该方法包括: 接收高通量测序技术得到的测序片段; 将测序片段中的样本接头序列与样本接头序列库进行比对, 实现 分样本操作; 将测序片段与参考基因组序列进行比对, 对比对后 的序列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性; 对 确定型别的序列片段按样本进行合并, 并根据合并后支持对应型 别的序列片段数量和比例进行筛选; 最终确认每个样本的 HPV型 别或者确定为阴性。 One aspect of the present invention provides a method for bioinformatics analysis of HPV exact typing, the method comprising: receiving a sequencing fragment obtained by high-throughput sequencing technology; performing a sample linker sequence and a sample linker sequence library in the sequenced fragment Alignment, realizing the sub-sample operation; comparing the sequenced fragments with the reference genomic sequence, screening the compared sequences to determine the HPV type or negative of the sequenced fragments; determining the sequence fragments of the determined type by sample Merge, and filter according to the number and proportion of sequence fragments supporting the corresponding type after the combination; finally confirm the HPV type of each sample Don't either be determined to be negative.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 该方法还包括: 接收到测序序列后, 对测序序列进行 过滤, 去除不合格的序列。  In an embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the method further comprises: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 步骤"对测序序列进行过滤, 去除不合格的序列"进一 步包括: 预先设置不合格碱基的测序质量阈值和比例阈值; 当测 序序列中碱基的测序质量值低于测序质量阈值, 且低于测序质量 阈值的碱基个数占整条序列碱基个数的比例超过比例阈值时; 则 认为测序序列是不合格序列并加以过滤; 当测序序列的测序结果 中不确定的碱基的个数超过整条序列碱基个数的 10%, 则认为测 序序列是不合格序列并加以过滤; 与测序接头序列库进行比对 时, 如果测序序列中存在测序接头序列, 则测序序列是不合格序 列并加以过滤。  In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the step of "filtering the sequencing sequence to remove the unqualified sequence" further comprises: presetting the sequencing quality threshold and ratio of the unqualified base Threshold; when the sequencing quality value of the base in the sequencing sequence is lower than the sequencing quality threshold, and the number of bases below the sequencing quality threshold accounts for the ratio of the number of bases of the entire sequence exceeds the ratio threshold; Qualified sequence and filtered; when the number of undetermined bases in the sequencing result of the sequencing sequence exceeds 10% of the number of bases in the entire sequence, the sequencing sequence is considered to be an unqualified sequence and filtered; For alignment, if a sequencing linker sequence is present in the sequencing sequence, the sequencing sequence is a failed sequence and filtered.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 该方法还包括: 实现分样本操作后, 将样本接头序列 从序列片段中去除。  In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the method further comprises: removing the sample linker sequence from the sequence fragment after performing the sample-sequencing operation.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 步骤"将样本接头序列从序列片段中去除 "进一步包 括: 预先设置样本接头序列的测序质量阈值和碱基数阈值; 将接 头序列中碱基的测序质量值低于测序质量阈值, 且碱基的数量超 过碱基数阔值的序列去除。  In one embodiment of the method for bioinformatics analysis of HPV exact typing provided by the present invention, the step of "removing the sample linker sequence from the sequence fragment" further comprises: presetting the sequencing quality threshold and the number of bases of the sample linker sequence A sequence in which the sequencing quality of the base in the linker sequence is lower than the sequencing quality threshold, and the number of bases exceeds the base number threshold is removed.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 该方法还包括: 步骤 a、 将样本接头序列与样本接头序 列库中序列进行完全匹配操作; 步骤 b、 将样本接头序列降解 1- 2bp 碱基, 与样本接头序列库中序列对应部分进行完全匹配操 作; 步骤 c、 允许样本接头序列仅有一个碱基的插入, 即在样本接 头序列起始端进行完全匹配操作, 当出现一个碱基无法匹配时将 该碱基视为插入碱基, 跳过此碱基后继续执行完全匹配操作; 步 骤 d、 允许样本接头序列仅有一个碱基的缺失, 即在样本接头序 列中模拟缺失任何一个碱基后, 进行完全匹配操作; 完成步骤 a - d后, 按照优先 ^J 序: 步骤 a>步蘇 b> "步骤 c或步骤 d" 的顺 序确定最终的样本接头序列的比对结果; 比对到同一样本接头序 列的被认为是来自同一样本的序列, 从而区分样本; 以及去除样 本的序列中的样本接头序列。 In an embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the method further comprises: step a: performing a complete matching operation between the sample linker sequence and the sequence in the sample linker sequence library; step b, taking the sample The linker sequence degrades the 1- 2 bp base and performs a perfect match with the corresponding portion of the sequence in the sample linker sequence library; Step c, allows the sample linker sequence to have only one base insertion, ie, in the sample The start of the head sequence performs a perfect match operation. When a base cannot match, the base is regarded as an insert base. After skipping the base, the exact match operation is continued. Step d: Allow the sample linker sequence to have only one base. The deletion of the base, that is, after the deletion of any one base in the sample linker sequence, the complete matching operation is performed; after completing steps a - d, according to the priority order: Step a> Step b>"Step c or Step d" The sequence determines the alignment of the final sample linker sequence; the sequence of the same sample linker sequence is considered to be from the same sample, thereby distinguishing the sample; and the sample linker sequence in the sequence of the sample is removed.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 该方法还包括: 如果步骤 a - d四步操作中均无比对结 果, 或者一个步骤同时比对到两个结果, 或者仅有步骤 c和步骤 d 同时比对出结果; 则认为该比对结果是由于无法区分而判定为无 效信息, 并将相应的整条序列去除。  In an embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the method further comprises: if there is no comparison result in the four steps of steps a - d, or one step simultaneously compares two results Or only step c and step d are compared at the same time; the comparison result is considered to be invalid information due to indistinguishable, and the corresponding entire sequence is removed.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 步骤"对比对后的序列进行筛选"进一步包括: 将高通 量测序技术得到的测序片段比对到参考基因组序列上; 比对后, 筛选并去除比对结果中比对长度低于 70 %, 或者一致性低于 85 % 的序列; 保留每条序列比对结果中的最佳结果; 保留次优结果; 其中, 次优结果满足: 序列的一致性 *比对长度、 比对得分分别高 于或等于最佳结果的 0.9倍、 0.85倍, 且序列与参考序列匹配不相 关的概率低于所述最佳结果的 103倍; 判断每条序列的最佳结果和 次优结果是否比对到同一型别或其亚型, 如果是, 则保留比对结 果仅比对到某一型别的序列作为有效序列, 确定有效序列比对的 HPV型别或阴性。 In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the step of "screening the compared sequence" further comprises: aligning the sequenced fragments obtained by the high throughput sequencing technique to the reference genome sequence After the alignment, screen and remove the alignment results in the alignment length less than 70%, or the consistency is less than 85% of the sequence; retain the best results in each sequence alignment result; retain the suboptimal results; The suboptimal result satisfies: the consistency of the sequence * the alignment length and the alignment score are higher than or equal to 0.9 times and 0.85 times of the best result, respectively, and the probability that the sequence is not correlated with the reference sequence is lower than the best result. 10 3 times; judge whether the best result and the sub-optimal result of each sequence are aligned to the same type or its subtype, and if so, the comparison result is only compared to a certain type of sequence as an effective sequence , determine the HPV type or negative of the effective sequence alignment.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 该方法还包括: 对确定型别的序列片段按样本进行合 并后, 对样本合并后的序列片段数量进行标准化。 本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 对样本合并后的序列片段数量进行标准化进一步包 括: 将各个文库中每个样品所拥有的序列数量, 都按照比例缩放 到文库的测序量为理想情况下的平均测序量。 In an embodiment of the method for bioinformatics analysis of HPV exact typing provided by the present invention, the method further comprises: normalizing the number of sequence fragments after combining the samples by combining the samples of the determined type. In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, standardizing the number of sequence fragments after sample combination further comprises: proportionally the number of sequences owned by each sample in each library The amount of sequencing scaled to the library is the average amount of sequencing in the ideal case.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 步骤"根据合并后支持对应型别的序列片段数量和比例 进行筛选"进一步包括: 标准化后, 按下述条件的先后顺序进行筛 选: 如果可用序列数小于阴性对照样本的有效序列片段的平均个 数加上其四倍标准差的和, 认为实猃或者测序操作失败; 否则, 如果比对结果支持 HPV型别的序列片段数小于预定阈值, 就认为 是阴性; 如果比对结果支持 HPV型别的序列片段数占总序列片段 数的比例达到预定阈值以上, 则认为感染了该型别。  In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, the step "screening according to the number and proportion of sequence fragments supporting the corresponding types after the combination" further includes: after standardization, according to the following conditions Screening in sequence: If the number of available sequences is less than the average number of valid sequence fragments of the negative control sample plus the sum of four standard deviations, the actual or sequencing operation is considered to be unsuccessful; otherwise, if the alignment results support HPV type If the number of sequence fragments is less than a predetermined threshold, it is considered to be negative; if the ratio of the number of sequence fragments supporting the HPV type to the total number of sequence fragments reaches a predetermined threshold or more, it is considered that the type is infected.
本发明的另一个方面提供了一种 HPV精确分型的生物信息学 分析的系统, 该系统包括: 接收模块, 用于接收高通量测序技术 得到的测序片段; 分样本模块, 用于将测序片段中的样本接头序 列与样本接头序列库进行比对, 实现分样本操作; 序列型别确定 模块, 用于将测序片段与参考基因组序列进行比对, 对比对后的 序列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性; 样本 型别确定模块, 用于对确定型别的序列片段按样本进行合并, 并 根据合并后支持对应型别的序列片段数量和比例进行筛选; 最终 确认每个样本的 HPV型别或者确定为阴性。  Another aspect of the present invention provides a system for bioinformatics analysis of HPV accurate typing, the system comprising: a receiving module for receiving a sequencing fragment obtained by high-throughput sequencing technology; a sample module for sequencing The sample linker sequence in the fragment is compared with the sample linker sequence library to implement a sample-sequencing operation; a sequence type determination module is used to compare the sequenced fragment with the reference genome sequence, and the compared sequence is compared to determine the selected sequence. The HPV type or negative of the sequence fragment; the sample type determining module is configured to combine the sample fragments of the determined type by sample, and according to the number and proportion of the sequence fragments supporting the corresponding type after the combination; The HPV type of the sample was either negative.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 接收模块还用于: 接收到测序序列后, 对测序序列进 行过滤, 去除不合格的序列。  In an embodiment of the system for bioinformatics analysis of HPV precise typing provided by the present invention, the receiving module is further configured to: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 分样本模块还用于: 实现分样本操作后, 将样本接头 序列从序列片段中去除。 本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 合并筛选模块还用于: 对确定型别的序列片段按样本 进行合并后, 对样本合并后的序列片段数量进行标准化。 In one embodiment of the system for bioinformatics analysis of HPV precise typing provided by the present invention, the sub-sample module is further configured to: after the sub-sample operation is performed, remove the sample linker sequence from the sequence fragment. In an embodiment of the system for bioinformatics analysis of HPV accurate typing provided by the present invention, the combined screening module is further configured to: after combining the determined sequence segments by samples, performing the combined number of sequence fragments of the samples standardization.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 对样本合并后的序列片段数量进行标准化进一步包 括: 将各个文库中每个样品所拥有的序列数量, 都按照比例缩放 到文库的测序量为理想情况下的平均测序量。  In one embodiment of the system for bioinformatics analysis of HPV precise typing provided by the present invention, standardizing the number of sequence fragments after combining the samples further comprises: proportionally the number of sequences owned by each sample in each library The amount of sequencing scaled to the library is the average amount of sequencing in the ideal case.
本发明提供的 HPV精确分型的生物信息学分析的方法及系 统, 利用测序技术与分析手段, 实现了高灵敏度和特异度, 快速 识别并确认 HPV型别的目的。 附图说明  The method and system for bioinformatics analysis of HPV precise typing provided by the invention realize high sensitivity and specificity by using sequencing technology and analysis means, and quickly identify and confirm the purpose of HPV type. DRAWINGS
图 1示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的方法的流程图;  1 is a flow chart showing a method for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention;
图 2示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图;  2 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention;
图 3示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图;  3 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV precise typing provided by the present invention;
图 4示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图;  4 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention;
图 5示出本发明提供的 HPV精确分型的生物信息学分析的方 法的一个具体实施方式的流程图;  Figure 5 is a flow chart showing one embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention;
图 6示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的系统的结构示意图;  6 is a schematic structural diagram of a system for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention;
图 7示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的方法及系统在分析过程中, 各阶段的有效序列占原始序 列的比例变化情况示意图; 图 8 示出本发明实施例提供的真实样本与阴性对照样本的有 效序列片段数量的分布情况示意图; FIG. 7 is a schematic diagram showing the variation of the effective sequence of each stage in the original sequence in the method and system of the bioinformatics analysis of the HPV precise classification provided by the embodiment of the present invention; FIG. 8 is a schematic diagram showing the distribution of the number of effective sequence segments of a real sample and a negative control sample according to an embodiment of the present invention; FIG.
图 9示出本发明实施例提供的每个样本重复 10次测序及分析 后的重复性的结果示意图;  FIG. 9 is a schematic diagram showing the results of repeatability after 10 times of sequencing and analysis of each sample provided by an embodiment of the present invention; FIG.
图 10示出本发明实施例提供的所有真实样 测出的阴阳性 结果与血液阴性样本、 临床检测结果的比较示意图;  Figure 10 is a schematic diagram showing the comparison of the negative positive results and the blood negative samples and the clinical test results measured by all the real samples provided by the embodiments of the present invention;
图 11示出本发明实施例提供的第二类文库中质粒样本的检测 结果的示意图。 具体实施方式  Figure 11 is a diagram showing the results of detection of plasmid samples in a second type of library provided by an embodiment of the present invention. detailed description
本发明实施例中具体采用的样本包括: 328个患者真实组织样 本、 血液阴性样本、 纯水阴性样本、 载入特定 HPV型别的质粒阳 性样本。  The samples specifically used in the examples of the present invention include: 328 patient real tissue samples, blood negative samples, pure water negative samples, and positive samples of plasmids loaded with specific HPV types.
在各个实施例中可以采用的上机策略包括: 每个测序文库中 包含 96个样本, 设计两类文库: 第一类为 82个患者真实组织样 本, 6个纯水阴性样本, 6个血液阴性样本, 2个质粒阳性样本; 第二类为质粒阳性样本 90个, 6个纯水阴性样本。 每个文库重复 测序 10 次以方便验证信息分析时的重复性。 因此共上机测序 50 个文库。  The strategies that can be employed in various embodiments include: 96 samples per sequencing library, two types of libraries are designed: The first category is 82 patient true tissue samples, 6 pure water negative samples, 6 blood negative Sample, 2 plasmid positive samples; The second category was 90 plasmid positive samples, 6 pure water negative samples. Each library was sequenced 10 times to facilitate verification of the repeatability of the information analysis. Therefore, 50 libraries were sequenced on the machine.
下面参照附图对本发明进行更全面的描述, 其中说明本发明 的示例性实施例。  The invention is described more fully hereinafter with reference to the accompanying drawings
图 1示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的方法的流程图。  FIG. 1 is a flow chart showing a method for bioinformatics analysis of HPV accurate typing according to an embodiment of the present invention.
如图 1所示, HPV精确分型的生物信息学分析的方法 100包 括: 步骤 102, 接收高通量测序技术得到的测序片段。 本发明中采 用的高通量测序技术可以为 Illumina GA 测序技术, 也可以是现 有的其它高通量测序技术。 步骤 104, 将测序片段中的样本接头序列与样本接头序列库进 行比对, 实现分样本操作。 本发明实施例中采用的样本接头序列 库是实验设计的 96对引物 -index (本发明中采用的样本接头序列 库可根据实验要求及样品数量要求自行设计, 在设计过程中样本 接头序列碱基分布及长度需注意综合考虑检测的样本个数和不同 样本接头序列的非同源性。 保证不同的样本通过样本接头序列比 对实现样本区分)。 As shown in FIG. 1, the method 100 for bioinformatics analysis of HPV precise typing comprises: Step 102, receiving a sequencing fragment obtained by high-throughput sequencing technology. The high-throughput sequencing technology employed in the present invention may be Illumina GA sequencing technology or other existing high-throughput sequencing technologies. Step 104: Align the sample linker sequence in the sequenced segment with the sample linker sequence library to implement a sample-sequencing operation. The sample connector sequence library used in the embodiment of the present invention is 96 pairs of primer-index designed experimentally. (The sample connector sequence library used in the present invention can be designed according to the experimental requirements and the number of samples, and the sample linker sequence base during the design process. Distribution and length should pay attention to comprehensively consider the number of samples tested and the non-homology of different sample linker sequences. Ensure that different samples are sampled by sample linker alignment.
步骤 106, 将测序片段与参考基因组序列进行比对, 对比对后 的序列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性。 例 如, 通过任何一种短序列映射程序 (如 blast等映射程序), 将高 通量测序技术得到的测序片段比对到参考基因组序列上, 其中, 参考基因組序列可取于公共数据库 NCBI, 该公共数据库可以通过 如下途径茨取 http:〃 www.ncbi.nlin.iiih.gov/geiie?term=hvp。  Step 106: Align the sequenced fragment with the reference genome sequence, and compare the sequence after the screening to determine the HPV type or negative of the sequenced fragment. For example, the sequencing fragments obtained by the high-throughput sequencing technology are aligned to the reference genome sequence by any short sequence mapping program (such as a mapping program such as blast), wherein the reference genome sequence can be taken from the public database NCBI, the public database. You can take http: 〃 www.ncbi.nlin.iiih.gov/geiie?term=hvp.
本发明的一个实施例中, "对比对后的序列进行筛选"进一步 包括: 将高通量测序技术得到的测序片段比对到参考基因组序列 后, 筛选并去除比对结果中比对长度低于 70 %, 或者一致性低于 85 %的序列 (100 %表示两条序列完全一致); 保留每条序列比的 对结果中的最佳结果; 保留次优结果; 其中, 次优结果满足: 序 列的一致性 *比对长度、 比对得分分别高于或等于最佳结果的 0.9 倍、 0.85倍, 且序列与参考序列匹配不相关的概率低于所述最佳 结果的 103倍; 判断每条序列的最佳结果和次优结果是否比对到同 一型别或其亚型, 如果是, 则保留比对结果仅比对到某一型别的 序列作为有效序列, 确定有效序列比对的 HPV型别或阴性。 In one embodiment of the present invention, "screening the aligned sequences" further comprises: after comparing the sequencing fragments obtained by the high-throughput sequencing technology to the reference genome sequence, screening and removing the alignment results in the alignment result is lower than 70%, or a sequence with less than 85% identity (100% means that the two sequences are identical); retain the best results for each sequence ratio; retain suboptimal results; where suboptimal results satisfy: sequence The consistency* alignment length and alignment score are higher than or equal to 0.9 times and 0.85 times of the optimal result, respectively, and the probability that the sequence is not correlated with the reference sequence is 10 3 times lower than the optimal result; Whether the best result of the sequence and the suboptimal result are aligned to the same type or its subtype, and if so, the comparison result is only compared to the sequence of a certain type as the effective sequence, and the effective sequence alignment is determined. HPV type or negative.
步骤 108, 对确定型别的序列片段按样本进行合并, 并根据合 并后支持对应型别的序列片段数量和比例进行筛选; 最终确认每 个样本的 HPV型别或者确定为阴性。  Step 108: Combine the determined sequence segments into samples, and select according to the number and proportion of the sequence segments supporting the corresponding types after the combination; finally confirm that the HPV type of each sample is negative.
稍后的其它实施例中还将举例对前述步骤中的具体实现方式 作进一步的详细介绍。 Specific implementations of the foregoing steps will also be exemplified in other embodiments that will be described later. For further details.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例, 利用生物信息学的分析方法及技术手段, 实现了快速检 测大量样本、 快速完成对感染 HPV型别的检测, 具有较高的灵敏 度和特异性。  An embodiment of the method for bioinformatics analysis of HPV accurate typing provided by the present invention utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete detection of infected HPV types. High sensitivity and specificity.
图 2示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图。  2 is a flow chart showing another embodiment of a method of bioinformatics analysis of HPV precise typing provided by the present invention.
如图 2所示, HPV精确分型的生物信息学分析的方法 200包 括: 步驟 202、 203、 204、 206和 208, 其中步骤 202、 204、 206 和 208可以分别执行与图 1所示的步骤 102、 104、 106和 108相 同或相似的技术内容, 为简洁起见, 这里不再赘述其技术内容。  As shown in FIG. 2, the method 200 for bioinformatics analysis of HPV precise typing includes: steps 202, 203, 204, 206, and 208, wherein steps 202, 204, 206, and 208 can perform the steps shown in FIG. 1, respectively. 102, 104, 106, and 108 are the same or similar technical contents, and the technical contents thereof will not be described herein for the sake of brevity.
如图 2所示, 在步骤 202之后, 执行步骤 203, 对测序序列进 行过滤, 去除不合格的序列。  As shown in FIG. 2, after step 202, step 203 is performed to filter the sequencing sequence to remove the unqualified sequence.
具体来说, 步骤"对测序序列进行过滤, 去除不合格的序列" 进一步包括: 预先设置不合格碱基的测序质量阈值和比例阈值 (本发明中低质量阈值由具体测序技术及测序环境而定, 例如, 测序质量值低于 5的碱基个数超过整条序列碱基个数的 50%则认 为是不合格序列)。  Specifically, the step of "sequencing the sequencing sequence to remove the unqualified sequence" further includes: presetting the sequencing quality threshold and the proportional threshold of the unqualified base (the low quality threshold in the present invention is determined by the specific sequencing technology and the sequencing environment) For example, if the number of bases whose sequencing quality value is less than 5 exceeds 50% of the number of bases of the entire sequence, it is considered to be an unqualified sequence).
当测序序列中碱基的测序质量值低于测序质量阈值(如 5 ), 且低于测序质量阈值的碱基个数占整条序列碱基个数的比例超过 比例阈值(如 50 % ) 时; 则认为测序序列是不合格序列并加以过 滤。  When the sequencing quality value of the base in the sequencing sequence is lower than the sequencing quality threshold (eg, 5), and the number of bases below the sequencing quality threshold accounts for more than a proportional threshold (eg, 50%) The sequencing sequence is considered to be an unqualified sequence and filtered.
当测序序列的测序结果中不确定的碱基(如 IUumina GA 测 序结果中的 N )的个数超过整条序列碱基个数的 10%, 则认为测 序序列是不合格序列并加以过滤。  When the number of undetermined bases in the sequencing result of the sequencing sequence (e.g., N in the IUumina GA sequence result) exceeds 10% of the number of bases in the entire sequence, the sequence is considered to be an unqualified sequence and filtered.
与测序接头序列库进行比对时, 如果测序序列中存在测序接 头序列, 则测序序列是不合格序列并加以过滤。 本发明提供的 HPV精确分型的生物信息学分析的方法, 通过 对测序序列进行过滤, 去除不合格的序列, 进一步降低了不合格 序列的影响, 从而提高了检测分析的准确性。 When aligned with a library of sequencing linker sequences, if a sequencing linker sequence is present in the sequencing sequence, the sequence is unqualified and filtered. The method for bioinformatics analysis of HPV accurate typing provided by the invention removes the unqualified sequence by filtering the sequencing sequence, thereby further reducing the influence of the unqualified sequence, thereby improving the accuracy of the detection analysis.
图 3示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图。  Figure 3 is a flow chart showing another embodiment of the method of bioinformatics analysis of HPV exact typing provided by the present invention.
如图 3所示, HPV精确分型的生物信息学分析的方法 300包 括: 步骤 302、 304、 305、 306和 308, 其中步骤 302、 304、 306 和 308可以分别执行与图 1所示的步骤 102、 104、 106和 108相 同或相似的技术内容, 为简洁起见, 这里不再赘述其技术内容。  As shown in FIG. 3, the method 300 for bioinformatics analysis of HPV exact typing includes: steps 302, 304, 305, 306, and 308, wherein steps 302, 304, 306, and 308 can perform the steps shown in FIG. 1, respectively. 102, 104, 106, and 108 are the same or similar technical contents, and the technical contents thereof will not be described herein for the sake of brevity.
如图 3所示, 在步骤 304之后, 执行步骤 305, 将样本接头序 列 M列片段中去除。  As shown in FIG. 3, after step 304, step 305 is performed to remove the sample connector sequence M column segment.
具体来说, 步骤"将样本接头序列从序列片段中去除"进一步 包括: 预先设置样本接头序列的测序质量阈值(如 5 )和碱基数阈 值(如 3 ); 将接头序列中碱基的测序质量值低于测序质量阁值, 且碱基的数量超过碱基数阈值的序列去除。 例如, 综合考虑测序 条件和环境, 将本实施例中 10bp (碱基对)的接头序列中测序质 量值低于 5的碱基且个数大于 3个的序列去除。  Specifically, the step of "removing the sample linker sequence from the sequence fragment" further comprises: presetting the sequencing quality threshold (eg, 5) and the number of bases threshold (eg, 3) of the sample linker sequence; sequencing the bases in the linker sequence A sequence whose mass value is lower than the sequencing quality and the number of bases exceeds the base number threshold is removed. For example, in consideration of sequencing conditions and environment, a sequence of 10 bp (base pair) of the linker sequence in the present embodiment in which the sequence quality value is less than 5 and the number is greater than 3 is removed.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例中, 进一步地执行:  In one embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention, further performing:
步骤 a、 将样本接头序列与样本接头序列库中序列进行完全匹 配操作;  Step a, completely matching the sample linker sequence with the sequence in the sample linker sequence library;
步骤 b、 将样本接头序列降解 l-2bp碱基, 与样本接头序列库 中序列对应部分进行完全匹配操作;  Step b: Degrading the sample linker sequence by l-2 bp base, and performing complete matching operation with the corresponding part of the sequence in the sample linker sequence library;
步骤 c、 允许样本接头序列仅有一个碱基的插入, 即在样本接 头序列起始端进行完全匹配操作, 当出现一个碱基无法匹配时将 该碱基视为插入碱基, 跳过此碱基后继续执行完全匹配操作;  Step c, allowing the sample linker sequence to insert only one base, that is, performing a perfect match operation at the beginning of the sample linker sequence, and treating the base as an insert base when a base cannot match, skipping the base Continue to perform the exact match operation;
步骤 d、 允许样本接头序列仅有一个碱基的缺失, 即在样本 接头序列中模拟缺失任何一个减基后, 进行完全匹配操作; 完成步骤 a - d后, 按照优先^ J顷序: 步骤 a>步骤 b> "步骤 c或步骤 d" 的顺序确定最终的样本接头序列的比对结果(在处理 接头比对的时候, 有时候同一个序列会得到不同的比对结果, 设 置筛选比对结果的优先级可以理解为: 步骤 a的最高, b次之, c 和 d的优先级等同)。 Step d, allowing the sample linker sequence to have only one base deletion, ie in the sample After the completion of steps a - d, the final sample connector is determined according to the order of priority: step a> step b>"step c or step d". Sequence alignment results (in the case of processing linker alignments, sometimes the same sequence will get different alignment results. Setting the priority of the screening comparison results can be understood as: the highest of step a, b times, c and d has the same priority).
比对到同一样本接头序列的被认为是来自同一样本的序列, 从而区分样本; 以及去除样本的序列中的样本接头序列 (可能是 8 - llbp )。  Comparing sequences to the same sample linker sequence that are considered to be from the same sample, thereby distinguishing the samples; and removing the sample linker sequence (possibly 8 - llbp) in the sequence of the sample.
进一步的, 如果步骤 a - d四步操作中均无比对结果, 或者一 个步骤同时比对到两个结果, 或者仅有步骤 c和步骤 d 同时比对 出结果; 则认为该比对结果是由于无法区分而判定为无效信息, 并将相应的整条序列去除。  Further, if there is no comparison result in the four steps of steps a - d, or one step simultaneously compares the two results, or only step c and step d simultaneously compare the results; then the comparison result is considered to be due to Cannot distinguish and determine invalid information, and remove the corresponding entire sequence.
本发明提供的 HPV精确分型的生物信息学分析的方法的一个 实施例, 将测序片段中的样本接头序列与样本接头序列库进行比 对, 实现分样本操作后, 再将样本接头序列从序列片段中去除, 从而确保 HPV分型分析的真实性和 靠性, 为进一步的 HPV精 确分型提供保障。  An embodiment of the method for bioinformatics analysis of HPV precise typing provided by the present invention compares the sample linker sequence in the sequenced fragment with the sample linker sequence library, and after performing the sample-sequencing operation, the sample linker sequence is sequenced from the sequence The fragment is removed to ensure the authenticity and reliability of the HPV typing analysis, providing further protection for further HPV classification.
图 4示出本发明提供的 HPV精确分型的生物信息学分析的方 法的另一个实施例的流程图。  Figure 4 is a flow chart showing another embodiment of the method of bioinformatics analysis of HPV exact typing provided by the present invention.
如图 4所示, HPV精确分型的生物信息学分析的方法 400包 括: 步骤 402、 404、 406、 408、 409 和 410, 其中步骤 402、 404、 406可以分别执行与图 1所示的步骤 102、 104、 106相同或 相似的技术内容, 为简洁起见, 这里不再赘述其技术内容。  As shown in FIG. 4, the method 400 for bioinformatics analysis of HPV precise typing includes: steps 402, 404, 406, 408, 409, and 410, wherein steps 402, 404, and 406 can respectively perform the steps shown in FIG. 102, 104, 106 the same or similar technical content, for the sake of brevity, the technical content will not be repeated here.
如图 4所示, 在步骤 406之后, 执行步骤 408, 对确定型别的 序列片段按样本进行合并。 具体来说, 在步驟 404 中, 已经将各 个序列是来自哪个样本的关系找到, 按照这个关系, 将属于同一 个样本的序列归在一起, 统计他们与 HPV参考基因组的比对结 果。 As shown in FIG. 4, after step 406, step 408 is performed to merge the sequence segments of the determined type by sample. Specifically, in step 404, the relationship between which samples the respective sequences are from is found, and according to this relationship, will belong to the same The sequences of the samples are grouped together and their alignment with the HPV reference genome is counted.
步骤 409, 对样本合并后的序列片段数量进行标准化。 在本发 明中, 由于是不同文库的样品混在同一个 lane里面测序, 会由于 各个文库上机浓度的不均一导致各个文库的样品的测序量有所不 同, 为了消除这个不同, 我们将各个文库每个样品所拥有的序列 数量, 都按照比例缩放到该文库的测序量为理想情况下的平均测 序量。 即对每个样本的合并后的序列数量标准化。 理想情况下是 指各个混合在一个 lane里面的样品的上样量都一样, 且都被平均 地测到同等的测序量, 即不受实验、 测序操作影响下理论的测序 标 准 化 公 式 为 : sample一 read一 num一 STD= sample一 read一 num一 ori *(150000/read_num_ori) ; 其 中 sample_read_num_STD 表示标准化后的样本序列数; sample read— num ori表示样本实际序歹1 J数; read— num ori表示 样本对应文库测序下机的序列数。 In step 409, the number of sequence fragments after the sample is combined is standardized. In the present invention, since samples of different libraries are mixed and sequenced in the same lane, the sequencing amount of each library sample is different due to the heterogeneity of the concentration on the respective libraries. To eliminate this difference, we will The number of sequences owned by each sample is scaled to the average amount of sequencing in which the sequencing amount of the library is ideal. That is, the number of combined sequences for each sample is normalized. Ideally, the sample loading of each sample mixed in a lane is the same, and the same amount of sequencing is averagely measured, that is, the theoretical standardization of sequencing is not affected by the experiment and sequencing operation: sample one Read-num_STD=sample-read-num-ori*(150000/read_num_ori); where sample_read_num_STD represents the number of sample sequences after normalization; sample read—num ori indicates the actual sequence of the sample 歹1 J number; read—num ori indicates the sample The number of sequences corresponding to the library sequencing.
步骤 410, 根据标准化后支持对应型别的序列片段数量和比例 进行筛选, 最终确认每个样本的 HPV型别或者确定为阴性。  In step 410, screening is performed according to the number and proportion of sequence fragments supporting the corresponding type after standardization, and finally confirming that the HPV type of each sample is negative.
标准化后, 对样本现有信息进行过滤筛选, 所采用的筛选条 件按顺序如下: 可用序列片段数小于一定阁值(如 137 ), 则认为 实验或者测序操作失败; 比对结果支持 HPV型别的序列片段数小 于一定阈值(如 350 ), 认为检测结果是阴性。 比对结果支持 HPV 某型别的序列片段数占总序列片段数的比例达到预定阈值(该阈 值的设定在具体实验背景下, 需综合考虑检测的真实性及可重复 性, 如 12% ) 以上, 则认为样本感染了该型别。 其中, 各部分的 具体阁值视具体的实验情况而定, 前述给出的具体参数值是通过 实际样品, 通过统计得出的, 采用阴性样 ^出数的平均值 +4倍 方差的值作为阴阳性判断的阈值。 统计结果分别见图 2、 5、 6。 不 同测序平台, 所应该用的过滤值会有所不同, 实际生产中应该按 照大致的方法先予确定; 从而达到最终检测出样本感染的所有 HPV型别或确定检测结果为阴性的目的。 After standardization, the existing information of the sample is filtered and filtered. The screening conditions used are as follows: If the number of available sequence fragments is less than a certain value (such as 137), the experiment or sequencing operation is considered to be unsuccessful; the comparison result supports HPV type. The number of sequence fragments is less than a certain threshold (such as 350), and the test result is considered negative. The comparison results support that the ratio of the number of sequence fragments of a certain type of HPV to the total number of sequence fragments reaches a predetermined threshold (the threshold is set in the specific experimental background, and the authenticity and repeatability of the detection should be considered comprehensively, such as 12%). Above, the sample is considered to be infected with this type. Among them, the specific value of each part depends on the specific experimental conditions. The specific parameter values given above are obtained through actual samples, and the average value of the negative samples is +4 times. The value of the variance is used as the threshold for the positive judgment of the negative. The statistical results are shown in Figures 2, 5 and 6, respectively. Different sequencing platforms, the filter values that should be used will be different, and the actual production should be determined according to the general method; thus, the purpose of finally detecting all HPV types of the sample infection or determining the negative result is determined.
图 5示出本发明提供的 HPV精确分型的生物信息学分析的方 法的一个具体实施方式的流程图。  Figure 5 is a flow chart showing one embodiment of a method of bioinformatics analysis of HPV exact typing provided by the present invention.
如图 5所示, HPV精确分型的生物信息学分析的方法 500包 括: 步骤 502, 接收高通量测序技术得到的测序片段。 在本发明实 施例中, 釆用 Illumina GA高通量测序技术。  As shown in FIG. 5, the method for bioinformatics analysis of HPV precise typing comprises: Step 502: Receiving a sequencing fragment obtained by high-throughput sequencing technology. In an embodiment of the invention, Illumina GA high throughput sequencing technology is employed.
步骤 504, 接收到测序序列后, 对测序序列进行过滤, 去除不 合格的序列。 不合格序列包括: 测序质量值低于 5 的减基个数超 过整条序列碱基个数的 50%则认为是不合格序列; 序列中测序结 果中 N 的个数超过整条序列碱基个数的 10%则认为是不合格序 列; 与测序接头序列库进行比对, 若序列中存在测序接头序列则 认为是不合格序列。  Step 504: After receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence. The unqualified sequence includes: The number of minus bases with a sequencing quality value below 5 is more than 50% of the number of bases in the entire sequence, which is considered to be an unqualified sequence; the number of N in the sequencing result exceeds the entire sequence of bases. A 10% of the number is considered to be an unqualified sequence; it is aligned with the sequence of the sequenced strander sequence, and if the sequence of the sequenced linker is present in the sequence, it is considered to be an unqualified sequence.
步骤 506, 将每个序列中样本接头序列与样 ^头序列库进行 比对, 实现分样本操作。  Step 506: Comparing the sample connector sequence in each sequence with the sample sequence library to implement the sample-sequencing operation.
步骤 508, 将样 ^^头序列从序列片段中去除。 具体来说, 将 接头序列中有测序质量值低于 5 的碱基个数大于 3 个的序列去 除。 而后, 1 )将样本接头序列与样本接头序列库中序列进行完全 匹配操作; 2 )假设样本接头序列降解 l-2bp与样本接头序列库中 序列对应部分进行完全匹配操作; 3 )允许样本序列仅有一个碱基 的插入。 在样本接头序列起始端进行完全匹配操作, 当出现某碱 基无法匹配时认为该碱基为插入碱基, 跳过此碱基后继续严格的 完全匹配操作; 4 )允许样本序列仅有一个碱基的缺失。 在样本接 头序列中模拟缺失任何一个碱基后, 进行完全匹配操作。 完成上 述四步操作后, 按照 1)>2)>3)、 4)的优先级顺序确定最终的样本接 头序列的比对结果, 而对于四步操作中四步均无比对结果, 一个 步驟同时比对到两个结果或仅有且同时 3、 4步骤比对出结果则认 为是由于无法区分, 而判定为无效信息, 将相应的整条序列去 除。 比对到同一样本接头序列的被认为是来自该一样本的序列, 从而实现样本区分的目的。 最后去除序列中样本接头序列部分Step 508, the sample sequence is removed from the sequence segment. Specifically, a sequence in the linker sequence in which the number of bases having a sequencing quality value of less than 5 is greater than three is removed. Then, 1) the sample linker sequence is completely matched with the sequence in the sample linker sequence library; 2) the sample linker sequence is degraded by l-2bp and the sequence corresponding to the sequence in the sample linker library is completely matched; 3) the sample sequence is allowed only There is a base insertion. Perform a perfect match at the beginning of the sample linker sequence. When a base cannot match, consider the base to be an insert base. After skipping this base, continue the strict exact match operation. 4) Allow the sample sequence to have only one base. The absence of the base. After the deletion of any one base in the sample linker sequence, a perfect match is performed. After completing the above four steps, determine the final sample connection according to the priority order of 1)>2)>3), 4). The result of the alignment of the head sequences, and the results of the four steps in the four-step operation are incomparable, and one step is compared to two results or only and the third and fourth steps are compared, and the result is considered to be indistinguishable. Determined as invalid information, the corresponding entire sequence is removed. The sequences aligned to the same sample linker are considered to be from the same sequence, thereby achieving the purpose of sample discrimination. Finally remove the sample linker sequence portion of the sequence
( 8-llbp )。 ( 8-llbp).
步骤 510, 将测序片段与参考基因组序列进行比对, 对比对后 的序列进行筛选。 确定筛选后的序列片段的 HPV型别或阴性。 本 发明实施例采用 blast映射程序, 将高通量测序技术得到的测序片 段比对到参考基因组序列上。 比对后, 筛选掉比对结果中比对长 度低于 70%, 或者一致性低于 85 %的序列。 而后保留每条序列比 对结果最好的, 即 blast软件比对输出的第一个比对结果, 同时也 保留次优结果; 其中, 次优结果满足: 序列的一致性 *比对长度、 比对得分分别对应的高于或等于所述最佳结果的 0.9倍、 0.85倍, 且序列与参考序列匹配不相关的概率低于所述最佳结果的 103倍。 然后判断该序列比对到的型别是否是同一型别 (或其亚型), 最终 仅保留筛选后的比对结果仅比对到某一型别的序列作为有效序 列, 实现确定每个序列比对的 HPV型别或确认为阴性。 In step 510, the sequenced fragments are aligned with the reference genome sequence, and the sequence after the comparison is screened. The HPV type or negative of the sequenced fragment after screening is determined. In the embodiment of the present invention, the blast mapping program is used to compare the sequencing fragments obtained by the high-throughput sequencing technology to the reference genome sequence. After the alignment, the alignment in the alignment result was less than 70%, or the sequence was less than 85%. Then retain the best result of each sequence alignment, that is, the first comparison result of the blast software comparison output, and also retain the suboptimal result; wherein, the suboptimal result satisfies: sequence consistency * alignment length, ratio The scores corresponding to the scores are respectively 0.9 times or 0.85 times higher than or equal to the best result, and the probability that the sequence is uncorrelated with the reference sequence match is 10 3 times lower than the best result. Then, it is judged whether the sequence of the sequence is the same type (or a subtype thereof), and finally only the selected alignment result is compared with the sequence of a certain type as a valid sequence, and each sequence is determined. The HPV type was compared or confirmed to be negative.
步骤 512, 将确定型别的序列的比对结果按样本合并。 具体来 说, 在步骤 506 中, 已经将各个序列是来自哪个样本的关系找 到, 按照这个关系, 将属于同一个样本的序列归在一起, 统计他 们与 HPV参考基因组的比对结果。  In step 512, the alignment results of the determined types of sequences are combined by sample. Specifically, in step 506, the relationship from which sample each sequence is derived has been found, and according to this relationship, the sequences belonging to the same sample are grouped together, and their alignment results with the HPV reference genome are counted.
步骤 514, 对每个样本的合并后序列的数量标准化。 在本发明 中, 为了消除这个不同, 我们将各个文库每个样品所拥有的序列 数量, 都按照比例缩放到该文库的测序量为理想情况下的平均测 序量。 即对每个样本的合并后的序列数量标准化。  At step 514, the number of merged sequences for each sample is normalized. In the present invention, in order to eliminate this difference, we scaled the number of sequences owned by each sample of each library to the average amount of sequencing in which the sequencing amount of the library was ideal. That is, the number of combined sequences for each sample is normalized.
标 准 化 公 式 为 : sample一 read一 num一 STD = sample_read_num_ori * (150000/read num ori) ; 其 中 sample_read_num_STD 表示标准化后的样本序列数; sample_read _num_ori表示样本实际序歹1 J数; read num ori表示 样本对应文库测序下机的序列数。 The standardization formula is: sample one read one num one STD = Sample_read_num_ori * (150000/read num ori) ; where sample_read_num_STD represents the number of sample sequences after normalization; sample_read _num_ori represents the actual sequence number of the sample 歹1 J; read num ori represents the number of sequences of the sample corresponding library sequencing.
步骤 516, 根据标准化后支持对应型别的序列片段数量和比例 进行筛选, 最终确认每个样本的 HPV型别或者确定为阴性。 在本 实施例中, 标准化后, 按下述条件的先后顺序进行筛选: 可用序 列数小于 137, 认为实验或者测序操作失败; 否则比对结果支持 HPV型别的序列片段数小于 350, 就认为是阴性。 比对结果支持 HPV某型别的序列片段数占总序列片段数的 12%以上, 则认为感 染了该型别, 最终确定每个样本感染的 HPV型别或者确定为阴 性。  Step 516, screening according to the number and proportion of sequence fragments supporting the corresponding type after standardization, and finally confirming that the HPV type of each sample is determined to be negative. In this embodiment, after standardization, the screening is performed according to the following conditions: The number of available sequences is less than 137, and the experiment or the sequencing operation is considered to be unsuccessful; otherwise, the comparison result supports the HPV type sequence fragment number less than 350, which is considered to be negative. The alignment results support that the number of HPV types of sequence fragments accounts for more than 12% of the total number of sequence fragments, and it is considered that the type is infected, and the HPV type of each sample infection is finally determined or determined to be negative.
图 6示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的系统的结构示意图。  FIG. 6 is a schematic structural diagram of a system for bioinformatics analysis of HPV accurate classification according to an embodiment of the present invention.
如图 6 所示, 一种 HPV精确分型的生物信息学分析的系统 600 包括: 接收模块 602、 分样本模块 604、 序列型别确定模块 606和样本型别确定模块 608。 其中  As shown in FIG. 6, a system 600 for bioinformatics analysis of HPV accurate typing includes: a receiving module 602, a sub-sample module 604, a sequence type determining module 606, and a sample type determining module 608. among them
接收模块 602, 用于接收高通量测序技术得到的测序片段。 分样本模块 604, 用于将测序片段中的样本接头序列与样本接 头序列库进行比对, 实现分样本操作。  The receiving module 602 is configured to receive the sequenced segment obtained by the high-throughput sequencing technology. The sample module 604 is configured to compare the sample connector sequence in the sequenced segment with the sample connector sequence library to implement a sample-sequencing operation.
序列型别确定模块 606, 用于将测序片段与参考基因组序列进 行比对, 对比对后的序列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性。  The sequence type determination module 606 is configured to compare the sequenced fragment with the reference genome sequence, and compare the sequence of the sequence to determine the HPV type or negative of the sequenced fragment.
样本型别确定模块 608, 用于对确定型别的序列片段按样本进 行合并, 并根据合并后支持对应型别的序列片段数量和比例进行 筛选; 最终确认每个样本的 HPV型别或者确定为阴性。  The sample type determining module 608 is configured to combine the determined sequence segments by samples, and perform screening according to the number and proportion of the sequence fragments supporting the corresponding types after the combination; finally confirming the HPV type of each sample or determining negative.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 接收模块还用于: 接收到测序序列后, 对测序序列进 行过滤, 去除不合格的序列。 具体采用的流程细节可参考方法实 施例中的描述, 这里就不再赘述。 One of the systems for bioinformatics analysis of HPV accurate typing provided by the present invention In an embodiment, the receiving module is further configured to: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence. For details of the specific process, refer to the description in the method embodiment, and details are not described herein again.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 分样本模块还用于: 实现分样本操作后, 将样本接头 序列从序列片段中去除。 具体采用的流程细节可参考方法实施例 中的描述, 这里就不再赘述。  In one embodiment of the system for bioinformatics analysis of HPV precise typing provided by the present invention, the sub-sample module is further configured to: after the sub-sample operation is performed, remove the sample linker sequence from the sequence fragment. For details of the specific process, refer to the description in the method embodiment, and details are not described herein.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 合并筛选模块还用于: 对确定型别的序列片段按样本 进行合并后, 对样本合并后的序列片段数量进行标准化。  In an embodiment of the system for bioinformatics analysis of HPV accurate typing provided by the present invention, the combined screening module is further configured to: after combining the determined sequence segments by samples, performing the combined number of sequence fragments of the samples standardization.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例中, 对样本合并后的序列片段数量进行标准化进一步包 括: 将各个文库中每个样品所拥有的序列数量, 都按照比例缩放 到文库的测序量为理想情况下的平均测序量。 具体采用的流程细 节可参考方法实施例中的描述, 这里就不再赘述。  In one embodiment of the system for bioinformatics analysis of HPV precise typing provided by the present invention, standardizing the number of sequence fragments after combining the samples further comprises: proportionally the number of sequences owned by each sample in each library The amount of sequencing scaled to the library is the average amount of sequencing in the ideal case. For details of the specific process, refer to the description in the method embodiment, and details are not described herein again.
本发明提供的 HPV精确分型的生物信息学分析的系统的一个 实施例, 利用生物信息学的分析方法及技术手段, 实现了快速检 测大量样本、 快速完成对感染 HPV型别的检测, 具有较高的灵敏 度和特异性。  The embodiment of the system for bioinformatics analysis of HPV accurate classification provided by the invention utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete detection of infected HPV types. High sensitivity and specificity.
图 7示出本发明实施例提供的一种 HPV精确分型的生物信息 学分析的方法及系统在分析过程中, 各阶段的有效序列占原始序 列的比例变化情况示意图。  FIG. 7 is a schematic diagram showing the variation of the effective sequence of each stage in the original sequence during the analysis process of the method and system for bioinformatics analysis of the HPV precise classification provided by the embodiment of the present invention.
如图 7 所示, 横坐标代表测序文库代码, 纵坐标代表有效序列 占原始序列的比例。 Filter 曲线表示不同测序文库在过滤测序序列后 有效序列占原始序列的比例变化情况; Lib— Match 曲线表示不同测 序文库在样本区分完成后有效序列占原始序列的比例变化情况; As shown in Figure 7, the abscissa represents the sequencing library code and the ordinate represents the ratio of the effective sequence to the original sequence. The Filter curve indicates the change of the ratio of the effective sequence to the original sequence of the different sequencing libraries after filtering the sequencing sequence; the Lib-match curve indicates the proportion of the effective sequence to the original sequence of the different sequencing libraries after the sample differentiation is completed;
Final 曲线表示不同测序文库在序列 HPV型别确定后有效序列占 原始序列的比例变化情况。 本实例全部 50个测序文库的序列有效 利用率达到了 80%以上。 The Final curve indicates that the different sequenced libraries account for the effective sequence after the sequence HPV type is determined. The proportion of the original sequence changes. The sequence utilization rate of all 50 sequencing libraries in this example reached more than 80%.
图 8 示出本发明实施例提供的真实样本与阴性对照样本的有 效序列片段数量的分布情况示意图。  Figure 8 is a diagram showing the distribution of the number of valid sequence fragments of the real sample and the negative control sample provided by the embodiment of the present invention.
如图 8 所示, 阴性对照样本的有效序列片段的平均数为 19.82。 该平均值加四倍的有效序列片段个数的标准差为 136.98。 如图 8所示, 将 137个有效序列片段作为实验或者测序操作成功 与否界定值可以有效的将真实样本和阴性对照样本区分开来。  As shown in Figure 8, the average number of valid sequence fragments of the negative control sample was 19.82. The standard deviation of the number of valid sequence fragments plus four times the mean is 136.98. As shown in Figure 8, the use of 137 valid sequence fragments as experimental or sequencing success or not defined values can effectively distinguish between real and negative control samples.
图 9示出本发明实施例提供的每个样本重复 10次测序及分析 后的重复性的结果示意图。  Fig. 9 is a view showing the results of repeatability after sequencing and analysis of each sample 10 times in the sample provided by the embodiment of the present invention.
为了评估将支持 HPV型别的序列片段数 350确定为检测结果 阴阳性的界定值的有效性, 图 9示出了每个样本重复 10次测序及 分析后的重复性的结果。 如图 9 所示, 横坐标表示确定检测结果 阴阳性的界定值, 纵坐标表示所有样本重复率的平均值。 本领域 技术人员根据图 9 所示可以清楚地知晓, 所有样本无论是在香港 还是深圳进行测序, 当将支持 HPV型别的序列片段数 350确定为 检测结果阴阳性的界定值时, 样本重复分析的重复性高达 99%, 充分体现了本发明对于 HPV检测的稳定性。  To evaluate the validity of the number of sequence segments 350 that would support the HPV type as the validity of the defined value of the positive result of the test results, Figure 9 shows the results of repeatability after 10 replicates of each sample and analysis. As shown in Figure 9, the abscissa represents the defined value that determines the positive result of the test, and the ordinate represents the average of the repetition rates of all samples. It can be clearly seen by those skilled in the art according to FIG. 9 that all samples are sequenced in Hong Kong or Shenzhen. When the number of sequence fragments supporting the HPV type is determined to be a defined value of the positive result of the detection result, the sample is repeatedly analyzed. The repeatability is as high as 99%, which fully reflects the stability of the present invention for HPV detection.
图 10示出本发明实施例提供的所有真实样^ r测出的阴阳性 结果与血液阴性样本、 临床检测结果的比较示意图。  FIG. 10 is a schematic diagram showing the comparison between the negative positive results measured by the real samples and the blood negative samples and clinical test results provided by the embodiments of the present invention.
如图 10所示, 血液(Blood DNA )是确定的没有 HPV感染 的阴性样本。 在临床上将检测结果大于 1的患者确认为 HPV感染 阳性。 本领域技术人员根据图 10 所示可以清楚地知晓, 将支持 As shown in Figure 10, blood (Blood DNA) is a confirmed negative sample without HPV infection. Patients with a test result greater than 1 were clinically confirmed to be positive for HPV infection. Those skilled in the art can clearly understand according to FIG. 10 and will support
HPV型别的序列片段数 350确定为检测结果阴阳性的界定值时, 本实施例中确认 HPV感染阴阳性的结果绝大部分与临床检测结果 相同。 而 350 的阁值又可以把血液阴性样本和阳性样本区分开 来, 避免了假阳性。 由于临床检测结果并不能完全作为阳性对 照, 所以本实施例的检测结果已足以证明本发明的精确性。 When the number of HPV-type sequence fragments 350 is determined as the definition value of the positive result of the detection result, the result of confirming the positive result of HPV infection in this embodiment is mostly the same as the clinical test result. The value of 350 can distinguish between blood-negative and positive samples, avoiding false positives. Because the clinical test results are not completely positive Therefore, the detection results of this embodiment are sufficient to demonstrate the accuracy of the present invention.
图 11 示出本发明实施例提供的第二类文库中质粒样本的检测 结果的示意图。  Figure 11 is a schematic diagram showing the results of detection of plasmid samples in a second type of library provided by an embodiment of the present invention.
如图 11 所示, 横坐标表示为质粒中载入 HPV病毒的型别, 纵坐标表示的为实施例分析过程中支持对应 HPV病毒型别的序列 片段所占比例。 本领域技术人员根据图 11所示可以清楚地知晓, 将支持 HPV 某型别的序列片段数比例达到 12%以上的样本确定 为感染 HPV的型别, 可以有效的特异的检测出样本感染了的具体 型别。  As shown in Fig. 11, the abscissa indicates the type in which the HPV virus was loaded into the plasmid, and the ordinate indicates the proportion of the sequence fragment supporting the corresponding HPV virus type during the analysis of the example. It can be clearly seen by those skilled in the art according to FIG. 11 that a sample supporting a ratio of the number of sequence fragments of a certain type of HPV is determined to be a type of HPV infection, and the sample can be effectively and specifically detected. Specific type.
Figure imgf000020_0001
sample 11 HBB sample 43 HPV6 sample 75 HBB sample 12 - sample 44 HBB sample 76 HBB sample 13 HBB sample 45 - sample 77 HBB sample 14 HPV59 sample 46 HBB sample 78 HBB sample 15 HPV16 sample 47 - sample 79 HBB sample 16 HBB sample 48 HBB sample 80 HBB sample 17 HBB sample 49 HBB sample 81 HBB sample 18 HBB sample 50 HBB sample 82 HBB sample 19 HBB sample 51 HBB 质粒 (33型) HPV33 sample 20 HPV16 sample 52 HBB 质粒 (33型) HPV33 sample 21 HBB sample 53 HBB 血液阴性样本 HBB sample 22 HBB sample 54 HBB 血液阴性样本 HBB sample 23 HPV11 sample 55 HBB 血液阴性样本 HBB sample 24 HBB sample 56 HBB 血液阴性样本 HBB sample 25 HBB sample 57 HBB 血液阴性样本 HBB sample 26 HBB sample 58 - 血液阴性样本 HBB sample 27 HBB sample 59 HBB 纯水阴性样本 - sample 28 HBB sample 60 HBB 纯水阴性样本 - sample 29 HBB sample 61 HBB 纯水阴性样本 - sample 30 HBB sample 62 HBB 纯水阴性样本 - sample 31 HBB sample 63 HBB 纯水阴性样本 - sample 32 - sample 64 HBB 纯水阴性样本 - 样本库的检测结果
Figure imgf000020_0001
Sample 11 HBB sample 43 HPV6 sample 75 HBB sample 12 - sample 44 HBB sample 76 HBB sample 13 HBB sample 45 - sample 77 HBB sample 14 HPV59 sample 46 HBB sample 78 HBB sample 15 HPV16 sample 47 - sample 79 HBB sample 16 HBB sample 48 HBB sample 80 HBB sample 17 HBB sample 49 HBB sample 81 HBB sample 18 HBB sample 50 HBB sample 82 HBB sample 19 HBB sample 51 HBB plasmid (type 33) HPV33 sample 20 HPV16 sample 52 HBB plasmid (type 33) HPV33 sample 21 HBB sample 53 HBB blood negative sample HBB sample 22 HBB sample 54 HBB blood negative sample HBB sample 23 HPV11 sample 55 HBB blood negative sample HBB sample 24 HBB sample 56 HBB blood negative sample HBB sample 25 HBB sample 57 HBB blood negative sample HBB sample 26 HBB sample 58 - Blood negative sample HBB sample 27 HBB sample 59 HBB pure water negative sample - sample 28 HBB sample 60 HBB pure water negative sample - sample 29 HBB sample 61 HBB pure water negative sample - sample 30 HBB sample 62 HBB pure water negative sample - Sample 31 HBB sample 63 HBB pure water negative sample - sample 32 - sample 64 HBB pure water negative sample - sample bank test results
表 1示出本发明实验例提供的一个样本库的检测结果。 如表 1 所示, 本表为第一类文库的一个样本库检测结果。 其中, "HBB" 表示检测结果为阴性, "-"表示因样品问题或实验问题导致、检测到 的序列数低于 137, 认为此样品检测失败。  Table 1 shows the results of detection of a sample library provided by the experimental example of the present invention. As shown in Table 1, this table is a sample library test result for the first class library. Where "HBB" indicates that the test result is negative, "-" indicates that the number of detected sequences is lower than 137 due to a sample problem or an experimental problem, and the sample test is considered to have failed.
参考前述本发明示例性的描述, 本领域技术人员可以清楚的 知晓本发明具有以下优点:  With reference to the foregoing exemplary description of the invention, it will be apparent to those skilled in the art that the present invention has the following advantages:
1、 本发明提供的 HPV精确分型的生物信息学分析的方法及 系统的一个实施例, 利用生物信息学的分析方法及技术手段, 实 现了快速检测大量样本、 快速完成对感染 HPV型别的检测, 具有 较高的灵敏度和特异性。  1. An embodiment of a method and system for bioinformatics analysis of HPV accurate classification provided by the present invention, which utilizes bioinformatics analysis methods and technical means to quickly detect a large number of samples and quickly complete the infection of HPV type. Detection, with high sensitivity and specificity.
2、 本发明提供的 HPV精确分型的生物信息学分析的方法及 系统的一个实施例, 通过对测序序列进行过滤, 去除不合格的序 列, 进一步降低了不合格序列的影响, 从而提高了检测分析的准 确性。  2. An embodiment of the method and system for bioinformatics analysis of HPV accurate typing provided by the present invention, by filtering the sequencing sequence, removing unqualified sequences, further reducing the influence of the unqualified sequence, thereby improving detection The accuracy of the analysis.
3、 本发明提供的 HPV精确分型的生物信息学分析的方法及 系统的一个实施例, 将测序片段中的样本接头序列与样本接头序 列库进行比对, 实现分样本操作后, 再将样本接头序列从序列片 段中去除, 从而确保 HPV分型分析的真实性和可靠性, 为进一步 的 HPV精确分型提 ^障。  3. An embodiment of a method and system for bioinformatics analysis of HPV accurate typing provided by the present invention, comparing a sample linker sequence in a sequenced segment with a sample linker sequence library, and implementing a sample-sequencing operation, and then taking the sample The linker sequence is removed from the sequence fragment to ensure the authenticity and reliability of the HPV typing analysis, further impeding further HPV typing.
本发明的描述是为了示例和描述起见而给出的, 而并不是无 遗漏的或者将本发明限于所公开的形式。 很多修改和变化对于本 领域的普通技术人员而言是显然的。 本发明中描述的功能模块以 及功能模块的划分方式仅为说明本发明的思想, 本领域技术人员 根据本发明的教导以及实际应用的需要可以自由改变功能模块的 划分方式及其模块构造以实现相同的功能; 选择和描述实施例是 为了更好说明本发明的原理和实际应用, 并且使本领域的普通技 术人员能够理解本发明从而设计适于特定用途的带有各种修改的 各种实施例。 The description of the present invention has been presented for purposes of illustration and description. Many modifications and variations will be apparent to those skilled in the art. The functional modules and functional modules described in the present invention are divided only to explain the idea of the present invention, and those skilled in the art. The manner in which the functional modules are divided and the configuration of the modules can be freely changed to achieve the same functions in accordance with the teachings of the present invention and the needs of the actual application. The embodiments were chosen and described in order to better explain the principles and practical applications of the present invention, and A person of ordinary skill in the art will be able to understand the present invention in order to design various embodiments with various modifications that are suitable for the particular application.

Claims

权 利 要 求 书 Claim
1. 一种 HPV精确分型的生物信息学分析的方法, 其特征在 于, 所述方法包括: A method of bioinformatics analysis of HPV precise typing, characterized in that the method comprises:
接收高通量测序技术得到的测序片段;  Receiving a sequencing fragment obtained by high-throughput sequencing technology;
将所述测序片段中的样本接头序列与样本接头序列库进行比 对, 实现分样本操作;  Comparing the sample linker sequence in the sequenced fragment with the sample linker sequence library to implement a sample-sequencing operation;
将所述测序片段与参考基因组序列进行比对, 对比对后的序 列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性;  The sequenced fragment is aligned with a reference genome sequence, and the sequence after the comparison is screened to determine the HPV type or negative of the sequenced fragment;
对确定型别的序列片段按样本进行合并, 并根据合并后支持 对应型别的序列片段数量和比例进行筛选; 最终确认每个样本的 HPV型别或者确定为阴性。  The sequence fragments of the determined type are combined by sample, and according to the number and proportion of the sequence fragments supporting the corresponding type after the combination; the HPV type of each sample is finally confirmed to be negative.
2. 根据权利要求 1 所述的方法, 其特征在于, 所述方法还包 括: 接收到测序序列后, 对测序序列进行过滤, 去除不合格的序 列。 2. The method according to claim 1, wherein the method further comprises: after receiving the sequencing sequence, filtering the sequencing sequence to remove the failed sequence.
3. 根据权利要求 1 所述的方法, 其特征在于, 步骤"对测序序 列进行过滤, 去除不合格的序列"进一步包括: 3. The method according to claim 1, wherein the step of "filtering the sequencing sequence to remove the unqualified sequence" further comprises:
预先设置不合格碱基的测序质量阔值和比例阈值;  Presetting the sequencing mass threshold and the proportional threshold of the unqualified base;
当测序序列中碱基的测序质量值低于所述测序质量阈值, 且 低于测序质量阈值的碱基个数占整条序列碱基个数的比例超过所 述比例阈值时; 则认为所述测序序列是不合格序列并加以过滤; 当所述测序序列的测序结果中不确定的碱基的个数超过整奈 序列碱基个数的 10%, 则认为所述测序序列是不合格序列并加以 过滤; 与测序接头序列库进行比对时, 如果所述测序序列中存在测 序接头序列, 则所述测序序列是不合格序列并加以过滤。 When the sequencing quality value of the base in the sequencing sequence is lower than the sequencing quality threshold, and the ratio of the number of bases lower than the sequencing quality threshold to the number of bases of the entire sequence exceeds the ratio threshold; The sequencing sequence is an unqualified sequence and is filtered; when the number of undetermined bases in the sequencing result of the sequencing sequence exceeds 10% of the number of bases in the sequence, the sequencing sequence is considered to be an unqualified sequence and Filtered; When aligned with a library of sequencing linker sequences, if a sequencing linker sequence is present in the sequencing sequence, the sequencing sequence is a failed sequence and is filtered.
4. 根据权利要求 1 所述的方法, 其特征在于, 所述方法还包 括: 实现分样本操作后, 将所述样本接头序列从序列片段中去 除。 4. The method according to claim 1, wherein the method further comprises: removing the sample linker sequence from the sequence segment after performing the sample-sampling operation.
5. 根据权利要求 4 所述的方法, 其特征在于, 步骤"将所述样 本接头序列 ^列片段中去除"进一步包括: 5. The method according to claim 4, wherein the step of "removing the sample linker sequence column" further comprises:
预先设置所述样本接头序列的测序质量阈值和碱基数阈值; 将接头序列中碱基的测序质量值低于所述测序质量阈值, 且 所述碱基的数量超过所述 数阁值的序列去除。  Presetting a sequencing quality threshold and a base number threshold of the sample linker sequence; a sequence in which a sequencing quality of a base in the linker sequence is lower than the sequencing quality threshold, and the number of the bases exceeds the number of values Remove.
6. 根据权利要求 5 所述的方法, 其特征在于, 所述方法还包 括: The method according to claim 5, wherein the method further comprises:
步骤 a、 将样本接头序列与样本接头序列库中序列进行完全匹 配操作;  Step a, completely matching the sample linker sequence with the sequence in the sample linker sequence library;
步骤 b、 将样本接头序列降解 l-2bp碱基, 与样本接头序列库 中序列对应部分进行完全匹配操作;  Step b: Degrading the sample linker sequence by l-2 bp base, and performing complete matching operation with the corresponding part of the sequence in the sample linker sequence library;
步骤 c、 允许样本接头序列仅有一个碱基的插入, 即在所述样 本接头序列起始端进行完全匹配操作, 当出现一个碱基无法匹配 时将该碱基视为插入碱基, 跳过此碱基后继续执行完全匹配操 作;  Step c, allowing the sample linker sequence to insert only one base, that is, performing a perfect match operation at the beginning of the sample linker sequence, and treating the base as an insert base when a base cannot match, skipping this Continue to perform an exact match after the base;
步骤 d、 允许样本接头序列仅有一个碱基的缺失, 即在样本 接头序列中模拟缺失任何一个减基后, 进行完全匹配操作;  Step d, allowing the sample linker sequence to have only one base deletion, that is, after performing any missing subtraction in the sample linker sequence, performing a complete matching operation;
完成所述步骤 a - d后, 按照优先级顺序: 步骤 a>步骤 b> After completing the steps a - d, in order of priority: Step a>Step b>
"步骤 c或步骤 d" 的顺序确定最终的样本接头序列的比对结果; 比对到同一样本接头序列的被认为是来自同一样本的序列, 从而区分样本; 以及 The order of "step c or step d" determines the alignment of the final sample linker sequence; Comparing sequences to the same sample linker sequence that are considered to be from the same sample, thereby distinguishing the samples;
去除所述样本的序列中的样本接头序列。  The sample linker sequence in the sequence of the sample is removed.
7. 根据权利要求 6 所述的方法, 其特征在于, 所述方法还包 括: The method according to claim 6, wherein the method further comprises:
如果步骤 a - d四步操作中均无比对结果, 或者一个步骤同时 比对到两个结果, 或者仅有步骤 c和步骤 d 同时比对出结果; 则 认为该比对结果是由于无法区分而判定为无效信息, 并将相应的 整条序列去除。  If there is no comparison result in the four steps of steps a - d, or one step simultaneously compares the two results, or only step c and step d simultaneously compare the results; then the comparison result is considered to be indistinguishable It is judged as invalid information, and the corresponding entire sequence is removed.
8. 根据权利要求 1 所述的方法, 其特征在于, 步骤"对比对后 的序列进行筛选"进一步包括: 8. The method according to claim 1, wherein the step of "screening the compared sequence" further comprises:
将高通量测序技术得到的测序片段比对到参考基因组序列 上;  The sequencing fragments obtained by the high-throughput sequencing technology are aligned to the reference genome sequence;
比对后, 筛选并去除比对结果中比对长度低于 70 %, 或者一 致性低于 85%的序列;  After the alignment, the sequence in which the alignment length is less than 70% or the consistency is less than 85% is screened and removed;
保留每条序列比对结果中的最佳结果;  Preserve the best results in each sequence alignment result;
保留次优结果; 其中, 所述次优结果满足: 序列的一致性 *比 对长度、 比对得分分别对应的高于或等于所述最佳结果的 0.9倍、 0.85倍, 且序列与参考序列匹配不相关的概率低于所述最佳结果 的 103倍; Retaining suboptimal results; wherein, the suboptimal result satisfies: sequence consistency * alignment length, alignment score respectively corresponding to 0.9 times or 0.85 times higher than or equal to the best result, and sequence and reference sequence The probability of matching irrelevance is 10 3 times lower than the best result;
判断每条序列的最佳结果和次优结果是否比对到同一型别或 其亚型, 如果是, 则保留比对结果仅比对到某一型别的序列作为 有效序列, 确定所述有效序列比对的 HPV型别或阴性。  Determining whether the best result and the suboptimal result of each sequence are aligned to the same type or its subtype, and if so, retaining the alignment result only by comparing the sequence to a certain type as the effective sequence, determining the effective The HPV type or negative of the sequence alignment.
9. 根据权利要求 1 所述的方法, 其特征在于, 所述方法还包 括: 9. The method according to claim 1, wherein the method further comprises Includes:
对确定型别的序列片段按样本进行合并后, 对所述样本合并 后的序列片段数量进行标准化。  After the sequence fragments of the determined type are combined by sample, the number of sequence fragments after the combination of the samples is normalized.
10. 根据权利要求 9 所述的方法, 其特征在于, 对所述样本合 并后的序列片段数量进行标准化进一步包括: 10. The method according to claim 9, wherein the standardizing the number of sequence fragments after the samples are combined further comprises:
将各个文库中每个样品所拥有的序列数量, 都按照比例缩放 到所述文库的测序量为理想情况下的平均测序量。  The number of sequences possessed by each sample in each library was scaled to the amount of sequencing of the library to the ideal amount of sequencing.
11. 根据权利要求 9所述的方法, 其特征在于, 步骤"根据合并 后支持对应型别的序列片段数量和比例进行筛选"进一步包括: 标准化后, 按下述条件的先后顺序进行筛选: 11. The method according to claim 9, wherein the step of "screening according to the number and proportion of sequence fragments supporting the corresponding types after the combination" further comprises: after standardization, screening according to the following conditions:
如果可用序列数小于阴性对照样本的有效序列片段的平均个 数加上其四倍标准差的和, 认为实验或者测序操作失败;  If the number of available sequences is less than the average number of valid sequence fragments of the negative control sample plus the sum of four standard deviations, the experimental or sequencing operation is considered to have failed;
否则, 如果比对结果支持 HPV型别的序列片段数小于预定阈 值, 就认为是阴性;  Otherwise, if the alignment result supports the number of HPV-type sequence fragments less than a predetermined threshold, it is considered negative;
如果比对结果支持 HPV型别的序列片段数占总序列片段数的 比例达到预定阈值以上, 则认为感染了该型别。  If the ratio of the HPV-type sequence fragments to the total number of sequence fragments exceeds a predetermined threshold, the alignment is considered to be infected.
12. 一种 HPV精确分型的生物信息学分析的系统, 其特征在 于, 所述系统包括: 12. A system for bioinformatics analysis of HPV precise typing, characterized in that the system comprises:
接收模块, 用于接收高通量测序技术得到的测序片段; 分样本模块, 用于将所述测序片段中的样本接头序列与样本 接头序列库进行比对, 实现分样本操作;  a receiving module, configured to receive a sequencing fragment obtained by high-throughput sequencing technology; a sample-sequencing module, configured to compare a sample connector sequence in the sequencing fragment with a sample connector sequence library to implement a sample-sequencing operation;
序列型别确定模块, 用于将所述测序片段与参考基因组序列 进行比对, 对比对后的序列进行筛选, 确定筛选后的序列片段的 HPV型别或阴性; 样本型别确定模块, 用于对确定型别的序列片段按样本进行 合并, 并根据合并后支持对应型别的序列片段数量和比例进行筛 选; 最终确认每个样本的 HPV型别或者确定为阴性。 a sequence type determining module, configured to compare the sequenced fragment with a reference genome sequence, and screen the compared sequence to determine the HPV type or negative of the sequenced fragment; The sample type determination module is configured to combine the sample fragments of the determined type by sample, and according to the number and proportion of the sequence fragments supporting the corresponding type after the combination, and finally confirm that the HPV type of each sample is determined to be negative. .
13. 根据权利要求 12所述的系统, 其特征在于, 所述接收模块 还用于: 接收到测序序列后, 对测序序列进行过滤, 去除不合格 的序列。 The system according to claim 12, wherein the receiving module is further configured to: after receiving the sequencing sequence, filtering the sequencing sequence to remove the unqualified sequence.
14. 根据权利要求 12所述的系统, 其特征在于, 所述分样^^ 块还用于: 实现分样本操作后, 将所述样本接头序列从序列片段 中去除。 The system according to claim 12, wherein the sample block is further configured to: after the sub-sample operation is performed, remove the sample connector sequence from the sequence segment.
15. 根据权利要求 12所述的系统, 其特征在于, 所述样本型别 确定模块还用于: 对确定型别的序列片段按样本进行合并后, 对 所述样本合并后的序列片段数量进行标准化。 The system according to claim 12, wherein the sample type determining module is further configured to: after merging the determined sequence segments by samples, performing the combined number of sequence segments of the samples standardization.
16. 根据权利要求 15所述的系统, 其特征在于, 对所述样本合 并后的序列片段数量进行标准化进一步包括: 16. The system of claim 15, wherein normalizing the number of sequence segments after the sample is combined further comprises:
将各个文库中每个样品所拥有的序列数量, 都按照比例缩放 到所述文库的测序量为理想情况下的平均测序量。  The number of sequences possessed by each sample in each library was scaled to the amount of sequencing of the library to the ideal amount of sequencing.
PCT/CN2010/001943 2010-12-02 2010-12-02 Method and system for bioinformatics analysis of hpv precise typing WO2012071685A1 (en)

Priority Applications (3)

Application Number Priority Date Filing Date Title
CN201080070484.7A CN103261442B (en) 2010-12-02 2010-12-02 Method and system for bioinformatics analysis of HPV precise typing
PCT/CN2010/001943 WO2012071685A1 (en) 2010-12-02 2010-12-02 Method and system for bioinformatics analysis of hpv precise typing
HK13112598.6A HK1185113A1 (en) 2010-12-02 2013-11-11 Method and system for bioinformatics analysis of hpv precise typing hpv

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/001943 WO2012071685A1 (en) 2010-12-02 2010-12-02 Method and system for bioinformatics analysis of hpv precise typing

Publications (1)

Publication Number Publication Date
WO2012071685A1 true WO2012071685A1 (en) 2012-06-07

Family

ID=46171145

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/001943 WO2012071685A1 (en) 2010-12-02 2010-12-02 Method and system for bioinformatics analysis of hpv precise typing

Country Status (3)

Country Link
CN (1) CN103261442B (en)
HK (1) HK1185113A1 (en)
WO (1) WO2012071685A1 (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111032885B (en) * 2017-09-07 2024-05-17 深圳华大基因股份有限公司 Bioinformatics analysis method and system for HPV precise typing
CN111919257B (en) * 2018-07-27 2021-05-28 思勤有限公司 Method and system for reducing noise in sequencing data, and implementation and application thereof
CN111755075B (en) * 2019-03-28 2023-09-29 深圳华大生命科学研究院 Method for filtering sequence pollution among high-throughput sequencing samples of immune repertoire
CN110951853B (en) * 2019-12-10 2021-03-30 中山大学附属第一医院 Method for accurately detecting DNA viruses in human genome
CN116403647B (en) * 2023-06-08 2023-08-15 上海精翰生物科技有限公司 Biological information detection method for detecting slow virus integration site and application thereof

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101397590A (en) * 2008-10-27 2009-04-01 杭州迪安医学检验中心有限公司 Typing method for human papilloma virus gene
CN101435002A (en) * 2008-12-12 2009-05-20 深圳华大基因科技有限公司 Method for detecting human papilloma virogene type
CN101838709A (en) * 2010-04-13 2010-09-22 中山大学 Method for performing rapid gene typing on trace human papilloma virus (HPV)

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3404114B1 (en) * 2005-12-22 2021-05-05 Keygene N.V. Method for high-throughput aflp-based polymorphism detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101397590A (en) * 2008-10-27 2009-04-01 杭州迪安医学检验中心有限公司 Typing method for human papilloma virus gene
CN101435002A (en) * 2008-12-12 2009-05-20 深圳华大基因科技有限公司 Method for detecting human papilloma virogene type
CN101838709A (en) * 2010-04-13 2010-09-22 中山大学 Method for performing rapid gene typing on trace human papilloma virus (HPV)

Also Published As

Publication number Publication date
CN103261442A (en) 2013-08-21
CN103261442B (en) 2014-12-10
HK1185113A1 (en) 2014-02-07

Similar Documents

Publication Publication Date Title
US20230151436A1 (en) Diagnostic applications using nucleic acid fragments
US10731224B2 (en) Enhancement of cancer screening using cell-free viral nucleic acids
WO2012071685A1 (en) Method and system for bioinformatics analysis of hpv precise typing
CN105624796A (en) Chip and uses of chip in deafness related gene detection
CN112397151B (en) Methylation marker screening and evaluating method and device based on target capture sequencing
CN112639987A (en) Nucleic acid rearrangement and integration analysis
WO2020224159A1 (en) Next generation sequencing-based panel for detecting glioma, detection kit, detection method, and application thereof
WO2016176846A1 (en) Reagent kit, apparatus, and method for detecting chromosome aneuploidy
WO2019047109A1 (en) Bioinformatics analysis method and system for hpv precise typing
WO2019129200A1 (en) C-site extraction method and apparatus
CN113710818A (en) Virus-associated cancer risk stratification
US20230207059A1 (en) Genome sequencing and detection techniques
CN102982253A (en) Detection method and device of methylation difference of multiple samples
CN115527611A (en) Method for analyzing HPV virus integration sites based on whole exome sequencing
CN115512767A (en) Method for detecting and analyzing virus expression quantity in single cell transcriptome sequencing data
Cullen et al. Papillomavirus Research

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10860226

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10860226

Country of ref document: EP

Kind code of ref document: A1