CN106529211A - Variable site obtaining method and apparatus - Google Patents

Variable site obtaining method and apparatus Download PDF

Info

Publication number
CN106529211A
CN106529211A CN201610972449.3A CN201610972449A CN106529211A CN 106529211 A CN106529211 A CN 106529211A CN 201610972449 A CN201610972449 A CN 201610972449A CN 106529211 A CN106529211 A CN 106529211A
Authority
CN
China
Prior art keywords
variant sites
preliminary
sites
preliminary variant
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610972449.3A
Other languages
Chinese (zh)
Inventor
范振鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Xin Yun Decoding Technology Co Ltd
Original Assignee
Chengdu Xin Yun Decoding Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Xin Yun Decoding Technology Co Ltd filed Critical Chengdu Xin Yun Decoding Technology Co Ltd
Priority to CN201610972449.3A priority Critical patent/CN106529211A/en
Publication of CN106529211A publication Critical patent/CN106529211A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B25/00ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression

Abstract

The invention provides a variable site obtaining method and apparatus, and relates to the technical field of bio-information. The method comprises the steps of performing data comparison on a plurality of short sequences of to-be-tested genes and reference genomes, and obtaining initial variable site information of the to-be-tested genes, wherein the initial variable site information comprises a plurality of initial variable sites; and deleting the variable sites which do not meet a preset reservation condition in the initial variable sites according to the initial variable site information, and obtaining variable sites of the to-be-tested genes. According to the method and the apparatus, the variable sites which do not meet the preset reservation condition are further deleted based on variable sites obtained for the first time, so that more accurate variable sites can be obtained.

Description

The acquisition methods and device of variant sites
Technical field
The application is related to technical field of biological information, in particular to the acquisition methods and device of a kind of variant sites.
Background technology
The existing acquisition methods to variant sites, the conventional new-generation sequencing technology for being referred to as second filial generation sequencing technologies (Next-generation sequencing).Although compared to first generation sequencing technologies (Sanger sequencing), second Have for sequencing technologies that data volume is huge, the sequencing time fast, individual gene site low cost and other advantages, but also there is initial data Error rate is high, the not accurate enough problem of the lookup of variant sites.
The content of the invention
In view of this, the embodiment of the present application provides a kind of acquisition methods and device of variant sites, soft to passing through sequencing The variant sites that part is tentatively obtained further are filtered, and the variant sites for being unsatisfactory for default reserve are deleted, so that The variant sites of acquisition are more accurate, to improve the problems referred to above.
To achieve these goals, the technical scheme that the application is adopted is as follows:
A kind of acquisition methods of variant sites, methods described include:By the multiple short sequence and reference gene of testing gene Group carries out comparing, obtains the preliminary variant sites information of testing gene, and the preliminary variant sites information includes multiple Preliminary variant sites;According to the preliminary variant sites information, default reservation in the plurality of preliminary variant sites, will be unsatisfactory for The variant sites of condition are deleted, and obtain the variant sites in the testing gene.
A kind of acquisition device of variant sites, described device include:Comparing module, for by the multiple short sequence of testing gene Row carry out comparing with reference gene group, obtain the preliminary variant sites information of testing gene, the preliminary variant sites letter Breath includes multiple preliminary variant sites;Filtering module, for according to the preliminary variant sites information, will be the plurality of preliminary The variant sites that default reserve is unsatisfactory in variant sites are deleted, and obtain the variant sites in the testing gene.
The embodiment of the present application provide variant sites acquisition methods and device, the short sequence of testing gene with refer to base Compare because of group and obtain after the preliminary variant sites information of information for including multiple variant sites, according to the preliminary variant sites Multiple variant sites in the preliminary variant sites information are carried out filtration again, that is, delete preliminary variant sites by information In be unsatisfactory for the variant sites of default reserve, the variant sites for making last reservation are the higher site of accuracy rate.This programme Variant sites in the testing gene of acquisition compared to prior art, with higher accuracy rate.
To enable the above-mentioned purpose of the application, feature and advantage to become apparent, preferred embodiment cited below particularly, and coordinate Appended accompanying drawing, is described in detail below.
Description of the drawings
To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is Some embodiments of the present application, rather than the embodiment of whole.Based on the embodiment in the application, those of ordinary skill in the art The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of the application protection.
Fig. 1 shows the structural representation of the computer that the embodiment of the present application is provided;
Fig. 2 shows a kind of flow chart of the acquisition methods of the variant sites that the application first embodiment is provided;
Fig. 3 shows another kind of flow chart of the acquisition methods of the variant sites that the application first embodiment is provided;
Fig. 4 shows the functional block diagram of the acquisition device of the variant sites that the application second embodiment is provided;
Fig. 5 shows the functional module of the filtering module of the acquisition device of the variant sites that the application second embodiment is provided Figure;
Fig. 6 shows the functional module of the comparing module of the acquisition device of the variant sites that the application second embodiment is provided Figure.
Specific embodiment
Below in conjunction with accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete Ground description, it is clear that described embodiment is only some embodiments of the present application, rather than the embodiment of whole.Generally exist The component of the embodiment of the present application described and illustrated in accompanying drawing can be arranged and be designed with a variety of configurations herein.Cause This, the detailed description of the embodiments herein to providing in the accompanying drawings is not intended to limit claimed the application's below Scope, but it is merely representative of the selected embodiment of the application.Based on embodiments herein, those skilled in the art are not doing The every other embodiment obtained on the premise of going out creative work, belongs to the scope of the application protection.
It should be noted that:Similar label and letter represent similar terms in following accompanying drawing, therefore, once a certain Xiang Yi It is defined in individual accompanying drawing, then in subsequent accompanying drawing which further need not be defined and is explained.Meanwhile, the application's In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that indicating or implying relative importance.
As shown in figure 1, being the block diagram of the application computer 100.The computer 100 includes obtaining for variant sites Take device 200, memorizer 101, storage control 102, processor 103, Peripheral Interface 104, input-output unit 105 and its He.
The memorizer 101, storage control 102, processor 103, Peripheral Interface 104 and input-output unit 105 Each element is directly or indirectly electrically connected with each other, to realize the transmission or interaction of data.For example, these elements mutually it Between can pass through one or more communication bus or holding wire and realize being electrically connected with.The acquisition device 200 of the variant sites includes During at least one can be stored in the memorizer 101 in the form of software or firmware (firmware) or it is solidificated in the computer Software function module in 100 operating system (operating system, OS).The processor 103 is used to perform storage The executable module stored in device 101, such as software function module or calculating that the acquisition device 200 of described variant sites includes Machine program.
Wherein, memorizer 101 may be, but not limited to, random access memory (Random Access Memory, RAM), read only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM), Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc.. Wherein, memorizer 101 is used for storage program, and the processor 103 performs described program after execute instruction is received, aforementioned Method performed by the computer 100 of the stream process definition that the embodiment of the present application any embodiment is disclosed can apply to processor In 103, or realized by processor 103.
A kind of possibly IC chip of processor 103, the disposal ability with signal.Above-mentioned processor 103 can Being general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit (Network Processor, abbreviation NP) etc.;Can also be digital signal processor (DSP), special IC (ASIC), It is ready-made programmable gate array (FPGA) or other PLDs, discrete gate or transistor logic, discrete hard Part component.Can realize or perform disclosed each method in the embodiment of the present application, step and logic diagram.General processor Can be microprocessor or the processor 103 can also be any conventional processor etc..
Various input/output devices are coupled to processor 103 and memorizer 101 by the Peripheral Interface 104.At some In embodiment, Peripheral Interface 104, processor 103 and storage control 102 can be realized in one single chip.Other one In a little examples, they can be realized by independent chip respectively.
Input-output unit 105 is used to be supplied to user input data to realize interacting for user and the computer.It is described Input-output unit may be, but not limited to, digital independent device, mouse and keyboard etc..
It should be understood that structure shown in Fig. 1 is only illustrated, computer 100 can also include it is more more than shown in Fig. 1 or Less component, or with the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can using hardware, software or its Combination is realized.
First embodiment
The embodiment of the present application provides a kind of acquisition methods of variant sites, refers to Fig. 2, and the method includes:
Step S110:The multiple short sequence of testing gene and reference gene group are carried out into comparing, testing gene is obtained Preliminary variant sites information, the preliminary variant sites information includes multiple preliminary variant sites.
First, the multiple short sequence of testing gene is obtained, the short sequence can be exported by second filial generation microarray dataset.Will The short sequence of testing gene is compared with reference gene group.Such as, if testing gene is human gene, the reference gene group is then Mankind's reference gene group.
Certainly, the comparison process can include repeatedly comparing and the process such as duplicate removal, after being compared including multiple changes The variant sites information of ectopic sites.
Specifically, as shown in figure 3, in the present embodiment, the comparing in this step is believed with obtaining preliminary variant sites The process of breath can include:
Step S111:The multiple short sequence of the testing gene and reference gene group are compared first, SAM lattice are obtained The comparison result of formula.
The short sequence of testing gene and reference gene group are carried out into comparing, the comparison process can utilize existing ratio Software is carried out, such as Bowtie2, it is possible to obtain the comparison result of SAM forms, be stored with the comparison result of the SAM forms ratio Comparison information to rear acquisition.It should be understood that in the comparison result of the SAM forms, including each alkali in testing gene The information of base, such as positional information.
Certainly, the representation of specifically used comparison software and comparison result is not intended as limit in the present embodiment System, can compare the multiple short sequence of testing gene and reference gene group and obtain the comparison information for representing comparison result It is advisable.
Step S112:Duplicate removal is carried out to the comparison result, contrast is made to the short sequence of a position of reference gene group Number is less than or equal to 1.
In the comparison result that step S111 is obtained, there are a certain proportion of repetitive sequence and result, for example, contrast to referring to base Because the same position organized there may be multiple short sequences, then, in this step, comparison result is carried out into duplicate removal.
In the present embodiment, it is possible to use software Picard carries out duplicate removal work.Specifically, that what is utilized can be Picard MarkDuplicate instrument duplicate removals, obtain bam forms duplicate removal result.
Step S113:Local anharmonic ratio is carried out to the comparing result after duplicate removal to (local multiple alignment).
It is difficult accurately to compare highly similar repetition to the short sequence that reference gene group is compared due to what is obtained Region, then the repeat region in genome be readily available false-positive variant sites, such as false-positive SNPs.It is appreciated that , false-positive variant sites are the variant sites of comparison result mistake.In order to reduce false positive variant sites quantity and Ratio, in the present embodiment, carries out local anharmonic ratio pair to the comparing result after duplicate removal.
Specifically, the local anharmonic ratio can be using in GATK to (local multiple alignment) IndelRealigner is carried out, and obtains comparison result of the anharmonic ratio of bam forms to after.The comparison process typically has three steps, A. detect suspicious, need to carry out the region of anharmonic ratio pair;B. anharmonic ratio pair is carried out to these suspicious regions;C. repair in anharmonic ratio The mate pairing information lost to during.
Step S114:Recalculate the base mass fraction in comparison result of the local anharmonic ratio to after.
In the step of during aforementioned processing S111, each single base can be endowed in data processing One mass fraction (Quality scores), for reflecting the credibility of nucleotide that corresponding base is observed.
As the mass fraction obtained during aforementioned processing does not have preferably to contact with the genotyping result probability of mistake Get up, while the mass fraction of single base, the contact of no and other specification phase example, the different surveys such as in same sample Sequence platform, different sequencing circulations, different libraries etc. are contacted.
Therefore, in this step S114, the mass fraction of each base is associated with each factor in sequencing procedure, The mass fraction of each base is recalculated, new mass fraction is generated, for judging whether each base is credible.
Specifically, in the present embodiment, it is possible to use GATK carries out empirical quality score Recalibration, obtains the result of bam forms.
Step S115:According to the base mass fraction, SNP and indel is carried out to comparing result of the local anharmonic ratio to after Analysis, obtains preliminary variant sites information.
According to the base mass fraction for recalculating acquisition, local anharmonic ratio is carried out to the comparison result for obtaining SNP and The preliminary interpretation of indel, carries out SNP and indel typings to which, to obtain variant sites information, the variant sites information conduct Preliminary variant sites information.It should be understood that in the preliminary variant sites information, include each variant sites and each Variant sites position.In the present embodiment, variant sites are SNP and indel, it is preferred that in the present embodiment, become dystopy Point is only SNP.
Specifically, in this step, can be analyzed using the Unified Genotyper of GATK.Because complete Into after the typing of SNPs, many data filtering parameter logistics are employed according to being filtered again, with further control data quality, So standard minimum confidence thresholds are both configured to zero in this step.It should be understood that SNPs represents the plural form of SNP.
Certainly, the preliminary interpretation process of the SNP and indel can also be carried out in other ways, in the present embodiment not As limit, or other, the such as HaplotypeCaller of GATK is carried out.
In this step, it is possible to obtain including the vcf files of preliminary variant sites information, the preliminary change in the vcf files Ectopic sites information includes each variant sites for obtaining in step s 110 and the corresponding positional information of each variant sites, Certainly, also including other, here is not added with repeating.
Step S120:According to the preliminary variant sites information, will be unsatisfactory for presetting in the plurality of preliminary variant sites The variant sites of reserve are deleted, and obtain the variant sites in the testing gene.
In step s 110, in the preliminary variant sites in the preliminary variant sites information of acquisition, it would still be possible to there is false sun Property variant sites, then, this step is further filtered to preliminary variant sites, delete wherein false positive probability it is higher Variant sites, using the variant sites in the result after deletion as the testing gene in variant sites, make last acquisition Variant sites are more accurate.It should be understood that delete after result in further comprises each variant sites positional information and Other information, will not be described here.
Specifically, in this step, can include following one or more deleting the variation for being unsatisfactory for default reserve The mode in site:
Mode one:Remove in the plurality of preliminary variant sites, the number of allele is more than the change dystopy of predetermined threshold value Point.
Allele is that the probability of false positive variant sites is higher, which is carried out more than the variant sites of predetermined threshold value Remove.In the present embodiment, the predetermined threshold value can value according to actual needs, due to comprising more than more than 1 allele Site just have higher gene type mistake, it is preferred that the value of the predetermined threshold value can be 1.
When predetermined threshold value value is 1, that is, there is more than 1 allele in removing the multiple preliminary variant sites of acquisition Variant sites.
Mode two:Delete in the plurality of preliminary variant sites, positioned at each insertion and deletion (indel) upstream span or All variant sites in person's span downstream, the base number that the upstream span and span downstream include are predetermined number.
As the short sequence for comparing is often exported by secondary direction finding platform, and the short sequence of secondary microarray dataset exists The comparison of mistake is more prone near the region of insertion and deletion (indel), and the local anharmonic ratio in above-mentioned processing procedure is not to This mistake can be completely eliminated.Then, all variant sites in insertion and deletion upstream span or span downstream are deleted, with Reduce the probability of false positive results.
The base number that the upstream span and span downstream include be predetermined number, the predetermined number can by user according to Actual demand determines, is not restricted in the present embodiment, also, the predetermined number of upstream span and span downstream can phase It is same or different.
In the present embodiment, the base number that scope includes above is preferably 5, the base number that span downstream includes is excellent Elect 5 as.That is, it is determined that all indel in preliminary variant sites, for each indel, by its upstream 5bp (5 bases) Within all variant sites delete, or all variant sites within 5bp downstream are deleted.
Certainly, in the present embodiment, only can delete in the variant sites or span downstream in the upstream span of indel Variant sites, it is also possible to the variant sites in the variant sites and span downstream in upstream span are all deleted.
Preferably, in the present embodiment, in the upstream span or span downstream for insertion and deletion (indel) of deletion All SNPs.
Mode three:By in the plurality of preliminary variant sites, the variant sites for being spaced default base number each other are deleted Remove.
In this step, variant sites close to each other are deleted, will variation of the distance less than certain value each other Delete in site.
In the present embodiment, the default base number is not intended as limiting, and can set according to actual needs.
Preferably, the default base number is 4, if change of the base number being spaced between certain presence less than 4 Ectopic sites, are deleted.That is, deleting the variant sites within upstream each other or downstream 5bp.
Preferably, in the step, the SNPs to be spaced default base number each other of deletion.
Mode four:By in the plurality of preliminary variant sites, corresponding GQ (Genotype quality) value is less than default The variant sites of GQ threshold values are deleted.
GQ (Genotype quality) is a posterior probability (the phred-scaled probabilities) value, For each site, GQ values are not possible of truth to represent the site in the genotypic results of current acquisition Property, that is, represent the probability existed in the site genotype for obtaining.Calculation is:
GQ values=- 10*log10 (P [error]), wherein, P [error] represents that corresponding site is not the general of truth Rate.
Preferably, in the present embodiment, it is 20 to preset GQ threshold values.Empirical tests, when GQ threshold values are 20, theoretic mistake Rate is 1%.
Mode five:By in the plurality of preliminary variant sites, corresponding MQ (Mapping quality) value is less than default MQ The variant sites of threshold value are deleted.
MQ represents the specificity (uniqueness) in aligned sequences.When same short sequence can compare it is same During genome zones of different, the alignment score of the first best comparison area (the first best alignment) The alignment score of (alignment's score) and the second best comparison area (the second best alignment), two Person's difference is bigger, shows that the specificity for comparing is better, and the value of MQ is higher.
In this embodiment it is believed that it is false sun that MQ values have higher probability less than the variant sites of default MQ threshold values Property, it is deleted.
Preferably, in the present embodiment, it is 30 to preset MQ threshold values value.Empirical tests, when MQ values are 30, P [error]= 0.001, i.e., relative to current location is compared, the probability for comparing another position is up to 0.1%.
In embodiments of the present invention, mode one is optional executive mode to mode five, i.e., in this step, can adopt which In a certain mode, certain several ways or all of mode.When carrying out being unsatisfactory for the change of reservation conditions using various ways During the deletion of ectopic sites, the execution sequence between the various ways is not intended as limiting.Certainly, the various ways can also be parallel Perform.
In addition, in the step 120, when there is various ways to be performed serially, follow-up step can be in preceding step On the basis of perform.For example, if the number of the plurality of preliminary variant sites allelic of removal of executive mode one is more than pre- If in the variant sites of threshold value, and mode three, default base will be spaced in the plurality of preliminary variant sites each other The variant sites of number are deleted, and first carry out mode one, then executive mode three.Then in mode three, deletion can be mode It is spaced the variant sites of default base number in variant sites after one process each other.
Step S120 is carried out to preliminary variant sites after deletion filtration, the variation of the final result of acquisition as testing gene Site, can be represented with vcf formatted files.
Second embodiment
A kind of acquisition device 200 of variant sites is present embodiments provided, Fig. 4 is referred to, the device 200 includes:
Comparing module 210, for the multiple short sequence of testing gene and reference gene group are carried out comparing, is treated The preliminary variant sites information of cls gene, the preliminary variant sites information include multiple preliminary variant sites.
Filtering module 220, for according to the preliminary variant sites information, will be discontented with the plurality of preliminary variant sites The variant sites of the default reserve of foot are deleted, and obtain the variant sites in the testing gene.
Further, in the present embodiment, as shown in figure 5, filtering module 220 can include one or more of list Unit:
First deletes unit 221, and for removing in the plurality of preliminary variant sites, the number of allele is more than default The variant sites of threshold value.Second deletes unit 222, for deleting in the plurality of preliminary variant sites, lacks positioned at each insertion What all variant sites in the upstream span or span downstream of mistake (indel), the upstream span and span downstream included Base number is predetermined number.3rd deletes unit 223, for by the plurality of preliminary variant sites, being spaced each other The variant sites of default base number are deleted.4th deletes unit 224, for by the plurality of preliminary variant sites, corresponding to GQ (Genotype quality) value delete less than the variant sites of default GQ threshold values.5th deletes unit 225, for by institute State in multiple preliminary variant sites, corresponding MQ (Mapping quality) value is deleted less than the variant sites of default MQ threshold values.
Further, as shown in fig. 6, the comparing module 210 that the present embodiment is provided can also include:
Comparing unit 211, for the multiple short sequence of the testing gene and reference gene group are compared first, obtains Obtain the comparison result of SAM forms.Duplicate removal unit 212, for carrying out duplicate removal to the comparison result, makes contrast to reference gene group A position short sequence number be less than or equal to 1.Weight comparing unit 213, for this is carried out to the comparing result after duplicate removal Ground anharmonic ratio is to (local multiple alignment).Mass fraction computing unit 214, for recalculating local anharmonic ratio pair Base mass fraction in comparison result afterwards.Just sentence unit 215, for according to the base mass fraction, to local anharmonic ratio Comparing result to after carries out SNP and indel analyses, obtains preliminary variant sites information.
It should be noted that for device class embodiment, due to itself and embodiment of the method basic simlarity, so description It is fairly simple, related part is illustrated referring to the part of embodiment of the method.
In sum, the acquisition methods and device of variant sites provided in an embodiment of the present invention, by existing software After preliminary acquisition variant sites information, the preliminary variant sites to obtaining further are filtered, and deletion is wherein unsatisfactory for presetting The variant sites of reserve, make the accuracy rate of the variant sites of the testing gene of final acquisition higher.
In several embodiments provided herein, it should be understood that disclosed apparatus and method, it is also possible to pass through Other modes are realized.Device embodiment described above is only schematically, for example flow chart and block diagram in accompanying drawing Show the device of multiple embodiments according to the application, the architectural framework in the cards of method and computer program product, Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of module, program segment or a code Part, a part for the module, program segment or code are used to realize holding for the logic function for specifying comprising one or more Row instruction.It should also be noted that at some as in the implementations replaced, the function of being marked in square frame can also be being different from The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially be performed substantially in parallel, and they are sometimes Can perform in the opposite order, this is depending on involved function.It is also noted that every in block diagram and/or flow chart The combination of individual square frame and block diagram and/or the square frame in flow chart, can use the special base for performing the function or action of regulation Realize in the system of hardware, or can be realized with the combination of specialized hardware and computer instruction.
In addition, each functional module in the application each embodiment can integrate to form an independent portion Divide, or modules individualism, it is also possible to which two or more modules are integrated to form an independent part.
If the function is realized using in the form of software function module and as independent production marketing or when using, can be with It is stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the application is substantially in other words The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be individual People's computer, server 100, or network equipment etc.) perform all or part of step of the application each embodiment methods described Suddenly.And aforesaid storage medium includes:USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), deposit at random Access to memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes. It should be noted that herein, such as first and second, another or the like relational terms be used merely to an entity or Person is operated and is made a distinction with another entity or operation, and not necessarily requires or imply that presence is appointed between these entities or operation What this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to non-row His property is included, so that a series of process, method, article or equipment including key elements not only include those key elements, and And also include other key elements being not expressly set out, or also include for this process, method, article or equipment institute inherently Key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that including institute Also there is other identical element in process, method, article or the equipment of stating key element.
The preferred embodiment of the application is the foregoing is only, the application is not limited to, for the skill of this area For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair Change, equivalent, improvement etc., should be included within the protection domain of the application.It should be noted that:Similar label and letter exist Similar terms is represented in figure below, therefore, once being defined in a certain Xiang Yi accompanying drawing, then it is not required in subsequent accompanying drawing Which is further defined and is explained.
The above, the protection domain of the only specific embodiment of the application, but the application is not limited thereto, any Those familiar with the art can readily occur in change or replacement in the technical scope that the application is disclosed, and should all contain Cover within the protection domain of the application.Therefore, the protection domain of the application described should be defined by scope of the claims.

Claims (10)

1. a kind of acquisition methods of variant sites, it is characterised in that methods described includes:
The multiple short sequence of testing gene and reference gene group are carried out into comparing, the preliminary variant sites of testing gene are obtained Information, the preliminary variant sites information include multiple preliminary variant sites;
According to the preliminary variant sites information, the variation of default reserve will be unsatisfactory in the plurality of preliminary variant sites Site is deleted, and obtains the variant sites in the testing gene.
2. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre- If the variant sites of reserve are deleted including:
Remove in the plurality of preliminary variant sites, the number of allele is more than the variant sites of predetermined threshold value.
3. method according to claim 1, it is characterised in that also include in the preliminary variant sites information the plurality of The position that preliminary variant sites are located, it is described the change dystopy of default reserve to be unsatisfactory in the plurality of preliminary variant sites Point deletion includes:
Delete in the plurality of preliminary variant sites, it is all in the upstream span or span downstream of each insertion and deletion Variant sites, the base number that the upstream span and span downstream include are predetermined number.
4. method according to claim 1, it is characterised in that also include in the preliminary variant sites information the plurality of The position that preliminary variant sites are located, it is described the change dystopy of default reserve to be unsatisfactory in the plurality of preliminary variant sites Point deletion includes:
By in the plurality of preliminary variant sites, the variant sites for being spaced default base number each other are deleted.
5. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre- If the variant sites of reserve are deleted including:
By in the plurality of preliminary variant sites, corresponding GQ values are deleted less than the variant sites of default GQ threshold values.
6. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre- If the variant sites of reserve are deleted including:
By in the plurality of preliminary variant sites, corresponding MQ values are deleted less than the variant sites of default MQ threshold values.
7. method according to claim 1, it is characterised in that the multiple short sequence and reference gene by testing gene Group carries out comparing, and the preliminary variant sites information for obtaining testing gene includes:
The multiple short sequence of the testing gene and reference gene group are compared first, the comparison result of SAM forms is obtained;
Duplicate removal is carried out to the comparison result, contrast is less than or equal to the short sequence number of a position of reference gene group 1;
Local anharmonic ratio pair is carried out to the comparing result after duplicate removal;
Recalculate the base mass fraction in comparison result of the local anharmonic ratio to after;
According to the base mass fraction, SNP and indel analyses are carried out to comparing result of the local anharmonic ratio to after, obtain preliminary Variant sites information.
8. method according to claim 1, it is characterised in that the variant sites are SNP.
9. a kind of acquisition device of variant sites, it is characterised in that described device includes:
Comparing module, for the multiple short sequence of testing gene and reference gene group are carried out comparing, obtains testing gene Preliminary variant sites information, the preliminary variant sites information includes multiple preliminary variant sites;
Filtering module, for according to the preliminary variant sites information, will be unsatisfactory for presetting in the plurality of preliminary variant sites The variant sites of reserve are deleted, and obtain the variant sites in the testing gene.
10. device according to claim 9, it is characterised in that the filtering module includes one or more of:
First deletes unit, and for removing in the plurality of preliminary variant sites, the number of allele is more than predetermined threshold value Variant sites;
Second delete unit, for deleting in the plurality of preliminary variant sites, positioned at each insertion and deletion upstream span or All variant sites in person's span downstream, the base number that the upstream span and span downstream include are predetermined number;
3rd deletes unit, for by the plurality of preliminary variant sites, being spaced the variation of default base number each other Delete in site;
4th deletes unit, for by the plurality of preliminary variant sites, corresponding GQ values are less than the variation for presetting GQ threshold values Delete in site;
5th deletes unit, for by the plurality of preliminary variant sites, corresponding MQ values are less than the variation for presetting MQ threshold values Delete in site.
CN201610972449.3A 2016-11-04 2016-11-04 Variable site obtaining method and apparatus Pending CN106529211A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610972449.3A CN106529211A (en) 2016-11-04 2016-11-04 Variable site obtaining method and apparatus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610972449.3A CN106529211A (en) 2016-11-04 2016-11-04 Variable site obtaining method and apparatus

Publications (1)

Publication Number Publication Date
CN106529211A true CN106529211A (en) 2017-03-22

Family

ID=58349459

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610972449.3A Pending CN106529211A (en) 2016-11-04 2016-11-04 Variable site obtaining method and apparatus

Country Status (1)

Country Link
CN (1) CN106529211A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109979530A (en) * 2019-03-26 2019-07-05 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN109979531A (en) * 2019-03-29 2019-07-05 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN109994155A (en) * 2019-03-29 2019-07-09 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN110021342A (en) * 2017-08-21 2019-07-16 北京哲源科技有限责任公司 For accelerating the method and system of the identification of variant sites
CN111091870A (en) * 2019-12-18 2020-05-01 中国科学院大学 Method and system for controlling quality of gene mutation site
CN111919257A (en) * 2018-07-27 2020-11-10 思勤有限公司 Reducing noise in sequencing data

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198236A (en) * 2012-01-06 2013-07-10 深圳华大基因科技有限公司 CYP450 gene type database and gene typing and enzymatic activity identification method
CN104462869A (en) * 2014-11-28 2015-03-25 天津诺禾致源生物信息科技有限公司 Method and device for detecting somatic cell SNP
CN105930690A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Whole-exome sequencing data analysis method
CN105925685A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Exome potential pathogenic mutation detection method based on family line
CN105956415A (en) * 2016-05-13 2016-09-21 万康源(天津)基因科技有限公司 SNV detection system affecting RNA splicing
CN105975809A (en) * 2016-05-13 2016-09-28 万康源(天津)基因科技有限公司 SNV detection method affecting RNA splicing
CN106011224A (en) * 2015-12-24 2016-10-12 晶能生物技术(上海)有限公司 Nervous system genetic disease gene united screening method, kit and preparation method thereof

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103198236A (en) * 2012-01-06 2013-07-10 深圳华大基因科技有限公司 CYP450 gene type database and gene typing and enzymatic activity identification method
CN104462869A (en) * 2014-11-28 2015-03-25 天津诺禾致源生物信息科技有限公司 Method and device for detecting somatic cell SNP
CN106011224A (en) * 2015-12-24 2016-10-12 晶能生物技术(上海)有限公司 Nervous system genetic disease gene united screening method, kit and preparation method thereof
CN105930690A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Whole-exome sequencing data analysis method
CN105925685A (en) * 2016-05-13 2016-09-07 万康源(天津)基因科技有限公司 Exome potential pathogenic mutation detection method based on family line
CN105956415A (en) * 2016-05-13 2016-09-21 万康源(天津)基因科技有限公司 SNV detection system affecting RNA splicing
CN105975809A (en) * 2016-05-13 2016-09-28 万康源(天津)基因科技有限公司 SNV detection method affecting RNA splicing

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
何伟明: ""基于重测序数据的群体SNP位点检测及基因型判断"", 《中国优秀硕士学位论文全文数据库 基础科学辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110021342A (en) * 2017-08-21 2019-07-16 北京哲源科技有限责任公司 For accelerating the method and system of the identification of variant sites
CN110021342B (en) * 2017-08-21 2020-12-15 北京哲源科技有限责任公司 Method and system for accelerating identification of variant sites
CN111919257A (en) * 2018-07-27 2020-11-10 思勤有限公司 Reducing noise in sequencing data
CN111919257B (en) * 2018-07-27 2021-05-28 思勤有限公司 Method and system for reducing noise in sequencing data, and implementation and application thereof
CN109979530A (en) * 2019-03-26 2019-07-05 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN109979530B (en) * 2019-03-26 2021-03-16 北京市商汤科技开发有限公司 Gene variation identification method, device and storage medium
CN109979531A (en) * 2019-03-29 2019-07-05 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN109994155A (en) * 2019-03-29 2019-07-09 北京市商汤科技开发有限公司 A kind of genetic mutation recognition methods, device and storage medium
CN109994155B (en) * 2019-03-29 2021-08-20 北京市商汤科技开发有限公司 Gene variation identification method, device and storage medium
CN109979531B (en) * 2019-03-29 2021-08-31 北京市商汤科技开发有限公司 Gene variation identification method, device and storage medium
CN111091870A (en) * 2019-12-18 2020-05-01 中国科学院大学 Method and system for controlling quality of gene mutation site
CN111091870B (en) * 2019-12-18 2021-11-02 中国科学院大学 Method and system for controlling quality of gene mutation site

Similar Documents

Publication Publication Date Title
CN106529211A (en) Variable site obtaining method and apparatus
Rakocevic et al. Fast and accurate genomic analyses using genome graphs
Turakhia et al. Stability of SARS-CoV-2 phylogenies
Cooke et al. A unified haplotype-based method for accurate and comprehensive variant calling
Bağcı et al. DIAMOND+ MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences
Sibbesen et al. Accurate genotyping across variant classes and lengths using variant graphs
Rumble et al. SHRiMP: accurate mapping of short color-space reads
Sims et al. Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions
Levi et al. DOMINO: a network‐based active module identification algorithm with reduced rate of false calls
CN103617256A (en) Method and device for processing file needing mutation detection
CN106407747A (en) Method and device for acquiring mutation sites of genes corresponding to tumors
Chowdhury et al. Differential expression analysis of RNA-seq reads: overview, taxonomy, and tools
Johnston et al. PEMapper and PECaller provide a simplified approach to whole-genome sequencing
US11842794B2 (en) Variant calling in single molecule sequencing using a convolutional neural network
CN110692101A (en) Method for aligning targeted nucleic acid sequencing data
US20190177719A1 (en) Method and System for Generating and Comparing Reduced Genome Data Sets
Niehus et al. PopDel identifies medium-size deletions jointly in tens of thousands of genomes
CN106503489A (en) The acquisition methods and device in the mutational site of the corresponding gene of cardiovascular system
Schull et al. Champagne: whole-genome phylogenomic character matrix method places Myomorpha basal in Rodentia
CN106407745A (en) Mutation site acquisition method and device for a gene corresponding to skin
Alfonsi et al. Data-driven recombination detection in viral genomes
CN106529208A (en) Method and device for obtaining mutation sites of gene corresponding to nervous system
CN106529210A (en) Method and device for acquiring gene mutation site corresponding to psychology and spirit
CN106407744A (en) Mutation site acquisition method and device for a gene corresponding to diet and health
CN106529209A (en) Method and device for acquiring gene mutation site corresponding to immune system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170322