CN106529211A - Variable site obtaining method and apparatus - Google Patents
Variable site obtaining method and apparatus Download PDFInfo
- Publication number
- CN106529211A CN106529211A CN201610972449.3A CN201610972449A CN106529211A CN 106529211 A CN106529211 A CN 106529211A CN 201610972449 A CN201610972449 A CN 201610972449A CN 106529211 A CN106529211 A CN 106529211A
- Authority
- CN
- China
- Prior art keywords
- variant sites
- preliminary
- sites
- preliminary variant
- information
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B25/00—ICT specially adapted for hybridisation; ICT specially adapted for gene or protein expression
Abstract
The invention provides a variable site obtaining method and apparatus, and relates to the technical field of bio-information. The method comprises the steps of performing data comparison on a plurality of short sequences of to-be-tested genes and reference genomes, and obtaining initial variable site information of the to-be-tested genes, wherein the initial variable site information comprises a plurality of initial variable sites; and deleting the variable sites which do not meet a preset reservation condition in the initial variable sites according to the initial variable site information, and obtaining variable sites of the to-be-tested genes. According to the method and the apparatus, the variable sites which do not meet the preset reservation condition are further deleted based on variable sites obtained for the first time, so that more accurate variable sites can be obtained.
Description
Technical field
The application is related to technical field of biological information, in particular to the acquisition methods and device of a kind of variant sites.
Background technology
The existing acquisition methods to variant sites, the conventional new-generation sequencing technology for being referred to as second filial generation sequencing technologies
(Next-generation sequencing).Although compared to first generation sequencing technologies (Sanger sequencing), second
Have for sequencing technologies that data volume is huge, the sequencing time fast, individual gene site low cost and other advantages, but also there is initial data
Error rate is high, the not accurate enough problem of the lookup of variant sites.
The content of the invention
In view of this, the embodiment of the present application provides a kind of acquisition methods and device of variant sites, soft to passing through sequencing
The variant sites that part is tentatively obtained further are filtered, and the variant sites for being unsatisfactory for default reserve are deleted, so that
The variant sites of acquisition are more accurate, to improve the problems referred to above.
To achieve these goals, the technical scheme that the application is adopted is as follows:
A kind of acquisition methods of variant sites, methods described include:By the multiple short sequence and reference gene of testing gene
Group carries out comparing, obtains the preliminary variant sites information of testing gene, and the preliminary variant sites information includes multiple
Preliminary variant sites;According to the preliminary variant sites information, default reservation in the plurality of preliminary variant sites, will be unsatisfactory for
The variant sites of condition are deleted, and obtain the variant sites in the testing gene.
A kind of acquisition device of variant sites, described device include:Comparing module, for by the multiple short sequence of testing gene
Row carry out comparing with reference gene group, obtain the preliminary variant sites information of testing gene, the preliminary variant sites letter
Breath includes multiple preliminary variant sites;Filtering module, for according to the preliminary variant sites information, will be the plurality of preliminary
The variant sites that default reserve is unsatisfactory in variant sites are deleted, and obtain the variant sites in the testing gene.
The embodiment of the present application provide variant sites acquisition methods and device, the short sequence of testing gene with refer to base
Compare because of group and obtain after the preliminary variant sites information of information for including multiple variant sites, according to the preliminary variant sites
Multiple variant sites in the preliminary variant sites information are carried out filtration again, that is, delete preliminary variant sites by information
In be unsatisfactory for the variant sites of default reserve, the variant sites for making last reservation are the higher site of accuracy rate.This programme
Variant sites in the testing gene of acquisition compared to prior art, with higher accuracy rate.
To enable the above-mentioned purpose of the application, feature and advantage to become apparent, preferred embodiment cited below particularly, and coordinate
Appended accompanying drawing, is described in detail below.
Description of the drawings
To make the purpose, technical scheme and advantage of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
In accompanying drawing, the technical scheme in the embodiment of the present application is clearly and completely described, it is clear that described embodiment is
Some embodiments of the present application, rather than the embodiment of whole.Based on the embodiment in the application, those of ordinary skill in the art
The every other embodiment obtained under the premise of creative work is not made, belongs to the scope of the application protection.
Fig. 1 shows the structural representation of the computer that the embodiment of the present application is provided;
Fig. 2 shows a kind of flow chart of the acquisition methods of the variant sites that the application first embodiment is provided;
Fig. 3 shows another kind of flow chart of the acquisition methods of the variant sites that the application first embodiment is provided;
Fig. 4 shows the functional block diagram of the acquisition device of the variant sites that the application second embodiment is provided;
Fig. 5 shows the functional module of the filtering module of the acquisition device of the variant sites that the application second embodiment is provided
Figure;
Fig. 6 shows the functional module of the comparing module of the acquisition device of the variant sites that the application second embodiment is provided
Figure.
Specific embodiment
Below in conjunction with accompanying drawing in the embodiment of the present application, the technical scheme in the embodiment of the present application is carried out clear, complete
Ground description, it is clear that described embodiment is only some embodiments of the present application, rather than the embodiment of whole.Generally exist
The component of the embodiment of the present application described and illustrated in accompanying drawing can be arranged and be designed with a variety of configurations herein.Cause
This, the detailed description of the embodiments herein to providing in the accompanying drawings is not intended to limit claimed the application's below
Scope, but it is merely representative of the selected embodiment of the application.Based on embodiments herein, those skilled in the art are not doing
The every other embodiment obtained on the premise of going out creative work, belongs to the scope of the application protection.
It should be noted that:Similar label and letter represent similar terms in following accompanying drawing, therefore, once a certain Xiang Yi
It is defined in individual accompanying drawing, then in subsequent accompanying drawing which further need not be defined and is explained.Meanwhile, the application's
In description, term " first ", " second " etc. are only used for distinguishing description, and it is not intended that indicating or implying relative importance.
As shown in figure 1, being the block diagram of the application computer 100.The computer 100 includes obtaining for variant sites
Take device 200, memorizer 101, storage control 102, processor 103, Peripheral Interface 104, input-output unit 105 and its
He.
The memorizer 101, storage control 102, processor 103, Peripheral Interface 104 and input-output unit 105
Each element is directly or indirectly electrically connected with each other, to realize the transmission or interaction of data.For example, these elements mutually it
Between can pass through one or more communication bus or holding wire and realize being electrically connected with.The acquisition device 200 of the variant sites includes
During at least one can be stored in the memorizer 101 in the form of software or firmware (firmware) or it is solidificated in the computer
Software function module in 100 operating system (operating system, OS).The processor 103 is used to perform storage
The executable module stored in device 101, such as software function module or calculating that the acquisition device 200 of described variant sites includes
Machine program.
Wherein, memorizer 101 may be, but not limited to, random access memory (Random Access Memory,
RAM), read only memory (Read Only Memory, ROM), programmable read only memory (Programmable Read-Only
Memory, PROM), erasable read-only memory (Erasable Programmable Read-Only Memory, EPROM),
Electricallyerasable ROM (EEROM) (Electric Erasable Programmable Read-Only Memory, EEPROM) etc..
Wherein, memorizer 101 is used for storage program, and the processor 103 performs described program after execute instruction is received, aforementioned
Method performed by the computer 100 of the stream process definition that the embodiment of the present application any embodiment is disclosed can apply to processor
In 103, or realized by processor 103.
A kind of possibly IC chip of processor 103, the disposal ability with signal.Above-mentioned processor 103 can
Being general processor, including central processing unit (Central Processing Unit, abbreviation CPU), network processing unit
(Network Processor, abbreviation NP) etc.;Can also be digital signal processor (DSP), special IC (ASIC),
It is ready-made programmable gate array (FPGA) or other PLDs, discrete gate or transistor logic, discrete hard
Part component.Can realize or perform disclosed each method in the embodiment of the present application, step and logic diagram.General processor
Can be microprocessor or the processor 103 can also be any conventional processor etc..
Various input/output devices are coupled to processor 103 and memorizer 101 by the Peripheral Interface 104.At some
In embodiment, Peripheral Interface 104, processor 103 and storage control 102 can be realized in one single chip.Other one
In a little examples, they can be realized by independent chip respectively.
Input-output unit 105 is used to be supplied to user input data to realize interacting for user and the computer.It is described
Input-output unit may be, but not limited to, digital independent device, mouse and keyboard etc..
It should be understood that structure shown in Fig. 1 is only illustrated, computer 100 can also include it is more more than shown in Fig. 1 or
Less component, or with the configuration different from shown in Fig. 1.Each component shown in Fig. 1 can using hardware, software or its
Combination is realized.
First embodiment
The embodiment of the present application provides a kind of acquisition methods of variant sites, refers to Fig. 2, and the method includes:
Step S110:The multiple short sequence of testing gene and reference gene group are carried out into comparing, testing gene is obtained
Preliminary variant sites information, the preliminary variant sites information includes multiple preliminary variant sites.
First, the multiple short sequence of testing gene is obtained, the short sequence can be exported by second filial generation microarray dataset.Will
The short sequence of testing gene is compared with reference gene group.Such as, if testing gene is human gene, the reference gene group is then
Mankind's reference gene group.
Certainly, the comparison process can include repeatedly comparing and the process such as duplicate removal, after being compared including multiple changes
The variant sites information of ectopic sites.
Specifically, as shown in figure 3, in the present embodiment, the comparing in this step is believed with obtaining preliminary variant sites
The process of breath can include:
Step S111:The multiple short sequence of the testing gene and reference gene group are compared first, SAM lattice are obtained
The comparison result of formula.
The short sequence of testing gene and reference gene group are carried out into comparing, the comparison process can utilize existing ratio
Software is carried out, such as Bowtie2, it is possible to obtain the comparison result of SAM forms, be stored with the comparison result of the SAM forms ratio
Comparison information to rear acquisition.It should be understood that in the comparison result of the SAM forms, including each alkali in testing gene
The information of base, such as positional information.
Certainly, the representation of specifically used comparison software and comparison result is not intended as limit in the present embodiment
System, can compare the multiple short sequence of testing gene and reference gene group and obtain the comparison information for representing comparison result
It is advisable.
Step S112:Duplicate removal is carried out to the comparison result, contrast is made to the short sequence of a position of reference gene group
Number is less than or equal to 1.
In the comparison result that step S111 is obtained, there are a certain proportion of repetitive sequence and result, for example, contrast to referring to base
Because the same position organized there may be multiple short sequences, then, in this step, comparison result is carried out into duplicate removal.
In the present embodiment, it is possible to use software Picard carries out duplicate removal work.Specifically, that what is utilized can be Picard
MarkDuplicate instrument duplicate removals, obtain bam forms duplicate removal result.
Step S113:Local anharmonic ratio is carried out to the comparing result after duplicate removal to (local multiple alignment).
It is difficult accurately to compare highly similar repetition to the short sequence that reference gene group is compared due to what is obtained
Region, then the repeat region in genome be readily available false-positive variant sites, such as false-positive SNPs.It is appreciated that
, false-positive variant sites are the variant sites of comparison result mistake.In order to reduce false positive variant sites quantity and
Ratio, in the present embodiment, carries out local anharmonic ratio pair to the comparing result after duplicate removal.
Specifically, the local anharmonic ratio can be using in GATK to (local multiple alignment)
IndelRealigner is carried out, and obtains comparison result of the anharmonic ratio of bam forms to after.The comparison process typically has three steps,
A. detect suspicious, need to carry out the region of anharmonic ratio pair;B. anharmonic ratio pair is carried out to these suspicious regions;C. repair in anharmonic ratio
The mate pairing information lost to during.
Step S114:Recalculate the base mass fraction in comparison result of the local anharmonic ratio to after.
In the step of during aforementioned processing S111, each single base can be endowed in data processing
One mass fraction (Quality scores), for reflecting the credibility of nucleotide that corresponding base is observed.
As the mass fraction obtained during aforementioned processing does not have preferably to contact with the genotyping result probability of mistake
Get up, while the mass fraction of single base, the contact of no and other specification phase example, the different surveys such as in same sample
Sequence platform, different sequencing circulations, different libraries etc. are contacted.
Therefore, in this step S114, the mass fraction of each base is associated with each factor in sequencing procedure,
The mass fraction of each base is recalculated, new mass fraction is generated, for judging whether each base is credible.
Specifically, in the present embodiment, it is possible to use GATK carries out empirical quality score
Recalibration, obtains the result of bam forms.
Step S115:According to the base mass fraction, SNP and indel is carried out to comparing result of the local anharmonic ratio to after
Analysis, obtains preliminary variant sites information.
According to the base mass fraction for recalculating acquisition, local anharmonic ratio is carried out to the comparison result for obtaining SNP and
The preliminary interpretation of indel, carries out SNP and indel typings to which, to obtain variant sites information, the variant sites information conduct
Preliminary variant sites information.It should be understood that in the preliminary variant sites information, include each variant sites and each
Variant sites position.In the present embodiment, variant sites are SNP and indel, it is preferred that in the present embodiment, become dystopy
Point is only SNP.
Specifically, in this step, can be analyzed using the Unified Genotyper of GATK.Because complete
Into after the typing of SNPs, many data filtering parameter logistics are employed according to being filtered again, with further control data quality,
So standard minimum confidence thresholds are both configured to zero in this step.It should be understood that
SNPs represents the plural form of SNP.
Certainly, the preliminary interpretation process of the SNP and indel can also be carried out in other ways, in the present embodiment not
As limit, or other, the such as HaplotypeCaller of GATK is carried out.
In this step, it is possible to obtain including the vcf files of preliminary variant sites information, the preliminary change in the vcf files
Ectopic sites information includes each variant sites for obtaining in step s 110 and the corresponding positional information of each variant sites,
Certainly, also including other, here is not added with repeating.
Step S120:According to the preliminary variant sites information, will be unsatisfactory for presetting in the plurality of preliminary variant sites
The variant sites of reserve are deleted, and obtain the variant sites in the testing gene.
In step s 110, in the preliminary variant sites in the preliminary variant sites information of acquisition, it would still be possible to there is false sun
Property variant sites, then, this step is further filtered to preliminary variant sites, delete wherein false positive probability it is higher
Variant sites, using the variant sites in the result after deletion as the testing gene in variant sites, make last acquisition
Variant sites are more accurate.It should be understood that delete after result in further comprises each variant sites positional information and
Other information, will not be described here.
Specifically, in this step, can include following one or more deleting the variation for being unsatisfactory for default reserve
The mode in site:
Mode one:Remove in the plurality of preliminary variant sites, the number of allele is more than the change dystopy of predetermined threshold value
Point.
Allele is that the probability of false positive variant sites is higher, which is carried out more than the variant sites of predetermined threshold value
Remove.In the present embodiment, the predetermined threshold value can value according to actual needs, due to comprising more than more than 1 allele
Site just have higher gene type mistake, it is preferred that the value of the predetermined threshold value can be 1.
When predetermined threshold value value is 1, that is, there is more than 1 allele in removing the multiple preliminary variant sites of acquisition
Variant sites.
Mode two:Delete in the plurality of preliminary variant sites, positioned at each insertion and deletion (indel) upstream span or
All variant sites in person's span downstream, the base number that the upstream span and span downstream include are predetermined number.
As the short sequence for comparing is often exported by secondary direction finding platform, and the short sequence of secondary microarray dataset exists
The comparison of mistake is more prone near the region of insertion and deletion (indel), and the local anharmonic ratio in above-mentioned processing procedure is not to
This mistake can be completely eliminated.Then, all variant sites in insertion and deletion upstream span or span downstream are deleted, with
Reduce the probability of false positive results.
The base number that the upstream span and span downstream include be predetermined number, the predetermined number can by user according to
Actual demand determines, is not restricted in the present embodiment, also, the predetermined number of upstream span and span downstream can phase
It is same or different.
In the present embodiment, the base number that scope includes above is preferably 5, the base number that span downstream includes is excellent
Elect 5 as.That is, it is determined that all indel in preliminary variant sites, for each indel, by its upstream 5bp (5 bases)
Within all variant sites delete, or all variant sites within 5bp downstream are deleted.
Certainly, in the present embodiment, only can delete in the variant sites or span downstream in the upstream span of indel
Variant sites, it is also possible to the variant sites in the variant sites and span downstream in upstream span are all deleted.
Preferably, in the present embodiment, in the upstream span or span downstream for insertion and deletion (indel) of deletion
All SNPs.
Mode three:By in the plurality of preliminary variant sites, the variant sites for being spaced default base number each other are deleted
Remove.
In this step, variant sites close to each other are deleted, will variation of the distance less than certain value each other
Delete in site.
In the present embodiment, the default base number is not intended as limiting, and can set according to actual needs.
Preferably, the default base number is 4, if change of the base number being spaced between certain presence less than 4
Ectopic sites, are deleted.That is, deleting the variant sites within upstream each other or downstream 5bp.
Preferably, in the step, the SNPs to be spaced default base number each other of deletion.
Mode four:By in the plurality of preliminary variant sites, corresponding GQ (Genotype quality) value is less than default
The variant sites of GQ threshold values are deleted.
GQ (Genotype quality) is a posterior probability (the phred-scaled probabilities) value,
For each site, GQ values are not possible of truth to represent the site in the genotypic results of current acquisition
Property, that is, represent the probability existed in the site genotype for obtaining.Calculation is:
GQ values=- 10*log10 (P [error]), wherein, P [error] represents that corresponding site is not the general of truth
Rate.
Preferably, in the present embodiment, it is 20 to preset GQ threshold values.Empirical tests, when GQ threshold values are 20, theoretic mistake
Rate is 1%.
Mode five:By in the plurality of preliminary variant sites, corresponding MQ (Mapping quality) value is less than default MQ
The variant sites of threshold value are deleted.
MQ represents the specificity (uniqueness) in aligned sequences.When same short sequence can compare it is same
During genome zones of different, the alignment score of the first best comparison area (the first best alignment)
The alignment score of (alignment's score) and the second best comparison area (the second best alignment), two
Person's difference is bigger, shows that the specificity for comparing is better, and the value of MQ is higher.
In this embodiment it is believed that it is false sun that MQ values have higher probability less than the variant sites of default MQ threshold values
Property, it is deleted.
Preferably, in the present embodiment, it is 30 to preset MQ threshold values value.Empirical tests, when MQ values are 30, P [error]=
0.001, i.e., relative to current location is compared, the probability for comparing another position is up to 0.1%.
In embodiments of the present invention, mode one is optional executive mode to mode five, i.e., in this step, can adopt which
In a certain mode, certain several ways or all of mode.When carrying out being unsatisfactory for the change of reservation conditions using various ways
During the deletion of ectopic sites, the execution sequence between the various ways is not intended as limiting.Certainly, the various ways can also be parallel
Perform.
In addition, in the step 120, when there is various ways to be performed serially, follow-up step can be in preceding step
On the basis of perform.For example, if the number of the plurality of preliminary variant sites allelic of removal of executive mode one is more than pre-
If in the variant sites of threshold value, and mode three, default base will be spaced in the plurality of preliminary variant sites each other
The variant sites of number are deleted, and first carry out mode one, then executive mode three.Then in mode three, deletion can be mode
It is spaced the variant sites of default base number in variant sites after one process each other.
Step S120 is carried out to preliminary variant sites after deletion filtration, the variation of the final result of acquisition as testing gene
Site, can be represented with vcf formatted files.
Second embodiment
A kind of acquisition device 200 of variant sites is present embodiments provided, Fig. 4 is referred to, the device 200 includes:
Comparing module 210, for the multiple short sequence of testing gene and reference gene group are carried out comparing, is treated
The preliminary variant sites information of cls gene, the preliminary variant sites information include multiple preliminary variant sites.
Filtering module 220, for according to the preliminary variant sites information, will be discontented with the plurality of preliminary variant sites
The variant sites of the default reserve of foot are deleted, and obtain the variant sites in the testing gene.
Further, in the present embodiment, as shown in figure 5, filtering module 220 can include one or more of list
Unit:
First deletes unit 221, and for removing in the plurality of preliminary variant sites, the number of allele is more than default
The variant sites of threshold value.Second deletes unit 222, for deleting in the plurality of preliminary variant sites, lacks positioned at each insertion
What all variant sites in the upstream span or span downstream of mistake (indel), the upstream span and span downstream included
Base number is predetermined number.3rd deletes unit 223, for by the plurality of preliminary variant sites, being spaced each other
The variant sites of default base number are deleted.4th deletes unit 224, for by the plurality of preliminary variant sites, corresponding to
GQ (Genotype quality) value delete less than the variant sites of default GQ threshold values.5th deletes unit 225, for by institute
State in multiple preliminary variant sites, corresponding MQ (Mapping quality) value is deleted less than the variant sites of default MQ threshold values.
Further, as shown in fig. 6, the comparing module 210 that the present embodiment is provided can also include:
Comparing unit 211, for the multiple short sequence of the testing gene and reference gene group are compared first, obtains
Obtain the comparison result of SAM forms.Duplicate removal unit 212, for carrying out duplicate removal to the comparison result, makes contrast to reference gene group
A position short sequence number be less than or equal to 1.Weight comparing unit 213, for this is carried out to the comparing result after duplicate removal
Ground anharmonic ratio is to (local multiple alignment).Mass fraction computing unit 214, for recalculating local anharmonic ratio pair
Base mass fraction in comparison result afterwards.Just sentence unit 215, for according to the base mass fraction, to local anharmonic ratio
Comparing result to after carries out SNP and indel analyses, obtains preliminary variant sites information.
It should be noted that for device class embodiment, due to itself and embodiment of the method basic simlarity, so description
It is fairly simple, related part is illustrated referring to the part of embodiment of the method.
In sum, the acquisition methods and device of variant sites provided in an embodiment of the present invention, by existing software
After preliminary acquisition variant sites information, the preliminary variant sites to obtaining further are filtered, and deletion is wherein unsatisfactory for presetting
The variant sites of reserve, make the accuracy rate of the variant sites of the testing gene of final acquisition higher.
In several embodiments provided herein, it should be understood that disclosed apparatus and method, it is also possible to pass through
Other modes are realized.Device embodiment described above is only schematically, for example flow chart and block diagram in accompanying drawing
Show the device of multiple embodiments according to the application, the architectural framework in the cards of method and computer program product,
Function and operation.At this point, each square frame in flow chart or block diagram can represent the one of module, program segment or a code
Part, a part for the module, program segment or code are used to realize holding for the logic function for specifying comprising one or more
Row instruction.It should also be noted that at some as in the implementations replaced, the function of being marked in square frame can also be being different from
The order marked in accompanying drawing occurs.For example, two continuous square frames can essentially be performed substantially in parallel, and they are sometimes
Can perform in the opposite order, this is depending on involved function.It is also noted that every in block diagram and/or flow chart
The combination of individual square frame and block diagram and/or the square frame in flow chart, can use the special base for performing the function or action of regulation
Realize in the system of hardware, or can be realized with the combination of specialized hardware and computer instruction.
In addition, each functional module in the application each embodiment can integrate to form an independent portion
Divide, or modules individualism, it is also possible to which two or more modules are integrated to form an independent part.
If the function is realized using in the form of software function module and as independent production marketing or when using, can be with
It is stored in a computer read/write memory medium.Based on such understanding, the technical scheme of the application is substantially in other words
The part contributed to prior art or the part of the technical scheme can be embodied in the form of software product, the meter
Calculation machine software product is stored in a storage medium, is used including some instructions so that a computer equipment (can be individual
People's computer, server 100, or network equipment etc.) perform all or part of step of the application each embodiment methods described
Suddenly.And aforesaid storage medium includes:USB flash disk, portable hard drive, read only memory (ROM, Read-Only Memory), deposit at random
Access to memory (RAM, Random Access Memory), magnetic disc or CD etc. are various can be with the medium of store program codes.
It should be noted that herein, such as first and second, another or the like relational terms be used merely to an entity or
Person is operated and is made a distinction with another entity or operation, and not necessarily requires or imply that presence is appointed between these entities or operation
What this actual relation or order.And, term " including ", "comprising" or its any other variant are intended to non-row
His property is included, so that a series of process, method, article or equipment including key elements not only include those key elements, and
And also include other key elements being not expressly set out, or also include for this process, method, article or equipment institute inherently
Key element.In the absence of more restrictions, the key element for being limited by sentence "including a ...", it is not excluded that including institute
Also there is other identical element in process, method, article or the equipment of stating key element.
The preferred embodiment of the application is the foregoing is only, the application is not limited to, for the skill of this area
For art personnel, the application can have various modifications and variations.It is all within spirit herein and principle, made any repair
Change, equivalent, improvement etc., should be included within the protection domain of the application.It should be noted that:Similar label and letter exist
Similar terms is represented in figure below, therefore, once being defined in a certain Xiang Yi accompanying drawing, then it is not required in subsequent accompanying drawing
Which is further defined and is explained.
The above, the protection domain of the only specific embodiment of the application, but the application is not limited thereto, any
Those familiar with the art can readily occur in change or replacement in the technical scope that the application is disclosed, and should all contain
Cover within the protection domain of the application.Therefore, the protection domain of the application described should be defined by scope of the claims.
Claims (10)
1. a kind of acquisition methods of variant sites, it is characterised in that methods described includes:
The multiple short sequence of testing gene and reference gene group are carried out into comparing, the preliminary variant sites of testing gene are obtained
Information, the preliminary variant sites information include multiple preliminary variant sites;
According to the preliminary variant sites information, the variation of default reserve will be unsatisfactory in the plurality of preliminary variant sites
Site is deleted, and obtains the variant sites in the testing gene.
2. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre-
If the variant sites of reserve are deleted including:
Remove in the plurality of preliminary variant sites, the number of allele is more than the variant sites of predetermined threshold value.
3. method according to claim 1, it is characterised in that also include in the preliminary variant sites information the plurality of
The position that preliminary variant sites are located, it is described the change dystopy of default reserve to be unsatisfactory in the plurality of preliminary variant sites
Point deletion includes:
Delete in the plurality of preliminary variant sites, it is all in the upstream span or span downstream of each insertion and deletion
Variant sites, the base number that the upstream span and span downstream include are predetermined number.
4. method according to claim 1, it is characterised in that also include in the preliminary variant sites information the plurality of
The position that preliminary variant sites are located, it is described the change dystopy of default reserve to be unsatisfactory in the plurality of preliminary variant sites
Point deletion includes:
By in the plurality of preliminary variant sites, the variant sites for being spaced default base number each other are deleted.
5. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre-
If the variant sites of reserve are deleted including:
By in the plurality of preliminary variant sites, corresponding GQ values are deleted less than the variant sites of default GQ threshold values.
6. method according to claim 1, it is characterised in that it is described will be unsatisfactory in the plurality of preliminary variant sites it is pre-
If the variant sites of reserve are deleted including:
By in the plurality of preliminary variant sites, corresponding MQ values are deleted less than the variant sites of default MQ threshold values.
7. method according to claim 1, it is characterised in that the multiple short sequence and reference gene by testing gene
Group carries out comparing, and the preliminary variant sites information for obtaining testing gene includes:
The multiple short sequence of the testing gene and reference gene group are compared first, the comparison result of SAM forms is obtained;
Duplicate removal is carried out to the comparison result, contrast is less than or equal to the short sequence number of a position of reference gene group
1;
Local anharmonic ratio pair is carried out to the comparing result after duplicate removal;
Recalculate the base mass fraction in comparison result of the local anharmonic ratio to after;
According to the base mass fraction, SNP and indel analyses are carried out to comparing result of the local anharmonic ratio to after, obtain preliminary
Variant sites information.
8. method according to claim 1, it is characterised in that the variant sites are SNP.
9. a kind of acquisition device of variant sites, it is characterised in that described device includes:
Comparing module, for the multiple short sequence of testing gene and reference gene group are carried out comparing, obtains testing gene
Preliminary variant sites information, the preliminary variant sites information includes multiple preliminary variant sites;
Filtering module, for according to the preliminary variant sites information, will be unsatisfactory for presetting in the plurality of preliminary variant sites
The variant sites of reserve are deleted, and obtain the variant sites in the testing gene.
10. device according to claim 9, it is characterised in that the filtering module includes one or more of:
First deletes unit, and for removing in the plurality of preliminary variant sites, the number of allele is more than predetermined threshold value
Variant sites;
Second delete unit, for deleting in the plurality of preliminary variant sites, positioned at each insertion and deletion upstream span or
All variant sites in person's span downstream, the base number that the upstream span and span downstream include are predetermined number;
3rd deletes unit, for by the plurality of preliminary variant sites, being spaced the variation of default base number each other
Delete in site;
4th deletes unit, for by the plurality of preliminary variant sites, corresponding GQ values are less than the variation for presetting GQ threshold values
Delete in site;
5th deletes unit, for by the plurality of preliminary variant sites, corresponding MQ values are less than the variation for presetting MQ threshold values
Delete in site.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610972449.3A CN106529211A (en) | 2016-11-04 | 2016-11-04 | Variable site obtaining method and apparatus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610972449.3A CN106529211A (en) | 2016-11-04 | 2016-11-04 | Variable site obtaining method and apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106529211A true CN106529211A (en) | 2017-03-22 |
Family
ID=58349459
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610972449.3A Pending CN106529211A (en) | 2016-11-04 | 2016-11-04 | Variable site obtaining method and apparatus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106529211A (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109979530A (en) * | 2019-03-26 | 2019-07-05 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN109979531A (en) * | 2019-03-29 | 2019-07-05 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN109994155A (en) * | 2019-03-29 | 2019-07-09 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN110021342A (en) * | 2017-08-21 | 2019-07-16 | 北京哲源科技有限责任公司 | For accelerating the method and system of the identification of variant sites |
CN111091870A (en) * | 2019-12-18 | 2020-05-01 | 中国科学院大学 | Method and system for controlling quality of gene mutation site |
CN111919257A (en) * | 2018-07-27 | 2020-11-10 | 思勤有限公司 | Reducing noise in sequencing data |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103198236A (en) * | 2012-01-06 | 2013-07-10 | 深圳华大基因科技有限公司 | CYP450 gene type database and gene typing and enzymatic activity identification method |
CN104462869A (en) * | 2014-11-28 | 2015-03-25 | 天津诺禾致源生物信息科技有限公司 | Method and device for detecting somatic cell SNP |
CN105930690A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Whole-exome sequencing data analysis method |
CN105925685A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Exome potential pathogenic mutation detection method based on family line |
CN105956415A (en) * | 2016-05-13 | 2016-09-21 | 万康源(天津)基因科技有限公司 | SNV detection system affecting RNA splicing |
CN105975809A (en) * | 2016-05-13 | 2016-09-28 | 万康源(天津)基因科技有限公司 | SNV detection method affecting RNA splicing |
CN106011224A (en) * | 2015-12-24 | 2016-10-12 | 晶能生物技术(上海)有限公司 | Nervous system genetic disease gene united screening method, kit and preparation method thereof |
-
2016
- 2016-11-04 CN CN201610972449.3A patent/CN106529211A/en active Pending
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103198236A (en) * | 2012-01-06 | 2013-07-10 | 深圳华大基因科技有限公司 | CYP450 gene type database and gene typing and enzymatic activity identification method |
CN104462869A (en) * | 2014-11-28 | 2015-03-25 | 天津诺禾致源生物信息科技有限公司 | Method and device for detecting somatic cell SNP |
CN106011224A (en) * | 2015-12-24 | 2016-10-12 | 晶能生物技术(上海)有限公司 | Nervous system genetic disease gene united screening method, kit and preparation method thereof |
CN105930690A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Whole-exome sequencing data analysis method |
CN105925685A (en) * | 2016-05-13 | 2016-09-07 | 万康源(天津)基因科技有限公司 | Exome potential pathogenic mutation detection method based on family line |
CN105956415A (en) * | 2016-05-13 | 2016-09-21 | 万康源(天津)基因科技有限公司 | SNV detection system affecting RNA splicing |
CN105975809A (en) * | 2016-05-13 | 2016-09-28 | 万康源(天津)基因科技有限公司 | SNV detection method affecting RNA splicing |
Non-Patent Citations (1)
Title |
---|
何伟明: ""基于重测序数据的群体SNP位点检测及基因型判断"", 《中国优秀硕士学位论文全文数据库 基础科学辑》 * |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110021342A (en) * | 2017-08-21 | 2019-07-16 | 北京哲源科技有限责任公司 | For accelerating the method and system of the identification of variant sites |
CN110021342B (en) * | 2017-08-21 | 2020-12-15 | 北京哲源科技有限责任公司 | Method and system for accelerating identification of variant sites |
CN111919257A (en) * | 2018-07-27 | 2020-11-10 | 思勤有限公司 | Reducing noise in sequencing data |
CN111919257B (en) * | 2018-07-27 | 2021-05-28 | 思勤有限公司 | Method and system for reducing noise in sequencing data, and implementation and application thereof |
CN109979530A (en) * | 2019-03-26 | 2019-07-05 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN109979530B (en) * | 2019-03-26 | 2021-03-16 | 北京市商汤科技开发有限公司 | Gene variation identification method, device and storage medium |
CN109979531A (en) * | 2019-03-29 | 2019-07-05 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN109994155A (en) * | 2019-03-29 | 2019-07-09 | 北京市商汤科技开发有限公司 | A kind of genetic mutation recognition methods, device and storage medium |
CN109994155B (en) * | 2019-03-29 | 2021-08-20 | 北京市商汤科技开发有限公司 | Gene variation identification method, device and storage medium |
CN109979531B (en) * | 2019-03-29 | 2021-08-31 | 北京市商汤科技开发有限公司 | Gene variation identification method, device and storage medium |
CN111091870A (en) * | 2019-12-18 | 2020-05-01 | 中国科学院大学 | Method and system for controlling quality of gene mutation site |
CN111091870B (en) * | 2019-12-18 | 2021-11-02 | 中国科学院大学 | Method and system for controlling quality of gene mutation site |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106529211A (en) | Variable site obtaining method and apparatus | |
Rakocevic et al. | Fast and accurate genomic analyses using genome graphs | |
Turakhia et al. | Stability of SARS-CoV-2 phylogenies | |
Cooke et al. | A unified haplotype-based method for accurate and comprehensive variant calling | |
Bağcı et al. | DIAMOND+ MEGAN: fast and easy taxonomic and functional analysis of short and long microbiome sequences | |
Sibbesen et al. | Accurate genotyping across variant classes and lengths using variant graphs | |
Rumble et al. | SHRiMP: accurate mapping of short color-space reads | |
Sims et al. | Alignment-free genome comparison with feature frequency profiles (FFP) and optimal resolutions | |
Levi et al. | DOMINO: a network‐based active module identification algorithm with reduced rate of false calls | |
CN103617256A (en) | Method and device for processing file needing mutation detection | |
CN106407747A (en) | Method and device for acquiring mutation sites of genes corresponding to tumors | |
Chowdhury et al. | Differential expression analysis of RNA-seq reads: overview, taxonomy, and tools | |
Johnston et al. | PEMapper and PECaller provide a simplified approach to whole-genome sequencing | |
US11842794B2 (en) | Variant calling in single molecule sequencing using a convolutional neural network | |
CN110692101A (en) | Method for aligning targeted nucleic acid sequencing data | |
US20190177719A1 (en) | Method and System for Generating and Comparing Reduced Genome Data Sets | |
Niehus et al. | PopDel identifies medium-size deletions jointly in tens of thousands of genomes | |
CN106503489A (en) | The acquisition methods and device in the mutational site of the corresponding gene of cardiovascular system | |
Schull et al. | Champagne: whole-genome phylogenomic character matrix method places Myomorpha basal in Rodentia | |
CN106407745A (en) | Mutation site acquisition method and device for a gene corresponding to skin | |
Alfonsi et al. | Data-driven recombination detection in viral genomes | |
CN106529208A (en) | Method and device for obtaining mutation sites of gene corresponding to nervous system | |
CN106529210A (en) | Method and device for acquiring gene mutation site corresponding to psychology and spirit | |
CN106407744A (en) | Mutation site acquisition method and device for a gene corresponding to diet and health | |
CN106529209A (en) | Method and device for acquiring gene mutation site corresponding to immune system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170322 |