CN103810402B

CN103810402B - Data processing method and device for genomes

Info

Publication number: CN103810402B
Application number: CN201410064832.XA
Authority: CN
Inventors: 江文恺; 占伟
Original assignee: Nuo Hezhi Source Beijing Bioinformation Science And Technology Ltd
Current assignee: Beijing Polytron Technologies Inc
Priority date: 2014-02-25
Filing date: 2014-02-25
Publication date: 2017-01-18
Anticipated expiration: 2034-02-25
Also published as: CN103810402A

Abstract

The invention discloses a data processing method and device for genomes. The data processing method for the genomes includes the steps that first comparison is carried out on information of the target genomes with the information of the reference genomes to obtain a first comparison result; information of sections, which do not meet the comparison conditions, of the genomes is obtained from the first comparison result; second comparison is carried out on the information of the sections, which do not meet the comparison conditions, of the genomes with the information of the reference genomes to obtain a second comparison result; information of distinguished sequences of the target genomes is obtained from the second comparison result. By means of the data processing method and device, the problem that the accurate distinguished sequences are difficult to obtain through the relative technology is solved.

Description

Data processing method for genome and device

Technical field

The present invention relates to data processing field, in particular to a kind of data processing method for genome and dress Put.

Background technology

Comparative genomic strategy analysis directions include: one, by finding the similar gene order of genome between species, research The similar gene function being likely to be of between species and mechanism；Two, by find species between genome broader region Genome mutation event that phase Sihe distinguished sequence, the evolutionary history of research species and species produce during evolution etc..

At present, in the related, when finding the distinguished sequence of genome between species, simply by species to be studied Genome protein sequence is compared with the genome protein sequence of the nearly edge species on evolutionary relationship, to obtain albumen between species The comparison information of sequence, and the comparison information of protein sequence between species is clustered, thus obtaining genome between species Distinguished sequence.Because genome is in addition to including protein sequence, also include the sequence of other elements, thus be difficult to obtain accurately Distinguished sequence.

Further, since the quantity of information of genome is larger, the comparison of genome protein sequence therefore in technique scheme needs Consume substantial amounts of time and internal memory.

For being difficult in correlation technique obtain the problem of accurate distinguished sequence, effective solution party is not yet proposed at present Case.

Content of the invention

Present invention is primarily targeted at providing a kind of data processing method for genome and device, to solve correlation It is difficult in technology obtain the problem of accurate distinguished sequence.

To achieve these goals, according to an aspect of the invention, it is provided a kind of data processing for genome Method.The method includes: the information of the information of target gene group and reference gene group is carried out first and compares, obtain the first comparison Result；The information of the genomic fragment not compared is obtained from the first comparison result；By the genomic fragment on not comparing The information of information and reference gene group carries out second and compares, and obtains the second comparison result；And obtain from the second comparison result The information of the distinguished sequence of target gene group.

Further, the information of the information of the genomic fragment on not comparing and reference gene group is carried out second to compare, Obtain the second comparison result to include: detect in the information of genomic fragment not compared with the presence or absence of the sequence information repeating； If there is the sequence information of repetition in the information detecting the genomic fragment not compared, the sequence information of repetition is entered Rower is noted, and obtains the information marking；Never filter, in the information of genetic fragment on comparing, the information marking, filtered Information afterwards；And the information of the information after filtering and reference gene group is compared, obtain the second comparison result.

Further, the first comparison result includes multiple homologous geness group fragments, and wherein, multiple homologous geness group fragments are Multiple genomic fragments comparing, the information obtaining the genomic fragment not compared from the first comparison result includes: from Filter multiple homologous geness group fragments in first comparison result, obtain multiple genome sub-piece not compared；According to multiple Position relationship in target gene group for the genome sub-piece not compared is ranked up, and obtains multiple genes not compared The sequence of group sub-piece；Genome sub-piece that is adjacent for any two position in sequence and having lap is merged, Obtain the sequence of the genome sub-piece not compared including multiple merging；And connection includes not comparing of multiple merging The sequence of genome sub-piece in full gene group sub-piece, the information of the genomic fragment not compared.

Further, the second comparison result includes multiple homologous geness group fragments, obtains target from the second comparison result The information of the distinguished sequence of genome includes: extracts multiple homologous geness group fragments；According to multiple homologous geness group fragments in mesh Position relationship in mark genome is ranked up, and obtains the sequence of multiple homologous geness group fragments；Any two in detection sequence The adjacent homologous geness group fragment in position whether there is lap；If detecting adjacent same in any two position in sequence There is lap in source genome fragment, then merge lap, obtains the homologous geness group fragment after multiple merging；And from The information of the homologous geness group fragment after filtering in the second comparison result including multiple merging, obtains the special sequence of target gene group The information of row.

Further, before extracting multiple homologous geness group fragments, data processing method also includes: judges multiple genes Whether the length of group fragment is more than or equal to preset length；If it is judged that the length of multiple genome fragments is more than or equal to default length Degree, then whether the similarity judging multiple genome fragments is more than or equal to default similarity；If it is judged that multiple genomes are broken The similarity of piece is more than or equal to default similarity, then judge whether the comparison rate of multiple genome fragments compares more than or equal to default Rate；And if it is judged that the comparison rate of multiple genome fragment is more than or equal to default comparison rate, then by multiple genome fragments Information as multiple homologous geness group fragments information.

To achieve these goals, according to a further aspect in the invention, there is provided a kind of data processing for genome Device.This device includes: the first comparing unit, for the information of the information of target gene group and reference gene group is carried out first Compare, obtain the first comparison result；First acquisition unit, for obtaining the gene pack not compared from the first comparison result The information of section；Second comparing unit, for carrying out the information of the information of the genomic fragment on not comparing and reference gene group Second comparison, obtains the second comparison result；And second acquisition unit, for obtaining target gene group from the second comparison result Distinguished sequence information.

Further, the second comparing unit includes: first detection module, the genomic fragment not compared for detection Whether there is the sequence information repeating in information；Labeling module, if for the letter detecting the genomic fragment not compared There is the sequence information of repetition in breath, then the sequence information of repetition is labeled, obtain the information marking；First filter module Block, filters, in the information for the genetic fragment never comparing, the information marking, the information after being filtered；And compare Module, for the information of the information after filtering and reference gene group is compared, obtains the second comparison result.

Further, the first comparison result includes multiple homologous geness group fragments, and wherein, multiple homologous geness group fragments are Multiple genomic fragments comparing, first acquisition unit includes: the second filtering module, for filtering from the first comparison result Multiple homologous geness group fragments, obtain multiple genome sub-piece not compared；First order module, for according to multiple not Position relationship in target gene group for the genome sub-piece in comparison is ranked up, and obtains multiple genomes not compared The sequence of sub-piece；First merging module, for by genome that is adjacent for any two position in sequence and having lap Sub-piece merges, and obtains the sequence of the genome sub-piece not compared including multiple merging；And link block, use Full gene group sub-piece in the sequence connecting the genome sub-piece not compared including multiple merging, is not compared Information to upper genomic fragment.

Further, the second comparison result includes multiple homologous geness group fragments, and second acquisition unit includes: extracts mould Block, for extracting multiple homologous geness group fragments；Second order module, for according to multiple homologous geness group fragments in target base Because the position relationship in group is ranked up, obtain the sequence of multiple homologous geness group fragments；Second detection module, for detecting sequence In row, the adjacent homologous geness group fragment in any two position whether there is lap；Second merging module, if for inspection Measure the adjacent homologous geness group fragment in any two position in sequence and there is lap, then merge lap, obtain many Homologous geness group fragment after individual merging；And the 3rd filtering module, for filtering including multiple conjunctions from the second comparison result And after homologous geness group fragment information, obtain the information of the distinguished sequence of target gene group.

Further, this data processing equipment also includes: the first judge module, for broken in the multiple homologous geness groups of extraction Before piece, judge whether the length of multiple genome fragments is more than or equal to preset length；Second judge module, for if it is determined that The length going out multiple genome fragments is more than or equal to preset length, then judge whether the similarity of multiple genome fragments is more than In default similarity；3rd judge module, for if it is judged that the similarity of multiple genome fragment is more than or equal to default phase Like spending, then whether the comparison rate judging multiple genome fragments is more than or equal to default comparison rate；And determining module, if for Judge that the comparison rate of multiple genome fragments is more than or equal to default comparison rate, then by the validation of information of multiple genome fragments be The information of multiple homologous geness group fragments.

By the present invention, compared using the information of the information of target gene group and reference gene group is carried out first, obtain First comparison result；The information of the genomic fragment not compared is obtained from the first comparison result；By the gene on not comparing The information of group fragment carries out second with the information of reference gene group and compares, and obtains the second comparison result；And compare knot from second Obtain the information of the distinguished sequence of target gene group in fruit, solve and be difficult in correlation technique obtain asking of accurate distinguished sequence Topic, and then reached the effect of the degree of accuracy improving distinguished sequence.

Brief description

The accompanying drawing constituting the part of the application is used for providing a further understanding of the present invention, the schematic reality of the present invention Apply example and its illustrate, for explaining the present invention, not constituting inappropriate limitation of the present invention.In the accompanying drawings:

Fig. 1 is the schematic diagram of the data processing equipment for genome according to embodiments of the present invention；

Fig. 2 is the schematic diagram of the according to embodiments of the present invention data processing equipment being preferably used for genome；

Fig. 3 is the flow chart of the data processing method for genome according to embodiments of the present invention；And

Fig. 4 is the flow chart of the according to embodiments of the present invention data processing method being preferably used for genome.

Specific embodiment

It should be noted that in the case of not conflicting, the embodiment in the application and the feature in embodiment can phases Mutually combine.To describe the present invention below with reference to the accompanying drawings and in conjunction with the embodiments in detail.

In order that those skilled in the art is better understood from the present invention program, below in conjunction with the embodiment of the present invention Accompanying drawing, to being clearly and completely described in the embodiment of the present invention it is clear that described embodiment is only the present invention one Partial embodiment, rather than whole embodiments.Based on the embodiment in the present invention, do not have in those of ordinary skill in the art The every other embodiment being obtained under the premise of making creative work, all should belong to protection scope of the present invention.

It should be noted that term " first " in description and claims of this specification and above-mentioned accompanying drawing, " Two " it is etc. for distinguishing similar object, without for describing specific order or precedence.It should be appreciated that such use Data can exchange in the appropriate case so that embodiments of the invention described herein can with except here diagram or Order beyond those of description is implemented.Additionally, term " comprising " and " having " and their any deformation are it is intended that cover Cover non-exclusive comprising.

According to embodiments of the invention, there is provided a kind of data processing equipment for genome, this is used for genome Data processing equipment is used for obtaining the information of accurate distinguished sequence, creates conditions for accurate gene analysiss.

Fig. 1 is the schematic diagram of the data processing equipment for genome according to embodiments of the present invention.

As shown in figure 1, this device includes: the first comparing unit 10, first acquisition unit 20, the second comparing unit 30 and Two acquiring units 40.

First comparing unit 10 compares for the information of the information of target gene group and reference gene group is carried out first, obtains To the first comparison result.

Specifically, can be by the nucmer instrument in mummer software, by same for target gene group reference gene group Carry out the first comparison, obtain the first comparison result between two genomes.It should be noted that in full-length genome scope During one compares, nucmer instrument can be replaced.

Wherein, target gene group and reference gene group may come from different species, and target gene group can be The genome of species to be studied, and reference gene group can be the genome of species known to gene information.For example, in analysis During the genome of willow, the genome of willow can be as target gene group, and if between willow to be analyzed and willow Gene function relation, can be using the genome of willow as reference gene group, and if the base between willow to be analyzed and Sophora japonica L. Because of functional relationship, can be using the genome of Sophora japonica L. as reference gene group.First compares and can compare for preliminary, and corresponding first Comparison result can be preliminary comparison result.It should be noted that species to be studied can include plant, animal and microorganism Deng.

Preferably, in comparing first, target gene group and reference gene group can be divided into n gene regions respectively, Can be compared with n gene regions of reference gene group in n gene regions of target gene group simultaneously.As such, it is possible to save Comparison time, improves comparison efficiency.

Alternatively, this data processing equipment can also include: the 3rd acquiring unit and the 4th acquiring unit.Wherein, the 3rd Acquiring unit is used for comparing the information of the information of target gene group and reference gene group is carried out first, obtains the first comparison and ties Before fruit, the 4th acquiring unit is used for obtaining the information of target gene group, and the information obtaining reference gene group.

First acquisition unit 20 is used for obtaining the information of the genomic fragment not compared from the first comparison result.

Wherein, the first comparison result can include the information of genomic fragment comparing and the gene pack not compared The information of section.Genomic fragment in comparison is properly termed as homologous geness group fragment again.

Specifically, the letter of the genomic fragment that first acquisition unit 20 can not compared by the acquisition of the following two kinds method Breath:

Method one, extracts the information of the genomic fragment not compared from the first comparison result.

Wherein, the information of the genomic fragment not compared may include that the similarity with reference gene group is less than first The information of the genomic fragment of default similarity, for example, this preset value can be 98%；Genomic fragment is less than the first default length The information of the genomic fragment of degree, for example, the first preset length can be 40bp, and if the first preset length is long for one Degree cluster, then this length cluster can be 90bp；Comparison rate is less than the information of the genomic fragment of the first default comparison rate.Comparison rate can To be the ratio that sequence to be compared in the genomic fragment of target gene group accounts for sequence to be compared in reference gene group.

Method two, the information filtering of the homologous geness group fragment in the first comparison result is fallen, obtains remaining genome The information of fragment, wherein, using the information of the information of the remaining genomic fragment genomic fragment as on do not compare.

Wherein it is possible to the information filtering of homologous geness group fragment be fallen by bedtools instrument.So, by filtering out The information of homologous geness group fragment, can save the consumption to calculator memory.

Second comparing unit 30 is used for carrying out the information of the information of the genomic fragment on not comparing and reference gene group Second comparison, obtains the second comparison result.

Second comparison result can include multiple homologous geness group fragments and distinguished sequence.Wherein, homologous geness group fragment For the genome fragment on comparing；Distinguished sequence is the sequence not compared, and it can include gene order and other elements sequence Row.

Specifically, by blastn software, the information of the genomic fragment on not comparing can be carried out and reference gene group Information compare, obtain the second comparison result.Wherein, this time compares and compares for fine, and corresponding second comparison result is Fine comparison result.As such, it is possible to the homologous geness group fragment comparing out in comparing first is found out, and filter out, thus can To obtain accurate distinguished sequence.This is because, in comparing second, the length of homologous geness group fragment can be preset for second Length, and the second preset length can be more than the first preset length, and for example, the second preset length can be 100bp；And homology The similarity of genome fragment can be the second default similarity；And second the comparison rate of comparison default can compare for second Rate, for example, the second default comparison rate can be 90.

Preferably, in comparing second, the genome on not comparing and reference gene group can be divided into n base respectively Because organize area, by n gene regions of the genomic fragment on not comparing can with n gene regions of reference gene group simultaneously than Right.As such, it is possible to saving comparison time, improve comparison efficiency.

Second acquisition unit 40 is used for obtaining the information of the distinguished sequence of target gene group from the second comparison result.

Obtain the method for the information of distinguished sequence of target gene group and from the first comparison result from the second comparison result The method of the information of the genomic fragment that middle acquisition does not compare is similar to, and will not be described here.

By the embodiment of the present invention, due to first is carried out to the information of target gene group and the information priority of reference gene group Compare and second compare and compare twice, and compare every time and compare softwares and preset length not etc., preset phase using different Like comparison data such as degree, default comparison rate, thus reach the effect of the degree of accuracy improving distinguished sequence.In addition, passing through Mummer software and the cooperation of blastn software, can analyze the diversity in gene structure level for the distinguished sequence.

Fig. 2 is the schematic diagram of the according to embodiments of the present invention data processing equipment being preferably used for genome.

As shown in Fig. 2 this embodiment can be used as the preferred implementation of embodiment illustrated in fig. 1, being used for of this embodiment The data processing equipment of genome includes the first comparing unit 10 of first embodiment, first acquisition unit 20, second compares list Unit 30 and second acquisition unit 40, wherein, the second comparing unit 30 includes first detection module 301, labeling module 302, first Filtering module 303 and comparing module 304.

Phase in the effect of the first comparing unit 10, first acquisition unit 20 and second acquisition unit 40 and first embodiment Same, will not be described here.

First detection module 301 is used in the information of genomic fragment that detection does not compare with the presence or absence of the sequence repeating Information.

Preferably, when species to be studied are plant, detect in the information of genomic fragment not compared whether deposit In the sequence information meaning repeating, this is because there is the sequence of substantial amounts of repetition in the genome of plant, and work as to be studied When species are animal, can not detect in the information of the genetic fragment not compared and whether there is the sequence information repeating, this is Because there is the sequence of a small amount of repetition in the genome of animal.

If labeling module 302 is used for the sequence letter that there is repetition in the information detect the genomic fragment not compared Breath, then be labeled the sequence information of repetition, obtain the information marking.

Specifically, the sequence information of repetition can be marked out by repeatmasker software, and can be with being different from Other characters of base symbol or numeral etc. are labeled to the sequence information repeating.As such, it is possible to prevent the letter marking Breath is obscured with base sequence information phase.

First filtering module 303 is used for filtering, in the information of genetic fragment never comparing, the information marking, and obtains Information after filtration.

It should be noted that the information marking can not be filtered, but in the information phase with reference gene group During contrast, skip the information marking.

Comparing module 304 is used for the information of the information after filtering and reference gene group is compared, and obtains the second comparison Result.

By the embodiment of the present invention, when the information with reference gene group is compared, using the sequence detecting repetition Information, and the mode being filtered or being skipped in comparison, it is possible to reduce the quantity of genome sequence to be compared, thus Comparison efficiency can be improved, and filter the information marking and can reduce the consumption to calculator memory for the genome.

Alternatively, in embodiments of the present invention, the first comparison result can include multiple homologous geness group fragments, wherein, Multiple homologous geness group fragments be multiple compare on genomic fragment, first acquisition unit may include that the second filtering module, First order module, the first merging module and link block.

Second filtering module is used for filtering multiple homologous geness group fragments from the first comparison result, obtains multiple comparison On genome sub-piece.

It should be noted that above-mentioned from filtering multiple homologous geness group fragments from the first comparison result, obtain multiple not The step of the genome sub-piece in comparison can be replaced with extracting the step of multiple genome sub-piece not compared.

First order module is used for the position in target gene group according to the genome sub-piece on multiple comparison and closes System is ranked up, and obtains the sequence of multiple genome sub-piece not compared.

First merges module is used for genome sub-piece that is adjacent for any two position in sequence and having lap Merge, obtain the sequence of the genome sub-piece not compared including multiple merging.

Specifically, by bedtools instrument, these genome sub-piece with lap can be merged.

Preferably, before this, can first in detection sequence whether the adjacent genome sub-piece in any two position There is lap, if detecting that in sequence, the adjacent genome sub-piece in any two position has lap, will In sequence, any two position is adjacent and genome sub-piece that have lap merges, and obtains including multiple merging The sequence of the genome sub-piece not compared.If detecting that in sequence, the adjacent genome sub-piece in any two position is not There is lap, then skip and genome sub-piece that is adjacent for any two position in sequence and having lap is closed And, the step obtaining the sequence of the genome sub-piece not compared including multiple merging.Wherein, overlap can be two bases Part because organizing sub-piece there occurs overlap, or can be that the whole of two genome sub-piece there occurs overlap, or can Be the whole of a genome sub-piece with the part of another genome sub-piece there occurs overlapping.

By repeating part in the genome sub-piece on multiple comparison is merged, it is possible to reduce during second compares To identical genomic fragment repeat compare, such that it is able to reduce time loss during comparison, and repeating part is carried out Merging can also reduce the consumption to calculator memory.

Link block is used for connecting the whole bases in the sequence of genome sub-piece not compared including multiple merging Because organizing sub-piece, the information of the genomic fragment not compared.

For example, after the multiple homologous geness group fragments in filtering the first comparison result, 4 can be obtained and do not compare Genome sub-piece, it is respectively the first sub-piece, the second sub-piece, the 3rd sub-piece and the 4th sub-piece, wherein, first Sub-piece, the second sub-piece, the 3rd sub-piece and the 4th sub-piece are from left to right arranged successively according to the position relationship in genome It is classified as a sequence, and the afterbody of the 3rd sub-piece in this sequence and the stem of the 4th sub-piece overlap, and so may be used To merge the part of this overlap, and the 3rd sub-piece and the 4th sub-piece merge into a new genome sub-pieces Section the 5th sub-piece, such that it is able to obtain the new sequence being made up of the first sub-piece, the second sub-piece and the 5th sub-piece, The first sub-piece in this new sequence, the second sub-piece and the 5th sub-piece are sequentially connected the information of the genomic fragment obtaining It is the information of the genomic fragment not compared.

Alternatively, the second comparison result can include multiple homologous geness group fragments, and second acquisition unit may include that and carries Delivery block, the second order module, the second detection module, the second merging module and the 3rd filtering module.

Extraction module is used for extracting multiple homologous geness group fragments.Second order module is used for according to multiple homologous geness groups Position relationship in target gene group for the fragment is ranked up, and obtains the sequence of multiple homologous geness group fragments, specifically, permissible By the sort instrument in bedtools, multiple homologous geness group fragments are ranked up.Second detection module is used for detection sequence The adjacent homologous geness group fragment in middle any two position whether there is lap.If the second merging module is used for detecting In sequence there is lap in the adjacent homologous geness group fragment in any two position, then merge lap, obtain multiple conjunctions And after homologous geness group fragment.3rd filtering module is for the homology after filtering from the second comparison result including multiple merging The information of genome fragment, obtains the information of the distinguished sequence of target gene group, and wherein, the information being herein filtered out is except including The information of the homologous geness group fragment after multiple merging, also includes the information that there is not the homologous geness group fragment of lap. Wherein, filter homologous geness group flaking step to be replaced with upset homologous geness group flaking step, specifically, can pass through Complement instrument overturns to homologous geness group fragment.

It should be noted that the function of first acquisition unit can be used to replace from the function of second acquisition unit, here is not Repeat again.

Preferably, this data processing equipment can also include: the first judge module, the second judge module, the 3rd judges mould Block and determining module.First judge module is used for, before extracting multiple homologous geness group fragments, judging multiple gene fragments Whether length is more than or equal to preset length.Wherein, preset length is identical with the second preset length.If the second judge module is used for Judge that the length of multiple genome fragments is more than or equal to preset length, then judge whether the similarity of multiple genome fragments is big Preset similarity in being equal to.Wherein, default similarity is identical with the second default similarity.3rd judge module be used for if it is determined that The similarity going out multiple genome fragments is more than or equal to default similarity, then judge whether the comparison rate of multiple genome fragments is big Preset comparison rate in being equal to.Wherein, default comparison rate is identical with the second default comparison rate.Determining module is used for if it is judged that many The comparison rate of individual genome fragment is more than or equal to default comparison rate, then using the information of multiple genome fragments as multiple homology bases Because organizing the information of fragment.

According to embodiments of the invention, there is provided a kind of data processing method for genome, this is used for genome Data processing method is used for obtaining the information of accurate distinguished sequence, creates conditions for accurate gene analysiss.This is used for gene The data processing method of group may operate on computer-processing equipment.It should be noted that what the embodiment of the present invention was provided Data processing method for genome can be executed for the data processing equipment of genome by the embodiment of the present invention, The data processing equipment for genome of the embodiment of the present invention can be used for execute the embodiment of the present invention for genome Data processing method.

Fig. 3 is the flow chart of the data processing method for genome according to embodiments of the present invention.

As shown in figure 3, the method includes steps s302 to step s308:

Step s302, the information of the information of target gene group and reference gene group is carried out first and compares, and obtains the first ratio To result.

Specifically, can be by the nucmer instrument in mummer software, by same for target gene group reference gene group Carry out the first comparison, obtain the first comparison result between two genomes.It should be noted that in full-length genome scope First comparison in, nucmer instrument can be replaced.

Preferably, in comparing first, target gene group and reference gene group can be divided into n genome respectively Area, n genomic region of target gene group can be compared with n genomic region of reference gene group simultaneously.So, may be used To save comparison time, improve comparison efficiency.

Alternatively, compare the information of the information of target gene group and reference gene group is carried out first, obtain the first ratio Before result, this data processing method can also include: obtain the information of target gene group, and obtain reference gene group Information.

Step s304, obtains the information of the genomic fragment not compared from the first comparison result.

Specifically, the information of the genomic fragment that can not compared by the acquisition of the following two kinds method:

Wherein it is possible to the information filtering of homologous geness group fragment be fallen by the nucmer instrument in mummer software.This Sample, by filtering out the information of homologous geness group fragment, can save the consumption to calculator memory.

Step s306, the information of the information of the genomic fragment on not comparing and reference gene group is carried out second and compares, Obtain the second comparison result.

Specifically, by blastn software, the information of the genetic fragment on not comparing can be carried out and reference gene group Information is compared, and obtains the second comparison result.Wherein, this time compares and compares for fine, and corresponding second comparison result is essence Thin comparison result.As such, it is possible to the homologous geness group fragment comparing out in comparing first is found out, and filter out, such that it is able to Obtain accurate distinguished sequence.This is because, in comparing second, the length of homologous geness group fragment can be the second default length Degree, and the second preset length can be more than the first preset length, and for example, the second preset length can be 100bp；And homology base Similarity because organizing fragment can be the second default similarity；And second the comparison rate of comparison can be the second default comparison rate, For example, the second default comparison rate can be 90.

Preferably, in comparing second, the genomic fragment on not comparing and reference gene group can be divided into n respectively Individual genomic region, can be same with n genomic region of reference gene group by n genomic region of the genomic fragment on not comparing When compare.As such, it is possible to saving comparison time, improve comparison efficiency.

Step s308, obtains the information of the distinguished sequence of target gene group from the second comparison result.

As shown in figure 4, the data processing method that this is used for genome includes steps s402 to step s414, this is real Applying example can be used as the preferred implementation of embodiment illustrated in fig. 3.

Step s402 to step s404, respectively with embodiment illustrated in fig. 3 step s302 to step s304, here is no longer superfluous State.

Step s406, detects in the information of genomic fragment not compared with the presence or absence of the sequence information repeating.

Preferably, when species to be studied are plant, detect in the information of genomic fragment not compared whether deposit In the sequence information meaning repeating, this is because there is the sequence of substantial amounts of repetition in the genome of plant, and work as to be studied When species are animal, can not detect in the information of the genomic fragment not compared and whether there is the sequence information repeating, this It is because the sequence that there is a small amount of repetition in the genome of animal.

Step s408, if there is the sequence information of repetition in detecting the information of the genomic fragment not compared, The sequence information of repetition is labeled, obtains the information marking.

Specifically, the sequence information of repetition can be marked out by repeatmasker software, and can be with being different from Other characters of base symbol or numeral etc. are labeled to the sequence information repeating.As such, it is possible to prevent the letter marking Breath is obscured with base sequence information phase

Step s410, filters the information marking, after being filtered in the information of genomic fragment never comparing Information.

Step s412, the information of the information after filtering and reference gene group is compared, obtains the second comparison result.

Step s414, with step s308 of embodiment illustrated in fig. 3, will not be described here.

By the embodiment of the present invention, when the information with reference gene group is compared, using the sequence detecting repetition Information, and the mode being filtered or skipping, it is possible to reduce the quantity of genome sequence to be compared, such that it is able to improve ratio To efficiency, and filter the information marking and can reduce the consumption to calculator memory for the genome.

Alternatively, in embodiments of the present invention, the first comparison result can include multiple homologous geness group fragments, wherein, Multiple homologous geness group fragments are the genomic fragment on multiple comparison, obtain the gene not compared from the first comparison result The information of group fragment may include steps of:

First, filter multiple homologous geness group fragments from the first comparison result, obtain multiple gene polyadenylation signals not compared Fragment.

Then, according to the genome sub-piece on multiple comparison, the position relationship in target gene group is ranked up, Obtain the sequence of multiple genome sub-piece not compared.

Then, genome sub-piece that is adjacent for any two position in sequence and having lap is merged, obtain Sequence to the genome sub-piece not compared including multiple merging.

Finally, connect the full gene sub-pieces in the sequence of genome sub-piece not compared including multiple merging Section, the information of the genome sub-piece not compared.

For example, after the multiple homologous geness group fragments in filtering the first comparison result, 4 can be obtained and do not compare Genome sub-piece, it is respectively the first sub-piece, the second sub-piece, the 3rd sub-piece and the 4th sub-piece, wherein, first Sub-piece, the second sub-piece, the 3rd sub-piece and the 4th sub-piece are from left to right arranged successively according to the position relationship in genome It is classified as a sequence, and the afterbody of the 3rd sub-piece in this sequence and the stem of the 4th sub-piece overlap, and so may be used To merge the part of this overlap, and the 3rd sub-piece and the 4th sub-piece merge into a new gene group sub-piece 5th sub-piece is such that it is able to obtain the new sequence being made up of the first sub-piece, the second sub-piece and the 5th sub-piece, new by this The information that the first sub-piece in sequence, the second sub-piece and the 5th sub-piece are sequentially connected the genome sub-piece obtaining is The information of the genome sub-piece not compared.

Alternatively, the second comparison result can include multiple homologous geness group fragments, obtains mesh from the second comparison result The information of the distinguished sequence of mark genome may include steps of:

First, multiple homologous geness group fragments are extracted.Secondly, according to multiple homologous geness group fragments in target gene group Position relationship be ranked up, obtain the sequence of multiple homologous geness group fragments, specifically, can be by bedtools Sort instrument is ranked up to multiple homologous geness group fragments.Again, the adjacent homology base in any two position in detection sequence Because group fragment whether there is lap.Then, if detecting that in sequence, the adjacent homologous geness group in any two position is broken There is lap in piece, then merge lap, obtains the homologous geness group fragment after multiple merging.Finally, from the second comparison The information of the homologous geness group fragment after filtering in result including multiple merging, obtains the letter of the distinguished sequence of target gene group Breath, wherein, the information that is herein filtered out except including the information of the homologous geness group fragment after multiple merging, also includes not existing The information of the homologous geness group fragment of lap.Wherein, filtering homologous geness flaking step can be with upset homologous geness group Flaking step is replaced, and specifically, by complement instrument, homologous geness group fragment can be overturn.

It should be noted that the step obtaining the information of distinguished sequence of target gene group from the second comparison result is permissible Replaced with the step with the information obtaining the genetic fragment not compared from the first comparison result, will not be described here.

Preferably, before extracting multiple homologous geness group fragments, this data processing method can also include: first, sentences Whether the length of disconnected multiple genome fragments is more than or equal to preset length.Wherein, preset length is identical with the second preset length.Connect , if it is judged that the length of multiple genome fragment is more than or equal to preset length, then judge the similar of multiple genome fragments Whether degree is more than or equal to default similarity.Wherein, default similarity is identical with the second default similarity.Then, if it is judged that The similarity of multiple genome fragments is more than or equal to default similarity, then judge whether the comparison rate of multiple genome fragments is more than It is equal to default comparison rate.Wherein, default comparison rate is identical with the second default comparison rate.Finally, if it is judged that multiple genome The comparison rate of fragment is more than or equal to default comparison rate, then using the information of multiple gene fragments as multiple homologous geness group fragments Information.

As can be seen from the above description, the present invention passes through long sequence alignment program and short sequence alignment program simultaneously With obtaining all types of distinguished sequences (being not limited to protein sequence) between accurate species, and having reached minimizing gene Time when group compares and the effect of internal memory, this can provide condition for the variety analysis of follow-up species.

It should be noted that the step that illustrates of flow process in accompanying drawing can be in such as one group of computer executable instructions Execute in computer system, and although showing logical order in flow charts, but in some cases, can be with not It is same as the step shown or described by order execution herein.

Obviously, those skilled in the art should be understood that each module of the above-mentioned present invention or each step can be with general Computing device realizing, they can concentrate on single computing device, or be distributed in multiple computing devices and formed Network on, alternatively, they can be realized with the executable program code of computing device, it is thus possible to they are stored To be executed by computing device in the storage device, or they be fabricated to each integrated circuit modules respectively, or by they In multiple modules or step be fabricated to single integrated circuit module to realize.So, the present invention be not restricted to any specific Hardware and software combines.

The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, made any repair Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims

1. a kind of data processing method for genome is it is characterised in that include:

The information of the information of target gene group and reference gene group is carried out first compare, obtain the first comparison result；

The information of the genomic fragment not compared is obtained from described first comparison result；

The information of the information of the genomic fragment on described comparison and described reference gene group is carried out second compare, obtain Two comparison results；And

The information of the distinguished sequence of described target gene group is obtained from described second comparison result；

Described first comparison result includes multiple homologous geness group fragments, and wherein, the plurality of homologous geness group fragment is multiple Genomic fragment in comparison, the information obtaining the genomic fragment not compared from described first comparison result includes:

Filter the plurality of homologous geness group fragment from described first comparison result, obtain the multiple genome not compared Fragment；

It is ranked up according to position relationship in described target gene group for the genome sub-piece on the plurality of comparison, obtain The sequence of the genome sub-piece on multiple comparison；

Genome sub-piece that is adjacent for any two position in described sequence and having lap is merged, including The sequence of the genome sub-piece not compared of multiple merging；And

Connect the full gene group sub-piece in the sequence of genome sub-piece not compared of the multiple merging of described inclusion, obtain The information of the genomic fragment on described comparison.

2. data processing method according to claim 1 is it is characterised in that by the genomic fragment on described comparison The information of information and described reference gene group carries out second and compares, and obtains the second comparison result and includes:

Whether there is the sequence information repeating in the information of the genomic fragment not compared described in detection；

If there is the sequence information of repetition in the information of the genomic fragment not compared described in detecting, by described repetition Sequence information be labeled, obtain the information marking；

The described information marking, the information after being filtered is filtered from the information of the genetic fragment described comparison；With And

The information of the information after described filtration and described reference gene group is compared, obtains described second comparison result.

3. data processing method according to claim 1 is it is characterised in that described second comparison result includes multiple homologies Genome fragment, the information of the distinguished sequence obtaining described target gene group from described second comparison result includes:

Extract the plurality of homologous geness group fragment；

It is ranked up according to position relationship in described target gene group for the plurality of homologous geness group fragment, obtain described many The sequence of individual homologous geness group fragment；

Detect that in described sequence, the adjacent homologous geness group fragment in any two position whether there is lap；

If detecting that in described sequence, the adjacent homologous geness group fragment in any two position has lap, merge institute State lap, obtain the homologous geness group fragment after multiple merging；And

The information of the homologous geness group fragment after filtering from described second comparison result including multiple merging, obtains described target The information of the distinguished sequence of genome.

4. data processing method according to claim 3 is it is characterised in that extracting the plurality of homologous geness group fragment Before, described data processing method also includes:

Judge whether the length of multiple genome fragments is more than or equal to preset length；

If it is judged that the length of the plurality of genome fragment is more than or equal to preset length, then judge that the plurality of genome is broken Whether the similarity of piece is more than or equal to default similarity；

If it is judged that the similarity of the plurality of genome fragment is more than or equal to default similarity, then judge the plurality of gene Whether the comparison rate of group fragment is more than or equal to default comparison rate；And

If it is judged that the comparison rate of the plurality of genome fragment is more than or equal to default comparison rate, then by the plurality of genome The information of fragment is as the information of the plurality of homologous geness group fragment.

5. a kind of data processing equipment for genome is it is characterised in that include:

First comparing unit, compares for the information of the information of target gene group and reference gene group is carried out first, obtains One comparison result；

First acquisition unit, for obtaining the information of the genomic fragment not compared from described first comparison result；

Second comparing unit, for entering the information of the information of the genomic fragment on described comparison and described reference gene group Row second compares, and obtains the second comparison result；And

Second acquisition unit, for obtaining the information of the distinguished sequence of described target gene group from described second comparison result；

Described first comparison result includes multiple homologous geness group fragments, and wherein, the plurality of homologous geness group fragment is multiple Genomic fragment in comparison, described first acquisition unit includes:

Second filtering module, for filtering the plurality of homologous geness group fragment from described first comparison result, obtains multiple The genome sub-piece not compared；

First order module, for the position in described target gene group according to the genome sub-piece on the plurality of comparison The relation of putting is ranked up, and obtains the sequence of multiple genome sub-piece not compared；

First merging module, for by genome sub-piece that is adjacent for any two position in described sequence and having lap Merge, obtain the sequence of the genome sub-piece not compared including multiple merging；And link block, for connecting Described include multiple merging the sequence of genome sub-piece not compared in full gene group sub-piece, obtain described in not The information of the genomic fragment in comparison.

6. data processing equipment according to claim 5 is it is characterised in that described second comparing unit includes:

First detection module, in the information of the genomic fragment not compared described in detecting with the presence or absence of the sequence letter repeating Breath；

, if there is the sequence letter of repetition in the information for the genomic fragment not compared described in detecting in labeling module Breath, then be labeled the sequence information of described repetition, obtain the information marking；

First filtering module, for filtering the described information marking from the information of the genetic fragment on described comparison, obtains Information to after filter；And

Comparing module, for the information after described filtration and the information of described reference gene group are compared, obtains described the Two comparison results.

7. data processing equipment according to claim 5 is it is characterised in that described second comparison result includes multiple homologies Genome fragment, described second acquisition unit includes:

Extraction module, for extracting the plurality of homologous geness group fragment；

Second order module, enters for the position relationship in described target gene group according to the plurality of homologous geness group fragment Row sequence, obtains the sequence of the plurality of homologous geness group fragment；

Second detection module, for detecting that in described sequence, the adjacent homologous geness group fragment in any two position whether there is weight Folded part；

Second merging module, if for detecting the homologous geness group fragment presence that in described sequence, any two position is adjacent Lap, then merge described lap, obtains the homologous geness group fragment after multiple merging；And

3rd filtering module, for the homologous geness group fragment after filtering from described second comparison result including multiple merging Information, obtains the information of the distinguished sequence of described target gene group.

8. data processing equipment according to claim 7 is it is characterised in that also include:

First judge module, for, before extracting the plurality of homologous geness group fragment, judging the length of multiple genome fragments Whether degree is more than or equal to preset length；

Second judge module, for if it is judged that the length of the plurality of genome fragment is more than or equal to preset length, then sentencing Whether the similarity of disconnected the plurality of genome fragment is more than or equal to default similarity；

3rd judge module, for if it is judged that the similarity of the plurality of genome fragment is more than or equal to default similarity, Whether the comparison rate then judging the plurality of genome fragment is more than or equal to default comparison rate；And

Determining module, for if it is judged that the comparison rate of the plurality of genome fragment is more than or equal to default comparison rate, then The validation of information of the plurality of genome fragment is the information of the plurality of homologous geness group fragment.