CN105260395B

CN105260395B - The storage of STR data and paternity test sequence comparison method based on inverted index structure

Info

Publication number: CN105260395B
Application number: CN201510590067.XA
Authority: CN
Inventors: 刘健; 李宝娟; 高东怀; 许卫中; 孙茂; 许浩; 靳豪杰; 张军超
Original assignee: Fourth Military Medical University FMMU
Current assignee: Fourth Military Medical University FMMU
Priority date: 2015-09-16
Filing date: 2015-09-16
Publication date: 2018-05-01
Anticipated expiration: 2035-09-16
Also published as: CN105260395A

Abstract

The invention discloses a kind of STR data storage based on inverted index structure and paternity test sequence comparison method, belong to data storage and processing technology field.STR data storage and paternity test sequence comparison method of the present invention based on inverted index structure, it is main to include two aspects：First, the STR date storage methods based on inverted index structure, this method can be established different data fields, store STR data with inverted index structure in data field according to str locus seat selected by sample；The comparison method second, paternity test is sorted, inverted index structure of this method based on dividing domain, calculates the affiliation of sample in look for relative sample and database, realization is quick, stablizes, reliably looks for relative online.

Description

The storage of STR data and paternity test sequence comparison method based on inverted index structure

Technical field

The invention belongs to data storage and processing technology field, and in particular to a kind of STR numbers based on inverted index structure According to storage and paternity test sequence comparison method.

Background technology

According to incompletely statistics, look for relative personnel in the whole nation shared about 500,000 at present, wherein since history, natural calamity, society are asked Chaos caused by war orphan caused by reason (Japanese orphan), natural calamity orphan and the abducted populations such as topic etc. constitute the master of looking for relative personnel Body.In recent years, with the continuous development of biotechnology, carrying out looking for relative by gene technology becomes more and more feasible.

Looking for relative based on gene technology is that mainly human inheritance's mark is detected by using paternity test technology, And according to identification of the Inheritance Analysis on Genetic to doubtful parent and child genetic connection.DNA is the underlying carrier of human inheritance's information, What the chromosome of the mankind was mainly made of DNA, each human body cell has 22 pairs of autosome chromosomes and 1 pair of sex chromosome, altogether Meter 46, respectively from father and mother.Parent both sides are respectively filial generation and provide a hemichromosome, are mutually paired in after fertilization, Form the chromosome of filial generation.Whole chromosome system is formed since human body there are about 3,000,000,000 nucleotide, and in reproduction cell shape It is random into preceding exchange and combination, so in addition to identical twin, there is identical core without any two people Nucleotide sequence, here it is the genetic polymorphism of people.Despite the presence of the polymorphism of heredity, but each human chromosome is inevitable also only Its parent can be come from, here it is the theoretical foundation of DNA paternity tests.At present, applied during paternity test relatively broad Be the identification technology based on short tandem repeat (short tandem repeat, STR), due to its extremely sensitiveization, height Feature, the technologies such as degree personalization, fully digitalization have become the authentication technique of Global Access.One typical autosome Str locus seat data are as follows：

Site	STR
		D8S1179	13/14
D21S11	31/32
		D7S820	11/12

CSF1PO	10/13
		D3S1358	15/16
D5S818	11/13
		D13S317	8/12
D16S539	9/12
		D2S1338	17/23
D19S433	14/14
		vWA	16/18
D12S391	18/21
		D18S51	13/13
AMEL	X/Y
		D6S1043	12/19
FGA	22/23

At present, the solution of paternity test problem relies on relevant database to store and compare STR data more, to realize The judgement of sample donor parent child relationship.For a site, its STR data is mainly made of two numerals, one of them From father, another then derives from mother.In detection process, it is assumed that each 16 sites of sample detection are (including one A gender site).The same loci of each sample can have the numerical value of two allele.Two with biology parent child relationship In 15 STR bit points of a tested person, the data in each site require that at least one numerical value is identical.For this For problem, to judge whether there is parent child relationship between two individuals, at most need on each site to compare 4 times, 15 Site then at most needs to compare 60 times.When the sample size stored in system gradually increases, its contrast conting amount also will gradually increase Add.Therefore, although solving depositing for looking for relative database to a certain extent using the storage of relevant database and alignments Storage and search problem, but the characteristics of due to human body str locus seat data itself, it is not appropriate for the relationship type number for using " form " Stored according to storehouse, and largely have impact on the comparison efficiency of STR data.In addition, gene information can also in the presence of what is made a variation Energy property, once STR data are undergone mutation, will be further increased and carry out looking for relative and paternity test difficulty using STR data.

The content of the invention

In order to overcome the problems of the above-mentioned prior art, it is an object of the invention to provide one kind to be based on inverted index knot Structure STR data storage and paternity test sequence comparison method, this method can effectively improve looking for relative database robustness and Comparison efficiency, while can effectively ensure that the reliability of looking for relative result.

The present invention is to be achieved through the following technical solutions：

The storage of STR data and paternity test sequence comparison method disclosed by the invention based on inverted index structure, including Following steps：

1) the STR data storage based on inverted index structure

First, all STR data are pre-processed, it is reference format that the STR data sets of each sample, which are arranged,；So Afterwards, using each site as a data field, respective STR data will be stored in each data field；Finally, by STR data Stored in a manner of inverted index；

2) the paternity test sequence based on the STR data stored in a manner of inverted index compares

First, STR data to be looked for relative are pre-processed, it is reference format that the STR data sets of each sample, which are arranged,；So Afterwards, the STR data in each site are compared in respective data field, and form final parent child relationship index；Finally, Judge to whether there is parent child relationship between sample, if parent child relationship index is higher than specific value, then it is assumed that the confession of candidate samples The donor of body and sample to be looked for relative has a parent child relationship, on the contrary then think parent child relationship is not present between the two.

In step 1), STR data are pre-processed, it is reference format that the STR data sets of each sample, which are arranged, specifically It is as follows：

Sample data set is denoted as X={ x₁,x₂..., x_n}；

Wherein, x_iRepresent the STR data of i-th of individual,Wherein,Represent j-th of STR The title of locus, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome.

The foundation of data field is as follows in step 1)：

The STR data of all samples are traveled through, establish the set STR of str locus seat title_N={ str₁,str₂,…, str_m, for STR_NIn each str_i, different data fields is established, is denoted as d_i；I=1,2 ... m.

STR data are stored in a manner of inverted index in step 1), sample data set X are traveled through, to any x_i, traversal:(v_j1/v_j2), ifCorresponding data field d_mIn there are v_j1Index, then by x_iIt is added in the index；If no There are v_j1Inverted index, then establish the index, and by x_iIt is added in index；For v_j2Adopt and located in a like fashion Reason.

STR data to be looked for relative are pre-processed in step 2), it is specific as follows：

It is following form by looking for relative sample arrangement：Y={ str_j:(v_j1/v_j2), wherein str_jRepresent j-th of str locus seat Title, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome.

The calculating of parent child relationship index is as follows in step 2)：

For sample y, str is traveled through_j:(v_j1/v_j2), if there is str_jCorresponding domain d_m, then v is obtained_j1And v_j2Index institute Corresponding sample set, is denoted as X respectively_j1And X_j2；

Take X_j1And X_j2Union, be denoted as X_j=X_j1∪X_j2；

Obtaining each str_jCorresponding X_jAfterwards, X is calculated_jUnionEach member in X Element is candidate samples；

To each element x in X_i, calculateWherein：

Then q_iFor candidate samples x_iParent child relationship index.

Parent child relationship is judged whether in step 2), it is specific as follows：

According to q_iTo candidate samples x_iDescending sort is carried out, if q_i>=θ, then it is assumed that the donor of sample y to be looked for relative and candidate Sample x_iDonor there is parent child relationship；It is on the contrary, then it is assumed that there is no parent child relationship between the two；Wherein, θ is set in advance for system Fixed threshold value, subtracts 1 for the quantity of locus.

Compared with prior art, the present invention has technique effect beneficial below：

1st, the search efficiency of higher

Traditional looking for relative database often uses relevant database, and by vertical segmentation, establish view, establish and count The means such as information optimize, improve the search efficiency of system.But these method search algorithms are relative complex, and it is not easy to be scarce Few relevant background knowledge operation maintenance personnel is understood.The present invention sets different data to store according to the difference of locus point position Domain, and stored paired str locus seat data using inverted index structure in different data fields, is drastically increased and is The search efficiency of system.

2nd, the scalability of higher

Traditional looking for relative database based on relevant database, often requires that looking for relative person must use specific gene position Point significantly limit the use scope for database of looking for relative to be detected, and great inconvenience is brought to vast looking for relative user.This Invention does not require specifically for looking for relative gene point position, if necessary to the point Bits Expanding to system, it is only necessary to which increase is corresponding Data field, without being modified to basic data structure, drastically increases the scope of application of system.

3rd, influence of the gene mutation to paternity test effect is avoided

Gene mutation is one of huge obstacle of accuracy of paternity test in limitation looking for relative database.Due to genetic mutation Presence so that parent-offspring two instead of between str locus seat data might not be completely superposed, therefore when using relevant database During storing str locus seat data, the complete matching between str locus seat data becomes difficult to operate, in SQL statement WHERE conditions are difficult to accurately match, and significantly limit the effect of paternity test.The present invention utilizes the row of falling in different pieces of information domain Index structure stores str locus seat data, in inquiry, it is only necessary to which the comparison score in different pieces of information domain calculates sample Parent child relationship index between this has the possibility of genetic connection between can obtaining sample donor, and is ranked up according to this, Influence of the gene mutation to paternity test effect is largely avoid, the present invention can effectively improve the robustness of looking for relative database And comparison efficiency, while can effectively ensure that the reliability of looking for relative result.

Brief description of the drawings

Fig. 1 is system overall framework schematic diagram；

Fig. 2 is the str locus seat data storage used in the present invention based on inverted index structure；

Fig. 3 is the algorithm flow chart of the STR data storage based on inverted index structure；

The algorithm flow chart that paternity test sequences of the Fig. 4 based on the STR data stored in a manner of inverted index compares；

Data structure when Fig. 5 and Fig. 6 does not store No. 00002 sample respectively and store 00002 in D8S1179 domains Data structure after number sample；

Fig. 7 and Fig. 8 is the looking for relative system prototype realized according to principle of the invention design, and Fig. 7 is based on inverted index structure The STR data of storage, Fig. 8 are the looking for relative result that looking for relative algorithm obtains.

Embodiment

With reference to specific embodiment, the present invention is described in further detail, it is described be explanation of the invention and It is not to limit.

Referring to Fig. 1, STR data storage and paternity test sequence comparison method of the present invention based on inverted index structure are main To include two aspects：First, the STR date storage methods based on inverted index structure, referring to Fig. 3, this method can be according to sample Selected str locus seat, establishes different data fields, stores STR data with inverted index structure in data field；Second, parent Son identification sequence comparison method, referring to Fig. 4, inverted index structure of this method based on dividing domain, calculates looking for relative sample and data The affiliation of sample in storehouse, realizes quick, stable, reliable online looking for relative.

1. the STR date storage methods based on inverted index structure

STR date storage methods based on inverted index structure, comprise the following steps：First, all STR data are carried out Pretreatment, it is reference format that the STR data sets of each sample, which are arranged,；Then, using each site as a data field, often Respective STR data will be stored in a data field.Finally, the mode of STR data inverted indexs is stored.Detailed process such as Fig. 3 It is shown, specifically：

Step 1：Data prediction.Data preparation to be stored is following form：Sample data set is denoted as X={ x₁, x₂..., x_n, wherein x_iRepresent the STR data of i-th of individual, be represented byWhereinRepresent The title of j-th of str locus seat, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome.

Step 2：Establish data field.The STR data of all samples are traveled through, establish the set STR of str locus seat title_N ={ str₁,str₂,…,str_m, for STR_NIn each str_i, different data fields is established, is denoted as d_i。

Step 3：Data store.Sample data set X is traveled through, to any x_i, traversal:(v_j1/v_j2), ifIt is corresponding Data field d_mIn there are v_j1Index, then by x_iIt is added in the index；If there is no v_j1Inverted index, then establish should Index, and by x_iIt is added in index.For v_j2Adopt and handled in a like fashion.

Str locus seat data after being stored using the above method are as shown in Figure 2.Wherein, the D8S1179 shown in top, D21S11 etc. is the data field corresponding to str locus seat；Lower left is the corresponding data key of data, digital representation therein STR numerical value；Inverted index list of the lower right corresponding to data key, the ID number of this donor of each numerical tabular sample.Such as List corresponding to STR numerical value 12 includes the numerals such as 1,5,7,13,22, represents that certain chromosome of sample 1,5,7,13,22 exists Numerical value on the D3S1358 of site is 12.

2. the paternity test sequence comparison method based on inverted index structure

On the basis of the STR date storage methods based on inverted index structure, str locus seat number as shown in Figure 2 is obtained According to storage organization., will be main using the paternity test sequence comparison method based on inverted index structure, this method when being looked for relative Comprise the following steps：First, STR data to be looked for relative are pre-processed, it is reticle that the STR data sets of each sample, which are arranged, Formula；Then, the STR data in each site are compared in respective data field, and form final parent child relationship index； Finally, judge to whether there is parent child relationship between sample, if parent child relationship index is higher than specific value, then it is assumed that candidate samples The donor of donor and sample to be looked for relative there is parent child relationship, it is on the contrary then think parent child relationship is not present between the two.

Step 1：Data prediction.It is following form by looking for relative sample arrangement：Y={ str_j:(v_j1/v_j2), wherein str_j Represent the title of j-th of str locus seat, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome.

Step 2：Calculate parent child relationship index.For sample y, str is traveled through_j:(v_j1/v_j2), if there is str_jIt is corresponding Domain d_m, then v is obtained_j1And v_j2The corresponding sample set of index, is denoted as X respectively_j1And X_j2.Take X_j1And X_j2Union, be denoted asObtaining each str_jCorresponding X_jAfterwards, X is calculated_jUnionIn X Each element is candidate samples.To each element x in X_i, calculateWherein,

Then q_iFor candidate samples x_iParent child relationship index.

Step 3：Judge whether parent child relationship.According to q_iTo candidate samples x_iDescending sort is carried out, if q_i>=θ is then Think the donor and candidate samples x of sample y to be looked for relative_iDonor there is parent child relationship；It is on the contrary then think to be not present between the two Parent child relationship.Wherein θ is the threshold value that system is previously set, general it is contemplated that the quantity for being arranged to locus subtracts 1.

Instantiation is as follows：

Need the looking for relative sample that stores as shown in table 1, sample to be looked for relative is as shown in table 2.

The sample instantiation to be stored in the looking for relative database of table 1

Sample ID

00001

00002

00003

00004

00005

……

D8S1179

14/15

13/15

10/13

13/15

13/14

……

D21S11

30.2/31

29/32.2

30/31.2

29/32.2

29/30

……

D7S820

10/11

8/9

11/11

10/12

……

CSF1PO

10/11

12/14

10/13

11/12

10/10

……

D3S1358

15/16

16/16

16/17

15/15

15/16

……

D5S818

10/11

11/12

11/13

10/11

10/13

……

D13S317

12/12

11/11

11/12

10/11

11/11

……

D16S539

10/13

9/11

11/12

10/12

……

D2S1338

20/23

21/23

18/24

20/23

18/19

……

D19S433

13/14

13/13

14/15

13/15.2

13/14

……

vWA

17/20

14/16

16/17

13/14

17/19

……

D12S391

18/21

19/20

17/17.3

18/19

18/18

……

D18S51

13/14

13/15

14/15

13/16

14/17

……

AMEL

X/X

X/Y

……

D6S1043

14/21.3

19/20

13/14

10/19

17/18

……

FGA

19/22

19/24

23/24

24/26

23/23

……

2 sample instantiation to be looked for relative of table

Site	STR
		D8S1179	13/15
D21S11	29/31
		D7S820	11/11
CSF1PO	11/11
		D3S1358	15/15
D5S818	10/12
		D13S317	10/10
D16S539	9/11
		D2S1338	18/23
D19S433	13/14

vWA	14/14
		D12S391	18/19
D18S51	13/15
		AMEL	X/Y
D6S1043	10/18
		FGA	23/24

1st, the STR data storage based on inverted index structure

Step 1：Data prediction.Will all samples to be stored to arrange be reference format, using No. 00001 sample as Example, the result after it is arranged are：

x₁={ D8S1179:(14/15),D21S11:(30.2/31),D7S820:(10/11),CSF1PO:(10/11), D3S1358:(15/16),D5S818:(10/11),D13S317:(12/12),D16S539:(10/13),D2S1338:(20/ 23),D19S433:(13/14),vWA:(17/20),D12S391:(18/21),D18S51:(13/14),AMEL:(X/X), D6S1043:(14/21.3),FGA:(19/22)}

Step 2：Establish data field.In this example, the locus title of all sample datas is completely the same, therefore establishes Data field share 16：

Step 3：Data store.Data storage has stored in database at this time by taking ID is 00002 sample as an example ID is 00001 sample, as shown in Figure 5.First group of data D8S1179 is obtained first:(13/15), deposit in the database at this time In data field D8S1179, and there is index 15 and index 13 may be not present, it is therefore desirable to newly-built 13 index, and 00002 is added Into 13 and 15 index, as shown in fig. 6, traveling through each data field of No. 00002 sample successively in the manner described above afterwards.

2nd, the paternity test sequence comparison method based on inverted index structure

Step 1：Data prediction.

It is following form that looking for relative sample in table 2, which is arranged,：

Y={ D8S1179:(13/15),D21S11:(29/31),D7S820:(11/11),CSF1PO:(11/11), D3S1358:(15/15),D5S818:(10/12),D13S317:(10/10),D16S539:(9/11),D2S1338:(18/ 23),D19S433:(13/14),vWA:(14/14),D12S391:(18/19),D18S51:(13/15),AMEL:(X/Y), D6S1043:(10/18),FGA:(23/24)}

Step 2：Calculate parent child relationship index.

For sample y, firstly for first data field D8S1179, it takes its sample set there are 13 and 15 index Union X_j={ 00001,00002,00003,00004,00005 ... }, calculates the union X=in all domains on this basis {00001,00002,00003,00004,00005,...}.To each element x in X_i, its score is calculated, as shown in table 3：

3 sample score of table

Sample ID	00001	00002	00003	00004	00005	……
							D8S1179	1	1	1	1	1	……
D21S11	0	1	0	1	1	……
							D7S820	1	0	1	1	0	……
CSF1PO	1	0	0	1	0	……
							D3S1358	1	0	0	1	1	……
D5S818	1	1	0	1	1	……
							D13S317	0	0	0	1	0	……
D16S539	0	1	1	1	0	……
							D2S1338	1	1	1	1	1	……
D19S433	1	1	1	1	1	……
							vWA	0	1	0	1	0	……
D12S391	1	1	0	1	1	……
							D18S51	1	1	1	1	0	……
AMEL	1	1	1	1	1	……
							D6S1043	0	0	0	1	1	……
FGA	0	1	1	1	1	……
							q_i	10	11	8	16	10

Step 3：Judge whether parent child relationship.

Descending sort is carried out to candidate samples according to score, makes θ=15, then can determine whether that sample 00004 and y is closed with parent-offspring System.

As shown in Figure 7 and Figure 8, wherein Fig. 7 illustrates this patent description to Database Systems prototype according to the system design The STR data based on inverted index structure storage, Fig. 8 illustrates the result of looking for relative.

In conclusion the storage of STR data and paternity test sequence proposed by the present invention based on inverted index structure compare Method.This method by establishing the modes such as data field, the index for establishing STR data values, by by paired str locus seat data with The form of inverted index is stored；On this basis, sorted comparison method by paternity test based on inverted index structure, The parent child relationship index between sample is calculated, the quick comparison of paternity test is realized, improves the efficiency of retrieval and inquisition, reduce Influence of the gene mutation to paternity test comparison efficiency；And due to the use of data field, drastically increase this method The scope of application.

Claims

1. the storage of STR data and paternity test sequence comparison method based on inverted index structure, it is characterised in that including following Step：

1) the STR data storage based on inverted index structure

First, all STR data are pre-processed, it is reference format that the STR data sets of each sample, which are arranged,；Then, will Each site will store respective STR data as a data field in each data field；Finally, by STR data with the row of falling The mode of index stores；

STR data are pre-processed, it is reference format that the STR data sets of each sample, which are arranged, specific as follows：

Sample data set is denoted as X={ x₁,x₂,...,x_n}；

Wherein, x_iRepresent the STR data of i-th of individual,Wherein,Represent j-th of str locus The title of seat, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome；

First, STR data to be looked for relative are pre-processed, it is reference format that the STR data sets of each sample, which are arranged,；Then, The STR data in each site are compared in respective data field, and form final parent child relationship index；Finally, sentence It whether there is parent child relationship between random sample sheet, if parent child relationship index is higher than specific value, then it is assumed that the donor of candidate samples There is parent child relationship with the donor of sample to be looked for relative, it is on the contrary then think parent child relationship is not present between the two；

Wherein, STR data to be looked for relative are pre-processed, it is specific as follows：

It is following form by looking for relative sample arrangement：Y={ str_j:(v_j1/v_j2), wherein str_jRepresent the name of j-th of str locus seat Claim, v_jkRepresent the characteristic value of STR on locus j on k-th of chromosome；

The calculating of the parent child relationship index is as follows：

For sample y, str is traveled through_j:(v_j1/v_j2), if there is str_jCorresponding domain d_m, then v is obtained_j1And v_j2Corresponding to index Sample set, be denoted as X respectively_j1And X_j2；

Take X_j1And X_j2Union, be denoted as X_j=X_j1∪X_j2；

Obtaining each str_jCorresponding X_jAfterwards, X is calculated_jUnion X=X₁∪X₂∪...∪X_J, each element in X is Candidate samples；

To each element x in X_i, calculateWherein：

Then q_iFor candidate samples x_iParent child relationship index.

2. the storage of STR data and paternity test sequence comparison method according to claim 1 based on inverted index structure, It is characterized in that, the foundation of data field is as follows in step 1)：

The STR data of all samples are traveled through, establish the set STR of str locus seat title_N={ str₁,str₂,...,str_m, pin To STR_NIn each str_i, different data fields is established, is denoted as d_i；I=1,2 ... m.

3. the storage of STR data and paternity test sequence comparison method according to claim 1 based on inverted index structure, It is characterized in that, storing STR data in a manner of inverted index in step 1), sample data set X is traveled through, to any x_i, time Go throughIfCorresponding data field d_mIn there are v_j1Index, then by x_iIt is added in the index；If There is no v_j1Inverted index, then establish the index, and by x_iIt is added in index；For v_j2Adopt and located in a like fashion Reason.

4. the storage of STR data and paternity test sequence comparison method according to claim 1 based on inverted index structure, Parent child relationship is judged whether in step 2), it is specific as follows：

According to q_iTo candidate samples x_iDescending sort is carried out, if q_i>=θ, then it is assumed that the donor and candidate samples of sample y to be looked for relative x_iDonor there is parent child relationship；It is on the contrary, then it is assumed that there is no parent child relationship between the two；Wherein, θ is what system was previously set Threshold value, subtracts 1 for the quantity of locus.