CN102200967B

CN102200967B - Method and system for processing text based on DNA sequences

Info

Publication number: CN102200967B
Application number: CN 201110079135
Authority: CN
Inventors: 张成岗; 周扬; 屈武斌
Original assignee: Institute of Radiation Medicine of CAMMS
Current assignee: Institute of Radiation Medicine of CAMMS
Priority date: 2011-03-30
Filing date: 2011-03-30
Publication date: 2012-10-24
Anticipated expiration: 2031-03-30
Also published as: CN102200967A

Abstract

The invention provides a method and system for processing a text based on DNA sequences. The method comprises the following steps of: allocating DNA sequence codes to characters of over two texts; and performing similarity analysis on the over two texts allocated with DNA sequence codes by using a DNA sequence processing method, wherein the characters are one kind or multiple kinds of digitals, letters, words or symbols, and the letters or the words are the letters or the words in one or multiple languages. The allocation of the DNA sequence codes to the characters of the over two texts is realized by the following steps of: allocating decimal numbers to the characters of the over two texts; converting the decimal numbers into quaternary numbers; enabling 0, 1, 2, 3 in the quaternary numbers to respectively correspond to one kind of four kinds of deoxyribonucleic acid; and converting the quaternary numbers into the DNA sequence codes. The invention also provides the system for realizing the method. The method and the system provided by the invention do not depend on the establishment of the existing database and the extraction of key words, have no restriction on the numbers of characters and character combinations, and can realize the efficient and comprehensive analysis for text information.

Description

A kind of text handling method and system based on dna sequence dna

Technical field

The present invention relates to a kind of information processing method and system, relate in particular to a kind of text handling method and system based on dna sequence dna.

Background technology

The frequency spectrum portrayal of text, similarity comparison and cluster analysis are conventional analysis means in the text-processing.At present existing multiple text processing system; Yet majority is a task of just accomplishing wherein; Like the scientific paper detection system of middle National IP Network (CNKI) and the anti-system of plagiarizing of ROST of Shenyang associate professor of Wuhan University and team develops thereof, its function is the similarity comparison of accomplishing text.

The portrayal of the frequency spectrum of text is meant from one or more text of character (monocase or the combination of multiword symbol) horizontal analysis; Through with the character that might occur or character combination fixing on horizontal ordinate; Add up its frequency of occurrences in text then one by one; As ordinate, depict the collection of illustrative plates of text with this frequency values.Though it can be described text message intuitively,, therefore just concentrate at present the frequency of occurrences statistics (being less than 20) of a few character, and use seldom because the One's name is legion of character and character are difficult for unification in the position of horizontal ordinate.

Text similarity comparison (or detection) is meant through the similarity degree between the comparison Word message to come different texts are analyzed; Its general core methed is that word frequency is calculated; At first article being carried out layering handles; Create fingerprint index (promptly with the label of the representative literal of a small fragment) respectively according to levels such as chapter, paragraph, sentences as big section literal, with the fingerprint index created as the input of the retrieval in the database, to retrieve similar text; Be anti-core of plagiarizing system, also can be used for fields such as text similarity search, text mining; But because its strong dependence to database; The text message that each comparison (detection) system all needs One's name is legion behind is as support; Thereby can not give good support for similarity comparison between the comparison of the Word message in twos of lightweight or small-scale text; That is, literal to be compared need appear in the employed database, therefore in the text analyzing process, exists great limitation.

Text cluster is meant through describing the similarity degree between the document in twos, and describes the relation between the text according to the higher principle of similarity degree between the similar document.Text cluster is mainly based on extracting document keyword at present, and the problem that text is similar converts the similar problem of keyword into and describes the similarity degree between the text.Though it has simplified the text analyzing process to a certain extent; But the mode through single keyword extracting is carried out text cluster; The error of keyword extracting is accumulated, cause the disappearance of textual analysis information, be difficult to provide and compare the similarity between text from global optimum's angle.

In sum, existing text handling method and system exist accomplish functional task single relatively, carry out not high, the problem that each other can not intercommunication of efficient, and on individual task is carried out, certain problem is arranged also.

Yet powerful and efficient as the analysis means of the dna sequence dna of information carrier to equally, for example: the dna sequence dna similarity is compared software BLAT; It is on common desk-top computer, and the c-DNA sequence of 1000-bp is inquired about the similarity sequence from hundreds thousand of gene orders the inquiry reaction time is less than one second, in addition; Also can be to further comparing between many similarity dna sequence dnas that inquire, wherein similar area and conservative type site are found out in splicing; Carry out calculation of similarity degree and cluster analysis; Can carry out the cluster analysis of sequence, and above-mentioned analytical approach not rely on the foundation of database through the difference between the methods analyst sequence of portrayal dna sequence dna frequency spectrum yet; There is not the restriction to the base combined number that can add up in the extracting of keyword yet.

By on can find out; To the analysis of dna sequence dna with to the analysis of text similar purpose is arranged, difference is that dna sequence dna serves as that the expression of information is carried out on the basis with dna sequence dna sign indicating number (being DNA: A, T, C, G), and text is the expression that information is carried out on the basis with the character; How to convert the character information in the text into the dna sequence dna sign indicating number; Use the dna sequence dna disposal route that text message is handled, realizes comprehensively to text, analysis efficiently becomes the problem that remains to be solved.

Summary of the invention

The present invention provides a kind of text handling method based on dna sequence dna; Through being the character distribution dna sequence dna sign indicating number in the text; Use the dna sequence dna disposal route that text is handled then; Do not rely on the foundation in data with existing storehouse and the extracting of keyword, and do not receive character and character combination limited in number, can realize efficient, comprehensively analysis text message.

The present invention also provides a kind of text processing system based on dna sequence dna; This system is through distributing the dna sequence dna sign indicating number for the character in the text; Use the dna sequence dna disposal route that text is handled then; Solved existing text processing system accomplish functional task single relatively, carry out not high, the problem that each other can not intercommunication of efficient, realized comprehensive efficient analysis to text.

The present invention also provides the application of said system in text-processing.

Text handling method based on dna sequence dna provided by the invention comprises: the character that is two above texts distributes the dna sequence dna sign indicating number, and character identical in its Chinese version distributes identical dna sequence dna sign indicating number; Use the dna sequence dna disposal route that two above texts of distributing the dna sequence dna sign indicating number are carried out similarity analysis.

Said is that the character of two above texts distributes the dna sequence dna sign indicating number to comprise: be respectively two characters in the above text and distribute decimal numbers, character identical in its Chinese version distributes identical decimal number; Convert the pairing decimal number of character in two above texts into quaternary number respectively, the figure place of said quaternary number is n, and 4n is at least greater than the sum of mutually different character in the text, and the quaternary number of not enough n position mends 0 at said quaternary number front end; Make a kind of in the respectively corresponding four kinds of DNAs of 0 in the quaternary number, 1,2,3, convert the pairing n of the character position quaternary number in two above texts into n position dna sequence dna sign indicating number respectively, obtain the pairing dna sequence dna of each text.

In one embodiment of the invention, said is that two characters in the above text distribute decimal number to be: the appearance according to the character in two above texts distributes decimal number for it in proper order.

In another embodiment provided by the invention; Said use dna sequence dna processing method is carried out similarity analysis to two above texts of distributing the dna sequence dna sign indicating number and comprised that text is carried out following analysis 1-analyzes one or more analyses in 3: said analysis 1 is for carrying out the statistics of the sequence frequency between text in twos to two above texts of distributing the dna sequence dna sign indicating number; Obtain the sequence frequency table; Carry out the calculating of distance based on said sequence frequency table then, and two above texts are carried out cluster based on the distance calculation result;

Said analysis 2 is confirmed the similar part of two above texts for two above texts of distributing the dna sequence dna sign indicating number being carried out the comparison of the sequence similarity between text in twos according to the high score coupling fragment that obtains; Said high score coupling fragment according to acquisition confirms that the similar part of two above texts comprises that the said high score coupling fragment that will obtain through following steps is reduced to character information: the initial point position information that will mate fragment through the high score that the comparison of the sequence similarity between text in twos obtains is got surplus respectively divided by n; If do not wait, skip and do not read; If equate that remainder is k, and k ≠ 0, then moves the head that the n-k position is a character string backward, reads n position dna sequence dna sign indicating number continuously from this beginning and converts character into; If equate that remainder is k, and k=0, begin to read continuously n position dna sequence dna sign indicating number from high score coupling fragment initial point position and convert character into; Read to the dna sequence dna afterbody, the dna sequence dna sign indicating number of the not enough n of afterbody position discards not to be read;

Said analysis 3 is for carrying out the comparison of the sequence similarity between text in twos to two above texts of distributing the dna sequence dna sign indicating number; With the long coupling fragment of the high score coupling fragment assembly that obtains for nothing intersection, no repetition; Calculate similarity numerical value then, and two above texts are carried out cluster according to said similarity numerical value; Said calculating similarity numerical value comprises: calculate similarity numerical value according to the starting point of said long coupling fragment and the positional information of end point through following formula; Said formula is

wherein; The similarity of the corresponding sequence A of sequence B among Similarity (A) any two dna sequence dna A of expression and the B; Num (hsp) is the sequence A of process splicing and the long summation of mating the number of fragment of sequence B; Hsp.end and hsp.begin are respectively the long starting point of fragment and the position of end point of mating of sequence A and sequence B, and length (A) is the length overall of sequence A.

In method provided by the invention, said character can be one or more in numeral, word, word or the symbol, and said word or word can be the word or the words of one or more languages.For example, said character can be: Roman number, arabic numeral; Chinese character; English word, French word, Russian word, arabic word, Spanish word; Japanese vocabulary; In the ASCII character etc. one or more, said for the distribution of the character in text decimal number can distribute (distribute and use the decimal number of distributing when occurring for the first time no longer to change when decimal numeral character occurs once more) according to the order that character occurs in text in text, also can distribute in proper order according to other; For example can be according to character at dictionary (dictionary that comprises variant languages), standard Chinese code table, or the appearance in the standard A SCII code table etc. is that it distributes decimal number in proper order.If both comprised in the text that the character in dictionary, standard Chinese code table or the standard A SCII code table also comprised the character that does not wherein have; The order that so both can occur in text according to character is fully distributed decimal number for it; Also can be earlier with being included in character in dictionary, standard Chinese code table or the standard A SCII code table according to its order assignment decimal number in dictionary, standard Chinese code table or standard A SCII code table, the character that will not be included in then in dictionary, standard Chinese code table or the standard A SCII code table distributes decimal number according to the order that occurs successively on the decimal number basis of having distributed.Distributing decimal number successively can be to distribute decimal number with the mode that increases progressively; Also can be to distribute decimal number with the mode of successively decreasing; As long as make when utilizing method of the present invention that text is handled, mutually different character is assigned with a unique decimal number respectively and gets final product in all texts.

In the method provided by the invention, the value of said n position quaternary number n can be confirmed according to mutually different number of characters in the text; For example mutually different number of characters is 7351 in the text; Then need 4n >=7351 just enough to represent characters all in the text, obviously, n should get 7.

In the method provided by the invention; Said sequence frequency statistics can be that the frequency of occurrences of the continuous n position dna sequence dna sign indicating number in the dna sequence dna (each continuous n position dna sequence dna sign indicating number is different) is added up; Also can be to more than the continuous n position, the frequency of occurrences of the dna sequence dna sign indicating number of the integral multiple figure place of preferred n be added up.For example, when the frequency of occurrences of n position dna sequence dna sign indicating number is added up, its result can obtain all array modes with n position dna sequence dna sign indicating number as horizontal ordinate, and the number of times that in text, occurs with n position dna sequence dna sign indicating number is as the sequence frequency table of ordinate.Statistic processes is specially: read n position dna sequence dna sign indicating number continuously from first of dna sequence dna; The occurrence number of adding up each n position dna sequence dna sign indicating number that reads continuously is until last position of dna sequence dna; According to method provided by the invention, can realize statistics to the sequence frequency table of n=[5,20]-mer.

In the method provided by the invention, the relation one to one of said

quaternary number

0,1,2,3 and four kind of DNA is not done concrete restriction, as long as corresponding for one by one.

Said distance calculation based on the sequence frequency table; Can use method well known in the art to calculate; For example: the generalized distance of Shen Juan etc. (new relative entropy) computing method, the result that this method will be added up regards a frequency vector as, and on the basis of vector, defines distance.If the discrete type probability distribution P=of two text frequency spectrums (p1, p2, p3......pn) T and Q=(q1, q2, q3......qn) T, the symmetrical relative entropy that then P and Q are new is defined as:

etc.Based on the distance calculation result two above texts are carried out cluster and can realize for example Chang Yong realizations such as cluster analysis software Cluster, TreeView or R Package through conventional software.

In the method provided by the invention, the acquisition of said high score coupling fragment (HSP) also can use conventional comparison software to carry out, and for example global optimization compares software BLAT, suboptimization comparison software BLAST or FASTA etc.Further, the concrete parameter that software uses, the score value scope of high score coupling fragment, and the splicing parameter of high score coupling fragment, those skilled in the art can be according to the difference of handling text, and the difference of processing intent, carries out suitable choice.In addition, after obtaining similarity numerical value, also can use conventional software to realize, cluster analysis software Cluster, Treeview or R Package etc. for example commonly used to the cluster of two above texts.

Further, the invention provides the realization system for carrying out said process, the text processing system based on dna sequence dna provided by the invention comprises character distribution module and similarity analysis module;

The character that said character distribution module is used to two above texts distributes the dna sequence dna sign indicating number, and character identical in its Chinese version distributes identical dna sequence dna sign indicating number; Said similarity analysis module is used to use the dna sequence dna disposal route that two above texts of distributing the dna sequence dna sign indicating number are carried out similarity analysis.

In an embodiment provided by the invention, said character distribution module comprises the decimal number distribution module, quaternary number modular converter and dna sequence dna sign indicating number modular converter; Said decimal number distribution module is used for being respectively the character distribution decimal number of two above texts, and character identical in its Chinese version distributes identical decimal number; Said quaternary number modular converter is used for converting the pairing decimal number of the character of two above texts into quaternary number respectively; The figure place of said quaternary number is n; And 4n is at least greater than the sum of mutually different character in the text, and the quaternary number of not enough n position mends 0 at said quaternary number front end; Said dna sequence dna sign indicating number modular converter is used for making a kind of in 0,1,2,3 corresponding respectively four kinds of DNAs of quaternary number; Convert the pairing n of the character position quaternary number in two above texts into n position dna sequence dna sign indicating number respectively, obtain the pairing dna sequence dna of each text.

In an embodiment provided by the invention, said decimal number distribution module is used for distributing decimal number for it in proper order according to the appearance of the character of two above texts.

In another embodiment provided by the invention, said similarity analysis module comprises sequence spectrum portrayal module, one or more in sequence similarity comparing module or the sequence cluster analysis module.

Said sequence spectrum portrayal module comprises statistical module, computing module and cluster module; Said statistical module is used for two above texts of distributing the dna sequence dna sign indicating number are carried out the statistics of the sequence frequency between text in twos, obtains the sequence frequency table; Said computing module is used for carrying out the calculating of distance according to said sequence frequency table; Said cluster module is used for according to the distance calculation result two above texts being carried out cluster;

Said sequence similarity comparing module comprises comparing module and recovery module; Said comparing module is used for two above texts of distributing the dna sequence dna sign indicating number are carried out the comparison of the sequence similarity between text in twos, obtains high score coupling fragment; Said recovery module is used for confirming according to the high score coupling fragment that obtains the similar part of two above texts; Said high score coupling fragment according to acquisition confirms that the similar part of two above texts comprises that the said high score coupling fragment that will obtain through following steps is reduced to character information: the initial point position information of the high score coupling fragment of the acquisition of the comparison of the sequence similarity between text is in twos got surplus respectively divided by n; If do not wait, skip and do not read; If equate that remainder is k, and k ≠ 0, then moves the head that the n-k position is a character string backward, reads n position dna sequence dna sign indicating number continuously from this beginning and converts character into; If equate that remainder is k, and k=0, begin to read continuously n position dna sequence dna sign indicating number from high score coupling fragment initial point position and convert character into; Read to the dna sequence dna afterbody, the dna sequence dna sign indicating number of the not enough n of afterbody position discards not to be read;

Said sequence cluster analysis module comprises comparing module, concatenation module and cluster module; Said comparing module is used for two above texts of distributing the dna sequence dna sign indicating number are carried out the comparison of the sequence similarity between text in twos, obtains high score coupling fragment; Said concatenation module is used for the high score coupling fragment assembly that obtains for there not being the long coupling fragment of intersection, no repetition; Said cluster module is used to calculate similarity numerical value, and according to said similarity numerical value two above texts is carried out cluster; Said calculating similarity numerical value comprises: calculate similarity numerical value according to the starting point of said long coupling fragment and the positional information of end point through following formula; Said formula is

Based on said system, the present invention also provides the application of said system in text-processing.

Method and system provided by the invention has the following advantages and effect:

1) according to the method for the invention; Through being the dna sequence dna sign indicating number with character conversion; The problem of text-processing is converted into the problem of series processing, is convenient to method, for example: the sequence spectrum portrayal a series of series processing; Methods such as sequence similarity comparison and sequence cluster analysis are applied to the text-processing process, have improved the ability to text-processing.

2) with after the processing of text dna sequence dna; Can be from the angle of global optimum, the angle and the short sequence frequency angle analysis dna sequence dna information of local optimum; Make the analysis of text more comprehensive; And do not rely on the foundation of database, can directly realize similarity comparison between small-scale text.

3) in method provided by the invention; High score is mated fragment be reduced to the character information step; Can accurately control the dna sequence dna sign indicating number reduces to character; Make comparison result intuitively readable, and also can be applied between any two dna sequence dnas, effectively avoid intersegmental the comprising and staggered problem of high score coupling sheet in the sequence alignment in the method for calculating similarity after the splicing.

4) use systematic analysis text provided by the invention; Do not rely on the foundation of database and the extracting of keyword; Do not receive character and character combination limited in number; Solved existing text processing system accomplish functional task single relatively, carry out not high, the problem that each other can not intercommunication of efficient, help realizing comprehensive efficient analysis to text.

Description of drawings

Fig. 1 is a kind of text processing system based on dna sequence dna provided by the invention.

Fig. 2 is the text processing system of another kind provided by the invention based on dna sequence dna.

Fig. 3 A is the sequence frequency table of 20 texts of PIBB, and Fig. 3 B is the sequence frequency table of 20 texts of text.

Fig. 4 is for carrying out clustering result figure according to the distance calculation result to 20 texts of PIBB and 20 of text.

Fig. 5 is for carrying out clustering result figure according to similarity numerical value to 20 texts of PIBB and 20 of text.

Embodiment

Further specify the technical scheme of the embodiment of the invention below in conjunction with accompanying drawing and specific embodiment.Embodiment 1: use the character distribution module 11 in the text processing system 10 shown in Figure 1 to convert the character in the text into dna sequence dna according to the method for the invention

Choose a text that contains the character that has nothing in common with each other about 7000, and contain one section character " Chinese text excavation " in the text, can know that according to the method for the invention will represent characters all in the text, the figure place that needs quaternary number is 7; Existing with this section character in the text---" Chinese text excavation " is example, and the character conversion that specifies present embodiment is the process of dna sequence dna:

At first use decimal number distribution module 101 to distribute decimal number for it according to the order of character in the text in the standard Chinese code table, wherein " in " the 61st appearance in the standard Chinese code table, for it distributes decimal number 61; " literary composition " the 2027th appearance is for it distributes decimal number 2027; " basis " the 1315th appearance is for it distributes decimal number 1315; " dig " the 1091st appearance, for it distributes decimal number 1091; " pick " the 1132nd appearance is for it distributes decimal number 1132.

Use quaternary number distribution module 102 based on the transformation rule between the decimal system and the quaternary then; Convert above-mentioned decimal number into 7 quaternary numbers, the sequence of the quaternary number that obtains is " 0,000,331 0,133,223 0,133,223 0,110,203 0,101,003 0101230 ";

Set quaternary number 0 corresponding A (adenine DNA), 1 corresponding C (cytimidine DNA), 2 corresponding G (guanine DNA), 3 corresponding T (thymine DNA).

Use dna sequence dna modular converter 103 to convert above-mentioned quaternary number sequence into dna sequence dna: " AAAATTC ACTTGGT ACTTGGT ACCAGAT ACACAAT ACACGTA ".

Can realize conversion according to identical method to the text that contains the character that has nothing in common with each other about 7000; The method of present embodiment is passed through the character in the text after the conversion of dna sequence dna sign indicating number; Can use similarity analysis module 12, use the method for series processing, for example: the sequence spectrum portrayal; Sequence similarity comparison and sequence cluster analysis etc. realize the processing to text having reduced the small-scale Treatment Technology threshold to text.

Embodiment 2: use text processing system 20 shown in Figure 2 according to the method for the invention two above texts to be carried out the frequency spectrum portrayal

Chosen 20 pieces of texts (20 pieces of texts that are called for short PIBB) of " biological chemistry and biophysics progress " the periodical newly release shown in the table 1 with shown in the table 2 with keyword---" text mining " 20 pieces of texts (20 pieces of texts of abbreviation text) that therefrom search is chosen in National IP Network (CNKI) text library are as the cluster object:

Table 1

Table 2

Method according to embodiment 1; At first add up the sum of the character that has nothing in common with each other in 40 pieces of texts; Be 3243, use the decimal number distribution module 201 in the character distribution module 21 that the character in 20 pieces of texts of PIBB is distributed decimal number according to the sequencing that character occurs then in 20 pieces of texts of PIBB, afterwards through using quaternary number modular converter 202 to convert said decimal number into 7 quaternary numbers; Again by the quaternary number of

setting

0,1,2,3 and the corresponding relation of four kind of DNA; For example: 1 corresponding A, 1 corresponding C, 2 corresponding G; 3 corresponding T use dna sequence dna sign indicating number modular converter 203 to convert said 7 quaternary numbers into 7 dna sequence dna sign indicating numbers;

Be the dna sequence dna sign indicating number by same method with the character conversion in 20 pieces of texts of text then; The dna sequence dna sign indicating number that the direct use of character that wherein in 20 pieces of texts of PIBB, occurred has been changed; The character that in 20 pieces of texts of PIBB, did not occur for its distribution decimal number, converts this decimal number into 7 quaternary numbers with the mode that increases progressively then; Convert 7 quaternary numbers into 7 again and be the dna sequence dna sign indicating number; For example: first does not appear at, and character is μ in 20 pieces of texts of PIBB, and the decimal number that distributes for 20 pieces of last characters of text (last of the character that has nothing in common with each other) of PIBB is 2144, is character μ distribution decimal number 2145 so; The character that is followed successively by in later 20 pieces of texts that do not appear at PIBB distributes decimal number, converts quaternary number and dna sequence dna sign indicating number then into.

The dna sequence dna that uses 20 pieces of text-converted of the 204 couples of PIBB of sequence spectrum portrayal module in the similarity analysis module 22 to obtain then carries out the frequency spectrum portrayal; 7 corresponding characters of dna sequence dna sign indicating number use 7 the dna sequence dna sign indicating numbers (each 7 continuous dna sequence dna sign indicating number is different) in 2041 pairs of dna sequence dnas that obtain of statistical module in the present embodiment, also can add up the frequency of 14 dna sequence dna sign indicating numbers (each 14 continuous dna sequence dna sign indicating number is different) as required; In the present embodiment 7 dna sequence dna sign indicating numbers being carried out frequency adds up; On the horizontal ordinate for totally 47 of the combinations of all 7 dna sequence dna sign indicating numbers that possibly occur, from " AAAAAAA ", " AAAAAAC "; ... " CCCCCCC " ... " GGGGGGG " is until " TTTTTT "; The number of times that occurs in 20 pieces of texts of PIBB with each 7 dna sequence dna sign indicating number then is an ordinate, obtains the sequence frequency table, shown in Fig. 3 A; Obtain the sequence frequency table of 20 pieces of texts of text according to identical method, shown in Fig. 3 B.

Use computing module 2042 generalized distances (new relative entropy) computing method (referring to Shen Juan based on above-mentioned sequence frequency table according to people such as Shen Juan; Wu Wenwu; Xie Xiaoli. based on a kind of non-sequence alignment analysis [J] of dna sequence dna K-tuple distribution. heredity; 2010,32 (6): 606) carry out the calculating of distance.

If two discrete type probability distribution P=(p ₁, p ₂, p ₃... p _n) ^TAnd Q=(q ₁, q ₂, q ₃... q _n) ^T, the symmetrical relative entropy that then P and Q are new is defined as:

The result calculated of distance shown in table 3-table 6, wherein table 3 be text text_01-text_10 respectively and the distance calculation result between 20 pieces of texts of the 20 pieces of texts of text and PIBB; Table 4 be text text_11-text_20 respectively and the distance calculation result between 20 pieces of texts of the 20 pieces of texts of text and PIBB; Table 5 be text PIBB_01-PIBB_10 respectively with the distance calculation result of 20 pieces of texts of the 20 pieces of texts of text and PIBB; Table 6 be text PIBB_11-PIBB_20 respectively and the distance calculation result between 20 pieces of texts of the 20 pieces of texts of text and PIBB.

Table 3

	text_01	text_02	text_03	text_04	text_05	text_06	text_07	text_08	text_09	text_10
											text_01	0	0.324	0.316	0.2823	0.3132	0.2693	0.3038	0.2985	0.2915	0.3859
text_02	0.324	0	0.2476	0.2687	0.2881	0.3293	0.2843	0.3146	0.2814	0.3174
											text_03	0.316	0.2476	0	0.2689	0.2568	0.3093	0.2486	0.3039	0.2535	0.2435
text_04	0.2823	0.2687	0.2689	0	0.2754	0.2988	0.2787	0.2807	0.2735	0.3263
											text_05	0.3132	0.2881	0.2568	0.2754	0	0.2911	0.2271	0.1976	0.2352	0.2533
text_06	0.2693	0.3293	0.3093	0.2988	0.2911	0	0.2628	0.2847	0.3053	0.3617
											text_07	0.3038	0.2843	0.2486	0.2787	0.2271	0.2628	0	0.2782	0.2205	0.2643
text_08	0.2985	0.3146	0.3039	0.2807	0.1976	0.2847	0.2782	0	0.3004	0.3681
											text_09	0.2915	0.2814	0.2535	0.2735	0.2352	0.3053	0.2205	0.3004	0	0.2515
text_10	0.3859	0.3174	0.2435	0.3263	0.2533	0.3617	0.2643	0.3681	0.2515	0
											text_11	0.2981	0.2752	0.2286	0.2647	0.203	0.3061	0.228	0.2122	0.2311	0.213
text_12	0.3011	0.2715	0.2187	0.2714	0.1974	0.3007	0.1784	0.2087	0.2291	0.2165
											text_13	0.3224	0.3073	0.2519	0.3013	0.1657	0.3197	0.2295	0.1984	0.2461	0.2404
text_14	0.2562	0.2442	0.1902	0.2391	0.1894	0.2736	0.1863	0.1774	0.207	0.2418
											text_15	0.342	0.2778	0.2239	0.2811	0.2219	0.3467	0.2581	0.2877	0.273	0.2288
text_16	0.253	0.2891	0.2884	0.228	0.2499	0.2449	0.2634	0.2141	0.2691	0.376
											text_17	0.3367	0.3243	0.3434	0.2671	0.2968	0.3108	0.3413	0.2541	0.3368	0.4439
text_18	0.3002	0.2713	0.2851	0.2419	0.2607	0.2517	0.2827	0.2431	0.2892	0.3773
											text_19	0.5157	0.4406	0.3553	0.4932	0.369	0.5281	0.4081	0.4858	0.3816	0.2374
text_20	0.3169	0.2736	0.1994	0.2663	0.2163	0.3174	0.2372	0.315	0.2272	0.1834
											PIBB_01	0.5256	0.548	0.603	0.481	0.5351	0.5077	0.613	0.4103	0.6539	0.7853
PIBB_02	0.4552	0.4888	0.5232	0.4199	0.4692	0.4414	0.527	0.3621	0.57	0.6932
											PIBB_03	0.5769	0.5734	0.6117	0.5114	0.5425	0.5215	0.6154	0.4634	0.6432	0.752
PIBB_04	0.4758	0.4854	0.5441	0.4459	0.4875	0.4498	0.5508	0.3915	0.5916	0.6841
											PIBB_05	0.4999	0.5118	0.568	0.442	0.5055	0.481	0.5793	0.3936	0.6207	0.7421
PIBB_06	0.4891	0.5102	0.5509	0.4543	0.4819	0.4641	0.5469	0.3608	0.5811	0.719
											PIBB_07	0.386	0.3719	0.3987	0.3182	0.356	0.347	0.4069	0.2733	0.4529	0.5474
PIBB_08	0.5176	0.5337	0.5691	0.4739	0.5117	0.4874	0.5755	0.4268	0.6091	0.7156
											PIBB_09	0.4801	0.4873	0.5351	0.4331	0.4847	0.4364	0.5231	0.3873	0.5747	0.6759
PIBB_10	0.5148	0.522	0.557	0.4728	0.5021	0.4883	0.557	0.4312	0.5961	0.7038
											PIBB_11	0.4647	0.4818	0.5354	0.4274	0.4624	0.4317	0.5321	0.356	0.5742	0.6808
PIBB_12	0.4325	0.4313	0.483	0.3818	0.4244	0.3981	0.4728	0.3493	0.516	0.6275
											PIBB_13	0.501	0.5271	0.5831	0.4626	0.5287	0.4823	0.5753	0.4158	0.6284	0.7577
PIBB_14	0.5131	0.5223	0.5523	0.4532	0.4793	0.4782	0.5338	0.3815	0.59	0.7046
											PIBB_15	0.4812	0.495	0.5414	0.443	0.4848	0.4428	0.5364	0.384	0.5879	0.6969
PIBB_16	0.4156	0.4161	0.4455	0.3771	0.4027	0.3767	0.4471	0.3134	0.488	0.6065
											PIBB_17	0.4454	0.4572	0.5019	0.4158	0.452	0.409	0.4758	0.3494	0.5413	0.6355
PIBB_18	0.4962	0.5169	0.5536	0.449	0.5024	0.4647	0.5514	0.3941	0.5931	0.7027
											PIBB_19	0.4693	0.493	0.5411	0.4399	0.4902	0.4358	0.5274	0.3681	0.5846	0.6998
PIBB_20	0.5044	0.5096	0.5437	0.4571	0.4991	0.46	0.5327	0.4018	0.5988	0.6887

Table 4

	text_11	text_12	text_13	text_14	text_15	text_16	text_17	text_18	text_19	text_20
											text_01	0.2981	0.3011	0.3224	0.2562	0.342	0.253	0.3367	0.3002	0.5157	0.3169
text_02	0.2752	0.2715	0.3073	0.2442	0.2778	0.2891	0.3243	0.2713	0.4406	0.2736
											text_03	0.2286	0.2187	0.2519	0.1902	0.2239	0.2884	0.3434	0.2851	0.3553	0.1994
text_04	0.2647	0.2714	0.3013	0.2391	0.2811	0.228	0.2671	0.2419	0.4932	0.2663
											text_05	0.203	0.1974	0.1657	0.1894	0.2219	0.2499	0.2968	0.2607	0.369	0.2163
text_06	0.3061	0.3007	0.3197	0.2736	0.3467	0.2449	0.3108	0.2517	0.5281	0.3174
											text_07	0.228	0.1784	0.2295	0.1863	0.2581	0.2634	0.3413	0.2827	0.4081	0.2372
text_08	0.2122	0.2087	0.1984	0.1774	0.2877	0.2141	0.2541	0.2431	0.4858	0.315
											text_09	0.2311	0.2291	0.2461	0.207	0.273	0.2691	0.3368	0.2892	0.3816	0.2272
text_10	0.213	0.2165	0.2404	0.2418	0.2288	0.376	0.4439	0.3773	0.2374	0.1834
											text_11	0	0.1499	0.1831	0.157	0.2403	0.28	0.3395	0.2879	0.3228	0.1987
text_12	0.1499	0	0.1776	0.1257	0.242	0.2588	0.3368	0.288	0.3586	0.2192
											text_13	0.1831	0.1776	0	0.1781	0.2133	0.2846	0.3508	0.3065	0.3139	0.2417
text_14	0.157	0.1257	0.1781	0	0.2242	0.2167	0.2749	0.2328	0.4008	0.2047
											text_15	0.2403	0.242	0.2133	0.2242	0	0.3277	0.391	0.3156	0.2986	0.2019
text_16	0.28	0.2588	0.2846	0.2167	0.3277	0	0.2288	0.2002	0.5667	0.299
											text_17	0.3395	0.3368	0.3508	0.2749	0.391	0.2288	0	0.2333	0.606	0.3418
text_18	0.2879	0.288	0.3065	0.2328	0.3156	0.2002	0.2333	0	0.5677	0.3077
											text_19	0.3228	0.3586	0.3139	0.4008	0.2986	0.5667	0.606	0.5677	0	0.3051
text_20	0.1987	0.2192	0.2417	0.2047	0.2019	0.299	0.3418	0.3077	0.3051	0
											PIBB_01	0.6135	0.6229	0.6031	0.5246	0.6016	0.4438	0.4201	0.3704	0.9368	0.6621
PIBB_02	0.5336	0.5381	0.5148	0.4514	0.5304	0.3852	0.3649	0.2972	0.8254	0.5806
											PIBB_03	0.6121	0.6294	0.5907	0.5543	0.6008	0.4857	0.4691	0.421	0.8633	0.6604
PIBB_04	0.5489	0.5656	0.5308	0.4803	0.5391	0.4195	0.4016	0.3381	0.8203	0.5891
											PIBB_05	0.5795	0.593	0.5729	0.5009	0.5852	0.4091	0.3811	0.3437	0.9036	0.6336
PIBB_06	0.557	0.5535	0.5274	0.471	0.5664	0.4016	0.3831	0.3395	0.8596	0.6114
											PIBB_07	0.423	0.4175	0.4014	0.3369	0.3971	0.2986	0.2863	0.2374	0.6952	0.4557
PIBB_08	0.5769	0.5906	0.5578	0.5105	0.5686	0.4627	0.4381	0.3751	0.848	0.6196
											PIBB_09	0.5383	0.5444	0.5338	0.4782	0.5374	0.4204	0.3998	0.3413	0.8063	0.5775
PIBB_10	0.5727	0.5731	0.5343	0.5	0.5588	0.4579	0.45	0.3793	0.8378	0.6208
											PIBB_11	0.5381	0.5428	0.5194	0.4675	0.5263	0.4006	0.3748	0.3086	0.8239	0.5741
PIBB_12	0.4958	0.4959	0.476	0.4113	0.4648	0.3548	0.3377	0.2792	0.7557	0.5139
											PIBB_13	0.5906	0.5953	0.5846	0.5173	0.6035	0.4341	0.4106	0.3653	0.9082	0.6507
PIBB_14	0.5429	0.5435	0.5224	0.4702	0.5449	0.4111	0.3933	0.3271	0.8244	0.603
											PIBB_15	0.5571	0.5534	0.5316	0.4797	0.5553	0.4173	0.3872	0.328	0.8407	0.5986
PIBB_16	0.4654	0.4532	0.4321	0.3775	0.4485	0.3459	0.3233	0.2637	0.7285	0.5152
											PIBB_17	0.4986	0.4895	0.4785	0.4364	0.5006	0.4186	0.3861	0.3378	0.7585	0.5592
PIBB_18	0.556	0.5641	0.5399	0.485	0.5579	0.4358	0.4143	0.3723	0.8331	0.601
											PIBB_19	0.5394	0.5477	0.5306	0.4717	0.5492	0.3982	0.3775	0.3273	0.8495	0.5898
PIBB_20	0.5634	0.5603	0.5358	0.487	0.547	0.4388	0.4244	0.3656	0.8226	0.6043

Table 5

	PIBB_01	PIBB_02	PIBB_03	PIBB_04	PIBB_05	PIBB_06	PIBB_07	PIBB_08	PIBB_09	PIBB_10
											text_01	0.5256	0.4552	0.5769	0.4758	0.4999	0.4891	0.386	0.5176	0.4801	0.5148
text_02	0.548	0.4888	0.5734	0.4854	0.5118	0.5102	0.3719	0.5337	0.4873	0.522
											text_03	0.603	0.5232	0.6117	0.5441	0.568	0.5509	0.3987	0.5691	0.5351	0.557
text_04	0.481	0.4199	0.5114	0.4459	0.442	0.4543	0.3182	0.4739	0.4331	0.4728
											text_05	0.5351	0.4692	0.5425	0.4875	0.5055	0.4819	0.356	0.5117	0.4847	0.5021
text_06	0.5077	0.4414	0.5215	0.4498	0.481	0.4641	0.347	0.4874	0.4364	0.4883
											text_07	0.613	0.527	0.6154	0.5508	0.5793	0.5469	0.4069	0.5755	0.5231	0.557
text_08	0.4103	0.3621	0.4634	0.3915	0.3936	0.3608	0.2733	0.4268	0.3873	0.4312
											text_09	0.6539	0.57	0.6432	0.5916	0.6207	0.5811	0.4529	0.6091	0.5747	0.5961
text_10	0.7853	0.6932	0.752	0.6841	0.7421	0.719	0.5474	0.7156	0.6759	0.7038
											text_11	0.6135	0.5336	0.6121	0.5489	0.5795	0.557	0.423	0.5769	0.5383	0.5727
text_12	0.6229	0.5381	0.6294	0.5656	0.593	0.5535	0.4175	0.5906	0.5444	0.5731
											text_13	0.6031	0.5148	0.5907	0.5308	0.5729	0.5274	0.4014	0.5578	0.5338	0.5343
text_14	0.5246	0.4514	0.5543	0.4803	0.5009	0.471	0.3369	0.5105	0.4782	0.5
											text_15	0.6016	0.5304	0.6008	0.5391	0.5852	0.5664	0.3971	0.5686	0.5374	0.5588
text_16	0.4438	0.3852	0.4857	0.4195	0.4091	0.4016	0.2986	0.4627	0.4204	0.4579
											text_17	0.4201	0.3649	0.4691	0.4016	0.3811	0.3831	0.2863	0.4381	0.3998	0.45
text_18	0.3704	0.2972	0.421	0.3381	0.3437	0.3395	0.2374	0.3751	0.3413	0.3793
											text_19	0.9368	0.8254	0.8633	0.8203	0.9036	0.8596	0.6952	0.848	0.8063	0.8378
text_20	0.6621	0.5806	0.6604	0.5891	0.6336	0.6114	0.4557	0.6196	0.5775	0.6208
											PIBB_01	0	0.2332	0.3519	0.2206	0.2151	0.2489	0.2439	0.2646	0.256	0.3243
PIBB_02	0.2332	0	0.2795	0.2091	0.1951	0.2533	0.185	0.2347	0.2105	0.2361
											PIBB_03	0.3519	0.2795	0	0.3089	0.2832	0.3539	0.291	0.3274	0.2945	0.285
PIBB_04	0.2206	0.2091	0.3089	0	0.2341	0.2689	0.2366	0.1492	0.1822	0.269
											PIBB_05	0.2151	0.1951	0.2832	0.2341	0	0.2608	0.2036	0.2746	0.234	0.2741
PIBB_06	0.2489	0.2533	0.3539	0.2689	0.2608	0	0.2418	0.2944	0.2901	0.3122
											PIBB_07	0.2439	0.185	0.291	0.2366	0.2036	0.2418	0	0.2764	0.2257	0.268
PIBB_08	0.2646	0.2347	0.3274	0.1492	0.2746	0.2944	0.2764	0	0.2063	0.2797
											PIBB_09	0.256	0.2105	0.2945	0.1822	0.234	0.2901	0.2257	0.2063	0	0.2507
PIBB_10	0.3243	0.2361	0.285	0.269	0.2741	0.3122	0.268	0.2797	0.2507	0
											PIBB_11	0.1832	0.2055	0.2868	0.1871	0.2127	0.2295	0.2208	0.2323	0.2284	0.2761
PIBB_12	0.2479	0.2051	0.2937	0.2016	0.2267	0.2593	0.171	0.2455	0.2219	0.269
											PIBB_13	0.2417	0.2065	0.3257	0.2313	0.1973	0.281	0.2426	0.2654	0.2302	0.2868
PIBB_14	0.3114	0.2049	0.3091	0.2917	0.2672	0.298	0.2498	0.2927	0.2735	0.2941
											PIBB_15	0.2318	0.174	0.2644	0.193	0.1946	0.2684	0.2093	0.2095	0.1812	0.2305
PIBB_16	0.2817	0.2084	0.3256	0.2544	0.2488	0.2632	0.1572	0.309	0.2751	0.3047
											PIBB_17	0.3411	0.2712	0.3335	0.2822	0.2941	0.3209	0.2059	0.3316	0.2613	0.3374
PIBB_18	0.2286	0.2474	0.3264	0.2577	0.2328	0.29	0.2503	0.2811	0.2538	0.3208
											PIBB_19	0.2523	0.2245	0.3276	0.2121	0.2119	0.2709	0.2207	0.2723	0.2336	0.2949
PIBB_20	0.2389	0.2379	0.3341	0.2407	0.2417	0.3129	0.259	0.2609	0.2384	0.2994

Table 6

	PIBB_11	PIBB_12	PIBB_13	PIBB_14	PIBB_15	PIBB_16	PIBB_17	PIBB_18	PIBB_19	PIBB_20
											text_01	0.4647	0.4325	0.501	0.5131	0.4812	0.4156	0.4454	0.4962	0.4693	0.5044
text_02	0.4818	0.4313	0.5271	0.5223	0.495	0.4161	0.4572	0.5169	0.493	0.5096
											text_03	0.5354	0.483	0.5831	0.5523	0.5414	0.4455	0.5019	0.5536	0.5411	0.5437
text_04	0.4274	0.3818	0.4626	0.4532	0.443	0.3771	0.4158	0.449	0.4399	0.4571
											text_05	0.4624	0.4244	0.5287	0.4793	0.4848	0.4027	0.452	0.5024	0.4902	0.4991
text_06	0.4317	0.3981	0.4823	0.4782	0.4428	0.3767	0.409	0.4647	0.4358	0.46
											text_07	0.5321	0.4728	0.5753	0.5338	0.5364	0.4471	0.4758	0.5514	0.5274	0.5327
text_08	0.356	0.3493	0.4158	0.3815	0.384	0.3134	0.3494	0.3941	0.3681	0.4018
											text_09	0.5742	0.516	0.6284	0.59	0.5879	0.488	0.5413	0.5931	0.5846	0.5988
text_10	0.6808	0.6275	0.7577	0.7046	0.6969	0.6065	0.6355	0.7027	0.6998	0.6887
											text_11	0.5381	0.4958	0.5906	0.5429	0.5571	0.4654	0.4986	0.556	0.5394	0.5634
text_12	0.5428	0.4959	0.5953	0.5435	0.5534	0.4532	0.4895	0.5641	0.5477	0.5603
											text_13	0.5194	0.476	0.5846	0.5224	0.5316	0.4321	0.4785	0.5399	0.5306	0.5358
text_14	0.4675	0.4113	0.5173	0.4702	0.4797	0.3775	0.4364	0.485	0.4717	0.487
											text_15	0.5263	0.4648	0.6035	0.5449	0.5553	0.4485	0.5006	0.5579	0.5492	0.547
text_16	0.4006	0.3548	0.4341	0.4111	0.4173	0.3459	0.4186	0.4358	0.3982	0.4388
											text_17	0.3748	0.3377	0.4106	0.3933	0.3872	0.3233	0.3861	0.4143	0.3775	0.4244
text_18	0.3086	0.2792	0.3653	0.3271	0.328	0.2637	0.3378	0.3723	0.3273	0.3656
											text_19	0.8239	0.7557	0.9082	0.8244	0.8407	0.7285	0.7585	0.8331	0.8495	0.8226
text_20	0.5741	0.5139	0.6507	0.603	0.5986	0.5152	0.5592	0.601	0.5898	0.6043
											PIBB_01	0.1832	0.2479	0.2417	0.3114	0.2318	0.2817	0.3411	0.2286	0.2523	0.2389
PIBB_02	0.2055	0.2051	0.2065	0.2049	0.174	0.2084	0.2712	0.2474	0.2245	0.2379
											PIBB_03	0.2868	0.2937	0.3257	0.3091	0.2644	0.3256	0.3335	0.3264	0.3276	0.3341
PIBB_04	0.1871	0.2016	0.2313	0.2917	0.193	0.2544	0.2822	0.2577	0.2121	0.2407
											PIBB_05	0.2127	0.2267	0.1973	0.2672	0.1946	0.2488	0.2941	0.2328	0.2119	0.2417
PIBB_06	0.2295	0.2593	0.281	0.298	0.2684	0.2632	0.3209	0.29	0.2709	0.3129
											PIBB_07	0.2208	0.171	0.2426	0.2498	0.2093	0.1572	0.2059	0.2503	0.2207	0.259
PIBB_08	0.2323	0.2455	0.2654	0.2927	0.2095	0.309	0.3316	0.2811	0.2723	0.2609
											PIBB_09	0.2284	0.2219	0.2302	0.2735	0.1812	0.2751	0.2613	0.2538	0.2336	0.2384
PIBB_10	0.2761	0.269	0.2868	0.2941	0.2305	0.3047	0.3374	0.3208	0.2949	0.2994
											PIBB_11	0	0.2068	0.227	0.2736	0.1927	0.2525	0.2838	0.2269	0.2116	0.2223
PIBB_12	0.2068	0	0.2501	0.258	0.2099	0.2134	0.2531	0.2573	0.2266	0.2596
											PIBB_13	0.227	0.2501	0	0.2913	0.1951	0.2668	0.3002	0.2335	0.238	0.246
PIBB_14	0.2736	0.258	0.2913	0	0.2445	0.2913	0.3253	0.3078	0.2939	0.3039
											PIBB_15	0.1927	0.2099	0.1951	0.2445	0	0.2495	0.266	0.2173	0.2064	0.2059
PIBB_16	0.2525	0.2134	0.2668	0.2913	0.2495	0	0.2664	0.2751	0.2406	0.285
											PIBB_17	0.2838	0.2531	0.3002	0.3253	0.266	0.2664	0	0.3245	0.2554	0.3318
PIBB_18	0.2269	0.2573	0.2335	0.3078	0.2173	0.2751	0.3245	0	0.2648	0.2406
											PIBB_19	0.2116	0.2266	0.238	0.2939	0.2064	0.2406	0.2554	0.2648	0	0.2886
PIBB_20	0.2223	0.2596	0.246	0.3039	0.2059	0.285	0.3318	0.2406	0.2886	0

Use cluster module 2043 according to the result who shows 3-table 6 above-mentioned text to be carried out cluster, for example: use cluster analysis software R Package, carry out cluster through the method for average connection layering according to conventional clustering method; This method is used the data in the table 3-table 6 at cluster process; According to the big more principle of the big more distance of numerical value the relation between each text is showed through dendrogram, the result is as shown in Figure 4, and color has been represented the distance of distance between two texts; Nearest is 0, is 2log (2) farthest.Can find out from the dendrogram of hierarchical cluster upper end by Fig. 4; Can 20 pieces of texts of PIBB 20 pieces of texts with text be separated, wherein, the distance between upper left 20 pieces of texts representing PIBB; And color is darker, the close together between 20 pieces of texts of expression PIBB; The distance between 20 pieces of texts of the corresponding text of 20 pieces of texts of PIBB represent down on a left side, and lighter color, representes that the distance between 20 pieces of texts of 20 pieces of texts and text of PIBB is far away; Distance between 20 pieces of texts of the upper right corresponding PIBB of 20 pieces of texts that represent text, and lighter color, the distance between 20 pieces of texts of expression text and the 20 pieces of texts of PIBB is far away; The bottom right represents distance and the color between 20 pieces of texts of text darker, close together between 20 pieces of texts of expression text.

The method that present embodiment provides; Through converting the character information in the text into the dna sequence dna sign indicating number, can realize portrayal to the sequence frequency table of n=[5,20]-mer; And existing text frequency spectrum depicting method only can be realized n=[0; 10]-and the portrayal of the sequence frequency table of mer, improved the precision and the ability of frequency spectrum depicting method, make the processing of text more efficient.

Embodiment 3: use text processing system 20 shown in Figure 2 according to the method for the invention two texts to be carried out the sequence similarity comparison

From 40 pieces of texts of the foregoing description 2, choose two pieces of texts arbitrarily and carry out the sequence similarity comparison; Two pieces of texts choosing are: " applied research of text mining in the multicultural intercommunion platform " (text_01) with " the text mining progress of protein interaction " (text_02); Existing is example with any two sections texts among text_01 and the text_02; Specify the sequence similarity comparison process of present embodiment; Will be from one section text called after among the text_01 " query text (Query.txt) ", from one section text called after among the text_02 " target text (Subject.txt) "

Said query text (Query.txt) is:

" based on the text mining of concept lattice, text mining is from non-structured text, to find potential notion and the mutual relationship between notion.As from the Web information resources of vastness, finding effective technology potential, valuable knowledge, the Web text mining receives much attention.Proposed to utilize concept lattice to extract in the literary composition and lain in conceptual relation potential in the text, the relation between document in the text mining and the keyword has been shown through the concept lattice structure.Keyword: text mining; Concept lattice; Feature extraction "

Said target text (Subject.txt) is:

" based on the decision-making of the Multi_Agent web text mining system house of grid to the speed of web text mining and accuracy require increasingly high.This paper proposed one based on gridding technique can parallel processing Multi_Agent web text mining system.And this system has been discussed text mining service method and step are provided.Keyword: grid; Many agent "

According to the method among the embodiment 1, using character distribution module 21 is following two dna sequence dnas with above-mentioned two text-converted:

Dna sequence dna after query text (Query.txt) conversion is:

＞Query.txt

ATCTAGGCATTGTCCAGTCACCCCTAAGCAAGATAAGGCCAAAGCTTGGCAACATGACTGTTGAAACGCGAACCTTGCAACATGACTGTTGAAACGCGAACCTTGATCCCCAATACCGGCAGGCAACCAAGGAACGTTCCAGGCACTAGCTTGGCAACATGACTGTTGAGCCGGAAAGACCTCGCATACCCATAACCCCTGTAAGCTTGGCCCTAAGCAAGATAAGTTCGTATAGCACCCCTAAGCAAGATACGCATGTAGCTTGGAAAGCCACACGCGGCATCAGAACTGCTGCCTCTGAATGCCGCCAGCCCAATACCGGCAAGGTCACAGGTGAGCTTGGAATTCATAGTGACCCACAGGCACGTATCATATGCACCCACAGCCCTTACAGTATGTAGCCGGAAAGACCTCGCATACCCATAACCCCTGTAAGCTTGGAATGACGATTCGGAACCTGCAAAGACATAAAACCGCCAATGCAGCTTGGATTCGGAAACTCTCACTAGCGATCGAGCATCGGAAAATTCATAGTGACCCACAGGCACGTATCCAACATGACTGTTGAAACGCGAACCTTGCGACAACCCACGAGCCTATCGCATCAGAAGGAAGACCTCTGACAACATGAGCCGGACAGTCTGCCCACCAATGTATCAATACTTATTACTCCCCTAAGCAAGATAAGGCCAACAAGGGTCAGCCCCCACACCCCAAATCCCCTTCAGCCCTGTACAACATGACTGTTGAGCCGGACCATAACCCCTGTAAGCTTGGCCCTAAGCAAGATACATCAGAACTGCTGATCGGAAACAATTACAACATGACTGTTGAAACGCGAACCTTGAGCCGGACAACATGACCTTGGACCGCATCATCAGAAATCAAACAGAGAAACGCGCTCGCATGTAGCTTGGCATCAGAACTGCTGCGACACAAACAAGGCCCTAAGCAAGATAAGGCCAACCAAGGAACGTTCCAGTGCTGCGCATACCCCACCACAAGGGTCCTCTGACATCAGAAATCAAACAGAGAAACAGTCACAACATGACTGTTGAAACGCGAACCTTGAGTGTGGCCCTAAGCAAGATAAGGCCAAAGTGTGGAGACACGATATCAACAGCCCCCACACCC

Dna sequence dna after target text (Subject.txt) conversion is:

＞Subject.txt

ATCTAGGCATTGTCCAGTCACCCCAAGCAGGCCAAAGCTTGGATGAAGCCAACGTTACTAGAGAGCGATGATATATGCCGGTAAAGGATGTCGCAACAAGTGACCCCATACCAGCGATGACGTATCAATTCATAGTGACCCACAGGCACGTATCCAACATGACTGTTGAAACGCGAACCTTGACTGCTGCAAGACTCCGGCTTAGACTTCATAGGGGATAAAGTCAAAGTGAATTCATAGTGACCCACAGGCACGTATCCAACATGACTGTTGAAACGCGAACCTTGAGCTTGGACGCGACCCCATTCAACCACGAGTGTTGACGAGAACGATTTCAGCTTGGCCCCAGGCAACCCCAAACTTGCAAGGGTAAACTTGCATCTGACCTCTGAACTGTTGCAACATGCAGTCTGCCCACCAATGTATCCAAACATAGGACCACATTGTCCAGTCACCCCAAGCAGGCCAAACTAGCGATCGAGCAGCTTGGACCGGTACACTTTAAGCTAAGCGATGTCCATGAAAATAGCTCAGCTTGGATGAAGCCAACGTTACTAGAGAGCGATGATATATGCCGGTAAAGGATGTCGCAACAAGTGACCCCATACCAGCGATGACGTATCAATTCATAGTGACCCACAGGCACGTATCCAACATGACTGTTGAAACGCGAACCTTGACTGCTGCAAGACTCCTCTGAAGCTAAGCGACCTTCATACACATGTATCCGCCCATACTGCTGCAAGACTCAGTCTGAACCCTGCAACATGACTGTTGAAACGCGAACCTTGCACCGGCATCTCCCAGCTTGGCAAGAAAAGGGATTAACCACGCATACGTCCGATGGCCTCTGACATCAGAAATCAAACAGAGAAACAGTCACCCAAGCAGGCCAAAGTGTGGATATCCTAGGATGTCGCAACAAGTGACCCCATACCAGCGATG

Use 2051 pairs of two dna sequence dnas of comparing module in the sequence similarity comparing module 205 to compare then; Further; Comparing module 2051 can be the comparison software BLAT of global optimization; Perhaps suboptimization comparison software BLAST, FASTA etc. use the comparison software BLAT of global optimization in the present embodiment, and concrete parameter is set to :-query (query text): Query.txt;-subject (target text): Subject.txt;-perc_identity (match-percentage fully): 100;-dust (dust sequence): forbidding;-gapopen (middle room): 1000;-gapextend (extension room): 1000.

With target text corresponding DNA sequences called after target sequence, query text corresponding DNA sequences called after search sequence, the comparison back obtains the HSP fragment of two dna sequence dnas, and as shown in table 7: wherein, query id is search sequence id; Subject id is target sequence id; %identity is that the matching ratio (pairing fragment length/HSP length) of high score coupling fragment (HSP) is because wherein possibly have one or two the unpaired possibility of base; Alignment is the fragment length overall of HSP; Mismatches is a frameshit mismatched bases number, in system, need avoid the phenomenon of frameshit mispairing, (the perc_identity (match-percentage fully): 100) do not allow the generation of this frameshit mismatching phenomenon of the parameter control when utilizing comparison; Gap opens is the room number in the comparison, the parameter (gapopen (middle room): 1000 when this situation also need be passed through comparison equally;-gapextend (extension room): 1000) control is avoided; Q.start and q.end are respectively the initial sum final position of this HSP in search sequence; S.start and s.end are respectively the initial sum final position of this HSP at target sequence, and bit score is the overall score value of this HSP.

Table 7

Using recovery module 2052 to be reduced to character information the HSP of above-mentioned acquisition comes the similar part of two sample text is confirmed: the initial point position information of the HSP that will obtain through the comparison of the sequence similarity between text is in twos got surplus respectively divided by 7; If do not wait, skip and do not read; If equate that remainder is k, and k ≠ 0, then moves the head that the 7-k position is a character string backward, reads 7 dna sequence dna sign indicating numbers continuously from this beginning and converts character into; If equate that remainder is k, and k=0, begin to read continuously 7 dna sequence dna sign indicating numbers from high score coupling fragment initial point position and convert character into; Read to the dna sequence dna afterbody, the dna sequence dna sign indicating number that the afterbody less than is 7 discards not to be read, and the result after the reduction is as shown in table 8:

Table 8

Above-mentioned target text can be observed through table 8 with the similar part in the query text intuitively; Same, for complete text_01 and text_02 text, also can pass through said method; Realize confirming of similarity fragment between the whole text; Can use different dna sequence dna comparison software in the method for present embodiment text is analyzed from global optimum, local optimum equal angles, make the analysis of text more comprehensively, and not rely on the foundation of database; Can directly realize similarity comparison between small-scale text; And can high score be mated fragment be reduced to character information through accurately controlling the dna sequence dna sign indicating number to the character reduction, make comparison result intuitively readable.

Embodiment 4: utilize text processing system shown in Figure 2 20 according to the method for the invention two texts to be carried out the sequence cluster analysis

Use 40 pieces of texts among the embodiment 2, the process that with character conversion is the dna sequence dna sign indicating number uses the comparing module 2061 in the sequence cluster analysis module 206 to carry out the comparison between dna sequence dna in twos the dna sequence dna that obtains with embodiment 2, and comparison method is with embodiment 3.

After accomplishing, comparison use the high score coupling fragment of 2062 pairs of acquisitions of concatenation module to splice the long coupling fragment that obtains not having intersection, no repetition; The splicing back uses cluster module 2063 to calculate similarity numerical value, and based on said similarity numerical value two above texts is carried out cluster.

Said similarity numerical value calculates through following method; Comprise: calculate similarity numerical value through formula according to the starting point of said long coupling fragment and the positional information of end point; Wherein, The similarity of the corresponding sequence A of sequence B among Similarity (A) any two dna sequence dna A of expression and the B; Num (hsp) is the sequence A of process splicing and the long summation of mating the number of fragment of sequence B; Hsp.end and hsp.begin are respectively the long starting point of fragment and the position of end point of mating of sequence A and sequence B, and length (A) is the length overall of sequence A.

Existing target text and query text with analysis among the embodiment 3 is example, specifies the splicing and the cluster process of text.The comparison result of pairing target sequence of target text and query text and search sequence is as shown in table 7; According to table 7 result; The HSP of two dna sequence dnas is spliced respectively, at first, first ancestral that will form by the position of the starting point of each HSP and end point; Sort according to the tuple first place, generating with the tuple first place is the ascending order tuple sequence of order standard.The tuple of search sequence after sorting is: (49,79), (50,77), (50,77), (50,77), (78,106); (78,106), (78,106), (526,581), (526,581), (526,581); (554,582), (820,847), (820,848), (820,848), (820,850); (1009,1044), (1044,1071), (1044,1072), (1044,1072), (1044,1073); The tuple of target sequence after sorting is: (127,182), (155,182), (155,183), (155,183), (155,183); (232,287), (260,287), (260,288), (260,289), (260,290); (624,679), (652,679), (652,680), (652,680), (652,680); (763,793), (764,791), (764,791), (764,792), (848,883).

Use first ancestral of each HSP after 2062 pairs of orderings of concatenation module to splice the long coupling fragment that obtains not having intersection, no repetition then; With first ancestral of first HSP (representing the most initial position of matching sequence) as basic; Splice with tuple subsequently successively: A) in next monobasic group is included in previous first ancestral's scope or identical with previous first ancestral's scope, this situation is not spliced; B) intersect when next monobasic group has with previous first ancestral's scope, and previous first ancestral's scope is had extension, this situation is upgraded extension to previous first ancestral's scope, and the first ancestral after will upgrading is kept in first ancestral's tabulation; C) do not occur simultaneously when next tuple and previous first ancestral's scope; This situation with previous first ancestral as do not have to intersect, the long coupling fragment of no repetition is kept at long the coupling in the fragment list, and this time monobasic ancestral is continued to repeat above-mentioned step and first ancestral is subsequently spliced.

Respectively the tuple of all HSP in search sequence and the target sequence is spliced according to said method, the long coupling fragment list that the tuple of all HSP obtains after splicing in the search sequence is: (49,106), (526,582), (820,850), (1009,1073);

The long coupling fragment list that the tuple of all HSP obtains after splicing in the target sequence is: (127,183), (232,290), (624,680), (763,791), (848,883); Use the information of cluster module 2063 then according to the long coupling fragment that obtains; Calculate the similarity value of the corresponding target sequence of search sequence;

carries out according to formula; Result of calculation is Similarity (Query) (similarity of the corresponding target sequence of search sequence)=3.733%; Similarity (Subject) (similarity of the corresponding search sequence of target sequence)=3.280%; Thus it is clear that, have certain similarity between two texts.

According to treatment step to above-mentioned target text and query text; Calculation of similarity degree between carrying out in twos to all the other texts in complete text_01 and text_02 and 40 texts; The result is shown in table 9-table 12; Table 9 is the similarity numerical value tables between 20 pieces of texts of PIBB; Table 10 is the similarity numerical value tables between 20 pieces of texts of the corresponding text of 20 pieces of texts of PIBB, and table 11 be the similarity numerical value tables of 20 pieces of texts of 20 pieces of texts correspondence PIBB of text, and table 12 is the similarity numerical value tables between 20 pieces of texts of text:

Table 9

Table 10

Table 11

Table 12

According to the data of table 9-table 12, use 2063 pairs of above-mentioned texts of cluster module to carry out cluster, clustering method is with embodiment 2, and the result is as shown in Figure 5.Wherein grey then representes to have similarity (positive), and black representes not have similarity (negative).From figure, can clearly find; 20 pieces of texts of PIBB separate with 20 pieces of texts of text, wherein, and the similarity numerical value between 20 pieces of texts of the upper left corresponding text of 20 pieces of texts that represent PIBB; And the color of the overwhelming majority is a black, and expression does not almost have similarity; Similarity numerical value under the left side between 20 pieces of texts of expression PIBB, color is a grey, expression has very big similarity; Similarity numerical value between upper right 20 pieces of texts representing text, color is a grey, expression has very big similarity; The similarity numerical value between 20 pieces of texts of the corresponding PIBB of 20 pieces of texts of text is represented in the bottom right, and the color of the overwhelming majority is black, and expression does not almost have similarity.

Use the method and system in the present embodiment that text is carried out cluster analysis, do not rely on the extracting keyword, avoided the error of keyword extracting to be accumulated, cause the disappearance of textual analysis information, make the analysis of text comprehensive, accurate.

Though the present invention discloses as above with embodiment; But it is not that any those skilled in the art are not breaking away from the spirit and scope of the present invention in order to qualification the present invention; Can change arbitrarily or be equal to replacement, so protection scope of the present invention should be as the criterion with the scope that the application's claims are defined.

Claims

1. the text handling method based on dna sequence dna is characterized in that, comprising:

The character that is two above texts distributes the dna sequence dna sign indicating number, and character identical in its Chinese version distributes identical dna sequence dna sign indicating number;

Use the dna sequence dna disposal route that two above texts of distributing the dna sequence dna sign indicating number are carried out similarity analysis;

2. method according to claim 1, said is that the character of two above texts distributes the dna sequence dna sign indicating number to comprise:

Be respectively two characters in the above text and distribute decimal number, character identical in its Chinese version distributes identical decimal number;

Convert the pairing decimal number of character in two above texts into quaternary number respectively, the figure place of said quaternary number is n, and 4n is at least greater than the sum of mutually different character in the text, and the quaternary number of not enough n position mends 0 at said quaternary number front end;

Make a kind of in the respectively corresponding four kinds of DNAs of 0 in the quaternary number, 1,2,3, convert the pairing n of the character position quaternary number in two above texts into n position dna sequence dna sign indicating number respectively, obtain the pairing dna sequence dna of each text.

3. based on the described method of claim 2, said is that two characters in the above text distribute decimal number to be: the appearance according to the character in two above texts distributes decimal number for it in proper order.

4. according to each described method of claim 1-3, said use dna sequence dna disposal route is carried out similarity analysis to two above texts of distributing the dna sequence dna sign indicating number and is comprised that text is carried out following analysis 1-analyzes one or more analyses in 3:

Said analysis 1 is for carrying out the statistics of the sequence frequency between text in twos to two above texts of distributing the dna sequence dna sign indicating number; Obtain the sequence frequency table; Carry out the calculating of distance based on said sequence frequency table then, and two above texts are carried out cluster based on the distance calculation result;

Said analysis 3 is for carrying out the comparison of the sequence similarity between text in twos to two above texts of distributing the dna sequence dna sign indicating number; With the long coupling fragment of the high score coupling fragment assembly that obtains for nothing intersection, no repetition; Calculate similarity numerical value then, and two above texts are carried out cluster according to said similarity numerical value; Said calculating similarity numerical value comprises: calculate similarity numerical value according to the starting point of said long coupling fragment and the positional information of end point through following formula; Said formula does

Similarity (A) = Σ_{1}^{Num (Hsp)} \frac{(Hsp . End - Hsp . Begin + 1)}{Length (A)},

Wherein, The similarity of the corresponding sequence A of sequence B among Similarity (A) any two dna sequence dna A of expression and the B; Num (hsp) is the sequence A of process splicing and the long summation of mating the number of fragment of sequence B; Hsp.end and hsp.begin are respectively the long starting point of fragment and the position of end point of mating of sequence A and sequence B, and length (A) is the length overall of sequence A.

5. the text processing system based on dna sequence dna comprises character distribution module and similarity analysis module;

The character that said character distribution module is used to two above texts distributes the dna sequence dna sign indicating number, and character identical in its Chinese version distributes identical dna sequence dna sign indicating number;

Said similarity analysis module is used to use the dna sequence dna disposal route that two above texts of distributing the dna sequence dna sign indicating number are carried out similarity analysis;

6. system according to claim 5, said character distribution module comprises the decimal number distribution module, quaternary number modular converter and dna sequence dna sign indicating number modular converter;

Said decimal number distribution module is used for being respectively the character distribution decimal number of two above texts, and character identical in its Chinese version distributes identical decimal number;

Said quaternary number modular converter is used for converting the pairing decimal number of the character of two above texts into quaternary number respectively; The figure place of said quaternary number is n; And 4n is at least greater than the sum of mutually different character in the text, and the quaternary number of not enough n position mends 0 at said quaternary number front end;

Said dna sequence dna sign indicating number modular converter is used for making a kind of in 0,1,2,3 corresponding respectively four kinds of DNAs of quaternary number; Convert the pairing n of the character position quaternary number in two above texts into n position dna sequence dna sign indicating number respectively, obtain the pairing dna sequence dna of each text.

7. system according to claim 6, said decimal number distribution module are used for distributing decimal number for it in proper order according to the appearance of the character of two above texts.

8. according to each described system of claim 5-7, said similarity analysis module comprises sequence spectrum portrayal module, one or more in sequence similarity comparing module or the sequence cluster analysis module.

9. system according to claim 8 is characterized in that,

Said sequence cluster analysis module comprises comparing module, concatenation module and cluster module; Said comparing module is used for two above texts of distributing the dna sequence dna sign indicating number are carried out the comparison of the sequence similarity between text in twos, obtains high score coupling fragment; Said concatenation module is used for the high score coupling fragment assembly that obtains for there not being the long coupling fragment of intersection, no repetition; Said cluster module is used to calculate similarity numerical value, and according to said similarity numerical value two above texts is carried out cluster; Said calculating similarity numerical value comprises: calculate similarity numerical value according to the starting point of said long coupling fragment and the positional information of end point through following formula; Said formula does

Similarity (A) = Σ_{1}^{Num (Hsp)} \frac{(Hsp . End - Hsp . Begin + 1)}{Length (A)},

10. each described system application in text-processing of claim 5-9.