CN106095791A

CN106095791A - A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof

Info

Publication number: CN106095791A
Application number: CN201610369833.4A
Authority: CN
Inventors: 吴�琳; 韩广; 袁鑫攀; 李亚楠
Original assignee: Changyuan Power (shandong) Technology Co Ltd
Current assignee: Changyuan power (Beijing) Technology Co., Ltd.
Priority date: 2016-01-31
Filing date: 2016-05-29
Publication date: 2016-11-09
Anticipated expiration: 2036-05-29
Also published as: CN106095791B

Abstract

The present invention proposes a kind of abstract sample information searching system based on context.In this system, abstract sample characteristics method for expressing utilizes Word2vector to extract meaning of a word feature, it is thus achieved that the term vector of abstract word；Then, the term vector of abstract word is carried out the cluster of " division of adaptive optimal control degree ", and be cluster barycenter according to cluster result by abstract word replacing representation；Finally, according to barycenter and the word frequency of representative abstract word thereof, constitute term vector cluster centroid frequency model (ST IDF), represent abstract sample for characterization.Present invention reduces cluster and the execution number of times of fitness calculating, improve the performance that abstract sample similarity is analyzed, improve sample classification accuracy rate.

Description

A kind of abstract sample information searching system based on context and abstract sample characteristics thereof Change method for expressing

Technical field

The present invention relates to the information retrieval field of Data-Link message, semi-structured text or plain text, particularly to base Sample similarity analysis and classification in term vector (Word2vector).

Background technology

Abstract word refers to the special word directly cannot understood by language in information retrieval sample, i.e. advise without known language Then (meaning of a word, grammer, word order) can directly identify that it is actual semantic.Substantial amounts of abstract word is present in information retrieval to some extent Sample in, the most military Data-Link message (Link-16, Link-22), for data exchange semi-structured text (XML) Or plain text.Concurrently there are substantial amounts of Data-Link message, semi-structured text or plain text and use abstract word record completely Information.For this situation, this type of message or text in information retrieval task are referred to as abstract sample by us.

At present, for the abstract sample in information retrieval task, cannot in the case of its abstract word semanteme of Direct Recognition, Many employings sample characteristics method for expressing based on word statistics.Existing characterization method for expressing based on word statistics cannot Efficiently extract its phrase semantic (meaning of a word) feature, such as TF-IDF (TermFrequency-Inverse Document Frequency) model and BOW (Bag of words) model.

Word2vector is a kind of phrase semantic (meaning of a word) feature extracting method according to context relation, at first by Mikolov proposes at the beginning of being equal to 2013 in the open source projects of Google.When document is as the sample of information retrieval, for Each word in different document, Word2vector can efficiently extract its semantic (i.e. meaning of a word spy according to its context relation Levy), and be given with the form of term vector.Must be noted that the meaning of a word feature extraction mechanism of Word2vector makes not identical text The term vector corresponding to word identical in Dang also differs.So, cause the term vector being difficult to according to Word2vector to form letter The characterization of breath sample retrieval represents, the sample characteristics being particularly difficult to be formed VSM (vector space model) form represents.

At present, abstract levying of sample represents needs to use Word2vector as meaning of a word feature extraction based on context Method, and make self to be applicable to existing Information-retrieval Algorithm based on sampling feature vectors.But, not yet occur clearly being recognized Can method can according to Word2vector meaning of a word feature extraction formed VSM form abstract sample characteristics represent.

Therefore it is badly in need of proposing a kind of abstract sample information searching system based on context and corresponding abstract sample characteristics Change method for expressing, solve the problems referred to above.

Summary of the invention

In information retrieval application, the invention provides the retrieval of a kind of abstract sample information based on context is System, and elaborate its characterization method for expressing in detail.It is an object of the invention to, overcome and prior art is difficult to basis The term vector of Word2vector forms the situation that the characterization of sample represents, solves the meaning of a word during abstract sample characteristics represents special The problem levying extraction.

A kind of abstract sample information searching system based on context, including participle functional module, meaning of a word feature extraction mould Block, abstract word character displacement representation module, ST-IDF module and sort module, taking out of described abstract sample information searching system Decent eigen method for expressing comprises the following steps:

Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can Form according to Data-Link message divides each abstract word with word length；When sample is text, can be according to space and specific participle The each abstract word of regular partition.

Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: for obtained by step 1 Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape Formula represents.

Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use Clustering quantity under excellent Clustering Effect fitness, carries out K-means algorithm cluster, i.e. to the term vector obtained by step 2 Realize the cluster of " division of adaptive optimal control degree " to abstract word term vector.Wherein, the barycenter of term vector clustering is referred to as S (table It is shown as the vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N, The sample classification quantity known be C, f (k) be embody Clustering Effect fitness function,

f (k) = \frac{α}{β}, N \leq k \leq N \times C,

α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering Average, makes positive integer k ∈ [N, N × C]；As f (k)=max (f (k)), make the clustering under optimum cluster effect fitness Quantity K=k, the quantity of barycenter S is ultimately determined to K.Then, it is its word according to final cluster result by abstract word replacing representation The barycenter S of clustering belonging to vector, or be referred to as representing the abstract word in its clustering with barycenter S, will the spy of abstract word Levy the barycenter that approximation approval is affiliated clustering.

Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at one The frequency occurred in sample, the replacing representation relation be given according to step 3, by the abstract word representated by barycenter S in this sample The frequency of occurrences be calculated as the frequency of barycenter S；And add up the reverse document-frequency of term vector cluster barycenter；Then, with reference to TF-IDF Model constitutes term vector cluster centroid frequency model ST-IDF, and ST-IDF model belongs to VSM form, represents for characterization One abstract sample.

Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: the characterization table provided according to step 4 Show, calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly.

Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, adopt With NWKNN algorithm, abstract sample is carried out kind judging.

Beneficial effects of the present invention is as follows:

The present invention proposes a kind of information retrieval system based on context and abstract sample characteristics method for expressing thereof, it Improvement including two aspects: (1) proposes optimum cluster effect fitness partitioning algorithm, and fits according in optimum cluster effect Term vector cluster under response, has carried out abstract word character displacement and has represented；(2) propose and represent for abstract sample characteristics Term vector cluster centroid frequency model ST-IDF.

The present invention extracts meaning of a word feature first with Word2vector, it is thus achieved that the term vector of all abstract words in sample；And After, it is proposed that optimum cluster effect fitness partitioning algorithm, and according to the optimum cluster effect fitness term vector to abstract word Carry out K-means cluster, and according to cluster result, the barycenter that abstract word replacing representation is clustering belonging to its term vector (is remembered For S)；Finally, the frequency of occurrences in the sample of the abstract word representated by barycenter is calculated as the frequency of barycenter S, and constitutes term vector Cluster centroid frequency model ST-IDF, represents abstract sample for characterization.With traditional sample based on word statistics Characterization method for expressing is compared, and ST-IDF model comprises the meaning of a word feature of abstract word, and belongs to VSM (vector space model) shape Formula, is applicable to the Information-retrieval Algorithm (as classified, return, clustering) of existing feature based vector.

From the angle of excess syndrome, use information retrieval field classics sample classification algorithm NWKNN, at public data collection On Reuter-21758, Wikipedia XML, ST-IDF model and TF-IDF model are carried out contrast experiment, experimental result Illustrate the clear superiority of the method for the invention objectively, improve the accuracy that abstract Sample Similarity calculates, improve Abstract sample classification accuracy, and effectively expanded the construction method of vector space model in information retrieval field.

Accompanying drawing explanation

Fig. 1 is data and the module map of abstract sample information searching system of the present invention.

Fig. 2 is the flow chart of information retrieval method of the present invention.

Fig. 3 is Word2vector method basic principle schematic.

Fig. 4 is Clustering Effect fitness function figure.

Fig. 5 is the replacing representation relation schematic diagram in term vector space according to cluster.

Detailed description of the invention

Below in conjunction with drawings and Examples, the present invention is described further.

As it is shown in figure 1, wherein content is a kind of abstract sample information searching system based on context of the present invention, including dividing Word functional module, meaning of a word characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module.

The abstract sample characteristics method for expressing of described abstract sample information searching system comprises the following steps:

Step 1: utilize participle functional module that sample is carried out the participle of abstract word.When sample uses abstract word record completely During information, it is impossible to carry out the participle of abstract word in sample according to dictionary or dictionary.So, abstract word is only considered as by this step The character string of ascii character.When sample is Data-Link message, divide each abstract according to the form of Data-Link message with word length Word；When sample is text, divide each abstract word according to space and specific word segmentation regulation.The participle of abstract word is designated as word_i,t, word word_i,tThe participle of the t kind abstract word in expression i-th sample, has i={1, and 2 ..., | D | }, | D | is number According to concentrate D sample number, t={1,2 ..., n}, n are abstract word species number, abstract word word in all samples_i,tQuantity be N。

Step 2: utilize meaning of a word characteristic extracting module, extracts the phrase semantic feature of abstract word.For obtained by step 1 Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape Formula represents.This step uses Word2vec instrument, can obtain the term vector of abstract word.

Word2vec is the model realization of Word2vector method, can context relation based on word, fast and effeciently Train and generate term vector.It contains two kinds of training patterns, CBOW and Skip_gram.As being used for training generation term vector Software tool, in Word2vec, the basis of training pattern is neutral net language model NNLM, its ultimate principle such as Fig. 2 institute Show.

According to the abstract word obtained by step 1, it is word that NNLM can calculate the next word of some context_i,t's Probability, i.e. p (word_i,t=t | context), term vector is the by-product of its training.NNLM is right according to data set D generation one The vocabulary V answered.Each word in V correspond to a labelling word_i,t.In order to determine the parameter of neutral net, need Training sample the input as neutral net is built by data set.The building process of NNLM word context sample is: For any one word word in D_i,t, obtain its context context (word_i,t) (n-1 word before such as), thus obtain One tuple (context (word_i,t),word_i,t).It is trained using this tuple as the input of neutral net.NNLM's is defeated Entering layer and traditional neural network model is different, each node unit of input is no longer a scalar value, but one Individual vector, each value of vector is variable, to be updated it during training, and this vector is exactly term vector.By Fig. 2 Understand, for each word word_i,t, NNLM maps it onto a vectorial w_i,t, it is term vector.

Use the term vector w that Word2vec instrument obtains_i,tThe concrete t kind abstract word participle represented in i-th sample Meaning of a word feature, has i={1, and 2 ..., | D | }, | D | is sample number, the term vector w of abstract word in all samples_i,tQuantity be N.

Step 3: utilize abstract word character displacement representation module, word vector clusters barycenter to represent taking out in its clustering As word.First, use the clustering quantity under optimum cluster effect fitness, step 2 term vector obtained is carried out K- Means algorithm clusters, and i.e. realizes the cluster of " division of adaptive optimal control degree " to abstract word term vector.The K-means of term vector gathers Apoplexy due to endogenous wind, uses the cosine value of two term vector angles to calculate distance between the two.

According to step 2 gained, the term vector w of abstract word in all samples_i,tQuantity be N, term vector w_i,tConcrete expression The meaning of a word feature of the t kind abstract word participle in i-th sample.Known sample classification quantity is C, and sample size is M.This In step, the barycenter of term vector clustering being referred to as S (being expressed as the vector in term vector space), quantity k of S is i.e. cluster Divide number.

For embodying the K-means Clustering Effect in term vector space, the present invention provides the adaptive meter of clustering quantity Calculate.For representing clustering quantity adaptability, making f (k) is the function embodying Clustering Effect fitness,

f (k) = \frac{α}{β}, N \leq k \leq N \times C,

α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering Average, specifically has:

α = \frac{1}{k} Σ c o s (S, S^{'}),

β = \frac{1}{k} Σ_{b = 1}^{k} \overset{&OverBar;}{c o s (w_{i, t}, w_{i, t}^{'})},

Wherein, S and S ' is the centroid vector of different clusterings, w_i,tWith w '_i,tIt is that class belongs in the b clustering The term vector of different abstract word participles.

If clustering number k ∈ [N, N × C], and be positive integer, as f (k)=max (f (k)), make optimum cluster imitate Really clustering quantity K=k under fitness, f (K) is the maximum of Clustering Effect fitness.It is computed understanding, function f (k) Being monotonically increasing in the interval of N to K, be monotone decreasing in the interval of K to N × C, the image of function f (k) is as shown in Figure 3.

So, as f (k)=max (f (k)), K=k, f (K) they are the extreme values of Clustering Effect fitness function, i.e. optimum poly- Class effect fitness, the quantity of K-means cluster barycenter S is ultimately determined to K.Determining max (f (k)), the process of K Yu f (K) In, for reducing the execution number of times that K-means clusters and f (k) calculates, the present invention proposes optimum cluster effect fitness and divides calculation Method, often carries out f (k) and calculates and then need pre-to first carry out the K-means cluster that barycenter quantity is k in algorithm, specific as follows:

Optimum cluster effect fitness partitioning algorithm

Optimum cluster effect fitness partitioning algorithm is analyzed: according to the recursive operation feature of algorithm, its time complexity is Ο(log₂[(N × C-N)/4], so the actual K-means cluster number of times performed is less than with f (k) calculation times in this step In log₂[(N × C-N)/4] are secondary；And when not using optimum cluster effect fitness partitioning algorithm, have k={N, N+1, N+ 1 ..., N × C}, during determining max (f (k)), K with f (K), the required K-means cluster performed is average with what f (k) calculated Number of times is (N × C-N)/2.So, the optimum cluster effect fitness partitioning algorithm in this step reduces cluster and fitness The execution number of times calculated.

Finally, according to final cluster result by barycenter S that abstract word replacing representation is clustering belonging to its term vector. Specifically, as f (k)=max (f (k)), clustering quantity K=k under optimum cluster effect fitness, will be the most abstract Word w_i,tReplacing representation is the barycenter S of clustering belonging to its term vector, will the feature approximation approval of abstract word for affiliated cluster The barycenter divided.In any meronym vector space, represent the abstract word in its clustering with barycenter S, its corresponding relation As shown in Figure 4.Concrete replacing representation relation is as described in following formula:

Wherein, the b cluster barycenter S_bRepresentative abstract word word_i,tConstitute an abstract word set, w_i,tIt is abstract Word word_i,tTerm vector, W_bIt is that class belongs to barycenter S_bThe set of the abstract word corresponding to the term vector of place clustering.

Step 4: utilize ST-IDF module, exports abstract sample characteristics and represents.First, each abstract word is added up at one The replacing representation relation of the frequency occurred in sample, the barycenter S be given according to step 3 and abstract word, by the b barycenter S_bInstitute's generation The abstract word of the table frequency of occurrences in this sample is calculated as barycenter S_bFrequency；And add up term vector cluster barycenter S_bReverse literary composition Part frequency, has b={1, and 2 ..., K}.Then, term vector cluster centroid frequency model ST-is constituted with reference to TF-IDF model IDF, concrete constituted mode will be further elaborated.

In TF-IDF model, sample doc_iCharacterization represent by characteristic vector d_iRealize,

d_i=(d_i(1),d_i(2),……,d_i(n))

Vector d_iIn t tie up element d_i(t)Calculation is as follows:

d_i(t)=TF (word_t,doc_i)·IDF(word_t),

TF(word_t,doc_i) it is word word_tAt sample doc_iIn frequency, have its calculation

T F ({word}_{t}, {doc}_{i}) = \frac{c o u n t ({word}_{t})}{Σ_{j = 1}^{n} c o u n t ({word}_{j})},

Middle molecule is this word occurrence number in the sample, and denominator is then the occurrence number of the most all words Sum,

IDF(word_t) it is word word_tReverse document-frequency, have its calculation

I D F ({word}_{t}) = \frac{| D |}{| {{doc}_{i} | {word}_{t} &Element; {doc}_{i}} |},

Wherein, D is sample doc_iComposition data set, | D | is the sum of sample in data set D, | { doc_i|word_t∈ doc_i| for comprising word word_tSample size.

With reference to TF-IDF model, ST-IDF model specifically constitutes as follows:

SF(S_b,doc_i) it is that term vector clusters barycenter S_bAt abstract sample doc_iIn frequency, have its calculation

S F (S_{b}, {doc}_{i}) = Σ_{w_{i, t} &Element; W_{b}} T F (w_{i, t}),

Wherein, W_bIt is that class belongs to barycenter S_bThe set of the abstract word corresponding to the term vector of place clustering, TF (w_i,t) Represent abstract word w_i,tAt abstract sample doc_iThe frequency of middle appearance, SF (S_b,doc_i) only add up abstract sample doc_iIn by barycenter S_b The frequency of representative abstract word.

IDF(S_b) it is that term vector clusters barycenter S_bReverse document-frequency, have its calculation

I D F (S_{b}) = \frac{| D |}{| {{doc}_{i} | w_{i, t_{w_{i, t} &Element; W_{b}}} &Element; {doc}_{i}} |},

Wherein, D is abstract sample doc_iComposition data set, | D | is the sum of sample in data set D,For comprising by barycenter S_bThe quantity of the sample of representative abstract word.

In ST-IDF model, abstract sample doc_iCharacterization represent by characteristic vectorRealize,

{\underset{\cdot}{d}}_{i} = ({\underset{\cdot}{d}}_{i (1)}, {\underset{\cdot}{d}}_{i (2)}, ... ..., {\underset{\cdot}{d}}_{i (K)}),

VectorIn b tie up elementCalculation is as follows:

{\underset{\cdot}{d}}_{i (b)} = S F (S_{b}, {doc}_{i}) \cdot I D F (S_{b}) .

The ST-IDF model that this step is proposed belongs to VSM (vector space model) form, represents one for characterization Abstract sample.

Step 5: Similarity Measure, it is achieved the similarity analysis of abstract sample.The characterization table provided according to step 4 Show, calculate the similarity between two abstract samples；And carry out the execution of sample classification algorithm in information retrieval field accordingly.

A kind of abstract sample characteristics method for expressing of information retrieval based on context uses the ST-IDF that step 4 is proposed Model carries out abstract sample characteristics and represents.Any two abstract sample doc_iWith doc_i' similarity is by similarity function Sim (doc_i,doc_i') represent, its concrete calculation is as follows:

S i m ({doc}_{i}, {doc}_{i}^{'}) = c o s ({\underset{\cdot}{d}}_{i}, {\underset{\cdot}{d}}_{i}^{'}),

For characteristic vector in ST-IDF vector spaceWithBetween the cosine value of angle.

Step 6: utilize sort module, the abstract sample after representing characterization carries out kind judging.According to similarity, adopt With NWKNN algorithm, abstract sample is carried out kind judging.

According to similarity function Sim (doc_i,doc_i'), use the classical sample classification algorithm in information retrieval field NWKNN performs abstract sample classification.NWKNN is weight neighbours' KNN algorithm, and the sample classification for unbalanced classified sample set is sentenced Not, its formula is as follows:

s c o r e (d o c, c_{i}) = {Weight}_{i} (\underset{{doc}_{j} &Element; K N N (d)}{Σ} S i m (d o c, {doc}_{j}) δ ({doc}_{j}, c_{i})),

Wherein, function score (doc, c_i) calculate document doc is attributed to classify c_iAssessed value；Function Sim (doc, doc_j) represent sample doc and known class sample doc_jSimilarity, use vector COS distance calculate；Weight_iFor classification Weight setting value, is entered as 3.5；Function δ (doc_j,c_i) represent sample doc_jWhether belong to classification c_iIf, sample doc_jBelong to class Other c_i, then this function value is 1, and otherwise, this function value is 0.

The Performance Evaluation of sample classification uses F1-measure standard.This standard combines recall rate Recall and accuracy rate The assessment tolerance F1 of Precision is as follows:

F 1 = \frac{2 \times Re c a l l \times \Pr e c i s i o n}{(Re c a l l + \Pr e c i s i o n)}

Use F1-measure standard, sample classification system classifying quality for data set be can be observed.For just In comparing, by summing up macroscopical F1 metric Macro-F1 of abstract sample classification result, meanwhile, abstract sample classification can be obtained The Average precision of result.

The data set being data exchange semi-structured text with wikipedia XML data Wikipedia XML, with Reuter Document sets Reuter-21578 is the data set of plain text, uses NWKNN algorithm to carry out abstract sample classification experiment, and uses F1-measure standard carries out the recruitment evaluation of sample classification, SF-IDF vector and the TF-IDF in prior art that the present invention proposes The classifying quality contrast of vector is shown in Table 1, table 2:

On table 1Wikipedia XML data collection, TF-IDF vector compares with the classifying quality of SF-IDF vector

On table 2Reuter-21578 data set, TF-IDF vector compares with the classifying quality of SF-IDF vector

According to table 1, table 2, it is seen that the classifying quality of the SF-IDF vector that the present invention proposes is substantially better than prior art Middle TF-IDF vector, especially on Wikipedia XML data collection, Average Accuracy is brought up to by original 48.7% 59.2%, on Reuter-21578 data set, Average Accuracy is brought up to 63.3% by original 57.1%.Experimental result shows Showing, the information retrieval task classified for abstract sample similarity, TF-IDF mould compared by ST-IDF model proposed by the invention Type has more excellent F1-measure assessment result, it was demonstrated that characterization method for expressing provided by the present invention possesses abstract sample The advantage of meaning of a word feature extraction.

Claims

1. an abstract sample information searching system based on context, it is characterised in that: it includes participle functional module, the meaning of a word Characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module, described abstract sample information is retrieved The abstract sample characteristics method for expressing of system comprises the following steps:

Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can basis The form of Data-Link message divides each abstract word with word length；When sample is text, can be according to space and specific word segmentation regulation Divide each abstract word；

Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: abstract for obtained by step 1 Word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector form table Show；

Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use optimum poly- Clustering quantity under class effect fitness, carries out K-means algorithm cluster, i.e. realizes the term vector obtained by step 2 Cluster to " division of adaptive optimal control degree " of abstract word term vector, wherein, the barycenter of term vector clustering is referred to as S and (is expressed as Vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N, it is known that Sample classification quantity be C, f (k) be embody Clustering Effect fitness function,

f (k) = \frac{α}{β}, N \leq k \leq N \times C,

α is the mean cosine distance between k S vector, and β is the average of mean cosine distance between the term vector in k clustering, Make positive integer k ∈ [N, N × C]；As f (k)=max (f (k)), make clustering quantity K under optimum cluster effect fitness =k, the quantity of barycenter S is ultimately determined to K；Then, it is its term vector according to final cluster result by abstract word replacing representation Belonging to the barycenter S of clustering, or be referred to as representing the abstract word in its clustering with barycenter S, will the feature of abstract word near Like the barycenter that approval is affiliated clustering；

Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at a sample The frequency of middle appearance, the replacing representation relation be given according to step 3, by abstract word the going out in this sample representated by barycenter S Existing frequency is calculated as the frequency of barycenter S；And add up the reverse document-frequency of term vector cluster barycenter；Then, with reference to TF-IDF model Constituting term vector cluster centroid frequency model ST-IDF, ST-IDF model belongs to VSM form, represents one for characterization Abstract sample；

Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: represent according to the characterization that step 4 is provided, meter Calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly；

Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, use NWKNN algorithm carries out kind judging to abstract sample.