CN108829679A

CN108829679A - Corpus labeling method and device

Info

Publication number: CN108829679A
Application number: CN201810644479.0A
Authority: CN
Inventors: 吴健君; 倪嘉呈
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2018-06-21
Filing date: 2018-06-21
Publication date: 2018-11-16

Abstract

A kind of corpus labeling method and device provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus carry out word cutting, obtain the corresponding multiple words to be converted of the corpus to be marked；The term vector of the corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus, wherein the corresponding each word of the training corpus is obtained after carrying out word cutting to corpus each in the training corpus；The corpus vector of each corpus in the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked；Each corpus vector is clustered, multiple clustering clusters are obtained, each clustering cluster is labeled.Since the present invention realizes the cluster mark to corpus vector, there is no need to individually be labeled to each corpus, more conveniently and quickly, the efficiency of corpus labeling is effectively increased.

Description

Corpus labeling method and device

Technical field

The present invention relates to natural language processing fields, in particular to corpus labeling method and device.

Background technique

Corpus in corpus is labeled be natural language processing field a kind of important technology, in query phrase (Query) analysis etc. has a wide range of applications.

Existing corpus labeling technology needs are respectively individually labeled each of corpus corpus, due to corpus Corpus quantity in library is more, therefore the annotating efficiency of existing corpus labeling technology is lower.

How to improve the efficiency of corpus labeling is still the technical problem urgently to be resolved of this field one.

Summary of the invention

In view of this, the present invention provides a kind of corpus labeling method and device, and to realize that the high efficiency to corpus annotates, skill Art scheme is as follows：

A kind of corpus labeling method, including：

Each corpus treated in tagged corpus carries out word cutting, and it is corresponding multiple to be converted to obtain the corpus to be marked Word；

The corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus Term vector, wherein the corresponding each word of the training corpus is after carrying out word cutting to corpus each in the training corpus It arrives；

It is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked each in the corpus to be marked The corpus vector of corpus；

Each corpus vector is clustered, multiple clustering clusters are obtained, each clustering cluster is labeled.

Optionally, the term vector according to the corresponding each word of training corpus determines that the corpus to be marked is corresponding The term vector of each word to be converted, including：

Any one word to be converted corresponding to the corpus to be marked：Searching in the corresponding each word of training corpus should Word to be converted, such as finds, then the term vector of the conversion word is determined from the term vector of the corresponding each word of the training corpus.

Optionally, the term vector according to the corresponding each word to be converted of the corpus to be marked determines described to be marked The corpus vector of each corpus in corpus, including：

Determine the weight of the corresponding each word to be converted of the corpus to be marked；

It is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked and the weight described to be marked The corpus vector of each corpus in corpus.

Optionally, described to be determined according to the term vector and the weight of the corresponding each word to be converted of the corpus to be marked The corpus vector of each corpus in the corpus to be marked, including：

Pass through formula

Calculate the corpus vector v (q for obtaining each corpus in the corpus to be marked_j), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the q_jIt is described to be marked The corpus that number is j in corpus, the n are corpus q_jThe quantity for each conversion word that word cutting obtains, the t_ijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (t_ij) it is word t to be converted_ijWeight, the v (t_ij) it is wait turn Change word t_ijTerm vector, the j, the i and the n are natural number.

Optionally, the weight of the corresponding each word to be converted of the determination corpus to be marked, including：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)

It calculates and obtains the corresponding each word t to be converted of the corpus to be marked_ijWeight w (t_ij), wherein the j is institute The number of corpus in corpus to be marked is stated, the i is the number for each word to be converted that corpus word cutting obtains, the t_ijFor number The word to be converted that the number obtained for the corpus word cutting of j is i, the freq (t_ij) it is word t to be converted_ijThe corpus for being j in number The frequency of middle appearance, the idf (t_ij) it is word t to be converted_ijInverse document frequency, the j and the i are natural number.

Optionally, it is determined in the term vector according to the corresponding each word to be converted of the corpus to be marked described wait mark Infuse corpus in each corpus corpus vector after, it is described each corpus vector is clustered before, the method also includes：

The corpus vector of determining each corpus is normalized；

It is described that each corpus vector is clustered, including：

Each corpus vector after normalized is clustered.

A kind of corpus labeling device, including：Word cutting unit, term vector determination unit, corpus vector determination unit and cluster Unit,

The word cutting unit carries out word cutting for treating each corpus in tagged corpus, obtains the corpus to be marked The corresponding multiple words to be converted in library；

The term vector determination unit, it is described to be marked for being determined according to the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted of corpus, wherein the corresponding each word of the training corpus is to the training corpus It is obtained after each corpus progress word cutting in library；

The corpus vector determination unit, for the term vector according to the corresponding each word to be converted of the corpus to be marked Determine the corpus vector of each corpus in the corpus to be marked；

The cluster cell obtains multiple clustering clusters, marks to each clustering cluster for clustering to each corpus vector Note.

Optionally, the corpus vector determination unit includes：Weight determines that subelement and corpus determine subelement,

The weight determines subelement, for determining the weight of the corresponding each word to be converted of the corpus to be marked；

The corpus determines subelement, for according to the term vector of the corresponding each word to be converted of the corpus to be marked and The weight determines the corpus vector of each corpus in the corpus to be marked.

Optionally, the corpus determines that subelement is specifically configured to：

Pass through formula

Optionally, the weight determines that subelement is specifically configured to：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)

A kind of corpus labeling method and device provided in an embodiment of the present invention, can treat each corpus in tagged corpus Word cutting is carried out, the corresponding multiple words to be converted of the corpus to be marked are obtained；According to the word of the corresponding each word of training corpus Vector determines the term vector of the corresponding each word to be converted of the corpus to be marked, wherein the training corpus is corresponding each Word is obtained after carrying out word cutting to corpus each in the training corpus；It is corresponding respectively wait turn according to the corpus to be marked The term vector for changing word determines the corpus vector of each corpus in the corpus to be marked；Each corpus vector is clustered, is obtained Multiple clustering clusters are labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to right Each corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as this hair It is bright to obtain the term vector of each word first by the way that corpus is cut to multiple words, then corpus vector is obtained further according to each term vector The accuracy of corpus vector can be improved in mode.Meanwhile term vector used in the present invention is by corpus each in training corpus Carry out what the word that word cutting obtains was converted to, therefore the present invention can be by each corpus in screening training corpus come further Improve the accuracy of term vector.

Detailed description of the invention

In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described.

Fig. 1 is a kind of flow diagram of corpus labeling method provided in an embodiment of the present invention；

Fig. 2 is the flow diagram of another corpus labeling method provided in an embodiment of the present invention；

Fig. 3 is the flow diagram of another corpus labeling method provided in an embodiment of the present invention；

Fig. 4 is a kind of structural schematic diagram of corpus labeling device provided in an embodiment of the present invention.

Specific embodiment

The invention discloses a kind of corpus labeling method and device, those skilled in the art can use for reference present disclosure, fit When improvement realization of process parameters.In particular, it should be pointed out that all similar substitutions and modifications are for a person skilled in the art It is it will be apparent that they are considered as being included in the present invention.Method of the invention and application passed through preferred embodiment into Gone description, related personnel obviously can not depart from the content of present invention, in spirit and scope to method described herein and application It is modified or appropriate changes and combinations, carrys out implementation and application the technology of the present invention.

As shown in Figure 1, a kind of corpus labeling method provided in an embodiment of the present invention, may include：

S100, each corpus treated in tagged corpus carry out word cutting, and it is corresponding multiple to obtain the corpus to be marked Word to be converted；

Wherein, each corpus in corpus to be marked can be selected by technical staff and is added in corpus to be marked. In practical applications, each corpus used in the embodiment of the present invention in corpus to be marked can be related to certain theme, such as It is related to automobile.In this way, can be obtained by and the theme by the way that corpus relevant to theme to be added in corpus to be marked Relevant corpus to be marked.Specifically, each corpus in corpus to be marked can be query phrase (Query).In reality In, the present invention can when inquiring, used query phrase be acquired and/or records to user, the present invention User's used query phrase when inquiring can be obtained from third party.

Specifically, a variety of different word cutting methods, which can be used, in the embodiment of the present invention carries out word cutting, such as：The present invention is implemented Example can carry out machine learning to word cutting training data by setting word cutting training data, obtain word cutting model.The present invention is real Need to only corpus be inputted the word to be converted that the output of word cutting model can be obtained in word cutting model by applying example.Certainly, the embodiment of the present invention Also the segmenting method based on string matching, the segmenting method based on understanding or the segmenting method based on statistics can be used to carry out Word cutting.

In practical applications, the embodiment of the present invention can be to the corresponding each word to be converted of the corpus to be marked of acquisition It carries out duplicate removal and reduces the workload of subsequent processing in this way, the embodiment of the present invention can remove duplicate word to be converted, Also avoid the interference as brought by duplicate word to be converted.

S200, determine that the corpus to be marked is corresponding respectively wait turn according to the term vector of the corresponding each word of training corpus Change the term vector of word, wherein the corresponding each word of the training corpus is to carry out word cutting to corpus each in the training corpus It obtains afterwards；

Specifically, in training corpus each corpus acquisition modes can there are many, such as：It is climbed from webpage by crawler A large amount of corpus relevant to certain theme are taken, to form training corpus.It should be understood that different crawling rule by being arranged Then, the embodiment of the present invention can crawl corpus relevant to different themes, to obtain the training corpus of different themes.Its In, above-mentioned theme can be configured and modify according to actual needs, and the embodiment of the present invention is it is not limited here.Such as：This hair Bright embodiment can crawl corpus relevant to automobile, to obtain the training corpus of automobile, wherein language relevant to automobile Material may include：Engine displacement, automatic gear-box, hundred kilometers accelerate 7.5 seconds, air bag etc..

Specifically, the theme of training corpus can be identical as the theme of corpus to be marked, can effectively improve in this way The accuracy of term vector.

Wherein, the word cutting method of corpus can be identical as the word cutting method of corpus in corpus to be marked in training corpus Or it is different.

In practical applications, the embodiment of the present invention obtains after can carrying out word cutting to corpus each in the training corpus Each word be converted to term vector.Specifically, word2vec technology, which can be used, in the present invention is converted to term vector for word.

In other embodiments of the present invention, step S200 is according to the determination of the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted of corpus to be marked may include：

For convenience of understanding, it is exemplified below：

If there are five the corresponding words of training corpus, respectively：A, the term vector of B, C, D and E, this five words is respectively： A, b, c, d and e.Then when certain word to be converted is one in A, B, C, D and E, so that it may in the corresponding each word of training corpus In find the word to be converted, to find the term vector of the word to be converted.Assuming that word to be converted is C, so that it may in training language C is found in the corresponding each word in material library, thus may determine that c is the term vector of the word to be converted.

As do not found the word to be converted in the corresponding each word of training corpus, the embodiment of the present invention can wait for this turning Word is changed to be recorded.In this way, technical staff can continue to expand training corpus so that training corpus is added in the word to be converted In the corresponding multiple words to be converted in library.Specifically, the embodiment of the present invention can be used crawler to comprising the word webpage to be converted into Row is crawled to obtain the corpus for including the word to be converted, and then these corpus are put into training corpus.In this way, to training Each corpus carries out being obtained with the word to be converted when word cutting in corpus.

S300, the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked In each corpus corpus vector；

Wherein, step S300 is determined there are many modes of corpus vector, such as：It will be obtained after certain corpus word cutting respectively wait turn The term vector for changing word is added, using calculated result as the corpus vector of the corpus.

In other embodiments of the present invention, it may be incorporated into weight, corpus vector determined by weight and term vector.Specifically , as shown in Fig. 2, step S300 may include：

S310, the weight for determining the corresponding each word to be converted of the corpus to be marked；

Wherein, the embodiment of the present invention determines there are many modes of the weight of each word to be converted, below exemplary offer wherein It is a kind of：

Step S310 can be specifically included：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)

S320, according to the term vector and the weight of the corresponding each word to be converted of the corpus to be marked determine it is described to The corpus vector of each corpus in tagged corpus.

Further, the term vector of each word to be converted obtained after certain corpus word cutting can be weighted by the embodiment of the present invention Summation, using calculated result as the corpus vector of the corpus.Therefore, step S320 can be specifically included：

Pass through formula

Calculate the corpus vector v (q for obtaining each corpus in the corpus to be marked_j), wherein the j is described wait mark The number of corpus in corpus is infused, the i is the number for each word to be converted that corpus word cutting obtains, the q_jIt is described to be marked The corpus that number is j in corpus, the n are corpus q_jThe quantity for each conversion word that word cutting obtains, the t_ijIt is j's for number The word to be converted that the number that corpus word cutting obtains is i, the w (t_ij) it is word to be converted_tijWeight, the v (t_ij) it is wait turn Change word t_ijTerm vector, the j, the i and the n are natural number.

S400, each corpus vector is clustered, obtains multiple clustering clusters, each clustering cluster is labeled.

Each corpus vector is clustered specifically, K mean cluster technology can be used in the embodiment of the present invention.

It is to be appreciated that step of embodiment of the present invention S400 clusters each corpus vector, after obtaining multiple clustering clusters, The embodiment of the present invention can also further classify to each clustering cluster, or carry out synonym excavation etc. by each clustering cluster.

Specifically, the mode that clustering cluster is labeled can there are many, such as：Each clustering cluster is marked according to industry Note etc..

When each corpus in corpus to be marked is query phrase, the present invention is by carrying out cluster mark to query phrase Can effectively each query phrase be divided or be identified, it is subsequent to dividing or the query phrase that has identified makes to facilitate With, such as：It is relevant to its query intention to determine that user query are intended to provide a user according to the division of query phrase Query result.

A kind of corpus labeling method provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus are cut Word obtains the corresponding multiple words to be converted of the corpus to be marked；Term vector according to the corresponding each word of training corpus is true The term vector of the corresponding each word to be converted of the fixed corpus to be marked, wherein the corresponding each word of the training corpus is pair It is obtained after each corpus progress word cutting in the training corpus；According to the corresponding each word to be converted of the corpus to be marked Term vector determines the corpus vector of each corpus in the corpus to be marked；Each corpus vector is clustered, is obtained multiple poly- Class cluster is labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to each Corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as the present invention passes through Corpus is cut to multiple words, obtains the term vector of each word first, the mode for then obtaining corpus vector further according to each term vector can To improve the accuracy of corpus vector.Meanwhile term vector used in the present invention is cut by corpus each in training corpus What the word that word obtains was converted to, therefore the present invention can further increase word by each corpus in screening training corpus The accuracy of vector.

As shown in figure 3, it is provided in an embodiment of the present invention another kind corpus labeling method, step S300 and step S400 it Between can also include：

S301, the corpus vector of determining each corpus is normalized.

Wherein, normalization is a kind of preprocess method of the data before use, it can not upset data distribution rule Under the premise of, improve the speed and precision of data processed result.For the present invention, it is poly- that corpus vector can be improved in normalization The speed and precision of class.

On this basis, step S400 can be specifically included：

Each corpus vector after normalized is clustered, multiple clustering clusters is obtained, each clustering cluster is labeled.

Wherein, the method that corpus vector is normalized in the embodiment of the present invention may include：

Pass through formula

v(q_j) '=v (q_j)/||v(q_j)||

To corpus vector v (q_j) be normalized, the corpus vector v (q after being normalized_j) ', wherein | | v (q_j)|| Indicate amount of orientation v (q_j) mould.

Corresponding with above method embodiment, the embodiment of the invention also provides a kind of corpus labeling devices.

As shown in figure 4, a kind of corpus labeling device provided in an embodiment of the present invention, may include：Word cutting unit 100, word Vector determination unit 200, corpus vector determination unit 300 and cluster cell 400,

The word cutting unit 100 carries out word cutting for treating each corpus in tagged corpus, obtains the language to be marked Expect the corresponding multiple words to be converted in library；

The term vector determination unit 200, for according to the term vector of the corresponding each word of training corpus determine it is described to The term vector of the corresponding each word to be converted of tagged corpus, wherein the corresponding each word of the training corpus is to the training It is obtained after each corpus progress word cutting in corpus；

Wherein, term vector determination unit 200 can be specifically set for：It is corresponding to the corpus to be marked any one wait for Convert word：It searches the word to be converted in the corresponding each word of training corpus, such as finds, then it is corresponding from the training corpus Each word term vector in determine the conversion word term vector.

The corpus vector determination unit 300, for the word according to the corresponding each word to be converted of the corpus to be marked Vector determines the corpus vector of each corpus in the corpus to be marked；

Wherein, there are many modes of the determining corpus vector of corpus vector determination unit 300, such as：After certain corpus word cutting The term vector of obtained each word to be converted is added, using calculated result as the corpus vector of the corpus.

In other embodiments of the present invention, it may be incorporated into weight, corpus vector determined by weight and term vector.Specifically , the corpus vector determination unit 300 may include：Weight determines that subelement and corpus determine subelement,

Wherein, the corpus determine subelement can be specifically set for：

Pass through formula

Wherein, the weight determine subelement can be specifically set for：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)

The cluster cell 400 obtains multiple clustering clusters for clustering to each corpus vector, to each clustering cluster into Rower note.

It is to be appreciated that cluster cell of the embodiment of the present invention 400 clusters each corpus vector, multiple clustering clusters are obtained Afterwards, the embodiment of the present invention can also further classify to each clustering cluster, or carry out synonym excavation by each clustering cluster Deng.

A kind of corpus labeling device provided in an embodiment of the present invention, each corpus that can be treated in tagged corpus are cut Word obtains the corresponding multiple words to be converted of the corpus to be marked；Term vector according to the corresponding each word of training corpus is true The term vector of the corresponding each word to be converted of the fixed corpus to be marked, wherein the corresponding each word of the training corpus is pair It is obtained after each corpus progress word cutting in the training corpus；According to the corresponding each word to be converted of the corpus to be marked Term vector determines the corpus vector of each corpus in the corpus to be marked；Each corpus vector is clustered, is obtained multiple poly- Class cluster is labeled each clustering cluster.Since the present invention realizes the cluster mark to corpus vector, there is no need to each Corpus is individually labeled, and more conveniently and quickly, effectively increases the efficiency of corpus labeling.Simultaneously as the present invention passes through Corpus is cut to multiple words, obtains the term vector of each word first, the mode for then obtaining corpus vector further according to each term vector can To improve the accuracy of corpus vector.Meanwhile term vector used in the present invention is cut by corpus each in training corpus What the word that word obtains was converted to, therefore the present invention can further increase word by each corpus in screening training corpus The accuracy of vector.

On the basis of embodiment shown in Fig. 4, another kind corpus labeling device provided in an embodiment of the present invention can also include： Normalization unit, for the term vector in corpus vector determination unit according to the corresponding each word to be converted of the corpus to be marked Determine that in the corpus to be marked after the corpus vector of each corpus, the cluster cell carries out clustering it to each corpus vector Before, the corpus vector of determining each corpus is normalized.

Further, cluster cell 400 can be specifically set for：Each corpus vector after normalized is clustered, Multiple clustering clusters are obtained, each clustering cluster is labeled.

Wherein, cluster cell 400 can pass through formula

v(q_j) '=v (q_j)/||v(q_j)||

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of corpus labeling method, which is characterized in that including：

Each corpus treated in tagged corpus carries out word cutting, obtains the corresponding multiple words to be converted of the corpus to be marked；

The word of the corresponding each word to be converted of the corpus to be marked is determined according to the term vector of the corresponding each word of training corpus Vector, wherein the corresponding each word of the training corpus is obtained after carrying out word cutting to corpus each in the training corpus；

Each corpus in the corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked Corpus vector；

2. the method according to claim 1, wherein the term vector according to the corresponding each word of training corpus Determine the term vector of the corresponding each word to be converted of the corpus to be marked, including：

Any one word to be converted corresponding to the corpus to be marked：Searching in the corresponding each word of training corpus should be wait turn Word is changed, is such as found, then determines the term vector of the conversion word from the term vector of the corresponding each word of the training corpus.

3. method according to claim 1 or 2, which is characterized in that described corresponding each according to the corpus to be marked The term vector of word to be converted determines the corpus vector of each corpus in the corpus to be marked, including：

The corpus to be marked is determined according to the term vector of the corresponding each word to be converted of the corpus to be marked and the weight The corpus vector of each corpus in library.

4. according to the method described in claim 3, it is characterized in that, described corresponding respectively wait turn according to the corpus to be marked The term vector and the weight that change word determine the corpus vector of each corpus in the corpus to be marked, including：

Pass through formula

Calculate the corpus vector v (q for obtaining each corpus in the corpus to be marked_j), wherein the j is the corpus to be marked The number of corpus in library, the i are the number for each word to be converted that corpus word cutting obtains, the q_jFor the corpus to be marked The corpus that middle number is j, the n are corpus q_jThe quantity for each conversion word that word cutting obtains, the t_ijThe corpus for being j for number is cut The word to be converted that the number that word obtains is i, the w (t_ij) it is word t to be converted_ijWeight, the v (t_ij) it is word t to be converted_ij Term vector, the j, the i and the n are natural number.

5. according to the method described in claim 3, it is characterized in that, the determination corpus to be marked is corresponding respectively wait turn The weight of word is changed, including：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)

It calculates and obtains the corresponding each word t to be converted of the corpus to be marked_ijWeight w (t_ij), wherein the j be it is described to The number of corpus in tagged corpus, the i are the number for each word to be converted that corpus word cutting obtains, the t_ijIt is j for number The obtained number of corpus word cutting be i word to be converted, the freq (t_ij) it is word t to be converted_ijIn the corpus that number is j The frequency of appearance, the idf (t_ij) it is word t to be converted_ijInverse document frequency, the j and the i are natural number.

6. the method according to claim 1, wherein it is described according to the corpus to be marked it is corresponding respectively to The term vector of conversion word determines in the corpus to be marked after the corpus vector of each corpus, described to carry out to each corpus vector Before cluster, the method also includes：

The corpus vector of determining each corpus is normalized；

It is described that each corpus vector is clustered, including：

Each corpus vector after normalized is clustered.

7. a kind of corpus labeling device, which is characterized in that including：Word cutting unit, term vector determination unit, corpus vector determine single Member and cluster cell,

The word cutting unit carries out word cutting for treating each corpus in tagged corpus, obtains the corpus pair to be marked The multiple words to be converted answered；

The term vector determination unit, for determining the corpus to be marked according to the term vector of the corresponding each word of training corpus The term vector of the corresponding each word to be converted in library, wherein the corresponding each word of the training corpus is in the training corpus It is obtained after each corpus progress word cutting；

The corpus vector determination unit, for being determined according to the term vector of the corresponding each word to be converted of the corpus to be marked The corpus vector of each corpus in the corpus to be marked；

The cluster cell obtains multiple clustering clusters, is labeled to each clustering cluster for clustering to each corpus vector.

8. device according to claim 7, which is characterized in that the corpus vector determination unit includes：Weight determines son Unit and corpus determine subelement,

The corpus determines subelement, for according to the term vector of the corresponding each word to be converted of the corpus to be marked and described Weight determines the corpus vector of each corpus in the corpus to be marked.

9. device according to claim 8, which is characterized in that the corpus determines that subelement is specifically configured to：

Pass through formula

10. device according to claim 8, which is characterized in that the weight determines that subelement is specifically configured to：

According to formula

w(t_ij)=freq (t_ij)×idf(t_ij)