CN107480200A

CN107480200A - Word mask method, device, server and the storage medium of word-based label

Info

Publication number: CN107480200A
Application number: CN201710581312.XA
Authority: CN
Inventors: 梁予之; 曲强
Original assignee: Shenzhen Institute of Advanced Technology of CAS
Current assignee: Shenzhen Institute of Advanced Technology of CAS
Priority date: 2017-07-17
Filing date: 2017-07-17
Publication date: 2017-12-15
Anticipated expiration: 2037-07-17
Also published as: CN107480200B

Abstract

The applicable field of computer technology of the present invention, there is provided a kind of word mask method, device, server and the storage medium of word-based label, this method include：Word to be marked is searched in the text document of input, pass through the good word's kinds device of training in advance, the known words related to word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to the label word of word to be marked, to be labeled by label word to word to be marked, wherein, word's kinds device trains to obtain by way of having supervision, so as to train word's kinds device by way of having supervision, using known words as label word, realize the word automatic marking to be marked of word-based label, it is effectively improved the efficiency of word mark to be marked, reduce the manpower consumption of word mark to be marked, it is effectively improved the accuracy rate and recall rate of word mark to be marked.

Description

Word mask method, device, server and the storage medium of word-based label

Technical field

The invention belongs to field of computer technology, more particularly to a kind of word mask method, device, the clothes of word-based label Business device and storage medium.

Background technology

In today that social media is flourishing, many new words are derived from from the network new media such as microblogging, Facebook Language, these newborn words are used in our real life more and more.It is born in the newborn word of network new media At the beginning of, people are difficult to the mark for obtaining these newborn words in time, because in dictionary or network encyclopaedia (such as wikipedia), The entry of these newborn words is not founded also, and the entry for manually founding each newborn word needs to do a large amount of cumbersome works Make.

At present, the research for word mark focuses mostly in part-of-speech tagging (Part of speech tagging, POS), Target word, is then divided into one type or a few classes by i.e. default several classes (such as personage, place, organization names).Part of speech The method comparative maturity of mark, the degree of accuracy are also higher.However, for network new media word to be marked, only they are drawn Assign in limited class, be not enough to understand their meaning, particularly many network new media words to be marked are all and hot topic Event correlation.

Word stamp methods have been widely used in such as photo description, document description field, but the research in word mark It is also very limited.The existing method that word is marked with label word uses non_monitor algorithm, and the algorithm is based on microblog data, will Each known words and target word are expressed as one group of vector, the cosine similarity of known words and target word are then calculated, by similarity The high word label for being set as target word.Instruct however, existed using non_monitor algorithm and lacked, assume that single, needs are manually set The shortcomings of determining threshold value, influence the accuracy rate and recall rate of word labeling system.

The content of the invention

It is an object of the invention to provide a kind of word mask method of word-based label, device, server and storage to be situated between Matter, it is intended to when solving due to being labeled to newborn word, be used for dividing in the prior art classification that newborn word arrives it is limited and Lack in partition process and instruct, cause word annotating efficiency to be marked, the problem of degree of accuracy is not high.

On the one hand, the invention provides a kind of word mask method of word-based label, methods described to comprise the steps：

Word to be marked is searched in the text document of input；

By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked Known words, the word's kinds device is by there is monitor mode to train to obtain；

The related known words are arranged to the label word of the word to be marked, with by the label word to described Word to be marked is labeled.

On the other hand, the invention provides a kind of word annotation equipment of word-based label, described device to include：

Word searching unit, for searching word to be marked in the text document of input；

Related term query unit, for by the good word's kinds device of training in advance, being inquired about in default known dictionary The known words related to the word to be marked, the word's kinds device is by there is monitor mode to train to obtain；And

Word marks unit, for the related known words to be arranged to the label word of the word to be marked, with logical The label word is crossed to be labeled the word to be marked.

On the other hand, present invention also offers a kind of server, including memory, processor and it is stored in the storage In device and the computer program that can run on the processor, realized as above during computer program described in the computing device State the step described in the word mask method of word-based label.

On the other hand, present invention also offers a kind of computer-readable recording medium, the computer-readable recording medium Computer program is stored with, the word mark side of word-based label as described above is realized when the computer program is executed by processor Step described in method.

The present invention searches word to be marked in the text document built in advance, passes through the good word's kinds of training in advance Device, the known words related to the word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to wait to mark The label word of word is noted, to be labeled by label word to word to be marked, wherein, word's kinds device is by there is supervision Mode trains what is obtained, so as to realize using known words as label word, carries out automatic marking to word to be marked, effectively improves The efficiency of explanation is labeled to word to be marked, reduces the manpower consumption being labeled to word to be marked, in addition, logical The method for crossing supervision trains obtained word's kinds device, effectively improves the accuracy rate and recall rate of word mark to be marked.

Brief description of the drawings

Fig. 1 is the implementation process figure of the word mask method for the word-based label that the embodiment of the present invention one provides；

Fig. 2 be the embodiment of the present invention two provide word-based label word mask method in word's kinds device train reality Existing flow chart

Fig. 3 is the structural representation of the word annotation equipment for the word-based label that the embodiment of the present invention three provides；

Fig. 4 is the preferred structure schematic diagram of the word annotation equipment for the word-based label that the embodiment of the present invention three provides；With And

Fig. 5 is the structural representation for the server that the embodiment of the present invention four provides.

Embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to mark the present invention, and It is not used in the restriction present invention.

It is described in detail below in conjunction with specific implementation of the specific embodiment to the present invention：

Embodiment one：

Fig. 1 shows the implementation process of the word mask method for the word-based label that the embodiment of the present invention one provides, in order to It is easy to illustrate, illustrate only the part related to the embodiment of the present invention, details are as follows：

In step S101, word to be marked is searched in the text document of input.

In embodiments of the present invention, the newborn word that word to be marked is labeled for needs, such as in microblogging, facebook Etc. (Facebook) what is occurred on network new media is similar to this kind of word word of " a word used for translation dance ", " freestyle ", in this kind of network Data acquisition is carried out on new media, is available for the text document of input.As illustratively, collected in microblog original Data, the nearest a part of initial data of issuing time is provided for the text document of input.

In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern Low-frequency word (such as the word of misspelling, name) screens out, and then detects whether these words appear in known dictionary and default In dictionary library, it is believed that do not appear in the word in known dictionary, dictionary library as the newborn word on network new media, will not go out Word in currently known dictionary, dictionary library is arranged to word to be marked.

In embodiments of the present invention, data acquisition is carried out on network new media, known dictionary can be obtained, as example Ground, initial data is collected in microblog, the earliest a part of initial data of issuing time is arranged to microblogging word document, it is right Microblogging word document carries out word segmentation processing, in the word after word segmentation processing, will appear from the word that frequency exceedes predeterminated frequency threshold value Known words are arranged to, some low-frequency words acted on less or not in the range of concern are screened out, are made up of these known words Known dictionary.

Specifically, the segmenting method for word segmentation processing can be condition random field, HMM and all kinds of nothings The segmenting method of supervision.

In step s 102, by the good word's kinds device of training in advance, inquire about and wait to mark in default known dictionary The related known words of word are noted, word's kinds device is by there is monitor mode to train to obtain.

In step s 103, the known words of correlation are arranged to the label word of word to be marked, to be treated by label word Mark word is labeled.

In embodiments of the present invention, word's kinds device trains to obtain by way of having supervision, and training process can refer to reality Apply the content of two each step of example description.Word to be marked is inputted in word grader, to search and wait in known dictionary to mark The related known words of word are noted, by the way that these related known words to be arranged to the label word of word to be marked, mark is treated in completion Note word is labeled, for example, passing through " Tsing-Hua University ", " natural sciences champion ", " station is on earth ", " new record ", " answer and wear exam pool " Word to be marked " Liu is good " is labeled Deng label word.

In embodiments of the present invention, word to be marked is searched in text document, by the word's kinds device trained, The known words related to the word to be marked are inquired about in known dictionary, these related known words are arranged to word to be marked Label word, the mark to word to be marked is realized, wherein, word's kinds device is to train what is obtained by way of having supervision, from And word to be marked is explained by known words, the efficiency of word mark to be marked is effectively improved, reduces and treats The manpower consumption of word mark is marked, the word's kinds device for training to obtain by the method for having supervision, is effectively improved to be marked The accuracy rate and recall rate of word mark.

Embodiment two：

Fig. 2 show the embodiment of the present invention two provide word-based label word mask method in word's kinds device train The implementation process of process, for convenience of description, the part related to the embodiment of the present invention is illustrate only, details are as follows：

In step s 201, concentrated in the training data built in advance and search sample word.

In embodiments of the present invention, word segmentation processing, the training dataset after word segmentation processing can be carried out to training dataset These vocabulary are arranged to sample by the middle vocabulary for searching the frequency of occurrences and exceeding predeterminated frequency threshold value and not appearing in known dictionary The newborn word that this word, i.e. training data are concentrated.As illustratively, initial data is collected in microblog, during by issuing Between positioned at a part of initial data of intermediate period be arranged to training dataset.

In step S202, the annotation of query sample word in default entry annotation database, the keyword of annotation is extracted, The keyword occurred in known dictionary is arranged to the label word of sample word.

In embodiments of the present invention, entry annotation database can be noted by the entry of network encyclopaedia (such as wikipedia, Baidupedia) Composition is released, entry annotation database can be downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database, And it is removed from it not inquiring the sample word of corresponding annotation.It can be extracted by existing text key word extracting mode each The keyword of the corresponding annotation of sample word, sample is arranged in correspondence with by the word occurred in these keywords in known dictionary The label word of word.

Preferably, in default Chinese vocabulary bank or Chinese word website, search related to the label word of sample word Word, these related words are also configured as to the label word of sample word, so as to increase the label word quantity of sample word, had Beneficial to raising training effect.

Alternatively, it can be realized by following step and extract keyword in the corresponding annotation of each sample word, and will be Know that the keyword occurred in dictionary is arranged to the label word of sample word：

(1) word segmentation processing and part-of-speech tagging are carried out to annotation, candidate's label word is extracted in the annotation after part-of-speech tagging.

In embodiments of the present invention, word segmentation processing and part-of-speech tagging are carried out to annotation, because the part of speech of content carrying word is big Mostly verb, adjective and noun, it can will belong to these parts of speech in the annotation after part-of-speech tagging and appear in known dictionary Word be arranged to candidate's label word.

(2) self-defined weight, candidate's label in every partial content of annotation according to corresponding to every partial content of annotation The frequency that word occurs, calculate encyclopaedia word frequency corresponding to candidate's label word.

In embodiments of the present invention, it is structural strong based on the annotation under encyclopaedia entry and the characteristics of with structure mark, can An encyclopaedia word frequency is designed, and calculates encyclopaedia word frequency corresponding to each candidate's label word.

Specifically, it is in advance self-defined weight corresponding to every partial content setting of annotation.For example, the catalogue one in annotation As be used for show entry structure, rather than play a part of carrying content, the self-defined weight of catalogue can be arranged to less It is worth, the Part I content in entry after catalogue is usually the summary to current vocabulary, and corresponding self-defined weight can be set For larger value.Because the word length for annotating every partial content is also related to keyword extraction, normal words are shorter, include pass The possibility of keyword is bigger, therefore can be according to the default self-defined weight of every partial content of annotation, in every part of annotation Hold and carry out redefining for weight, redefining formula can be：

Wherein, β_jFor the self-defined weight of jth partial content in annotation, p_jFor jth partial content,For annotation, α_jTo carry out the weight obtained after weight redefines to jth partial content.

In embodiments of the present invention, then according in the every part for annotating weight, annotation that often partial content redefines The frequency that candidate's label word occurs in appearance, calculates encyclopaedia word frequency corresponding to candidate's label word, and calculation formula is：

Wherein,For i-th of candidate's label word w_i Encyclopaedia word frequency, w_kFor k-th of candidate's label word, f (w_i,p_j) and f (w_k,p_j) it is respectively i-th, kth in jth partial content The frequency that individual candidate's label word occurs, p_jFor jth partial content, Φ is the set of candidate's label word, and A is in all part of annotation The weight set of appearance.

(3) dictionary known to calculates reverse archives frequency corresponding to candidate's label word, according to corresponding to candidate's label word Encyclopaedia word frequency, reverse archives frequency, calculate the keyword fraction of candidate's label word.

In embodiments of the present invention, each candidate's label word corresponding reverse archives frequency, meter in known dictionary are calculated Calculating formula can be：

Wherein, doc_co_rpusFor for generating the text of known dictionary text Shelves (known dictionary is obtained after being segmented to document text), | doc_corpus| for the amount of text in this article this document, w_iFor I-th of candidate's label word, | { j:w_i∈d_j| to include w in known dictionary_iAmount of text (such as microblogging quantity).

In embodiments of the present invention, encyclopaedia word frequency, reverse archives frequency according to corresponding to candidate's label word, candidate's mark is calculated The keyword fraction of word is signed, calculation formula is：

Wherein,For i-th of candidate The keyword fraction of label word, idf represent reverse archives frequency.

(4) when the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to sample The label word of word.

In step S203, sample word and the relationship characteristic of each known words in known dictionary are calculated respectively, according to pass It is the label word of feature and sample word, training obtains word's kinds device.

In embodiments of the present invention, before being trained to word's kinds device, feature is first passed through to describe sample word With the relation of known words in known dictionary, then by characteristic relation and the label word of sample word, training obtains word's kinds device.

In embodiments of the present invention, each known words in sample word, known dictionary are expressed as term vector respectively, word to Amount is represented by v_w={ θ₁,θ₂,...,θ_n, n is the quantity of known words in known dictionary.First concentrated in training data and search institute There is the text (such as microblogging) for including current sample word (or known words), document t is formed by these texts_w, calculate current sample This word (or known words) is in t_wIn word frequency-reverse archives frequency, calculation formula is：

θ_k=tf (w_k,t_w)×idf(w_k,doc_corpus), wherein, θ_kFor current sample word (or known words) w_kIn t_wIn Word frequency-reverse archives frequency, be also k-th of component in term vector, tf (w_k,t_w) it is current sample word (or known words) w_k In t_wIn word frequency, idf (w_k,doc_corpus) it is current sample word (or known words) w_kReverse archives frequency in known dictionary Rate.tf(w_k,t_w) calculation formula be：

Wherein, f (w_k,t_w) it is t_wMiddle sample word (or known words) w_kOccur Frequency, Φ (t_w) it is t_wIn all words.

In embodiments of the present invention, after current sample word and each known words are expressed as into term vector, can lead to Cross Europe it is several in and apart from the word distance for calculating current sample word and each known words, calculation formula be：

Wherein, w_newFor current sample word, w_iFor i-th of known words, θ_kFor sample This word w_newK-th of component of term vector, θ_k' it is known words w_iK-th of component of term vector, d (w_new,w_i) it is sample word w_newWith known words w_iWord distance.Then, sample word w is calculated_newWith known words w_iWord cosine similarity, calculation formula is：

Wherein, sim (w_new,w_i) it is sample word w_newWith known words w_iWord cosine similarity.

In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, c_cO is that training data is concentrated together When the amount of text including current sample word, i-th known words, c_newBeing concentrated for training data includes current sample word The amount of text of language, c_iBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.

In embodiments of the present invention, it is the label word of the relationship characteristic of sample word and each known words, sample word is defeated Enter and be trained in default SVMs, generate word's kinds device, wherein, the kernel function of SVMs can be used radially Basic function core, other sorting algorithms also can be selected and be trained.

In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, , can be in a short time to substantial amounts of to be marked so as to by way of having supervision, realize automatically generating for known words label Word makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, together When, also it is effectively improved the careful degree that word marks.

Embodiment three：

Fig. 3 shows the structure for the word annotation equipment that the embodiment of the present invention three provides, and for convenience of description, illustrate only The part related to the embodiment of the present invention, including：

Word searching unit 31, for searching word to be marked in the text document of input.

In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern Low-frequency word screens out, and then detects whether these words are appeared in known dictionary and default dictionary library, it is believed that does not go out Word in currently known dictionary, dictionary library is the newborn word on network new media, will not appear in known dictionary, dictionary library In word be arranged to word to be marked.

Related term query unit 32, for by the good word's kinds device of training in advance, being looked into default known dictionary The known words related to word to be marked are ask, word's kinds device is by there is monitor mode to train to obtain.

Word marks unit 33, for the known words of correlation to be arranged to the label word of word to be marked, to pass through label Word is labeled to word to be marked.

In embodiments of the present invention, word's kinds device is trained to obtain by way of having supervision, and word to be marked is inputted In word's kinds device, to search the known words related to word to be marked in known dictionary, by by known to these correlations Word is arranged to the label word of word to be marked, completes to be labeled word to be marked, for example, passing through " Tsing-Hua University ", " natural sciences The label word such as champion ", " station is on earth ", " new record ", " answer and wear exam pool " is labeled to word to be marked " Liu is good ".

Preferably, as shown in figure 4, the word annotation equipment of word-based label also includes sample word searching unit 41, closed Keyword extraction unit 42 and classifier training unit 43, wherein：

Sample word searching unit 41, sample word is searched for being concentrated in the training data built in advance.

Keyword extracting unit 42, for the annotation of the query sample word in default dictionary annotation database, extraction annotation Keyword, the keyword occurred in known dictionary is arranged to the label word of sample word.

In embodiments of the present invention, entry annotation database can be made up of the entry annotation of network encyclopaedia, and entry annotation database can be It is downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database, and is removed from it not inquiring The sample word of corresponding annotation.The pass of the corresponding annotation of each sample word can be extracted by existing text key word extracting mode Keyword, the word occurred in these keywords in known dictionary is arranged in correspondence with to the label word of sample word.

Wherein, doc_corpusFor for generating the text of known dictionary text Shelves (known dictionary is obtained after being segmented to document text), | doc_corpus| for the amount of text in this article this document, w_iFor I-th of candidate's label word, | { j:w_i∈d_j| to include w in known dictionary_iAmount of text (such as microblogging quantity).

Classifier training unit 43, it is special for calculating sample word and the relation of each known words in known dictionary respectively Sign, according to relationship characteristic and the label word of sample word, training obtains word's kinds device.

In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, c_coConcentrated for training data same When the amount of text including current sample word, i-th known words, c_newBeing concentrated for training data includes current sample word The amount of text of language, c_iBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.

In embodiments of the present invention, calculate sample word and the relationship characteristic of each known words, the label of sample word Word is inputted in default SVMs and is trained, and generates word's kinds device, wherein, the kernel function of SVMs can be used RBF core, other sorting algorithms also can be selected and be trained.

Preferably, classifier training unit includes：

Term vector converting unit, for each known words in sample word, known dictionary to be converted into corresponding word respectively Vector；

Relation computing unit, for the term vector and the term vector of known words according to sample word, calculate sample word with The word distance and word cosine similarity of each known words, calculate sample word and known words go out simultaneously in the word that training data is concentrated Existing frequency；And

Composition of relations unit, for the frequency of occurrences to be combined as sample word simultaneously by word distance, word cosine similarity and word The relationship characteristic of language and known words.

In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, By the word class device to be marked trained, obtained in known dictionary it is related to the word to be marked in text document known to Word, these known words are arranged to the label of word to be marked in text document, so as to by way of having supervision, realize Know automatically generating for word label, mark can be made to substantial amounts of word to be marked in a short time, and improve word to be marked The accuracy rate and recall rate of language mark, have been effectively saved manpower consumption, meanwhile, also it is effectively improved the careful of word mark Degree.

In embodiments of the present invention, each unit of the word annotation equipment of word-based label can be by corresponding hardware or software Unit realize, each unit can be independent soft and hardware unit, can also be integrated into a soft and hardware unit, herein not to The limitation present invention.

Example IV：

Fig. 5 shows the structure for the server that the embodiment of the present invention four provides, and for convenience of description, illustrate only and this hair The related part of bright embodiment.

The server 5 of the embodiment of the present invention includes processor 50, memory 51 and is stored in memory 51 and can be The computer program 52 run on processor 50.The processor 50 realizes above-mentioned each method embodiment when performing computer program 52 In step, such as the step S101 to S103 shown in Fig. 1.Or realized during the execution computer program 52 of processor 50 above-mentioned The function of each unit in device embodiment, such as the function of unit 31 to 33 shown in Fig. 3.

In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, When receiving the text document of input, by the word class device to be marked trained, obtained in known dictionary and text text The known words of word to be marked correlation in shelves, these known words are arranged to the label of word to be marked in text document, from And by way of having supervision, automatically generating for known words label is realized, can be in a short time to substantial amounts of word to be marked Language makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, meanwhile, Also it is effectively improved the careful degree of word mark.

Embodiment five：

In embodiments of the present invention, there is provided a kind of computer-readable recording medium, the computer-readable recording medium are deposited Computer program is contained, the computer program realizes the step in the above method embodiment when being executed by processor, for example, Fig. 1 Shown step S101 to S103.Or the computer program realizes each list in said apparatus embodiment when being executed by processor The function of member, such as the function of unit 31 to 33 shown in Fig. 3.

The computer-readable recording medium of the embodiment of the present invention can include that any of computer program code can be carried Entity or device, recording medium, for example, the memory such as ROM/RAM, disk, CD, flash memory.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims

1. a kind of word mask method of word-based label, it is characterised in that methods described comprises the steps：

Word to be marked is searched in the text document of input；

By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked Know word, the word's kinds device is by there is monitor mode to train to obtain；

The related known words are arranged to the label word of the word to be marked, to wait to mark to described by the label word Note word is labeled.

2. the method as described in claim 1, it is characterised in that the step of searching word to be marked in the text document of input Before, methods described also includes：

Concentrated in the training data built in advance and search sample word；

The annotation of the sample word is inquired about in default entry annotation database, extracts the keyword of the annotation, will be described The keyword occurred in known dictionary is arranged to the label word of the sample word；

The sample word and the relationship characteristic of each known words in the known dictionary are calculated respectively, according to the relationship characteristic With the label word of the sample word, training obtains the word's kinds device.

3. method as claimed in claim 2, it is characterised in that the sample word is inquired about in default entry annotation database Annotation, extracts the keyword of the annotation, the keyword occurred in the known dictionary is arranged into the sample The step of label word of word, including：

The annotation of the sample word is inquired about in the entry annotation database, word segmentation processing and part of speech mark are carried out to the annotation Note, candidate's label word is extracted in the annotation after the part-of-speech tagging；

Self-defined weight, the candidate described in every partial content of the annotation according to corresponding to every partial content of the annotation The frequency that label word occurs, calculates encyclopaedia word frequency corresponding to candidate's label word；

The reverse archives frequency according to corresponding to the known dictionary calculates candidate's label word, according to candidate's label word pair The encyclopaedia word frequency, the reverse archives frequency answered, calculate the keyword fraction of candidate's label word；

When the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to described The label word of sample word.

4. method as claimed in claim 3, it is characterised in that the self-defined power according to corresponding to every partial content of the annotation The frequency that weight, candidate's label word described in every partial content of the annotation occur, is calculated corresponding to candidate's label word The step of encyclopaedia word frequency, including：

According to the self-defined weight in the annotation per partial content, the weight of every partial content of the annotation is carried out again Definition, the formula redefined to the weight of the jth partial content of the annotation are：

Wherein, the β_jFor the self-defined weight of jth partial content in the annotation, the p_jTo be described Jth partial content, it is describedFor the annotation, the α_jFor after being redefined to the weight of the jth partial content The value arrived；

Weight after being redefined according to every partial content in the annotation, candidate's mark described in every partial content of the annotation The frequency that word occurs is signed, calculates encyclopaedia word frequency corresponding to candidate's label word, calculation formula is：

Wherein, it is describedFor i-th of candidate label Word w_iEncyclopaedia word frequency, the w_kFor k-th of candidate label word, the f (w_i,p_j) and the f (w_k,p_j) it is respectively in institute State jth partial content p_jDescribed in i-th, the frequency that occurs of k-th candidate's label word, the Φ is candidate's label word Set, the A are the weight set of all partial contents of annotation.

5. method as claimed in claim 2, it is characterised in that calculate respectively every in the sample word and the known dictionary The step of relationship characteristic of individual known words, including：

Each known words in the sample word, the known dictionary are converted into corresponding term vector respectively；

According to the term vector of the sample word and the term vector of the known words, calculate the sample word with it is described it is each Know the word distance and word cosine similarity of word, calculate the sample word and word that the known words are concentrated in the training data The frequency of occurrences simultaneously；

By institute's predicate distance, institute's predicate cosine similarity and institute's predicate, the frequency of occurrences is combined as the sample word and institute simultaneously State the relationship characteristic of known words.

6. a kind of word annotation equipment of word-based label, it is characterised in that described device includes：

Related term query unit, for by the good word's kinds device of training in advance, inquiry and institute in default known dictionary The related known words of word to be marked are stated, the word's kinds device is by there is monitor mode to train to obtain；And

Word marks unit, for the related known words to be arranged to the label word of the word to be marked, to pass through Label word is stated to be labeled the word to be marked.

7. device as claimed in claim 6, it is characterised in that described device also includes：

Sample word searching unit, sample word is searched for being concentrated in the training data built in advance；

Keyword extracting unit, for inquiring about the annotation of the sample word in default dictionary annotation database, extract the note The keyword released, the keyword occurred in the known dictionary is arranged to the label word of the sample word；With And

Classifier training unit, it is special for calculating the sample word and the relation of each known words in the known dictionary respectively Sign, according to the relationship characteristic and the label word of the sample word, training obtains the word's kinds device.

8. device as claimed in claim 7, it is characterised in that the classifier training unit includes

Term vector converting unit, for being respectively converted to each known words in the sample word, the known dictionary correspondingly Term vector；

Relation computing unit, for the term vector and the term vector of the known words according to the sample word, calculate the sample This word and the word distance and word cosine similarity of each known words, calculate the sample word and the known words in institute State the word while the frequency of occurrences of training data concentration；And

Composition of relations unit, for the frequency of occurrences to combine simultaneously by institute's predicate distance, institute's predicate cosine similarity and institute's predicate For the sample word and the relationship characteristic of the known words.

9. a kind of server, including memory, processor and it is stored in the memory and can transports on the processor Capable computer program, it is characterised in that realize such as claim 1 to 5 times described in the computing device during computer program The step of one methods described.

10. a kind of computer-readable recording medium, the computer-readable recording medium storage has computer program, and its feature exists In when the computer program is executed by processor the step of realization such as any one of claim 1 to 5 methods described.