CN107480200A - Word mask method, device, server and the storage medium of word-based label - Google Patents
Word mask method, device, server and the storage medium of word-based label Download PDFInfo
- Publication number
- CN107480200A CN107480200A CN201710581312.XA CN201710581312A CN107480200A CN 107480200 A CN107480200 A CN 107480200A CN 201710581312 A CN201710581312 A CN 201710581312A CN 107480200 A CN107480200 A CN 107480200A
- Authority
- CN
- China
- Prior art keywords
- word
- label
- annotation
- sample
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/374—Thesaurus
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Mathematical Physics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The applicable field of computer technology of the present invention, there is provided a kind of word mask method, device, server and the storage medium of word-based label, this method include:Word to be marked is searched in the text document of input, pass through the good word's kinds device of training in advance, the known words related to word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to the label word of word to be marked, to be labeled by label word to word to be marked, wherein, word's kinds device trains to obtain by way of having supervision, so as to train word's kinds device by way of having supervision, using known words as label word, realize the word automatic marking to be marked of word-based label, it is effectively improved the efficiency of word mark to be marked, reduce the manpower consumption of word mark to be marked, it is effectively improved the accuracy rate and recall rate of word mark to be marked.
Description
Technical field
The invention belongs to field of computer technology, more particularly to a kind of word mask method, device, the clothes of word-based label
Business device and storage medium.
Background technology
In today that social media is flourishing, many new words are derived from from the network new media such as microblogging, Facebook
Language, these newborn words are used in our real life more and more.It is born in the newborn word of network new media
At the beginning of, people are difficult to the mark for obtaining these newborn words in time, because in dictionary or network encyclopaedia (such as wikipedia),
The entry of these newborn words is not founded also, and the entry for manually founding each newborn word needs to do a large amount of cumbersome works
Make.
At present, the research for word mark focuses mostly in part-of-speech tagging (Part of speech tagging, POS),
Target word, is then divided into one type or a few classes by i.e. default several classes (such as personage, place, organization names).Part of speech
The method comparative maturity of mark, the degree of accuracy are also higher.However, for network new media word to be marked, only they are drawn
Assign in limited class, be not enough to understand their meaning, particularly many network new media words to be marked are all and hot topic
Event correlation.
Word stamp methods have been widely used in such as photo description, document description field, but the research in word mark
It is also very limited.The existing method that word is marked with label word uses non_monitor algorithm, and the algorithm is based on microblog data, will
Each known words and target word are expressed as one group of vector, the cosine similarity of known words and target word are then calculated, by similarity
The high word label for being set as target word.Instruct however, existed using non_monitor algorithm and lacked, assume that single, needs are manually set
The shortcomings of determining threshold value, influence the accuracy rate and recall rate of word labeling system.
The content of the invention
It is an object of the invention to provide a kind of word mask method of word-based label, device, server and storage to be situated between
Matter, it is intended to when solving due to being labeled to newborn word, be used for dividing in the prior art classification that newborn word arrives it is limited and
Lack in partition process and instruct, cause word annotating efficiency to be marked, the problem of degree of accuracy is not high.
On the one hand, the invention provides a kind of word mask method of word-based label, methods described to comprise the steps:
Word to be marked is searched in the text document of input;
By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked
Known words, the word's kinds device is by there is monitor mode to train to obtain;
The related known words are arranged to the label word of the word to be marked, with by the label word to described
Word to be marked is labeled.
On the other hand, the invention provides a kind of word annotation equipment of word-based label, described device to include:
Word searching unit, for searching word to be marked in the text document of input;
Related term query unit, for by the good word's kinds device of training in advance, being inquired about in default known dictionary
The known words related to the word to be marked, the word's kinds device is by there is monitor mode to train to obtain;And
Word marks unit, for the related known words to be arranged to the label word of the word to be marked, with logical
The label word is crossed to be labeled the word to be marked.
On the other hand, present invention also offers a kind of server, including memory, processor and it is stored in the storage
In device and the computer program that can run on the processor, realized as above during computer program described in the computing device
State the step described in the word mask method of word-based label.
On the other hand, present invention also offers a kind of computer-readable recording medium, the computer-readable recording medium
Computer program is stored with, the word mark side of word-based label as described above is realized when the computer program is executed by processor
Step described in method.
The present invention searches word to be marked in the text document built in advance, passes through the good word's kinds of training in advance
Device, the known words related to the word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to wait to mark
The label word of word is noted, to be labeled by label word to word to be marked, wherein, word's kinds device is by there is supervision
Mode trains what is obtained, so as to realize using known words as label word, carries out automatic marking to word to be marked, effectively improves
The efficiency of explanation is labeled to word to be marked, reduces the manpower consumption being labeled to word to be marked, in addition, logical
The method for crossing supervision trains obtained word's kinds device, effectively improves the accuracy rate and recall rate of word mark to be marked.
Brief description of the drawings
Fig. 1 is the implementation process figure of the word mask method for the word-based label that the embodiment of the present invention one provides;
Fig. 2 be the embodiment of the present invention two provide word-based label word mask method in word's kinds device train reality
Existing flow chart
Fig. 3 is the structural representation of the word annotation equipment for the word-based label that the embodiment of the present invention three provides;
Fig. 4 is the preferred structure schematic diagram of the word annotation equipment for the word-based label that the embodiment of the present invention three provides;With
And
Fig. 5 is the structural representation for the server that the embodiment of the present invention four provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples
The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to mark the present invention, and
It is not used in the restriction present invention.
It is described in detail below in conjunction with specific implementation of the specific embodiment to the present invention:
Embodiment one:
Fig. 1 shows the implementation process of the word mask method for the word-based label that the embodiment of the present invention one provides, in order to
It is easy to illustrate, illustrate only the part related to the embodiment of the present invention, details are as follows:
In step S101, word to be marked is searched in the text document of input.
In embodiments of the present invention, the newborn word that word to be marked is labeled for needs, such as in microblogging, facebook
Etc. (Facebook) what is occurred on network new media is similar to this kind of word word of " a word used for translation dance ", " freestyle ", in this kind of network
Data acquisition is carried out on new media, is available for the text document of input.As illustratively, collected in microblog original
Data, the nearest a part of initial data of issuing time is provided for the text document of input.
In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document
In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern
Low-frequency word (such as the word of misspelling, name) screens out, and then detects whether these words appear in known dictionary and default
In dictionary library, it is believed that do not appear in the word in known dictionary, dictionary library as the newborn word on network new media, will not go out
Word in currently known dictionary, dictionary library is arranged to word to be marked.
In embodiments of the present invention, data acquisition is carried out on network new media, known dictionary can be obtained, as example
Ground, initial data is collected in microblog, the earliest a part of initial data of issuing time is arranged to microblogging word document, it is right
Microblogging word document carries out word segmentation processing, in the word after word segmentation processing, will appear from the word that frequency exceedes predeterminated frequency threshold value
Known words are arranged to, some low-frequency words acted on less or not in the range of concern are screened out, are made up of these known words
Known dictionary.
Specifically, the segmenting method for word segmentation processing can be condition random field, HMM and all kinds of nothings
The segmenting method of supervision.
In step s 102, by the good word's kinds device of training in advance, inquire about and wait to mark in default known dictionary
The related known words of word are noted, word's kinds device is by there is monitor mode to train to obtain.
In step s 103, the known words of correlation are arranged to the label word of word to be marked, to be treated by label word
Mark word is labeled.
In embodiments of the present invention, word's kinds device trains to obtain by way of having supervision, and training process can refer to reality
Apply the content of two each step of example description.Word to be marked is inputted in word grader, to search and wait in known dictionary to mark
The related known words of word are noted, by the way that these related known words to be arranged to the label word of word to be marked, mark is treated in completion
Note word is labeled, for example, passing through " Tsing-Hua University ", " natural sciences champion ", " station is on earth ", " new record ", " answer and wear exam pool "
Word to be marked " Liu is good " is labeled Deng label word.
In embodiments of the present invention, word to be marked is searched in text document, by the word's kinds device trained,
The known words related to the word to be marked are inquired about in known dictionary, these related known words are arranged to word to be marked
Label word, the mark to word to be marked is realized, wherein, word's kinds device is to train what is obtained by way of having supervision, from
And word to be marked is explained by known words, the efficiency of word mark to be marked is effectively improved, reduces and treats
The manpower consumption of word mark is marked, the word's kinds device for training to obtain by the method for having supervision, is effectively improved to be marked
The accuracy rate and recall rate of word mark.
Embodiment two:
Fig. 2 show the embodiment of the present invention two provide word-based label word mask method in word's kinds device train
The implementation process of process, for convenience of description, the part related to the embodiment of the present invention is illustrate only, details are as follows:
In step s 201, concentrated in the training data built in advance and search sample word.
In embodiments of the present invention, word segmentation processing, the training dataset after word segmentation processing can be carried out to training dataset
These vocabulary are arranged to sample by the middle vocabulary for searching the frequency of occurrences and exceeding predeterminated frequency threshold value and not appearing in known dictionary
The newborn word that this word, i.e. training data are concentrated.As illustratively, initial data is collected in microblog, during by issuing
Between positioned at a part of initial data of intermediate period be arranged to training dataset.
In step S202, the annotation of query sample word in default entry annotation database, the keyword of annotation is extracted,
The keyword occurred in known dictionary is arranged to the label word of sample word.
In embodiments of the present invention, entry annotation database can be noted by the entry of network encyclopaedia (such as wikipedia, Baidupedia)
Composition is released, entry annotation database can be downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database,
And it is removed from it not inquiring the sample word of corresponding annotation.It can be extracted by existing text key word extracting mode each
The keyword of the corresponding annotation of sample word, sample is arranged in correspondence with by the word occurred in these keywords in known dictionary
The label word of word.
Preferably, in default Chinese vocabulary bank or Chinese word website, search related to the label word of sample word
Word, these related words are also configured as to the label word of sample word, so as to increase the label word quantity of sample word, had
Beneficial to raising training effect.
Alternatively, it can be realized by following step and extract keyword in the corresponding annotation of each sample word, and will be
Know that the keyword occurred in dictionary is arranged to the label word of sample word:
(1) word segmentation processing and part-of-speech tagging are carried out to annotation, candidate's label word is extracted in the annotation after part-of-speech tagging.
In embodiments of the present invention, word segmentation processing and part-of-speech tagging are carried out to annotation, because the part of speech of content carrying word is big
Mostly verb, adjective and noun, it can will belong to these parts of speech in the annotation after part-of-speech tagging and appear in known dictionary
Word be arranged to candidate's label word.
(2) self-defined weight, candidate's label in every partial content of annotation according to corresponding to every partial content of annotation
The frequency that word occurs, calculate encyclopaedia word frequency corresponding to candidate's label word.
In embodiments of the present invention, it is structural strong based on the annotation under encyclopaedia entry and the characteristics of with structure mark, can
An encyclopaedia word frequency is designed, and calculates encyclopaedia word frequency corresponding to each candidate's label word.
Specifically, it is in advance self-defined weight corresponding to every partial content setting of annotation.For example, the catalogue one in annotation
As be used for show entry structure, rather than play a part of carrying content, the self-defined weight of catalogue can be arranged to less
It is worth, the Part I content in entry after catalogue is usually the summary to current vocabulary, and corresponding self-defined weight can be set
For larger value.Because the word length for annotating every partial content is also related to keyword extraction, normal words are shorter, include pass
The possibility of keyword is bigger, therefore can be according to the default self-defined weight of every partial content of annotation, in every part of annotation
Hold and carry out redefining for weight, redefining formula can be:
Wherein, βjFor the self-defined weight of jth partial content in annotation, pjFor jth partial content,For annotation, αjTo carry out the weight obtained after weight redefines to jth partial content.
In embodiments of the present invention, then according in the every part for annotating weight, annotation that often partial content redefines
The frequency that candidate's label word occurs in appearance, calculates encyclopaedia word frequency corresponding to candidate's label word, and calculation formula is:
Wherein,For i-th of candidate's label word wi
Encyclopaedia word frequency, wkFor k-th of candidate's label word, f (wi,pj) and f (wk,pj) it is respectively i-th, kth in jth partial content
The frequency that individual candidate's label word occurs, pjFor jth partial content, Φ is the set of candidate's label word, and A is in all part of annotation
The weight set of appearance.
(3) dictionary known to calculates reverse archives frequency corresponding to candidate's label word, according to corresponding to candidate's label word
Encyclopaedia word frequency, reverse archives frequency, calculate the keyword fraction of candidate's label word.
In embodiments of the present invention, each candidate's label word corresponding reverse archives frequency, meter in known dictionary are calculated
Calculating formula can be:
Wherein, doccorpusFor for generating the text of known dictionary text
Shelves (known dictionary is obtained after being segmented to document text), | doccorpus| for the amount of text in this article this document, wiFor
I-th of candidate's label word, | { j:wi∈dj| to include w in known dictionaryiAmount of text (such as microblogging quantity).
In embodiments of the present invention, encyclopaedia word frequency, reverse archives frequency according to corresponding to candidate's label word, candidate's mark is calculated
The keyword fraction of word is signed, calculation formula is:
Wherein,For i-th of candidate
The keyword fraction of label word, idf represent reverse archives frequency.
(4) when the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to sample
The label word of word.
In step S203, sample word and the relationship characteristic of each known words in known dictionary are calculated respectively, according to pass
It is the label word of feature and sample word, training obtains word's kinds device.
In embodiments of the present invention, before being trained to word's kinds device, feature is first passed through to describe sample word
With the relation of known words in known dictionary, then by characteristic relation and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, each known words in sample word, known dictionary are expressed as term vector respectively, word to
Amount is represented by vw={ θ1,θ2,...,θn, n is the quantity of known words in known dictionary.First concentrated in training data and search institute
There is the text (such as microblogging) for including current sample word (or known words), document t is formed by these textsw, calculate current sample
This word (or known words) is in twIn word frequency-reverse archives frequency, calculation formula is:
θk=tf (wk,tw)×idf(wk,doccorpus), wherein, θkFor current sample word (or known words) wkIn twIn
Word frequency-reverse archives frequency, be also k-th of component in term vector, tf (wk,tw) it is current sample word (or known words) wk
In twIn word frequency, idf (wk,doccorpus) it is current sample word (or known words) wkReverse archives frequency in known dictionary
Rate.tf(wk,tw) calculation formula be:
Wherein, f (wk,tw) it is twMiddle sample word (or known words) wkOccur
Frequency, Φ (tw) it is twIn all words.
In embodiments of the present invention, after current sample word and each known words are expressed as into term vector, can lead to
Cross Europe it is several in and apart from the word distance for calculating current sample word and each known words, calculation formula be:
Wherein, wnewFor current sample word, wiFor i-th of known words, θkFor sample
This word wnewK-th of component of term vector, θk' it is known words wiK-th of component of term vector, d (wnew,wi) it is sample word
wnewWith known words wiWord distance.Then, sample word w is calculatednewWith known words wiWord cosine similarity, calculation formula is:
Wherein, sim (wnew,wi) it is sample word wnewWith known words
wiWord cosine similarity.
In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each
Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, ccO is that training data is concentrated together
When the amount of text including current sample word, i-th known words, cnewBeing concentrated for training data includes current sample word
The amount of text of language, ciBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words
Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.
In embodiments of the present invention, it is the label word of the relationship characteristic of sample word and each known words, sample word is defeated
Enter and be trained in default SVMs, generate word's kinds device, wherein, the kernel function of SVMs can be used radially
Basic function core, other sorting algorithms also can be selected and be trained.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry
The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to
Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device,
, can be in a short time to substantial amounts of to be marked so as to by way of having supervision, realize automatically generating for known words label
Word makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, together
When, also it is effectively improved the careful degree that word marks.
Embodiment three:
Fig. 3 shows the structure for the word annotation equipment that the embodiment of the present invention three provides, and for convenience of description, illustrate only
The part related to the embodiment of the present invention, including:
Word searching unit 31, for searching word to be marked in the text document of input.
In embodiments of the present invention, the newborn word that word to be marked is labeled for needs, such as in microblogging, facebook
Etc. (Facebook) what is occurred on network new media is similar to this kind of word word of " a word used for translation dance ", " freestyle ", in this kind of network
Data acquisition is carried out on new media, is available for the text document of input.As illustratively, collected in microblog original
Data, the nearest a part of initial data of issuing time is provided for the text document of input.
In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document
In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern
Low-frequency word screens out, and then detects whether these words are appeared in known dictionary and default dictionary library, it is believed that does not go out
Word in currently known dictionary, dictionary library is the newborn word on network new media, will not appear in known dictionary, dictionary library
In word be arranged to word to be marked.
In embodiments of the present invention, data acquisition is carried out on network new media, known dictionary can be obtained, as example
Ground, initial data is collected in microblog, the earliest a part of initial data of issuing time is arranged to microblogging word document, it is right
Microblogging word document carries out word segmentation processing, in the word after word segmentation processing, will appear from the word that frequency exceedes predeterminated frequency threshold value
Known words are arranged to, some low-frequency words acted on less or not in the range of concern are screened out, are made up of these known words
Known dictionary.
Specifically, the segmenting method for word segmentation processing can be condition random field, HMM and all kinds of nothings
The segmenting method of supervision.
Related term query unit 32, for by the good word's kinds device of training in advance, being looked into default known dictionary
The known words related to word to be marked are ask, word's kinds device is by there is monitor mode to train to obtain.
Word marks unit 33, for the known words of correlation to be arranged to the label word of word to be marked, to pass through label
Word is labeled to word to be marked.
In embodiments of the present invention, word's kinds device is trained to obtain by way of having supervision, and word to be marked is inputted
In word's kinds device, to search the known words related to word to be marked in known dictionary, by by known to these correlations
Word is arranged to the label word of word to be marked, completes to be labeled word to be marked, for example, passing through " Tsing-Hua University ", " natural sciences
The label word such as champion ", " station is on earth ", " new record ", " answer and wear exam pool " is labeled to word to be marked " Liu is good ".
Preferably, as shown in figure 4, the word annotation equipment of word-based label also includes sample word searching unit 41, closed
Keyword extraction unit 42 and classifier training unit 43, wherein:
Sample word searching unit 41, sample word is searched for being concentrated in the training data built in advance.
In embodiments of the present invention, word segmentation processing, the training dataset after word segmentation processing can be carried out to training dataset
These vocabulary are arranged to sample by the middle vocabulary for searching the frequency of occurrences and exceeding predeterminated frequency threshold value and not appearing in known dictionary
The newborn word that this word, i.e. training data are concentrated.As illustratively, initial data is collected in microblog, during by issuing
Between positioned at a part of initial data of intermediate period be arranged to training dataset.
Keyword extracting unit 42, for the annotation of the query sample word in default dictionary annotation database, extraction annotation
Keyword, the keyword occurred in known dictionary is arranged to the label word of sample word.
In embodiments of the present invention, entry annotation database can be made up of the entry annotation of network encyclopaedia, and entry annotation database can be
It is downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database, and is removed from it not inquiring
The sample word of corresponding annotation.The pass of the corresponding annotation of each sample word can be extracted by existing text key word extracting mode
Keyword, the word occurred in these keywords in known dictionary is arranged in correspondence with to the label word of sample word.
Preferably, in default Chinese vocabulary bank or Chinese word website, search related to the label word of sample word
Word, these related words are also configured as to the label word of sample word, so as to increase the label word quantity of sample word, had
Beneficial to raising training effect.
Alternatively, it can be realized by following step and extract keyword in the corresponding annotation of each sample word, and will be
Know that the keyword occurred in dictionary is arranged to the label word of sample word:
(1) word segmentation processing and part-of-speech tagging are carried out to annotation, candidate's label word is extracted in the annotation after part-of-speech tagging.
In embodiments of the present invention, word segmentation processing and part-of-speech tagging are carried out to annotation, because the part of speech of content carrying word is big
Mostly verb, adjective and noun, it can will belong to these parts of speech in the annotation after part-of-speech tagging and appear in known dictionary
Word be arranged to candidate's label word.
(2) self-defined weight, candidate's label in every partial content of annotation according to corresponding to every partial content of annotation
The frequency that word occurs, calculate encyclopaedia word frequency corresponding to candidate's label word.
In embodiments of the present invention, it is structural strong based on the annotation under encyclopaedia entry and the characteristics of with structure mark, can
An encyclopaedia word frequency is designed, and calculates encyclopaedia word frequency corresponding to each candidate's label word.
Specifically, it is in advance self-defined weight corresponding to every partial content setting of annotation.For example, the catalogue one in annotation
As be used for show entry structure, rather than play a part of carrying content, the self-defined weight of catalogue can be arranged to less
It is worth, the Part I content in entry after catalogue is usually the summary to current vocabulary, and corresponding self-defined weight can be set
For larger value.Because the word length for annotating every partial content is also related to keyword extraction, normal words are shorter, include pass
The possibility of keyword is bigger, therefore can be according to the default self-defined weight of every partial content of annotation, in every part of annotation
Hold and carry out redefining for weight, redefining formula can be:
Wherein, βjFor the self-defined weight of jth partial content in annotation, pjFor jth partial content,For annotation, αjTo carry out the weight obtained after weight redefines to jth partial content.
In embodiments of the present invention, then according in the every part for annotating weight, annotation that often partial content redefines
The frequency that candidate's label word occurs in appearance, calculates encyclopaedia word frequency corresponding to candidate's label word, and calculation formula is:
Wherein,For i-th of candidate's label word wi
Encyclopaedia word frequency, wkFor k-th of candidate's label word, f (wi,pj) and f (wk,pj) it is respectively i-th, kth in jth partial content
The frequency that individual candidate's label word occurs, pjFor jth partial content, Φ is the set of candidate's label word, and A is in all part of annotation
The weight set of appearance.
(3) dictionary known to calculates reverse archives frequency corresponding to candidate's label word, according to corresponding to candidate's label word
Encyclopaedia word frequency, reverse archives frequency, calculate the keyword fraction of candidate's label word.
In embodiments of the present invention, each candidate's label word corresponding reverse archives frequency, meter in known dictionary are calculated
Calculating formula can be:
Wherein, doccorpusFor for generating the text of known dictionary text
Shelves (known dictionary is obtained after being segmented to document text), | doccorpus| for the amount of text in this article this document, wiFor
I-th of candidate's label word, | { j:wi∈dj| to include w in known dictionaryiAmount of text (such as microblogging quantity).
In embodiments of the present invention, encyclopaedia word frequency, reverse archives frequency according to corresponding to candidate's label word, candidate's mark is calculated
The keyword fraction of word is signed, calculation formula is:
Wherein,For i-th of candidate
The keyword fraction of label word, idf represent reverse archives frequency.
(4) when the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to sample
The label word of word.
Classifier training unit 43, it is special for calculating sample word and the relation of each known words in known dictionary respectively
Sign, according to relationship characteristic and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, before being trained to word's kinds device, feature is first passed through to describe sample word
With the relation of known words in known dictionary, then by characteristic relation and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, each known words in sample word, known dictionary are expressed as term vector respectively, word to
Amount is represented by vw={ θ1,θ2,...,θn, n is the quantity of known words in known dictionary.First concentrated in training data and search institute
There is the text (such as microblogging) for including current sample word (or known words), document t is formed by these textsw, calculate current sample
This word (or known words) is in twIn word frequency-reverse archives frequency, calculation formula is:
θk=tf (wk,tw)×idf(wk,doccorpus), wherein, θkFor current sample word (or known words) wkIn twIn
Word frequency-reverse archives frequency, be also k-th of component in term vector, tf (wk,tw) it is current sample word (or known words) wk
In twIn word frequency, idf (wk,doccorpus) it is current sample word (or known words) wkReverse archives frequency in known dictionary
Rate.tf(wk,tw) calculation formula be:
Wherein, f (wk,tw) it is twMiddle sample word (or known words) wkOccur
Frequency, Φ (tw) it is twIn all words.
In embodiments of the present invention, after current sample word and each known words are expressed as into term vector, can lead to
Cross Europe it is several in and apart from the word distance for calculating current sample word and each known words, calculation formula be:
Wherein, wnewFor current sample word, wiFor i-th of known words, θkFor sample
This word wnewK-th of component of term vector, θk' it is known words wiK-th of component of term vector, d (wnew,wi) it is sample word
wnewWith known words wiWord distance.Then, sample word w is calculatednewWith known words wiWord cosine similarity, calculation formula is:
Wherein, sim (wnew,wi) it is sample word wnewWith known words
wiWord cosine similarity.
In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each
Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, ccoConcentrated for training data same
When the amount of text including current sample word, i-th known words, cnewBeing concentrated for training data includes current sample word
The amount of text of language, ciBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words
Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.
In embodiments of the present invention, calculate sample word and the relationship characteristic of each known words, the label of sample word
Word is inputted in default SVMs and is trained, and generates word's kinds device, wherein, the kernel function of SVMs can be used
RBF core, other sorting algorithms also can be selected and be trained.
Preferably, classifier training unit includes:
Term vector converting unit, for each known words in sample word, known dictionary to be converted into corresponding word respectively
Vector;
Relation computing unit, for the term vector and the term vector of known words according to sample word, calculate sample word with
The word distance and word cosine similarity of each known words, calculate sample word and known words go out simultaneously in the word that training data is concentrated
Existing frequency;And
Composition of relations unit, for the frequency of occurrences to be combined as sample word simultaneously by word distance, word cosine similarity and word
The relationship characteristic of language and known words.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry
The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to
Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device,
By the word class device to be marked trained, obtained in known dictionary it is related to the word to be marked in text document known to
Word, these known words are arranged to the label of word to be marked in text document, so as to by way of having supervision, realize
Know automatically generating for word label, mark can be made to substantial amounts of word to be marked in a short time, and improve word to be marked
The accuracy rate and recall rate of language mark, have been effectively saved manpower consumption, meanwhile, also it is effectively improved the careful of word mark
Degree.
In embodiments of the present invention, each unit of the word annotation equipment of word-based label can be by corresponding hardware or software
Unit realize, each unit can be independent soft and hardware unit, can also be integrated into a soft and hardware unit, herein not to
The limitation present invention.
Example IV:
Fig. 5 shows the structure for the server that the embodiment of the present invention four provides, and for convenience of description, illustrate only and this hair
The related part of bright embodiment.
The server 5 of the embodiment of the present invention includes processor 50, memory 51 and is stored in memory 51 and can be
The computer program 52 run on processor 50.The processor 50 realizes above-mentioned each method embodiment when performing computer program 52
In step, such as the step S101 to S103 shown in Fig. 1.Or realized during the execution computer program 52 of processor 50 above-mentioned
The function of each unit in device embodiment, such as the function of unit 31 to 33 shown in Fig. 3.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry
The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to
Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device,
When receiving the text document of input, by the word class device to be marked trained, obtained in known dictionary and text text
The known words of word to be marked correlation in shelves, these known words are arranged to the label of word to be marked in text document, from
And by way of having supervision, automatically generating for known words label is realized, can be in a short time to substantial amounts of word to be marked
Language makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, meanwhile,
Also it is effectively improved the careful degree of word mark.
Embodiment five:
In embodiments of the present invention, there is provided a kind of computer-readable recording medium, the computer-readable recording medium are deposited
Computer program is contained, the computer program realizes the step in the above method embodiment when being executed by processor, for example, Fig. 1
Shown step S101 to S103.Or the computer program realizes each list in said apparatus embodiment when being executed by processor
The function of member, such as the function of unit 31 to 33 shown in Fig. 3.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry
The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to
Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device,
When receiving the text document of input, by the word class device to be marked trained, obtained in known dictionary and text text
The known words of word to be marked correlation in shelves, these known words are arranged to the label of word to be marked in text document, from
And by way of having supervision, automatically generating for known words label is realized, can be in a short time to substantial amounts of word to be marked
Language makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, meanwhile,
Also it is effectively improved the careful degree of word mark.
The computer-readable recording medium of the embodiment of the present invention can include that any of computer program code can be carried
Entity or device, recording medium, for example, the memory such as ROM/RAM, disk, CD, flash memory.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention
All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.
Claims (10)
1. a kind of word mask method of word-based label, it is characterised in that methods described comprises the steps:
Word to be marked is searched in the text document of input;
By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked
Know word, the word's kinds device is by there is monitor mode to train to obtain;
The related known words are arranged to the label word of the word to be marked, to wait to mark to described by the label word
Note word is labeled.
2. the method as described in claim 1, it is characterised in that the step of searching word to be marked in the text document of input
Before, methods described also includes:
Concentrated in the training data built in advance and search sample word;
The annotation of the sample word is inquired about in default entry annotation database, extracts the keyword of the annotation, will be described
The keyword occurred in known dictionary is arranged to the label word of the sample word;
The sample word and the relationship characteristic of each known words in the known dictionary are calculated respectively, according to the relationship characteristic
With the label word of the sample word, training obtains the word's kinds device.
3. method as claimed in claim 2, it is characterised in that the sample word is inquired about in default entry annotation database
Annotation, extracts the keyword of the annotation, the keyword occurred in the known dictionary is arranged into the sample
The step of label word of word, including:
The annotation of the sample word is inquired about in the entry annotation database, word segmentation processing and part of speech mark are carried out to the annotation
Note, candidate's label word is extracted in the annotation after the part-of-speech tagging;
Self-defined weight, the candidate described in every partial content of the annotation according to corresponding to every partial content of the annotation
The frequency that label word occurs, calculates encyclopaedia word frequency corresponding to candidate's label word;
The reverse archives frequency according to corresponding to the known dictionary calculates candidate's label word, according to candidate's label word pair
The encyclopaedia word frequency, the reverse archives frequency answered, calculate the keyword fraction of candidate's label word;
When the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to described
The label word of sample word.
4. method as claimed in claim 3, it is characterised in that the self-defined power according to corresponding to every partial content of the annotation
The frequency that weight, candidate's label word described in every partial content of the annotation occur, is calculated corresponding to candidate's label word
The step of encyclopaedia word frequency, including:
According to the self-defined weight in the annotation per partial content, the weight of every partial content of the annotation is carried out again
Definition, the formula redefined to the weight of the jth partial content of the annotation are:
Wherein, the βjFor the self-defined weight of jth partial content in the annotation, the pjTo be described
Jth partial content, it is describedFor the annotation, the αjFor after being redefined to the weight of the jth partial content
The value arrived;
Weight after being redefined according to every partial content in the annotation, candidate's mark described in every partial content of the annotation
The frequency that word occurs is signed, calculates encyclopaedia word frequency corresponding to candidate's label word, calculation formula is:
Wherein, it is describedFor i-th of candidate label
Word wiEncyclopaedia word frequency, the wkFor k-th of candidate label word, the f (wi,pj) and the f (wk,pj) it is respectively in institute
State jth partial content pjDescribed in i-th, the frequency that occurs of k-th candidate's label word, the Φ is candidate's label word
Set, the A are the weight set of all partial contents of annotation.
5. method as claimed in claim 2, it is characterised in that calculate respectively every in the sample word and the known dictionary
The step of relationship characteristic of individual known words, including:
Each known words in the sample word, the known dictionary are converted into corresponding term vector respectively;
According to the term vector of the sample word and the term vector of the known words, calculate the sample word with it is described it is each
Know the word distance and word cosine similarity of word, calculate the sample word and word that the known words are concentrated in the training data
The frequency of occurrences simultaneously;
By institute's predicate distance, institute's predicate cosine similarity and institute's predicate, the frequency of occurrences is combined as the sample word and institute simultaneously
State the relationship characteristic of known words.
6. a kind of word annotation equipment of word-based label, it is characterised in that described device includes:
Word searching unit, for searching word to be marked in the text document of input;
Related term query unit, for by the good word's kinds device of training in advance, inquiry and institute in default known dictionary
The related known words of word to be marked are stated, the word's kinds device is by there is monitor mode to train to obtain;And
Word marks unit, for the related known words to be arranged to the label word of the word to be marked, to pass through
Label word is stated to be labeled the word to be marked.
7. device as claimed in claim 6, it is characterised in that described device also includes:
Sample word searching unit, sample word is searched for being concentrated in the training data built in advance;
Keyword extracting unit, for inquiring about the annotation of the sample word in default dictionary annotation database, extract the note
The keyword released, the keyword occurred in the known dictionary is arranged to the label word of the sample word;With
And
Classifier training unit, it is special for calculating the sample word and the relation of each known words in the known dictionary respectively
Sign, according to the relationship characteristic and the label word of the sample word, training obtains the word's kinds device.
8. device as claimed in claim 7, it is characterised in that the classifier training unit includes
Term vector converting unit, for being respectively converted to each known words in the sample word, the known dictionary correspondingly
Term vector;
Relation computing unit, for the term vector and the term vector of the known words according to the sample word, calculate the sample
This word and the word distance and word cosine similarity of each known words, calculate the sample word and the known words in institute
State the word while the frequency of occurrences of training data concentration;And
Composition of relations unit, for the frequency of occurrences to combine simultaneously by institute's predicate distance, institute's predicate cosine similarity and institute's predicate
For the sample word and the relationship characteristic of the known words.
9. a kind of server, including memory, processor and it is stored in the memory and can transports on the processor
Capable computer program, it is characterised in that realize such as claim 1 to 5 times described in the computing device during computer program
The step of one methods described.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has computer program, and its feature exists
In when the computer program is executed by processor the step of realization such as any one of claim 1 to 5 methods described.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710581312.XA CN107480200B (en) | 2017-07-17 | 2017-07-17 | Word labeling method, device, server and storage medium based on word labels |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710581312.XA CN107480200B (en) | 2017-07-17 | 2017-07-17 | Word labeling method, device, server and storage medium based on word labels |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107480200A true CN107480200A (en) | 2017-12-15 |
CN107480200B CN107480200B (en) | 2020-10-23 |
Family
ID=60595121
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710581312.XA Active CN107480200B (en) | 2017-07-17 | 2017-07-17 | Word labeling method, device, server and storage medium based on word labels |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107480200B (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109145296A (en) * | 2018-08-09 | 2019-01-04 | 新华智云科技有限公司 | A kind of general word recognition method and device based on monitor model |
CN109271392A (en) * | 2018-10-30 | 2019-01-25 | 长威信息科技发展股份有限公司 | Quick discrimination and the method and apparatus for extracting relevant database entity and attribute |
CN109344367A (en) * | 2018-10-24 | 2019-02-15 | 厦门美图之家科技有限公司 | Region mask method, device and computer readable storage medium |
CN109522424A (en) * | 2018-10-16 | 2019-03-26 | 北京达佳互联信息技术有限公司 | Processing method, device, electronic equipment and the storage medium of data |
CN109740157A (en) * | 2018-12-29 | 2019-05-10 | 贵州小爱机器人科技有限公司 | The label of working individual determines method, apparatus and computer storage medium |
CN109816047A (en) * | 2019-02-19 | 2019-05-28 | 北京达佳互联信息技术有限公司 | Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided |
CN110276064A (en) * | 2018-03-14 | 2019-09-24 | 普天信息技术有限公司 | A kind of part-of-speech tagging method and device |
CN110991181A (en) * | 2019-11-29 | 2020-04-10 | 腾讯科技(深圳)有限公司 | Method and apparatus for enhancing labeled samples |
CN113177109A (en) * | 2021-05-27 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Text weak labeling method, device, equipment and storage medium |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102508923A (en) * | 2011-11-22 | 2012-06-20 | 北京大学 | Automatic video annotation method based on automatic classification and keyword marking |
US20160042427A1 (en) * | 2011-04-06 | 2016-02-11 | Google Inc. | Mining For Product Classification Structures For Internet-Based Product Searching |
-
2017
- 2017-07-17 CN CN201710581312.XA patent/CN107480200B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160042427A1 (en) * | 2011-04-06 | 2016-02-11 | Google Inc. | Mining For Product Classification Structures For Internet-Based Product Searching |
CN102508923A (en) * | 2011-11-22 | 2012-06-20 | 北京大学 | Automatic video annotation method based on automatic classification and keyword marking |
Non-Patent Citations (6)
Title |
---|
MOHAMMAD HOSSEIN ELAHIMANESH ET AL.: ""ACUT: An Associative Classifier Approach to Unknown Word POS Tagging"", 《INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING》 * |
YUZHI LIANG ET AL.: ""New Word Detection and Tagging on Chinese Twitter Stream"", 《INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY》 * |
刘遥峰 等: ""中文分词和词性标注模型"", 《计算机工程》 * |
姜维 等: ""基于条件随机域的词性标注模型"", 《计算机工程与应用》 * |
王阿园 等: ""查询扩展中扩展词提取算法研究"", 《中国科技论文在线》 * |
郭振 等: ""基于字符的中文分词、词性标注和依存句法分析联合模型"", 《中文信息学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110276064B (en) * | 2018-03-14 | 2023-06-23 | 普天信息技术有限公司 | Part-of-speech tagging method and device |
CN110276064A (en) * | 2018-03-14 | 2019-09-24 | 普天信息技术有限公司 | A kind of part-of-speech tagging method and device |
CN109145296A (en) * | 2018-08-09 | 2019-01-04 | 新华智云科技有限公司 | A kind of general word recognition method and device based on monitor model |
CN109522424A (en) * | 2018-10-16 | 2019-03-26 | 北京达佳互联信息技术有限公司 | Processing method, device, electronic equipment and the storage medium of data |
CN109344367A (en) * | 2018-10-24 | 2019-02-15 | 厦门美图之家科技有限公司 | Region mask method, device and computer readable storage medium |
CN109344367B (en) * | 2018-10-24 | 2022-11-01 | 厦门美图之家科技有限公司 | Region labeling method and device and computer readable storage medium |
CN109271392A (en) * | 2018-10-30 | 2019-01-25 | 长威信息科技发展股份有限公司 | Quick discrimination and the method and apparatus for extracting relevant database entity and attribute |
CN109740157A (en) * | 2018-12-29 | 2019-05-10 | 贵州小爱机器人科技有限公司 | The label of working individual determines method, apparatus and computer storage medium |
CN109816047A (en) * | 2019-02-19 | 2019-05-28 | 北京达佳互联信息技术有限公司 | Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided |
CN109816047B (en) * | 2019-02-19 | 2022-05-24 | 北京达佳互联信息技术有限公司 | Method, device and equipment for providing label and readable storage medium |
CN110991181A (en) * | 2019-11-29 | 2020-04-10 | 腾讯科技(深圳)有限公司 | Method and apparatus for enhancing labeled samples |
CN110991181B (en) * | 2019-11-29 | 2023-03-31 | 腾讯科技(深圳)有限公司 | Method and apparatus for enhancing labeled samples |
CN113177109A (en) * | 2021-05-27 | 2021-07-27 | 中国平安人寿保险股份有限公司 | Text weak labeling method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN107480200B (en) | 2020-10-23 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107480200A (en) | Word mask method, device, server and the storage medium of word-based label | |
CN111177365B (en) | Unsupervised automatic abstract extraction method based on graph model | |
US11216504B2 (en) | Document recommendation method and device based on semantic tag | |
CN106997382B (en) | Innovative creative tag automatic labeling method and system based on big data | |
CN106649818B (en) | Application search intention identification method and device, application search method and server | |
Huston et al. | Evaluating verbose query processing techniques | |
CN106776574B (en) | User comment text mining method and device | |
CN107608999A (en) | A kind of Question Classification method suitable for automatically request-answering system | |
CN106126619A (en) | A kind of video retrieval method based on video content and system | |
CN110347790B (en) | Text duplicate checking method, device and equipment based on attention mechanism and storage medium | |
CN113569050B (en) | Method and device for automatically constructing government affair field knowledge map based on deep learning | |
CN112069312B (en) | Text classification method based on entity recognition and electronic device | |
Man | Feature extension for short text categorization using frequent term sets | |
Barriere et al. | TerminoWeb: a software environment for term study in rich contexts | |
CN111291177A (en) | Information processing method and device and computer storage medium | |
CN108228612B (en) | Method and device for extracting network event keywords and emotional tendency | |
Wang et al. | Neural related work summarization with a joint context-driven attention mechanism | |
CN105786971B (en) | A kind of grammer point recognition methods towards international Chinese teaching | |
Singh et al. | Writing Style Change Detection on Multi-Author Documents. | |
Gong et al. | A semantic similarity language model to improve automatic image annotation | |
CN112000929A (en) | Cross-platform data analysis method, system, equipment and readable storage medium | |
Ohta et al. | CRF-based bibliography extraction from reference strings focusing on various token granularities | |
Pu et al. | A vision-based approach for deep web form extraction | |
Tan et al. | Sentiment analysis of chinese short text based on multiple features | |
Chen | Natural language processing in web data mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |