CN107480200A - Word mask method, device, server and the storage medium of word-based label - Google Patents

Word mask method, device, server and the storage medium of word-based label Download PDF

Info

Publication number
CN107480200A
CN107480200A CN201710581312.XA CN201710581312A CN107480200A CN 107480200 A CN107480200 A CN 107480200A CN 201710581312 A CN201710581312 A CN 201710581312A CN 107480200 A CN107480200 A CN 107480200A
Authority
CN
China
Prior art keywords
word
label
annotation
sample
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710581312.XA
Other languages
Chinese (zh)
Other versions
CN107480200B (en
Inventor
梁予之
曲强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Institute of Advanced Technology of CAS
Original Assignee
Shenzhen Institute of Advanced Technology of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Institute of Advanced Technology of CAS filed Critical Shenzhen Institute of Advanced Technology of CAS
Priority to CN201710581312.XA priority Critical patent/CN107480200B/en
Publication of CN107480200A publication Critical patent/CN107480200A/en
Application granted granted Critical
Publication of CN107480200B publication Critical patent/CN107480200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

The applicable field of computer technology of the present invention, there is provided a kind of word mask method, device, server and the storage medium of word-based label, this method include:Word to be marked is searched in the text document of input, pass through the good word's kinds device of training in advance, the known words related to word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to the label word of word to be marked, to be labeled by label word to word to be marked, wherein, word's kinds device trains to obtain by way of having supervision, so as to train word's kinds device by way of having supervision, using known words as label word, realize the word automatic marking to be marked of word-based label, it is effectively improved the efficiency of word mark to be marked, reduce the manpower consumption of word mark to be marked, it is effectively improved the accuracy rate and recall rate of word mark to be marked.

Description

Word mask method, device, server and the storage medium of word-based label
Technical field
The invention belongs to field of computer technology, more particularly to a kind of word mask method, device, the clothes of word-based label Business device and storage medium.
Background technology
In today that social media is flourishing, many new words are derived from from the network new media such as microblogging, Facebook Language, these newborn words are used in our real life more and more.It is born in the newborn word of network new media At the beginning of, people are difficult to the mark for obtaining these newborn words in time, because in dictionary or network encyclopaedia (such as wikipedia), The entry of these newborn words is not founded also, and the entry for manually founding each newborn word needs to do a large amount of cumbersome works Make.
At present, the research for word mark focuses mostly in part-of-speech tagging (Part of speech tagging, POS), Target word, is then divided into one type or a few classes by i.e. default several classes (such as personage, place, organization names).Part of speech The method comparative maturity of mark, the degree of accuracy are also higher.However, for network new media word to be marked, only they are drawn Assign in limited class, be not enough to understand their meaning, particularly many network new media words to be marked are all and hot topic Event correlation.
Word stamp methods have been widely used in such as photo description, document description field, but the research in word mark It is also very limited.The existing method that word is marked with label word uses non_monitor algorithm, and the algorithm is based on microblog data, will Each known words and target word are expressed as one group of vector, the cosine similarity of known words and target word are then calculated, by similarity The high word label for being set as target word.Instruct however, existed using non_monitor algorithm and lacked, assume that single, needs are manually set The shortcomings of determining threshold value, influence the accuracy rate and recall rate of word labeling system.
The content of the invention
It is an object of the invention to provide a kind of word mask method of word-based label, device, server and storage to be situated between Matter, it is intended to when solving due to being labeled to newborn word, be used for dividing in the prior art classification that newborn word arrives it is limited and Lack in partition process and instruct, cause word annotating efficiency to be marked, the problem of degree of accuracy is not high.
On the one hand, the invention provides a kind of word mask method of word-based label, methods described to comprise the steps:
Word to be marked is searched in the text document of input;
By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked Known words, the word's kinds device is by there is monitor mode to train to obtain;
The related known words are arranged to the label word of the word to be marked, with by the label word to described Word to be marked is labeled.
On the other hand, the invention provides a kind of word annotation equipment of word-based label, described device to include:
Word searching unit, for searching word to be marked in the text document of input;
Related term query unit, for by the good word's kinds device of training in advance, being inquired about in default known dictionary The known words related to the word to be marked, the word's kinds device is by there is monitor mode to train to obtain;And
Word marks unit, for the related known words to be arranged to the label word of the word to be marked, with logical The label word is crossed to be labeled the word to be marked.
On the other hand, present invention also offers a kind of server, including memory, processor and it is stored in the storage In device and the computer program that can run on the processor, realized as above during computer program described in the computing device State the step described in the word mask method of word-based label.
On the other hand, present invention also offers a kind of computer-readable recording medium, the computer-readable recording medium Computer program is stored with, the word mark side of word-based label as described above is realized when the computer program is executed by processor Step described in method.
The present invention searches word to be marked in the text document built in advance, passes through the good word's kinds of training in advance Device, the known words related to the word to be marked are inquired about in default known dictionary, the known words of correlation are arranged to wait to mark The label word of word is noted, to be labeled by label word to word to be marked, wherein, word's kinds device is by there is supervision Mode trains what is obtained, so as to realize using known words as label word, carries out automatic marking to word to be marked, effectively improves The efficiency of explanation is labeled to word to be marked, reduces the manpower consumption being labeled to word to be marked, in addition, logical The method for crossing supervision trains obtained word's kinds device, effectively improves the accuracy rate and recall rate of word mark to be marked.
Brief description of the drawings
Fig. 1 is the implementation process figure of the word mask method for the word-based label that the embodiment of the present invention one provides;
Fig. 2 be the embodiment of the present invention two provide word-based label word mask method in word's kinds device train reality Existing flow chart
Fig. 3 is the structural representation of the word annotation equipment for the word-based label that the embodiment of the present invention three provides;
Fig. 4 is the preferred structure schematic diagram of the word annotation equipment for the word-based label that the embodiment of the present invention three provides;With And
Fig. 5 is the structural representation for the server that the embodiment of the present invention four provides.
Embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, it is right below in conjunction with drawings and Examples The present invention is further elaborated.It should be appreciated that specific embodiment described herein is only to mark the present invention, and It is not used in the restriction present invention.
It is described in detail below in conjunction with specific implementation of the specific embodiment to the present invention:
Embodiment one:
Fig. 1 shows the implementation process of the word mask method for the word-based label that the embodiment of the present invention one provides, in order to It is easy to illustrate, illustrate only the part related to the embodiment of the present invention, details are as follows:
In step S101, word to be marked is searched in the text document of input.
In embodiments of the present invention, the newborn word that word to be marked is labeled for needs, such as in microblogging, facebook Etc. (Facebook) what is occurred on network new media is similar to this kind of word word of " a word used for translation dance ", " freestyle ", in this kind of network Data acquisition is carried out on new media, is available for the text document of input.As illustratively, collected in microblog original Data, the nearest a part of initial data of issuing time is provided for the text document of input.
In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern Low-frequency word (such as the word of misspelling, name) screens out, and then detects whether these words appear in known dictionary and default In dictionary library, it is believed that do not appear in the word in known dictionary, dictionary library as the newborn word on network new media, will not go out Word in currently known dictionary, dictionary library is arranged to word to be marked.
In embodiments of the present invention, data acquisition is carried out on network new media, known dictionary can be obtained, as example Ground, initial data is collected in microblog, the earliest a part of initial data of issuing time is arranged to microblogging word document, it is right Microblogging word document carries out word segmentation processing, in the word after word segmentation processing, will appear from the word that frequency exceedes predeterminated frequency threshold value Known words are arranged to, some low-frequency words acted on less or not in the range of concern are screened out, are made up of these known words Known dictionary.
Specifically, the segmenting method for word segmentation processing can be condition random field, HMM and all kinds of nothings The segmenting method of supervision.
In step s 102, by the good word's kinds device of training in advance, inquire about and wait to mark in default known dictionary The related known words of word are noted, word's kinds device is by there is monitor mode to train to obtain.
In step s 103, the known words of correlation are arranged to the label word of word to be marked, to be treated by label word Mark word is labeled.
In embodiments of the present invention, word's kinds device trains to obtain by way of having supervision, and training process can refer to reality Apply the content of two each step of example description.Word to be marked is inputted in word grader, to search and wait in known dictionary to mark The related known words of word are noted, by the way that these related known words to be arranged to the label word of word to be marked, mark is treated in completion Note word is labeled, for example, passing through " Tsing-Hua University ", " natural sciences champion ", " station is on earth ", " new record ", " answer and wear exam pool " Word to be marked " Liu is good " is labeled Deng label word.
In embodiments of the present invention, word to be marked is searched in text document, by the word's kinds device trained, The known words related to the word to be marked are inquired about in known dictionary, these related known words are arranged to word to be marked Label word, the mark to word to be marked is realized, wherein, word's kinds device is to train what is obtained by way of having supervision, from And word to be marked is explained by known words, the efficiency of word mark to be marked is effectively improved, reduces and treats The manpower consumption of word mark is marked, the word's kinds device for training to obtain by the method for having supervision, is effectively improved to be marked The accuracy rate and recall rate of word mark.
Embodiment two:
Fig. 2 show the embodiment of the present invention two provide word-based label word mask method in word's kinds device train The implementation process of process, for convenience of description, the part related to the embodiment of the present invention is illustrate only, details are as follows:
In step s 201, concentrated in the training data built in advance and search sample word.
In embodiments of the present invention, word segmentation processing, the training dataset after word segmentation processing can be carried out to training dataset These vocabulary are arranged to sample by the middle vocabulary for searching the frequency of occurrences and exceeding predeterminated frequency threshold value and not appearing in known dictionary The newborn word that this word, i.e. training data are concentrated.As illustratively, initial data is collected in microblog, during by issuing Between positioned at a part of initial data of intermediate period be arranged to training dataset.
In step S202, the annotation of query sample word in default entry annotation database, the keyword of annotation is extracted, The keyword occurred in known dictionary is arranged to the label word of sample word.
In embodiments of the present invention, entry annotation database can be noted by the entry of network encyclopaedia (such as wikipedia, Baidupedia) Composition is released, entry annotation database can be downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database, And it is removed from it not inquiring the sample word of corresponding annotation.It can be extracted by existing text key word extracting mode each The keyword of the corresponding annotation of sample word, sample is arranged in correspondence with by the word occurred in these keywords in known dictionary The label word of word.
Preferably, in default Chinese vocabulary bank or Chinese word website, search related to the label word of sample word Word, these related words are also configured as to the label word of sample word, so as to increase the label word quantity of sample word, had Beneficial to raising training effect.
Alternatively, it can be realized by following step and extract keyword in the corresponding annotation of each sample word, and will be Know that the keyword occurred in dictionary is arranged to the label word of sample word:
(1) word segmentation processing and part-of-speech tagging are carried out to annotation, candidate's label word is extracted in the annotation after part-of-speech tagging.
In embodiments of the present invention, word segmentation processing and part-of-speech tagging are carried out to annotation, because the part of speech of content carrying word is big Mostly verb, adjective and noun, it can will belong to these parts of speech in the annotation after part-of-speech tagging and appear in known dictionary Word be arranged to candidate's label word.
(2) self-defined weight, candidate's label in every partial content of annotation according to corresponding to every partial content of annotation The frequency that word occurs, calculate encyclopaedia word frequency corresponding to candidate's label word.
In embodiments of the present invention, it is structural strong based on the annotation under encyclopaedia entry and the characteristics of with structure mark, can An encyclopaedia word frequency is designed, and calculates encyclopaedia word frequency corresponding to each candidate's label word.
Specifically, it is in advance self-defined weight corresponding to every partial content setting of annotation.For example, the catalogue one in annotation As be used for show entry structure, rather than play a part of carrying content, the self-defined weight of catalogue can be arranged to less It is worth, the Part I content in entry after catalogue is usually the summary to current vocabulary, and corresponding self-defined weight can be set For larger value.Because the word length for annotating every partial content is also related to keyword extraction, normal words are shorter, include pass The possibility of keyword is bigger, therefore can be according to the default self-defined weight of every partial content of annotation, in every part of annotation Hold and carry out redefining for weight, redefining formula can be:
Wherein, βjFor the self-defined weight of jth partial content in annotation, pjFor jth partial content,For annotation, αjTo carry out the weight obtained after weight redefines to jth partial content.
In embodiments of the present invention, then according in the every part for annotating weight, annotation that often partial content redefines The frequency that candidate's label word occurs in appearance, calculates encyclopaedia word frequency corresponding to candidate's label word, and calculation formula is:
Wherein,For i-th of candidate's label word wi Encyclopaedia word frequency, wkFor k-th of candidate's label word, f (wi,pj) and f (wk,pj) it is respectively i-th, kth in jth partial content The frequency that individual candidate's label word occurs, pjFor jth partial content, Φ is the set of candidate's label word, and A is in all part of annotation The weight set of appearance.
(3) dictionary known to calculates reverse archives frequency corresponding to candidate's label word, according to corresponding to candidate's label word Encyclopaedia word frequency, reverse archives frequency, calculate the keyword fraction of candidate's label word.
In embodiments of the present invention, each candidate's label word corresponding reverse archives frequency, meter in known dictionary are calculated Calculating formula can be:
Wherein, doccorpusFor for generating the text of known dictionary text Shelves (known dictionary is obtained after being segmented to document text), | doccorpus| for the amount of text in this article this document, wiFor I-th of candidate's label word, | { j:wi∈dj| to include w in known dictionaryiAmount of text (such as microblogging quantity).
In embodiments of the present invention, encyclopaedia word frequency, reverse archives frequency according to corresponding to candidate's label word, candidate's mark is calculated The keyword fraction of word is signed, calculation formula is:
Wherein,For i-th of candidate The keyword fraction of label word, idf represent reverse archives frequency.
(4) when the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to sample The label word of word.
In step S203, sample word and the relationship characteristic of each known words in known dictionary are calculated respectively, according to pass It is the label word of feature and sample word, training obtains word's kinds device.
In embodiments of the present invention, before being trained to word's kinds device, feature is first passed through to describe sample word With the relation of known words in known dictionary, then by characteristic relation and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, each known words in sample word, known dictionary are expressed as term vector respectively, word to Amount is represented by vw={ θ12,...,θn, n is the quantity of known words in known dictionary.First concentrated in training data and search institute There is the text (such as microblogging) for including current sample word (or known words), document t is formed by these textsw, calculate current sample This word (or known words) is in twIn word frequency-reverse archives frequency, calculation formula is:
θk=tf (wk,tw)×idf(wk,doccorpus), wherein, θkFor current sample word (or known words) wkIn twIn Word frequency-reverse archives frequency, be also k-th of component in term vector, tf (wk,tw) it is current sample word (or known words) wk In twIn word frequency, idf (wk,doccorpus) it is current sample word (or known words) wkReverse archives frequency in known dictionary Rate.tf(wk,tw) calculation formula be:
Wherein, f (wk,tw) it is twMiddle sample word (or known words) wkOccur Frequency, Φ (tw) it is twIn all words.
In embodiments of the present invention, after current sample word and each known words are expressed as into term vector, can lead to Cross Europe it is several in and apart from the word distance for calculating current sample word and each known words, calculation formula be:
Wherein, wnewFor current sample word, wiFor i-th of known words, θkFor sample This word wnewK-th of component of term vector, θk' it is known words wiK-th of component of term vector, d (wnew,wi) it is sample word wnewWith known words wiWord distance.Then, sample word w is calculatednewWith known words wiWord cosine similarity, calculation formula is:
Wherein, sim (wnew,wi) it is sample word wnewWith known words wiWord cosine similarity.
In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, ccO is that training data is concentrated together When the amount of text including current sample word, i-th known words, cnewBeing concentrated for training data includes current sample word The amount of text of language, ciBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.
In embodiments of the present invention, it is the label word of the relationship characteristic of sample word and each known words, sample word is defeated Enter and be trained in default SVMs, generate word's kinds device, wherein, the kernel function of SVMs can be used radially Basic function core, other sorting algorithms also can be selected and be trained.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, , can be in a short time to substantial amounts of to be marked so as to by way of having supervision, realize automatically generating for known words label Word makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, together When, also it is effectively improved the careful degree that word marks.
Embodiment three:
Fig. 3 shows the structure for the word annotation equipment that the embodiment of the present invention three provides, and for convenience of description, illustrate only The part related to the embodiment of the present invention, including:
Word searching unit 31, for searching word to be marked in the text document of input.
In embodiments of the present invention, the newborn word that word to be marked is labeled for needs, such as in microblogging, facebook Etc. (Facebook) what is occurred on network new media is similar to this kind of word word of " a word used for translation dance ", " freestyle ", in this kind of network Data acquisition is carried out on new media, is available for the text document of input.As illustratively, collected in microblog original Data, the nearest a part of initial data of issuing time is provided for the text document of input.
In embodiments of the present invention, word segmentation processing, the text after word segmentation processing can be carried out to the text in text document In document, search the word that the frequency of occurrences exceedes predeterminated frequency threshold value, will effect less or not some in the range of concern Low-frequency word screens out, and then detects whether these words are appeared in known dictionary and default dictionary library, it is believed that does not go out Word in currently known dictionary, dictionary library is the newborn word on network new media, will not appear in known dictionary, dictionary library In word be arranged to word to be marked.
In embodiments of the present invention, data acquisition is carried out on network new media, known dictionary can be obtained, as example Ground, initial data is collected in microblog, the earliest a part of initial data of issuing time is arranged to microblogging word document, it is right Microblogging word document carries out word segmentation processing, in the word after word segmentation processing, will appear from the word that frequency exceedes predeterminated frequency threshold value Known words are arranged to, some low-frequency words acted on less or not in the range of concern are screened out, are made up of these known words Known dictionary.
Specifically, the segmenting method for word segmentation processing can be condition random field, HMM and all kinds of nothings The segmenting method of supervision.
Related term query unit 32, for by the good word's kinds device of training in advance, being looked into default known dictionary The known words related to word to be marked are ask, word's kinds device is by there is monitor mode to train to obtain.
Word marks unit 33, for the known words of correlation to be arranged to the label word of word to be marked, to pass through label Word is labeled to word to be marked.
In embodiments of the present invention, word's kinds device is trained to obtain by way of having supervision, and word to be marked is inputted In word's kinds device, to search the known words related to word to be marked in known dictionary, by by known to these correlations Word is arranged to the label word of word to be marked, completes to be labeled word to be marked, for example, passing through " Tsing-Hua University ", " natural sciences The label word such as champion ", " station is on earth ", " new record ", " answer and wear exam pool " is labeled to word to be marked " Liu is good ".
Preferably, as shown in figure 4, the word annotation equipment of word-based label also includes sample word searching unit 41, closed Keyword extraction unit 42 and classifier training unit 43, wherein:
Sample word searching unit 41, sample word is searched for being concentrated in the training data built in advance.
In embodiments of the present invention, word segmentation processing, the training dataset after word segmentation processing can be carried out to training dataset These vocabulary are arranged to sample by the middle vocabulary for searching the frequency of occurrences and exceeding predeterminated frequency threshold value and not appearing in known dictionary The newborn word that this word, i.e. training data are concentrated.As illustratively, initial data is collected in microblog, during by issuing Between positioned at a part of initial data of intermediate period be arranged to training dataset.
Keyword extracting unit 42, for the annotation of the query sample word in default dictionary annotation database, extraction annotation Keyword, the keyword occurred in known dictionary is arranged to the label word of sample word.
In embodiments of the present invention, entry annotation database can be made up of the entry annotation of network encyclopaedia, and entry annotation database can be It is downloaded on website.Annotation corresponding to each sample word is inquired about in entry annotation database, and is removed from it not inquiring The sample word of corresponding annotation.The pass of the corresponding annotation of each sample word can be extracted by existing text key word extracting mode Keyword, the word occurred in these keywords in known dictionary is arranged in correspondence with to the label word of sample word.
Preferably, in default Chinese vocabulary bank or Chinese word website, search related to the label word of sample word Word, these related words are also configured as to the label word of sample word, so as to increase the label word quantity of sample word, had Beneficial to raising training effect.
Alternatively, it can be realized by following step and extract keyword in the corresponding annotation of each sample word, and will be Know that the keyword occurred in dictionary is arranged to the label word of sample word:
(1) word segmentation processing and part-of-speech tagging are carried out to annotation, candidate's label word is extracted in the annotation after part-of-speech tagging.
In embodiments of the present invention, word segmentation processing and part-of-speech tagging are carried out to annotation, because the part of speech of content carrying word is big Mostly verb, adjective and noun, it can will belong to these parts of speech in the annotation after part-of-speech tagging and appear in known dictionary Word be arranged to candidate's label word.
(2) self-defined weight, candidate's label in every partial content of annotation according to corresponding to every partial content of annotation The frequency that word occurs, calculate encyclopaedia word frequency corresponding to candidate's label word.
In embodiments of the present invention, it is structural strong based on the annotation under encyclopaedia entry and the characteristics of with structure mark, can An encyclopaedia word frequency is designed, and calculates encyclopaedia word frequency corresponding to each candidate's label word.
Specifically, it is in advance self-defined weight corresponding to every partial content setting of annotation.For example, the catalogue one in annotation As be used for show entry structure, rather than play a part of carrying content, the self-defined weight of catalogue can be arranged to less It is worth, the Part I content in entry after catalogue is usually the summary to current vocabulary, and corresponding self-defined weight can be set For larger value.Because the word length for annotating every partial content is also related to keyword extraction, normal words are shorter, include pass The possibility of keyword is bigger, therefore can be according to the default self-defined weight of every partial content of annotation, in every part of annotation Hold and carry out redefining for weight, redefining formula can be:
Wherein, βjFor the self-defined weight of jth partial content in annotation, pjFor jth partial content,For annotation, αjTo carry out the weight obtained after weight redefines to jth partial content.
In embodiments of the present invention, then according in the every part for annotating weight, annotation that often partial content redefines The frequency that candidate's label word occurs in appearance, calculates encyclopaedia word frequency corresponding to candidate's label word, and calculation formula is:
Wherein,For i-th of candidate's label word wi Encyclopaedia word frequency, wkFor k-th of candidate's label word, f (wi,pj) and f (wk,pj) it is respectively i-th, kth in jth partial content The frequency that individual candidate's label word occurs, pjFor jth partial content, Φ is the set of candidate's label word, and A is in all part of annotation The weight set of appearance.
(3) dictionary known to calculates reverse archives frequency corresponding to candidate's label word, according to corresponding to candidate's label word Encyclopaedia word frequency, reverse archives frequency, calculate the keyword fraction of candidate's label word.
In embodiments of the present invention, each candidate's label word corresponding reverse archives frequency, meter in known dictionary are calculated Calculating formula can be:
Wherein, doccorpusFor for generating the text of known dictionary text Shelves (known dictionary is obtained after being segmented to document text), | doccorpus| for the amount of text in this article this document, wiFor I-th of candidate's label word, | { j:wi∈dj| to include w in known dictionaryiAmount of text (such as microblogging quantity).
In embodiments of the present invention, encyclopaedia word frequency, reverse archives frequency according to corresponding to candidate's label word, candidate's mark is calculated The keyword fraction of word is signed, calculation formula is:
Wherein,For i-th of candidate The keyword fraction of label word, idf represent reverse archives frequency.
(4) when the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to sample The label word of word.
Classifier training unit 43, it is special for calculating sample word and the relation of each known words in known dictionary respectively Sign, according to relationship characteristic and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, before being trained to word's kinds device, feature is first passed through to describe sample word With the relation of known words in known dictionary, then by characteristic relation and the label word of sample word, training obtains word's kinds device.
In embodiments of the present invention, each known words in sample word, known dictionary are expressed as term vector respectively, word to Amount is represented by vw={ θ12,...,θn, n is the quantity of known words in known dictionary.First concentrated in training data and search institute There is the text (such as microblogging) for including current sample word (or known words), document t is formed by these textsw, calculate current sample This word (or known words) is in twIn word frequency-reverse archives frequency, calculation formula is:
θk=tf (wk,tw)×idf(wk,doccorpus), wherein, θkFor current sample word (or known words) wkIn twIn Word frequency-reverse archives frequency, be also k-th of component in term vector, tf (wk,tw) it is current sample word (or known words) wk In twIn word frequency, idf (wk,doccorpus) it is current sample word (or known words) wkReverse archives frequency in known dictionary Rate.tf(wk,tw) calculation formula be:
Wherein, f (wk,tw) it is twMiddle sample word (or known words) wkOccur Frequency, Φ (tw) it is twIn all words.
In embodiments of the present invention, after current sample word and each known words are expressed as into term vector, can lead to Cross Europe it is several in and apart from the word distance for calculating current sample word and each known words, calculation formula be:
Wherein, wnewFor current sample word, wiFor i-th of known words, θkFor sample This word wnewK-th of component of term vector, θk' it is known words wiK-th of component of term vector, d (wnew,wi) it is sample word wnewWith known words wiWord distance.Then, sample word w is calculatednewWith known words wiWord cosine similarity, calculation formula is:
Wherein, sim (wnew,wi) it is sample word wnewWith known words wiWord cosine similarity.
In embodiments of the present invention, after word cosine similarity is calculated, in calculating sample word and known dictionary each Know the word while the frequency of occurrences of word, the frequency of occurrences is represented by word simultaneouslyWithWherein, ccoConcentrated for training data same When the amount of text including current sample word, i-th known words, cnewBeing concentrated for training data includes current sample word The amount of text of language, ciBeing concentrated for training data includes the amount of text of i-th of known words.By sample word and each known words Word distance, the frequency of occurrences forms the relationship characteristics of sample word and each known words simultaneously for word cosine similarity, word.
In embodiments of the present invention, calculate sample word and the relationship characteristic of each known words, the label of sample word Word is inputted in default SVMs and is trained, and generates word's kinds device, wherein, the kernel function of SVMs can be used RBF core, other sorting algorithms also can be selected and be trained.
Preferably, classifier training unit includes:
Term vector converting unit, for each known words in sample word, known dictionary to be converted into corresponding word respectively Vector;
Relation computing unit, for the term vector and the term vector of known words according to sample word, calculate sample word with The word distance and word cosine similarity of each known words, calculate sample word and known words go out simultaneously in the word that training data is concentrated Existing frequency;And
Composition of relations unit, for the frequency of occurrences to be combined as sample word simultaneously by word distance, word cosine similarity and word The relationship characteristic of language and known words.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, By the word class device to be marked trained, obtained in known dictionary it is related to the word to be marked in text document known to Word, these known words are arranged to the label of word to be marked in text document, so as to by way of having supervision, realize Know automatically generating for word label, mark can be made to substantial amounts of word to be marked in a short time, and improve word to be marked The accuracy rate and recall rate of language mark, have been effectively saved manpower consumption, meanwhile, also it is effectively improved the careful of word mark Degree.
In embodiments of the present invention, each unit of the word annotation equipment of word-based label can be by corresponding hardware or software Unit realize, each unit can be independent soft and hardware unit, can also be integrated into a soft and hardware unit, herein not to The limitation present invention.
Example IV:
Fig. 5 shows the structure for the server that the embodiment of the present invention four provides, and for convenience of description, illustrate only and this hair The related part of bright embodiment.
The server 5 of the embodiment of the present invention includes processor 50, memory 51 and is stored in memory 51 and can be The computer program 52 run on processor 50.The processor 50 realizes above-mentioned each method embodiment when performing computer program 52 In step, such as the step S101 to S103 shown in Fig. 1.Or realized during the execution computer program 52 of processor 50 above-mentioned The function of each unit in device embodiment, such as the function of unit 31 to 33 shown in Fig. 3.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, When receiving the text document of input, by the word class device to be marked trained, obtained in known dictionary and text text The known words of word to be marked correlation in shelves, these known words are arranged to the label of word to be marked in text document, from And by way of having supervision, automatically generating for known words label is realized, can be in a short time to substantial amounts of word to be marked Language makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, meanwhile, Also it is effectively improved the careful degree of word mark.
Embodiment five:
In embodiments of the present invention, there is provided a kind of computer-readable recording medium, the computer-readable recording medium are deposited Computer program is contained, the computer program realizes the step in the above method embodiment when being executed by processor, for example, Fig. 1 Shown step S101 to S103.Or the computer program realizes each list in said apparatus embodiment when being executed by processor The function of member, such as the function of unit 31 to 33 shown in Fig. 3.
In embodiments of the present invention, concentrated in the training data built in advance and search sample word, noted in default entry The label word that sample word is extracted in storehouse is released, calculates sample word and the relationship characteristic of each known words in known dictionary, according to Label word, sample word and the relationship characteristic of each known words in known dictionary of sample word, train and obtain word's kinds device, When receiving the text document of input, by the word class device to be marked trained, obtained in known dictionary and text text The known words of word to be marked correlation in shelves, these known words are arranged to the label of word to be marked in text document, from And by way of having supervision, automatically generating for known words label is realized, can be in a short time to substantial amounts of word to be marked Language makes mark, and improves the accuracy rate and recall rate of word mark to be marked, has been effectively saved manpower consumption, meanwhile, Also it is effectively improved the careful degree of word mark.
The computer-readable recording medium of the embodiment of the present invention can include that any of computer program code can be carried Entity or device, recording medium, for example, the memory such as ROM/RAM, disk, CD, flash memory.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention All any modification, equivalent and improvement made within refreshing and principle etc., should be included in the scope of the protection.

Claims (10)

1. a kind of word mask method of word-based label, it is characterised in that methods described comprises the steps:
Word to be marked is searched in the text document of input;
By the good word's kinds device of training in advance, inquired about in default known dictionary related to the word to be marked Know word, the word's kinds device is by there is monitor mode to train to obtain;
The related known words are arranged to the label word of the word to be marked, to wait to mark to described by the label word Note word is labeled.
2. the method as described in claim 1, it is characterised in that the step of searching word to be marked in the text document of input Before, methods described also includes:
Concentrated in the training data built in advance and search sample word;
The annotation of the sample word is inquired about in default entry annotation database, extracts the keyword of the annotation, will be described The keyword occurred in known dictionary is arranged to the label word of the sample word;
The sample word and the relationship characteristic of each known words in the known dictionary are calculated respectively, according to the relationship characteristic With the label word of the sample word, training obtains the word's kinds device.
3. method as claimed in claim 2, it is characterised in that the sample word is inquired about in default entry annotation database Annotation, extracts the keyword of the annotation, the keyword occurred in the known dictionary is arranged into the sample The step of label word of word, including:
The annotation of the sample word is inquired about in the entry annotation database, word segmentation processing and part of speech mark are carried out to the annotation Note, candidate's label word is extracted in the annotation after the part-of-speech tagging;
Self-defined weight, the candidate described in every partial content of the annotation according to corresponding to every partial content of the annotation The frequency that label word occurs, calculates encyclopaedia word frequency corresponding to candidate's label word;
The reverse archives frequency according to corresponding to the known dictionary calculates candidate's label word, according to candidate's label word pair The encyclopaedia word frequency, the reverse archives frequency answered, calculate the keyword fraction of candidate's label word;
When the keyword fraction of candidate's label word exceedes preset fraction threshold value, candidate's label word is arranged to described The label word of sample word.
4. method as claimed in claim 3, it is characterised in that the self-defined power according to corresponding to every partial content of the annotation The frequency that weight, candidate's label word described in every partial content of the annotation occur, is calculated corresponding to candidate's label word The step of encyclopaedia word frequency, including:
According to the self-defined weight in the annotation per partial content, the weight of every partial content of the annotation is carried out again Definition, the formula redefined to the weight of the jth partial content of the annotation are:
Wherein, the βjFor the self-defined weight of jth partial content in the annotation, the pjTo be described Jth partial content, it is describedFor the annotation, the αjFor after being redefined to the weight of the jth partial content The value arrived;
Weight after being redefined according to every partial content in the annotation, candidate's mark described in every partial content of the annotation The frequency that word occurs is signed, calculates encyclopaedia word frequency corresponding to candidate's label word, calculation formula is:
Wherein, it is describedFor i-th of candidate label Word wiEncyclopaedia word frequency, the wkFor k-th of candidate label word, the f (wi,pj) and the f (wk,pj) it is respectively in institute State jth partial content pjDescribed in i-th, the frequency that occurs of k-th candidate's label word, the Φ is candidate's label word Set, the A are the weight set of all partial contents of annotation.
5. method as claimed in claim 2, it is characterised in that calculate respectively every in the sample word and the known dictionary The step of relationship characteristic of individual known words, including:
Each known words in the sample word, the known dictionary are converted into corresponding term vector respectively;
According to the term vector of the sample word and the term vector of the known words, calculate the sample word with it is described it is each Know the word distance and word cosine similarity of word, calculate the sample word and word that the known words are concentrated in the training data The frequency of occurrences simultaneously;
By institute's predicate distance, institute's predicate cosine similarity and institute's predicate, the frequency of occurrences is combined as the sample word and institute simultaneously State the relationship characteristic of known words.
6. a kind of word annotation equipment of word-based label, it is characterised in that described device includes:
Word searching unit, for searching word to be marked in the text document of input;
Related term query unit, for by the good word's kinds device of training in advance, inquiry and institute in default known dictionary The related known words of word to be marked are stated, the word's kinds device is by there is monitor mode to train to obtain;And
Word marks unit, for the related known words to be arranged to the label word of the word to be marked, to pass through Label word is stated to be labeled the word to be marked.
7. device as claimed in claim 6, it is characterised in that described device also includes:
Sample word searching unit, sample word is searched for being concentrated in the training data built in advance;
Keyword extracting unit, for inquiring about the annotation of the sample word in default dictionary annotation database, extract the note The keyword released, the keyword occurred in the known dictionary is arranged to the label word of the sample word;With And
Classifier training unit, it is special for calculating the sample word and the relation of each known words in the known dictionary respectively Sign, according to the relationship characteristic and the label word of the sample word, training obtains the word's kinds device.
8. device as claimed in claim 7, it is characterised in that the classifier training unit includes
Term vector converting unit, for being respectively converted to each known words in the sample word, the known dictionary correspondingly Term vector;
Relation computing unit, for the term vector and the term vector of the known words according to the sample word, calculate the sample This word and the word distance and word cosine similarity of each known words, calculate the sample word and the known words in institute State the word while the frequency of occurrences of training data concentration;And
Composition of relations unit, for the frequency of occurrences to combine simultaneously by institute's predicate distance, institute's predicate cosine similarity and institute's predicate For the sample word and the relationship characteristic of the known words.
9. a kind of server, including memory, processor and it is stored in the memory and can transports on the processor Capable computer program, it is characterised in that realize such as claim 1 to 5 times described in the computing device during computer program The step of one methods described.
10. a kind of computer-readable recording medium, the computer-readable recording medium storage has computer program, and its feature exists In when the computer program is executed by processor the step of realization such as any one of claim 1 to 5 methods described.
CN201710581312.XA 2017-07-17 2017-07-17 Word labeling method, device, server and storage medium based on word labels Active CN107480200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710581312.XA CN107480200B (en) 2017-07-17 2017-07-17 Word labeling method, device, server and storage medium based on word labels

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710581312.XA CN107480200B (en) 2017-07-17 2017-07-17 Word labeling method, device, server and storage medium based on word labels

Publications (2)

Publication Number Publication Date
CN107480200A true CN107480200A (en) 2017-12-15
CN107480200B CN107480200B (en) 2020-10-23

Family

ID=60595121

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710581312.XA Active CN107480200B (en) 2017-07-17 2017-07-17 Word labeling method, device, server and storage medium based on word labels

Country Status (1)

Country Link
CN (1) CN107480200B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145296A (en) * 2018-08-09 2019-01-04 新华智云科技有限公司 A kind of general word recognition method and device based on monitor model
CN109271392A (en) * 2018-10-30 2019-01-25 长威信息科技发展股份有限公司 Quick discrimination and the method and apparatus for extracting relevant database entity and attribute
CN109344367A (en) * 2018-10-24 2019-02-15 厦门美图之家科技有限公司 Region mask method, device and computer readable storage medium
CN109522424A (en) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of data
CN109740157A (en) * 2018-12-29 2019-05-10 贵州小爱机器人科技有限公司 The label of working individual determines method, apparatus and computer storage medium
CN109816047A (en) * 2019-02-19 2019-05-28 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided
CN110276064A (en) * 2018-03-14 2019-09-24 普天信息技术有限公司 A kind of part-of-speech tagging method and device
CN110991181A (en) * 2019-11-29 2020-04-10 腾讯科技(深圳)有限公司 Method and apparatus for enhancing labeled samples
CN113177109A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Text weak labeling method, device, equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102508923A (en) * 2011-11-22 2012-06-20 北京大学 Automatic video annotation method based on automatic classification and keyword marking
US20160042427A1 (en) * 2011-04-06 2016-02-11 Google Inc. Mining For Product Classification Structures For Internet-Based Product Searching

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160042427A1 (en) * 2011-04-06 2016-02-11 Google Inc. Mining For Product Classification Structures For Internet-Based Product Searching
CN102508923A (en) * 2011-11-22 2012-06-20 北京大学 Automatic video annotation method based on automatic classification and keyword marking

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
MOHAMMAD HOSSEIN ELAHIMANESH ET AL.: ""ACUT: An Associative Classifier Approach to Unknown Word POS Tagging"", 《INTERNATIONAL SYMPOSIUM ON ARTIFICIAL INTELLIGENCE AND SIGNAL PROCESSING》 *
YUZHI LIANG ET AL.: ""New Word Detection and Tagging on Chinese Twitter Stream"", 《INTERNATIONAL CONFERENCE ON BIG DATA ANALYTICS AND KNOWLEDGE DISCOVERY》 *
刘遥峰 等: ""中文分词和词性标注模型"", 《计算机工程》 *
姜维 等: ""基于条件随机域的词性标注模型"", 《计算机工程与应用》 *
王阿园 等: ""查询扩展中扩展词提取算法研究"", 《中国科技论文在线》 *
郭振 等: ""基于字符的中文分词、词性标注和依存句法分析联合模型"", 《中文信息学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110276064B (en) * 2018-03-14 2023-06-23 普天信息技术有限公司 Part-of-speech tagging method and device
CN110276064A (en) * 2018-03-14 2019-09-24 普天信息技术有限公司 A kind of part-of-speech tagging method and device
CN109145296A (en) * 2018-08-09 2019-01-04 新华智云科技有限公司 A kind of general word recognition method and device based on monitor model
CN109522424A (en) * 2018-10-16 2019-03-26 北京达佳互联信息技术有限公司 Processing method, device, electronic equipment and the storage medium of data
CN109344367A (en) * 2018-10-24 2019-02-15 厦门美图之家科技有限公司 Region mask method, device and computer readable storage medium
CN109344367B (en) * 2018-10-24 2022-11-01 厦门美图之家科技有限公司 Region labeling method and device and computer readable storage medium
CN109271392A (en) * 2018-10-30 2019-01-25 长威信息科技发展股份有限公司 Quick discrimination and the method and apparatus for extracting relevant database entity and attribute
CN109740157A (en) * 2018-12-29 2019-05-10 贵州小爱机器人科技有限公司 The label of working individual determines method, apparatus and computer storage medium
CN109816047A (en) * 2019-02-19 2019-05-28 北京达佳互联信息技术有限公司 Method, apparatus, equipment and the readable storage medium storing program for executing of label are provided
CN109816047B (en) * 2019-02-19 2022-05-24 北京达佳互联信息技术有限公司 Method, device and equipment for providing label and readable storage medium
CN110991181A (en) * 2019-11-29 2020-04-10 腾讯科技(深圳)有限公司 Method and apparatus for enhancing labeled samples
CN110991181B (en) * 2019-11-29 2023-03-31 腾讯科技(深圳)有限公司 Method and apparatus for enhancing labeled samples
CN113177109A (en) * 2021-05-27 2021-07-27 中国平安人寿保险股份有限公司 Text weak labeling method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN107480200B (en) 2020-10-23

Similar Documents

Publication Publication Date Title
CN107480200A (en) Word mask method, device, server and the storage medium of word-based label
CN111177365B (en) Unsupervised automatic abstract extraction method based on graph model
US11216504B2 (en) Document recommendation method and device based on semantic tag
CN106997382B (en) Innovative creative tag automatic labeling method and system based on big data
CN106649818B (en) Application search intention identification method and device, application search method and server
Huston et al. Evaluating verbose query processing techniques
CN106776574B (en) User comment text mining method and device
CN107608999A (en) A kind of Question Classification method suitable for automatically request-answering system
CN106126619A (en) A kind of video retrieval method based on video content and system
CN110347790B (en) Text duplicate checking method, device and equipment based on attention mechanism and storage medium
CN113569050B (en) Method and device for automatically constructing government affair field knowledge map based on deep learning
CN112069312B (en) Text classification method based on entity recognition and electronic device
Man Feature extension for short text categorization using frequent term sets
Barriere et al. TerminoWeb: a software environment for term study in rich contexts
CN111291177A (en) Information processing method and device and computer storage medium
CN108228612B (en) Method and device for extracting network event keywords and emotional tendency
Wang et al. Neural related work summarization with a joint context-driven attention mechanism
CN105786971B (en) A kind of grammer point recognition methods towards international Chinese teaching
Singh et al. Writing Style Change Detection on Multi-Author Documents.
Gong et al. A semantic similarity language model to improve automatic image annotation
CN112000929A (en) Cross-platform data analysis method, system, equipment and readable storage medium
Ohta et al. CRF-based bibliography extraction from reference strings focusing on various token granularities
Pu et al. A vision-based approach for deep web form extraction
Tan et al. Sentiment analysis of chinese short text based on multiple features
Chen Natural language processing in web data mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant