CN103955453B - A kind of method and device for finding neologisms automatic from document sets - Google Patents

A kind of method and device for finding neologisms automatic from document sets Download PDF

Info

Publication number
CN103955453B
CN103955453B CN201410220317.6A CN201410220317A CN103955453B CN 103955453 B CN103955453 B CN 103955453B CN 201410220317 A CN201410220317 A CN 201410220317A CN 103955453 B CN103955453 B CN 103955453B
Authority
CN
China
Prior art keywords
candidate
templates
words
word
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410220317.6A
Other languages
Chinese (zh)
Other versions
CN103955453A (en
Inventor
黄民烈
朱小燕
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN201410220317.6A priority Critical patent/CN103955453B/en
Publication of CN103955453A publication Critical patent/CN103955453A/en
Application granted granted Critical
Publication of CN103955453B publication Critical patent/CN103955453B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of method and device for finding neologisms automatic from document sets, wherein, template acquiring unit obtains one or more templates;Word extraction unit extracts the word matched with each template in one or more of templates from the document sets;Candidate template set addition unit at least chooses a part of template from one or more of templates and is added to candidate template set;Candidate word set addition unit at least chooses a part of word from the word matched with each template in one or more of templates extracted and is added to candidate word set;New set of words is added unit and the candidate word in the candidate word set is sorted based on the template in candidate template set, and a number of candidate word is added into new set of words based on the sequence.Compared with prior art, the method and apparatus that the present invention is provided can effectively find neologisms.

Description

Method and device for automatically discovering new words from document set
Technical Field
The invention relates to a natural language processing technology, in particular to a method and a device for automatically discovering new words from a document set.
Background
In social networks, netizens like to express opinions on politics, society, culture, etc. in their own personalized languages. Generally, personalized languages are more easily propagated by more people to become new network hot words (simply "new words"). At present, new words have important application in the aspects of automatic abstracting, text clustering/classifying, information retrieval and the like, according to statistics, more than 1000 new Chinese words appear on the internet every year, the new words are mostly professional terms with timeliness in various fields, and because most of the new words do not exist in dictionaries, the existing word segmentation algorithm is difficult to intensively recognize the new words from documents. Taking the new word "give force (adjective)" of the emotion class and the document "perform very much, as an example, the existing word segmentation algorithm generally performs the following word segmentation on the new word: the performance/noun is very/adverb give/verb force/noun, so that the new word "give force" cannot be segmented as a complete word to influence the recognition of the new word.
Disclosure of Invention
One of the technical problems solved by the invention is to improve the accuracy of new word recognition.
According to an embodiment of one aspect of the present invention, there is provided a method for automatically discovering new words from a document set, comprising:
acquiring one or more templates;
extracting words matched with each template in the one or more templates from the document set;
at least selecting a part of templates from the one or more templates and adding the selected part of templates into a candidate template set;
at least selecting a part of words from the extracted words matched with each template in the one or more templates and adding the words into a candidate word set;
the method comprises the steps of sorting candidate words in a candidate word set based on templates in the candidate template set, and adding a certain number of candidate words to a new word set based on the sorting of the candidate words in the candidate word set by the templates in the candidate template set.
According to one embodiment of the invention, the one or more templates are obtained by any one of:
predefining said one or more templates, or
And after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with the specific regular expression from the document set subjected to word segmentation processing.
According to an embodiment of the present invention, the step of selecting at least a part of the templates from the one or more templates to be added to the candidate template set comprises any one of the following:
adding all of the one or more templates into a candidate template set;
adding a portion of the templates to a set of candidate templates based on a number of times each of the one or more templates occurs in the set of documents.
According to one embodiment of the present invention, the step of adding a portion of the templates to the set of candidate templates based on the number of times each of the one or more templates occurs in the set of documents comprises:
adding the template with the first f names in the frequency of occurrence in the document set into a candidate template set, wherein f is a positive integer; or
Adding templates which appear in the document set for more than a certain threshold number of times into a candidate template set.
According to an embodiment of the present invention, the step of selecting at least a part of the extracted words matching each of the one or more templates to be added to the candidate word set includes any one of the following steps:
adding all the matched words into a candidate word set;
and adding a part of words into a candidate word set based on the matching times of the matched words and the templates.
According to an embodiment of the present invention, the step of adding a part of words into the candidate word set based on the matching times of the matched words and the templates includes:
adding the words with the first g matched times of each template in the matched words into a candidate word set, wherein g is a positive integer; or
And adding the words of which the matching times with the templates exceed a specific threshold in the matched words into the candidate word set.
According to an embodiment of the invention, the method further comprises: before the candidate words in the candidate word set are sorted based on the templates in the candidate template set, the templates in the candidate template set are sorted by a preset new word set, and the candidate template set is filtered based on the sorting of the templates in the candidate template set by the preset new word set.
According to an embodiment of the invention, the method further comprises: and ranking the templates in the candidate template set by using the obtained new word set, filtering the candidate template set based on the ranking of the templates in the candidate template set by using the obtained new word set, ranking the candidate words in the candidate word set again by using the filtered candidate template set, and adding a certain number of candidate words into the new word set again based on the ranking of the candidate words in the candidate word set again by using the filtered candidate template set.
According to one embodiment of the present invention, ranking the templates of the set of candidate templates is performed by calculating template weights in the set of candidate templates based on the following formula and ranking the templates of the set of candidate templates according to the calculated template weights:
n1i=k1i+k3i,n2i=k2i+k4i
w denotes a set of new words, P denotes a set of candidate templates, WiRepresenting a word in the new set W of words, pjRepresenting one template, k, of the set of candidate templates P1iIndicating that a match found from the document set to each of the one or more templates contains both w and wiAnd also contains pjK is the number of matches2iIndicating that a match found from the document set with each of the one or more templates contains wiBut does not contain pjK is the number of matches3iIndicating that a match found from the document set with each of the one or more templates contains pjBut does not contain wiK is the number of matches4iIndicating that matches found from the document set to each of the one or more templates contain no pjNor wiThe number of matches.
According to an embodiment of the present invention, the sorting candidate words in the candidate word set based on templates in the candidate template set, and adding a certain number of candidate words to the new word set based on the sorting candidate words in the candidate word set by templates in the candidate template set comprises:
adding m candidate words with the number of times matched with the templates in the candidate template set to the new word set, wherein m is a positive integer; or
Candidate words that are matched with templates in the candidate template set more frequently than a certain threshold are added to the new word set.
According to an embodiment of the present invention, in the step of ordering the candidate words in the set of candidate words based on the templates in the set of candidate templates,
according to LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Calculating weights of the candidate words in the candidate word set, and ordering the candidate words in the candidate word set based on the calculated weights;
wherein, wiRepresents one candidate word in the set of candidate words, LLR (W)i) Represents a candidate word wiThe degree of closeness of statistical association with templates in the set of candidate templates, E (w)i) Represents a candidate word wiLeft entropy of P (w)i) Represents a candidate word wiProbability of word-to-word, EMI (w)i) And NMED (w)i) Respectively represent the candidate words wiDifferent measures of semantic synthesis of;
wherein in the step of ordering candidate words in the set of candidate words based on templates in the set of candidate templates, the LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Respectively obtained by the following calculation:
n1j=k1j+k3j,n2j=k2j+k4j
wherein W represents a set of candidate words, P represents a set of candidate templates, WiRepresenting a candidate word in W, pjRepresenting one template, k, of the set of candidate templates P1jIndicating that a match found from the document set to each of the one or more templates contains both w and wiAnd also contains pjK is the number of matches2jIndicating that a match found from the document set with each of the one or more templates contains wiBut does not contain pjK is the number of matches3jIndicating that a match found from the document set with each of the one or more templates contains pjBut does not contain wiK is the number of matches4jIndicating that matches found from the document set to each of the one or more templates contain no pjNor wiThe matching number of (2);
where L represents the document set and the candidate word wiMatching the left side with the left side word l which is already generated and matched with any template in the candidate template setoSet, c (l)o) Represents the left-hand word loAnd candidate word wiThe frequency of matching occurring on the left side and matching with any template in the candidate template set, N represents the candidate word wiA total number of co-occurrences with templates in the set of candidate templates;
wherein t ishRepresenting a candidate word w in a set of candidate wordsiH-th word in (1), n represents a candidate word wiThe number of individual words contained therein;
all(th) Represents a candidate word wiThe number of times the h-th word in (b) appears in the document set, s (t)h) Represents a candidate word wiThe number of times that the h-th word and any word in the document set appear as a single word;
wherein S is the total number of paragraphs in the document set M, and n represents a candidate word wiThe number of contained characters, F represents that the document set contains candidate words wiNumber of speech segments of, FhIndicating that the document set contains a candidate word wiThe number of the h-th word;
wherein S is the total number of paragraphs in the document set M, and μ (g) represents that the candidate word w is contained in the document set MiThe number of tokens of all words contained,(g) representing candidate words w contained in the document set MiAll the contained words appear strictly consecutively in the number of segments of a single phrase.
According to an embodiment of the present invention, the step of sorting the templates in the candidate template set by using the obtained new word set, and filtering the candidate template set based on the sorting of the templates in the candidate template set by using the obtained new word set comprises:
filtering the candidate words with the first r names matched with the templates in the candidate template set to obtain a filtered candidate template set, wherein r is a positive integer; or
And filtering the candidate words which are matched with the templates in the candidate template set for times higher than a specific threshold value from the candidate template set to obtain a filtered candidate template set.
There is also provided, in accordance with an embodiment of another aspect of the present invention, apparatus for automatically discovering new words from a document collection, including:
a template acquisition unit configured to acquire one or more templates;
a word extraction unit configured to extract words matching each of the one or more templates from the document set;
the candidate template set adding unit is configured to select at least one part of templates from the one or more templates and add the selected part of templates into the candidate template set;
the candidate word set adding unit is configured to select at least one part of words from the extracted words matched with each template in the one or more templates and add the selected words into a candidate word set;
the new word set adding unit is configured to rank the candidate words in the candidate word set based on the templates in the candidate template set, and add a certain number of candidate words to the new word set based on the ranking of the candidate words in the candidate word set by the templates in the candidate template set.
According to an embodiment of the present invention, the template obtaining unit obtains the one or more templates by any one of:
predefining said one or more templates, or
And after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with the specific regular expression from the document set subjected to word segmentation processing.
According to an embodiment of the invention, the apparatus further comprises:
a candidate template set filtering unit configured to rank the templates in the candidate template set with a predefined new word set before ranking the candidate words in the candidate word set based on the templates in the candidate template set, and filter the candidate template set based on the ranking of the templates in the candidate template set with the predefined new word set.
According to an embodiment of the present invention, the new word set adding unit is configured to sort again the candidate words in the candidate word set with the filtered candidate template set and add again a certain number of candidate words to the new word set based on the sorting.
Compared with the prior art, the method for automatically discovering the new words from the document set provided by the embodiment of the invention can effectively, more accurately and unsupervised discover the new words compared with the scheme in the prior art.
In addition, according to the method for automatically discovering new words from a document set provided by one embodiment of the present invention, after the document set is obtained, word segmentation is performed on the document set, and the one or more templates matched with the specific regular expression are extracted from the document set subjected to word segmentation, so that the obtained templates are more standard, and thus new words are discovered more accurately.
In addition, according to the method for automatically discovering new words from the document set provided by an embodiment of the present invention, based on the number of times that each template of the one or more templates appears in the document set, a part of templates are added to the candidate template set, that is, by screening the templates, the computational efficiency of the present invention is improved.
In addition, in the method for automatically discovering new words from the document set provided by one embodiment of the present invention, g words with the first g names of the matching times of the matched words and each template are added into the candidate word set, wherein g is a positive integer; or adding the words of which the matching times with the templates exceeds a specific threshold value into the candidate word set, namely screening out specific and better candidate words, so as to more accurately find new words.
In addition, according to the method for automatically discovering new words from a document set provided by an embodiment of the present invention, before the candidate words in the candidate word set are ranked based on the templates in the candidate template set, the templates in the candidate template set are ranked by using the predefined new word set, and the candidate template set is filtered based on the ranking of the templates in the candidate template set by using the predefined new word set, that is, by screening the templates, a better template matching with the new words to be discovered can be obtained, so as to discover the new words more accurately.
It will be appreciated by those of ordinary skill in the art that although the following detailed description will proceed with reference being made to illustrative embodiments, the present invention is not intended to be limited to these embodiments. Rather, the scope of the invention is broad and is intended to be defined only by the claims appended hereto.
Drawings
Other features, objects and advantages of the invention will become more apparent upon reading of the detailed description of non-limiting embodiments made with reference to the following drawings:
FIG. 1 illustrates a flow diagram of a method for automatically discovering new words from a document collection, according to one embodiment of the invention;
FIG. 2 shows a schematic block diagram of an apparatus for automatically discovering new words from a document collection according to another embodiment of the present invention;
the same or similar reference numbers in the drawings identify the same or similar elements.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
FIG. 1 illustrates a flow diagram of a method for automatically discovering new words from a document collection, according to one embodiment of the invention. According to one embodiment of the invention, the method comprises:
step S101, one or more templates are obtained.
Step S102, extracting words matched with each template in the one or more templates from the document set;
step S103, at least one part of templates are selected from the one or more templates and added into a candidate template set;
step S104, at least selecting a part of words from the extracted words matched with each template in the one or more templates and adding the selected words into a candidate word set;
step S105, sorting the candidate words in the candidate word set based on the templates in the candidate template set, and adding a certain number of candidate words into the new word set based on the sorting of the candidate words in the candidate word set by the templates in the candidate template set.
The document set may refer to a single document or a set of multiple documents, and of course, the document set is only an example, and may also be other corpus resources, such as a dictionary, a microblog database, and the like, and is also applicable to the present invention.
Specifically, in step S101, one or more templates may be obtained by any one of the following ways:
the first method is as follows: the one or more templates are specified in advance, for example, by defining templates by part of speech such as adverb and co-word, so as to enumerate the one or more templates, or the one or more templates may be set or given in advance; or,
the second method comprises the following steps: and after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with a specific matching rule from the document set subjected to word segmentation processing. The word segmentation processing method is not limited, and the word segmentation method based on character string matching, the word segmentation method based on understanding, the word segmentation method based on statistics, and the like can be applied to the present invention, and are included herein by reference. The specific matching rule includes, for example, a rule for performing regular matching based on a regular expression, which is not limited herein, and may also include other matching rules. The regular expressions herein mainly refer to expressions formed by combining one or more specific parts of speech, such as "noun + adjective", "adverb + space/+ sigh" are two different regular expressions, where space/' represents any other content and part of speech. Preferably, in this embodiment, if one or more templates are obtained in the second way, different regular expressions are defined according to the part of speech of the new word to be found. For example, the present embodiment aims to find words of emotion class, and the regular expressions mainly defined by "adverb + space/+ sigh", "adverb + space/", and the like.
The template mainly refers to a combination of words and space/custom symbols, for example, "too +/+ and" good "+ are two different templates, and the space/custom symbol here represents a combination of any words. For example, to enumerate a template, a template is defined by "adverb + space + exclamation" as follows: "true".
In step S102, words matching each of the one or more templates are extracted from the document set. Specifically, taking the example of a document set of "give force to the sun", "give pit to the sun", and "give pit to the sun", as an example: "give power", "martial dada" and "beautiful". Of course, the present embodiment is not limited to the way words are extracted from the document set, and is incorporated herein by reference.
In step S103, at least a part of the templates from the one or more templates is selected and added to the candidate template set. Specifically, a part of the templates may be selected from the one or more templates and added to the candidate template set, or all of the one or more templates may be added to the candidate template set.
Optionally, a portion of the templates are added to the set of candidate templates based on a number of times each of the one or more templates appears in the set of documents. Specifically, the number of times of occurrence of each template of the one or more templates in the document set is counted, then the one or more templates are screened according to a certain rule, and the screened templates are added into the candidate template set. Still taking the example of the document set "too forced, too pit satd, good beautiful satd, very pit satd", a plurality of templates "too +/+ and" good + + satd ", as the template" too +/+ is matched with the template "too forced, too pit satd" in the document set, thereby obtaining the number of times that the template "too +/+ is concentrated" in the document set as 2 by statistics, and similarly obtaining the number of times that the template "good + + satd" appears in the document set as 3 by statistics, further screening the one or more templates according to a certain rule, and adding the screened template into the candidate template set.
Optionally, the step of adding a portion of the templates to the set of candidate templates based on the number of times each of the one or more templates appears in the set of documents comprises:
adding the template with the first f names in the frequency of occurrence in the document set into a candidate template set, wherein f is a positive integer; or
Adding templates which appear in the document set for more than a certain threshold number of times into a candidate template set.
For example, still taking the document set "is too strong, is too pitted, is good, is beautiful, is very pitted, takes a plurality of templates" too +/+ is "and" is good ". as examples, statistics is carried out to obtain that the number of times that the template" is too +/+ is appeared in the document set is 2, and after the number of times that the template "is good" + is appeared in the document set is 3, if the template with the number of times that the template appears in the document set is ranked at the 1 st template is added into the candidate template set, the template "is too +/+ is filtered, and the template" is good "+ is added into the candidate template set; if the templates which appear in the document set more than 1 time are added into the candidate template set, the templates of "too +/+ plus" and "good" - "are added into the candidate template set at the same time.
In step S104, at least a part of the extracted words matching each of the one or more templates is selected and added to the candidate word set. Specifically, a part of the extracted words matched with each of the one or more templates may be selected and added to the candidate word set, or all the extracted words matched with each of the one or more templates may be added to the candidate word set.
Optionally, a part of the words is added to the candidate word set based on the matching times of the matched words and the templates. Specifically, the matching times of the matched words and the templates are counted, the matched words are screened according to a certain rule, and the screened words are added into a candidate word set. Still taking the document set's "give force too, show in the tai pit, give force well, show in the pit, show in the plural templates" too +/+ and "good + + as examples, counting up to obtain that the times of matching the word" give force "with the template" too +/+ and the template "good + + are respectively 1, the times of matching the word" pit and the template "too +/+ and the template" good + + are respectively 1, the times of matching the word "beautiful" with the template "too +/+ and the template" good + + are respectively 0 and 1, further screening the matched words according to a certain rule, collecting and adding the screened words into the candidate word set.
Optionally, the step of adding a part of words into the candidate word set based on the matching times of the matched words and the templates includes:
adding the words with the first g matched times of each template in the matched words into a candidate word set, wherein g is a positive integer; or
And adding the words of which the matching times with the templates exceed a specific threshold in the matched words into the candidate word set.
For example, still taking the document set "give force too, show in the tai kou, give force well, show in the kou, show in the beautiful o, show in the very hole o", a plurality of templates "too +/+ is given" and "good + + o" as examples, counting the matching times of each word and each template, and if the word with the template "too +/+ is matched is ranked in the first name or the word with the template" good + + o "is ranked in the first name and added to the candidate word set, adding" give force "," pit is distinguished "and" beautiful "to the candidate word set; and if the words of the matched words, which are matched with the template for more than 1 time, are added into the candidate word set, taking a union set of the words of 'give power', 'hole die' and 'beautiful' and then adding the union set into the candidate word set.
In step S105, the candidate words in the candidate word set are sorted based on the templates in the candidate template set, and a certain number of candidate words are added to the new word set based on the sorting of the candidate words in the candidate word set by the templates in the candidate template set.
Specifically, taking the candidate template set { "too +/+ plus", "good + + o" }, and the candidate word set { "give force", "pit die" } as an example, each candidate word may be sorted based on the number of times that each word in the candidate word set matches each template in the candidate template set, for example, the number of times that each word in the candidate word set matches the template "too +/+ plus" is 5, and the number of times that each word matches the template "good + + o" is 4; the number of times of matching the candidate word "satay" with the template "too +/+ is 3, and the number of times of matching the candidate word" good "+ is 5, then the candidate words may be ranked according to the total number of times of matching the candidate words with all templates in the candidate template set, for example, the total number of times of matching the candidate word" give power "is 9, and the total number of times of matching the candidate word" satay "is 8; the candidate words may also be sorted according to the respective matching times of the candidate words and the templates in the candidate template set, or the candidate words may also be sorted based on other manners or algorithms, which is not limited herein.
Optionally, the sorting the candidate words in the candidate word set based on the templates in the candidate template set, and the adding a certain number of candidate words to the new word set based on the sorting the candidate words in the candidate word set by the templates in the candidate template set includes:
adding m candidate words with the number of times matched with the templates in the candidate template set to the new word set, wherein m is a positive integer; or
Candidate words that are matched with templates in the candidate template set more frequently than a certain threshold are added to the new word set.
Optionally, in the step of ordering the candidate words in the candidate word set based on the templates in the candidate template set,
according to LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Or any plurality thereofTo calculate the weight (w) of the candidate word in the candidate word seti) E.g. in terms of LLR (w)i)、LLR(wi)*E(wi)、LLR(wi)*E(wi)*P(wi)、LLR(wi)*E(wi)*EMI(wi) Or LLR (w)i)*E(wi)/NMED(wi) Calculating the weight of the candidate words in the candidate word set, and sorting the candidate words in the candidate word set based on the calculated weight; of course, for weight (w)i) The calculation of (b) is not limited to the above calculation, and any calculation that can evaluate the closeness of association of a candidate word with a template in the candidate template set or/and a candidate word cohesion measure may be suitable for use in the present invention and is incorporated herein by reference.
W hereiniRepresents one candidate word in the set of candidate words, LLR (W)i) Represents a candidate word wiCloseness of contact with templates in the candidate template set, E (w)i) Represents a candidate word wiLeft entropy of P (w)i) Represents a candidate word wiProbability of word-to-word, EMI (w)i) And NMED (w)i) Respectively represent the candidate words wiDifferent measures of semantic synthesis of.
Wherein, in the step of ordering the candidate words in the candidate word set based on the templates in the candidate template set, LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Can be obtained by the following calculation respectively:
n1j=k1j+k3j,n2j=k2j+k4j
wherein W represents a set of candidate words, P represents a set of candidate templates, WiRepresenting a candidate word in W, pjRepresenting one template, k, of the set of candidate templates P1jIndicating that a match found from the document set to each of the one or more templates contains both w and wiAnd also contains pjK is the number of matches2jIndicating that a match found from the document set with each of the one or more templates contains wiBut does not contain pjK is the number of matches3jIndicating that a match found from the document set with each of the one or more templates contains pjBut does not contain wiK is the number of matches4jIndicating that matches found from the document set to each of the one or more templates contain no pjNor wiThe matching number of (2); using template P in document set "Tai Dong Li, Tai Tuo Li, Hao Tuo Li and Suo Tuo Li", candidate word "Dong Li", candidate template set P "Tai +/+1"too +/+ by" for example, calculated as: k is a radical of11=1,k21=1,k31=1,k41=2;
Wherein, E (w)i) Is calculated as follows:
where L represents the document set and the candidate word wiMatching the left side with the left side word l which is already generated and matched with any template in the candidate template setoSet, c (l)o) Represents the left-hand word loAnd candidate word wiThe frequency of matching occurring on the left side and matching with any template in the candidate template set, N represents the candidate word wiA total number of co-occurrences with templates in the set of candidate templates; usually left entropy E (w)i) The larger the candidate word is, the higher the diversity of the collocation of the candidate word and the left word in the candidate template set is, namely the higher the possibility of representing the candidate word as a new word is; for example, taking document set { "give a force o well", "give a force la too much", "give a force a good", "give a force really" } and candidate word "give a force", candidate template set { "do + + or" too + + or "good + + or" true + + "as an example, left word loSet is { good, Tai, true }, c (l)1)=2,c(l2)=2,c(l3)=1,E(wi) Is calculated as follows:
E(wi)=-2/5log(2/5)-2/5log(2/5)-1/5log(1/5);
wherein, for the candidate word w in the candidate word setiP (w) ofi) Is calculated as follows:
wherein t ishRepresenting a candidate word w in a set of candidate wordsiH-th word in (1), n represents a candidate word wiThe number of individual words contained therein;
all(th) Represents a candidate word wiThe number of times the h-th word in (b) appears in the document set, s (t)h) Represents a candidate word wiThe number of times that the h-th word and any word in the document set appear as a single word; taking the word "love" as an example, the word "love" is the first word in the word "love", the word "love" is the second word in the word "love", statistics shows that "love" appears 100 times in the document set, "say" appears 200 times in the document set, "love" and other words in the document set are matched up into a single word and appear 50 times, "say" and other words in the document set are matched up into a single word and appear 150 times, therefore, the probability of the first word "love" being a word is calculated as: p (love) ═ 100-50)/100, the probability of the second word "saying" word formation is: p (say) ═ 200-: p (say) ═ (P (ai)/(1-P (ai))) (P (say)/(1-P (say))));
wherein, for the candidate word w in the candidate word setiThe semantic synthesis metric EMI is calculated as follows:
wherein S is the total number of paragraphs in the document set M, and n represents a candidate word wiThe number of contained characters, F represents that the document set contains candidate words wiNumber of speech segments of, FhIndicating that the document set contains a candidate word wiNumber of word segment of h-th word. The term "segment" may be defined as a natural term of each document in the document set M, or may be defined as each document in the document set M, which is not limited herein, for example, the document set M is composed of multiple microblogs, and each microblog may be regarded as a term. Taking the candidate word "give power" as an example, the total number of paragraphs in the document set M is200, the number of the candidate word "give" in the document set M is 10, the number of the candidate word "give" in the document set M is 20, the number of the candidate word "give" in the document set M is 30, S is 200, n is 2, F is 10, F is 301=20,F230, and calculated to yield:
wherein, for the candidate word w in the candidate word setiThe semantic synthesis metric NMED is calculated as follows:
wherein S is the total number of paragraphs in the document set M, and μ (g) represents that the candidate word w is contained in the document set MiThe number of tokens of all words contained,(g) representing candidate words w contained in the document set MiAll the contained words appear strictly consecutively in the number of segments of a single phrase. Usually NMED (w)i) The smaller the value of (a), the candidate word w is indicatediChinese character and candidate word wiThe greater the probability that a word other than medium is formed. For candidate word wiFor all words appearing dispersedly, for example, taking the candidate word "give force" as an example, the situation of the word appearing in one word segment in the document set may be: "busy a day without any strength, give kneeling. "; the case where it appears strictly continuously in one speech passage in the document set may be: "their performance is simply too powerful! ".
Optionally, before sorting the candidate words in the candidate word set based on the templates in the candidate template set, sorting the templates in the candidate template set by using a predefined new word set, and filtering the candidate template set based on the sorting of the templates in the candidate template set by using the predefined new word set. The words in the new word set may be any number of words specified in advance. To improve the accuracy of new word discovery, words are typically selected to be added to the new word set that have been considered by the user to be newer or more popular prior to new word discovery.
Optionally, the step of sorting the templates in the candidate template set by using a predefined new word set, and filtering the candidate template set based on the sorting includes:
filtering the candidate templates with the first r names which are matched with the new words in the new word set to obtain a filtered candidate template set, wherein r is a positive integer; or
And filtering the candidate templates of which the times of matching with the new words in the new word set are higher than a specific threshold value from the candidate template set to obtain a filtered candidate template set.
Optionally, the method further comprises: and ranking the templates in the candidate template set by using the obtained new word set, filtering the candidate template set based on the ranking of the templates in the candidate template set by using the obtained new word set, ranking the candidate words in the candidate word set again by using the filtered candidate template set, and adding a certain number of candidate words into the new word set again based on the ranking of the candidate words in the candidate word set again by using the filtered candidate template set. Here, "again" does not limit the number of times, and the number of times may be adjusted according to a set condition. It should be noted that the "obtained new word set" herein refers to a set obtained by processing the "predefined new word set" through the method for automatically discovering new words from a document set provided by the present invention.
Optionally, the step of sorting the templates in the candidate template set by using the obtained new word set, and filtering the candidate template set based on the sorting of the templates in the candidate template set by using the obtained new word set includes:
filtering the candidate templates with the first r names which are matched with the new words in the new word set to obtain a filtered candidate template set, wherein r is a positive integer; or
And filtering the candidate templates of which the times of matching with the new words in the new word set are higher than a specific threshold value from the candidate template set to obtain a filtered candidate template set.
Optionally, the ranking of the templates in the candidate template set with the obtained new word set is performed by calculating template weights in the candidate template set based on the following formula and ranking the templates in the candidate template set according to the calculated template weights:
n1i=k1i+k3i,n2i=k2i+k4i
where W represents a set of new words, P represents a set of candidate templates, WiRepresenting a word in the new set W of words, pjRepresenting one template, k, of the set of candidate templates P1iIndicating that a match found from the document set to each of the one or more templates contains both w and wiAnd also contains pjK is the number of matches2iIndicating that a match found from the document set with each of the one or more templates contains wiBut does not contain pjK is the number of matches3iIndicating that a match found from the document set with each of the one or more templates contains pjBut does not contain wiK is the number of matches4iIndicating that matches found from the document set to each of the one or more templates contain no pjNor wiThe number of matches.
FIG. 2 shows a schematic block diagram of an apparatus for automatically discovering new words from a document collection according to another embodiment of the present invention. According to another embodiment of the present invention, an apparatus for automatically discovering new words from a document collection comprises:
a template acquisition unit 201 configured to acquire one or more templates;
a word extraction unit 202 configured to extract words matching each of the one or more templates from the document set;
a candidate template set adding unit 203, configured to select at least a part of templates from the one or more templates to add to a candidate template set;
a candidate word set adding unit 204 configured to select at least a part of the extracted words matching each of the one or more templates and add the selected words to a candidate word set;
a new word set adding unit 205 configured to rank the candidate words in the candidate word set based on the templates in the candidate template set, and add a certain number of candidate words to the new word set based on the ranking of the candidate words in the candidate word set by the templates in the candidate template set.
It should be understood that the block diagram shown in fig. 2 is for exemplary purposes only and is not limiting upon the scope of the present invention. In some cases, certain elements or devices may be added or subtracted as appropriate.
Optionally, the template obtaining unit 201 obtains the one or more templates by any one of the following methods:
predefining said one or more templates, or
And after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with the specific regular expression from the document set subjected to word segmentation processing.
Optionally, the apparatus for automatically discovering new words from a document set further includes:
a candidate template set filtering unit 206 configured to rank the templates in the candidate template set with a predefined new word set before ranking the candidate words in the candidate word set based on the templates in the candidate template set, and to filter the candidate template set based on the ranking of the templates in the candidate template set with the predefined new word set.
Optionally, the new word set adding unit 205 is configured to sort again the candidate words in the candidate word set by using the filtered candidate template set and add again a certain number of candidate words to the new word set based on the sorting.
As will be appreciated by one skilled in the art, the present invention may be embodied as an apparatus, device, method, or computer program product. Accordingly, the present disclosure may be embodied in the form of: the software may be a complete hardware, a complete software, or a combination of hardware and software.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims (10)

1. A method of automatically discovering new words from a document collection, comprising:
acquiring one or more templates (S101), wherein the templates comprise words and spaces or/and custom symbols;
extracting words matched with each template in the one or more templates from the document set (S102), wherein the words matched with each template in the one or more templates extracted from the document set are words except for words included in the template;
selecting at least one part of templates from the one or more templates and adding the selected part of templates into a candidate template set (S103);
at least selecting a part of words from the extracted words matched with each template in the one or more templates and adding the selected words into a candidate word set (S104);
the candidate words in the set of candidate words are ranked based on the templates in the set of candidate templates, and a number of candidate words are added to the set of new words based on the ranking of the candidate words in the set of candidate words with the templates in the set of candidate templates (S105).
2. The method of claim 1, wherein the one or more templates are obtained by any one of:
predefining said one or more templates, or
And after a document set is obtained, performing word segmentation processing on the document set, and extracting the one or more templates matched with the specific regular expression from the document set subjected to word segmentation processing.
3. The method of claim 1, wherein the step of selecting at least a portion of the templates from the one or more templates for adding to the set of candidate templates comprises any one of:
adding all of the one or more templates into a candidate template set;
adding a portion of the templates to a set of candidate templates based on a number of times each of the one or more templates occurs in the set of documents.
4. The method of claim 3, wherein adding a portion of the templates to the set of candidate templates based on a number of times each of the one or more templates occurs in the set of documents comprises:
adding the template with the first f names in the frequency of occurrence in the document set into a candidate template set, wherein f is a positive integer; or
Adding templates which appear in the document set for more than a certain threshold number of times into a candidate template set.
5. The method of claim 1, wherein the step of selecting at least a portion of the extracted words that match each of the one or more templates to be added to the set of candidate words comprises any one of:
adding all the matched words into a candidate word set;
and adding a part of words into a candidate word set based on the matching times of the matched words and the templates.
6. The method of claim 5, wherein adding a portion of words to a set of candidate words based on the number of matches of the matched words to each template comprises:
adding the words with the first g matched times of each template in the matched words into a candidate word set, wherein g is a positive integer; or
And adding the words of which the matching times with the templates exceed a specific threshold in the matched words into the candidate word set.
7. The method of claim 1, further comprising: before the candidate words in the candidate word set are sorted based on the templates in the candidate template set, the templates in the candidate template set are sorted by a preset new word set, and the candidate template set is filtered based on the sorting of the templates in the candidate template set by the preset new word set.
8. The method of claim 1, further comprising: and ranking the templates in the candidate template set by using the obtained new word set, filtering the candidate template set based on the ranking of the templates in the candidate template set by using the obtained new word set, ranking the candidate words in the candidate word set again by using the filtered candidate template set, and adding a certain number of candidate words into the new word set again based on the ranking of the candidate words in the candidate word set again by using the filtered candidate template set.
9. The method of claim 1, wherein in the step of ordering candidate words in the set of candidate words based on templates in the set of candidate templates,
according to LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Calculating weights of the candidate words in the candidate word set, and ordering the candidate words in the candidate word set based on the calculated weights;
wherein, wiRepresents one candidate word in the set of candidate words, LLR (W)i) Represents a candidate word wiCloseness of contact with templates in the candidate template set, E (w)i) Represents a candidate word wiLeft entropy of P (w)i) Represents a candidate word wiProbability of word-to-word, EMI (w)i) And 1/NMED (w)i) Respectively represent the candidate words wiOf semantic synthesis, wherein LLR (w)i)、E(wi)、P(wi)、EMI(wi)、1/NMED(wi) Respectively obtained by the following calculation:
n1j=k1j+k3j,n2j=k2j+k4j
wherein W represents a set of candidate words, P represents a set of candidate templates, WiRepresenting a candidate word in W, pjRepresenting one template, k, of the set of candidate templates P1jIndicating that a match found from the document set to each of the one or more templates contains both w and wiAnd also contains pjK is the number of matches2jIndicating that a match found from the document set with each of the one or more templates contains wiBut does not contain pjK is the number of matches3jIndicating that a match found from the document set with each of the one or more templates contains pjBut does not contain wiK is the number of matches4jIndicating that matches found from the document set to each of the one or more templates contain no pjNor wiThe matching number of (2);
where L represents the document set and the candidate word wiMatching the left side with the left side word l which is already generated and matched with any template in the candidate template setoSet, c (l)o) Represents the left-hand word loAnd candidate word wiThe frequency of matching occurring on the left side and matching with any template in the candidate template set, N represents the candidate word wiA total number of co-occurrences with templates in the set of candidate templates;
wherein t ishRepresenting a candidate word w in a set of candidate wordsiH-th word in (1), n represents a candidate word wiThe number of individual words contained therein;
all(th) Represents a candidate word wiThe number of times the h-th word in (b) appears in the document set, s (t)h) Represents a candidate word wiThe number of times that the h-th word and any word in the document set appear as a single word;
wherein S is the total number of paragraphs in the document set M, and n represents a candidate word wiThe number of contained characters, F represents that the document set contains candidate words wiNumber of speech segments of, FhIndicating that the document set contains a candidate word wiThe number of the h-th word;
wherein S is the total number of paragraphs in the document set M, and μ (g) represents that the candidate word w is contained in the document set MiThe number of tokens of all words contained,representing candidate words w contained in the document set MiAll the contained words appear strictly consecutively in the number of segments of a single phrase.
10. An apparatus for automatically discovering new words from a document collection, comprising:
a template obtaining unit (201) configured to obtain one or more templates, the templates comprising words and spaces or/and custom symbols;
a word extraction unit (202) configured to extract words matching each of the one or more templates from the document set, the words matching each of the one or more templates being words other than the words included in the template;
a candidate template set adding unit (203) configured to select at least a part of the one or more templates from the one or more templates to add to the candidate template set;
a candidate word set adding unit (204) configured to select at least a part of words from the extracted words matched with each of the one or more templates and add the selected words to a candidate word set;
a new word set adding unit (205) configured to rank the candidate words in the candidate word set based on the templates in the candidate template set, and add a certain number of candidate words to the new word set based on the ranking of the candidate words in the candidate word set by the templates in the candidate template set.
CN201410220317.6A 2014-05-23 2014-05-23 A kind of method and device for finding neologisms automatic from document sets Expired - Fee Related CN103955453B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410220317.6A CN103955453B (en) 2014-05-23 2014-05-23 A kind of method and device for finding neologisms automatic from document sets

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410220317.6A CN103955453B (en) 2014-05-23 2014-05-23 A kind of method and device for finding neologisms automatic from document sets

Publications (2)

Publication Number Publication Date
CN103955453A CN103955453A (en) 2014-07-30
CN103955453B true CN103955453B (en) 2017-09-29

Family

ID=51332728

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410220317.6A Expired - Fee Related CN103955453B (en) 2014-05-23 2014-05-23 A kind of method and device for finding neologisms automatic from document sets

Country Status (1)

Country Link
CN (1) CN103955453B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095196B (en) * 2015-07-24 2017-11-14 北京京东尚科信息技术有限公司 The method and apparatus of new word discovery in text
CN105389349B (en) * 2015-10-27 2018-07-27 上海智臻智能网络科技股份有限公司 Dictionary update method and device
CN105224682B (en) * 2015-10-27 2018-06-05 上海智臻智能网络科技股份有限公司 New word discovery method and device
CN105512109B (en) * 2015-12-11 2019-04-16 北京锐安科技有限公司 The discovery method and device of new term
CN106970919B (en) * 2016-01-14 2020-05-12 北京国双科技有限公司 Method and device for discovering new word group
CN107870927B (en) * 2016-09-26 2021-08-13 博彦泓智科技(上海)有限公司 File evaluation method and device
CN106708807B (en) * 2017-02-10 2019-11-15 广东惠禾科技发展有限公司 Unsupervised participle model training method and device
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN110069780B (en) * 2019-04-19 2021-11-19 中译语通科技股份有限公司 Specific field text-based emotion word recognition method
CN112395395B (en) * 2021-01-19 2021-05-28 平安国际智慧城市科技股份有限公司 Text keyword extraction method, device, equipment and storage medium

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8327265B1 (en) * 1999-04-09 2012-12-04 Lucimedia Networks, Inc. System and method for parsing a document
CN101452463A (en) * 2007-12-05 2009-06-10 浙江大学 Method and apparatus for directionally grabbing page resource
CN101566995A (en) * 2008-04-25 2009-10-28 北京搜狗科技发展有限公司 Method and system for integral release of internet information
CN101464898B (en) * 2009-01-12 2011-09-21 腾讯科技(深圳)有限公司 Method for extracting feature word of text
CN101950309A (en) * 2010-10-08 2011-01-19 华中师范大学 Subject area-oriented method for recognizing new specialized vocabulary
CN102467548B (en) * 2010-11-15 2015-09-16 腾讯科技(深圳)有限公司 A kind of recognition methods of neologisms and system

Also Published As

Publication number Publication date
CN103955453A (en) 2014-07-30

Similar Documents

Publication Publication Date Title
CN103955453B (en) A kind of method and device for finding neologisms automatic from document sets
KR101737887B1 (en) Apparatus and Method for Topic Category Classification of Social Media Text based on Cross-Media Analysis
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN103198057B (en) One kind adds tagged method and apparatus to document automatically
CN110134792B (en) Text recognition method and device, electronic equipment and storage medium
CN111177365A (en) Unsupervised automatic abstract extraction method based on graph model
El-Fishawy et al. Arabic summarization in twitter social network
CN104102681B (en) Microblog key event acquiring method and device
CN112989802B (en) Bullet screen keyword extraction method, bullet screen keyword extraction device, bullet screen keyword extraction equipment and bullet screen keyword extraction medium
KR101713558B1 (en) Method of classification and analysis of sentiment in social network service
CN104866558B (en) A kind of social networks account mapping model training method and mapping method and system
CN104298665A (en) Identification method and device of evaluation objects of Chinese texts
CN104281653A (en) Viewpoint mining method for ten million microblog texts
CN103793434A (en) Content-based image search method and device
CN106649849A (en) Text information base building method and device and searching method, device and system
Bora Summarizing public opinions in tweets
CN109840324B (en) Semantic enhancement topic model construction method and topic evolution analysis method
CN104298732B (en) The personalized text sequence of network-oriented user a kind of and recommendation method
CN110728144B (en) Extraction type document automatic summarization method based on context semantic perception
CN110399606A (en) A kind of unsupervised electric power document subject matter generation method and system
CN108021545A (en) A kind of case of administration of justice document is by extracting method and device
CN112559747A (en) Event classification processing method and device, electronic equipment and storage medium
CN113032557A (en) Microblog hot topic discovery method based on frequent word set and BERT semantics
Barnaghi et al. Text analysis and sentiment polarity on FIFA world cup 2014 tweets
KR101326313B1 (en) Method of classifying emotion from multi sentence using context information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170929