CN108875040B - Dictionary updating method and computer-readable storage medium - Google Patents

Dictionary updating method and computer-readable storage medium Download PDF

Info

Publication number
CN108875040B
CN108875040B CN201810677081.7A CN201810677081A CN108875040B CN 108875040 B CN108875040 B CN 108875040B CN 201810677081 A CN201810677081 A CN 201810677081A CN 108875040 B CN108875040 B CN 108875040B
Authority
CN
China
Prior art keywords
word
candidate data
data string
data
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810677081.7A
Other languages
Chinese (zh)
Other versions
CN108875040A (en
Inventor
朱频频
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Xiaoi Robot Technology Co Ltd
Original Assignee
Shanghai Xiaoi Robot Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Xiaoi Robot Technology Co Ltd filed Critical Shanghai Xiaoi Robot Technology Co Ltd
Priority to CN201810677081.7A priority Critical patent/CN108875040B/en
Publication of CN108875040A publication Critical patent/CN108875040A/en
Application granted granted Critical
Publication of CN108875040B publication Critical patent/CN108875040B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a dictionary updating method and a computer-readable storage medium. The method comprises the following steps: new words are discovered in the following way: preprocessing the received linguistic data to obtain text data; performing line-by-line processing on the text data to obtain statement data; performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation; combining the adjacent word data after word segmentation to generate a candidate data string; judging the candidate data string to find new words; the judgment processing includes: calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range; when a new word is found, the new word is added into the dictionary, and the updated dictionary is used for carrying out the processes of word segmentation processing, combination processing and new word finding again until no new word is found. The invention can improve the accuracy of new word discovery.

Description

Dictionary updating method and computer-readable storage medium
The application is a divisional application with the application date of 2015, 10 and 27 and the application number of 201510706254.X, and the name of the invention is 'new word discovery method and device'.
Technical Field
The present invention relates to the field of intelligent interaction, and in particular, to a dictionary updating method and a computer-readable storage medium.
Background
In many fields of Chinese information processing, a dictionary-based function is required to be completed. For example, in an intelligent retrieval system or an intelligent dialogue system, the word segmentation, question retrieval, similarity matching, search result determination, answer determination of an intelligent dialogue, and the like are performed, wherein each process is calculated by taking a word as a minimum unit, and the calculation is based on a word dictionary, so the word dictionary has a great influence on the performance of the whole system.
The progress and the transition of social culture and the rapid development of economic business often drive the change of language, and the appearance of new words is the fastest to reflect the change of language. Particularly in a specific field, whether the word dictionary can be updated in time after a new word appears has a decisive influence on the system efficiency of the intelligent dialogue system in which the word dictionary is located.
The new word is a newly discovered single word, and in the prior art, there are at least three sources: new words in the domain provided by the customer; new words discovered through the corpus provided by the customer; new words found in the operation process.
The accuracy of new word discovery in the prior art needs to be improved.
Disclosure of Invention
The invention solves the technical problem of how to improve the accuracy of finding new words.
To solve the above technical problem, an embodiment of the present invention provides a computer-readable storage medium having a program stored thereon, the program, when executed, implementing a new word discovery method, the method including:
preprocessing the received linguistic data to obtain text data;
performing line-by-line processing on the text data to obtain statement data;
performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation;
combining the adjacent word data after word segmentation to generate a candidate data string;
judging the candidate data string to find new words; the judgment processing includes: and calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range.
Optionally, the determining further includes: and calculating the probability characteristic value of the frequency correlation of the candidate data string, and removing the candidate data string when the probability characteristic value of the frequency correlation of the candidate data string is out of a preset range.
Optionally, the frequency-dependent probability feature values include: the frequency and frequency of the candidate data string or the value calculated according to the frequency and frequency of the candidate data string.
Optionally, the determining further includes: calculating mutual information among word data in the candidate data string; and removing the candidate data string of the mutual information out of the preset range.
Optionally, the determining further includes: and calculating the information entropy of the boundary word data and the inner word data of the candidate data string, and removing the candidate data string of which the information entropy is out of a preset range.
Optionally, the determining the candidate data string to find new words sequentially includes:
calculating the frequency of the candidate data strings, and removing the candidate data strings with the frequency outside a preset range;
calculating mutual information of the remaining candidate data strings, and removing the candidate data strings of which the mutual information is out of a preset range;
calculating the information entropy of the remaining candidate data string boundary word data and the remaining candidate data string boundary word data, and removing the candidate data string of which the information entropy is out of a preset range;
calculating the information entropy of the remaining candidate data string boundary word data and the outer word data, and removing the candidate data string of which the information entropy is out of a preset range;
and the rest of the candidate data strings are used as new words.
Optionally, the generating the candidate data string includes: and using a Bigram model to take adjacent words in the statement data of the same row as candidate data strings.
Optionally, the preprocessing the received corpus to obtain text data includes: unifying the format of the corpus into a text format; filtering one or more of dirty words, sensitive words, and stop words.
Optionally, the method further comprises: setting a length range of the candidate data strings to exclude candidate data strings having a length outside the length range.
The embodiment of the invention also provides a dictionary updating method, which comprises the following steps:
new words are discovered in the following way: preprocessing the received linguistic data to obtain text data; performing line-by-line processing on the text data to obtain statement data; performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation; combining the adjacent word data after word segmentation to generate a candidate data string; judging the candidate data string to find new words; the judgment processing includes: calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range;
when a new word is found, the new word is added into the dictionary, and the updated dictionary is used for carrying out the processes of word segmentation processing, combination processing and new word finding again until no new word is found.
Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:
the information entropy of each word and the outer word in the candidate data string is judged by calculating the information entropy of each word and the outer word in the candidate data string, so that the possibility of combining each word and the outer word in the candidate data string can be judged; and removing the candidate data strings of which the information entropies of the words and the words outside the words are out of the preset range, and removing the candidate data strings of which the combination possibility of the words and the words outside the words in the candidate data strings is high, so that the accuracy of the new word discovery method can be improved.
Further, when the type of the probability characteristic value of the candidate data string to be calculated to become the new word is more than one, the candidate data string is sequentially judged, whether the probability characteristic value in the front of the calculation order is in the preset range is judged, and only the candidate data string with the probability characteristic value in the preset range is subjected to calculation of the probability characteristic value in the back of the order, so that the calculation range in the back of the order can be reduced, the calculation amount is reduced, and the updating efficiency is improved.
In addition, the length range of the candidate data string is set to exclude the adjacent word data with the length outside the length range, so that only the probability characteristic value calculation is carried out on the adjacent word data with the length within the length range, finally, the calculation amount of new word discovery can be further reduced, and the updating efficiency is improved.
Drawings
FIG. 1 is a flow chart of a method for discovering new words in an embodiment of the present invention;
FIG. 2 is a flow chart of another method for discovering new words in an embodiment of the present invention;
FIG. 3 is a flow chart of another method for discovering new words in an embodiment of the present invention;
FIG. 4 is a flow chart of another method for discovering new words in an embodiment of the present invention;
FIG. 5 is a flowchart of a determination process according to an embodiment of the present invention;
fig. 6 is a schematic structural diagram of a new word discovering device according to an embodiment of the present invention;
fig. 7 is a schematic structural diagram of another neologism discovery apparatus according to an embodiment of the present invention.
Detailed Description
The inventor researches and discovers that the existing new word discovery method only judges the combination tightness of all words in the candidate data string, and takes the candidate data string with the tightly combined words in the candidate data string as a new word. However, some candidate data strings have more closely combined words and outer words, and are not suitable as a new word. Therefore, if only the relation between the words in the candidate data string is judged, the result of finding a new word is not accurate enough.
According to the embodiment of the invention, the information entropy of each word and the outer word in the candidate data string is calculated, the candidate data string of which the information entropy of each word and the outer word is out of the preset range is removed, and the candidate data string which is judged and found to be more suitable for being combined with the outer word can be eliminated, so that the accuracy of finding new words can be improved.
In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in detail below.
Fig. 1 is a flowchart of a new word discovery method according to an embodiment of the present invention.
And S11, preprocessing the received linguistic data to obtain text data.
The corpus may be a text paragraph that may contain a new word when the new word appears in a specific field. For example, when the dictionary is applied to a bank intelligent question-answering system, the corpus may be articles provided by a bank, frequently asked questions or system logs, and the like.
The diversity of the corpus sources can enable new words to be found more comprehensively, but at the same time, the corpus is more in format types, so that the corpus needs to be preprocessed to obtain text data in order to facilitate the subsequent processing of the corpus.
In particular implementations, the preprocessing may unify the format of the corpus into a text format and filter one or more of dirty words, sensitive words, and stop words. When the format of the corpus is unified into the text format, the content which can not be converted into the text format by the prior art can be filtered.
And S12, performing line division processing on the text data to obtain sentence data.
The line-splitting process may be to split lines for the material by punctuation, such as at the occurrence of periods, commas, exclamations, question marks, and the like. The statement data obtained here is the initial segmentation of the corpus in order to determine the scope of the subsequent word segmentation processing.
And S13, performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation.
The dictionary contains a plurality of individual words, which may differ in length. In a specific implementation, the process of performing word segmentation processing based on the dictionary may utilize one or more of a dictionary two-way maximum matching method, an HMM method, and a CRF method.
The word segmentation processing is to perform word segmentation processing on the sentence data in the same row, so that the word data after word segmentation are in the same row, and the word data are all single words included in a dictionary.
In the dialogue system in the field, the processes of intelligent reply of questions through the processes of word segmentation, question retrieval, similarity matching, answer determination and the like are all calculated by taking single words as minimum units, and the process of performing word segmentation processing according to a basic dictionary is similar to the word segmentation process in the operation of the dialogue system, and is different from the difference of dictionary contents based on the word segmentation processing.
The new word discovery method in the embodiment of the invention is suitable for updating the dictionary, namely, the found new words can be added into the dictionary, and the new words can be discovered again on the original material by referring to the updated dictionary until the new words can not be discovered again.
And S14, combining the adjacent word data after word segmentation to generate a candidate data string.
The word segmentation process is performed according to a dictionary, and word data that should be a word in a certain field may be segmented into a plurality of word data, so that new words need to be found. And setting conditions to screen out a candidate data string to be used as a new word, and using the candidate data string as the new word. Generating candidate data strings as a precondition for the screening process may be accomplished in a variety of ways.
If all adjacent words in the corpus are used as candidate data strings, the calculation amount of the new word discovery system is too large, the efficiency is low, and the adjacent words in different rows have no calculation significance. Therefore, adjacent words can be screened to generate candidate data strings.
In a specific implementation, two adjacent words in the statement data of the same row can be used as candidate data strings by using a Bigram model.
Assuming that a sentence S can be represented as a sequence S ═ w1w2 … wn, the language model is the probability p (S) of requiring the sentence S:
p(S)=p(w1,w2,w3,w4,w5,…,wn)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
(1)
the probability statistics in the formula (1) are based on the Ngram model, and the calculated amount of the probability is too large to be applied to practical application. Based on Markov Assumption (Markov assemption): the next word appears only dependent on the word or words preceding it. Assuming that the next word appears dependent on the word before it, then:
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w2)...p(wn|wn-1)
(2)
assuming that the next word appears to depend on the two words before it, then there are:
p(S)=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|w1,w2,...,wn-1)
=p(w1)p(w2|w1)p(w3|w1,w2)...p(wn|wn-1,wn-2)
(3)
the formula (2) is a Bigram probability calculation formula, and the formula (3) is a trigram probability calculation formula. By setting a larger n value, more constraint information for the appearance of the next word can be set, and the discrimination is higher; by setting smaller n, the number of times of occurrence of the candidate data string in the new word discovery is larger, more reliable statistical information can be provided, and higher reliability is achieved.
Theoretically, the larger the n value is, the higher the reliability is, and the most Trigram is used in the existing processing method. But the Bigram has smaller calculation amount and higher system efficiency.
In particular implementations, a length range of the candidate data strings may also be set to exclude candidate data strings having a length outside the length range. Therefore, new words with different length ranges can be obtained according to requirements so as to be applied to different scenes. For example, a smaller range of the length range is set to obtain grammatical words, and the grammatical words are applied to an intelligent question-answering system; the range with larger length range value is set to obtain phrases or phrases which are used as keywords of document retrieval catalogues and the like.
S15, judging the candidate data string to find new words; the judgment processing includes: and calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range.
In specific implementation, the judging process of the candidate data string to find the new word may further include internal judgment, and the judgment of the closeness degree of combination of each word in the candidate data string is performed, that is, the probability characteristic value of the candidate data string becoming the new word is calculated, and the candidate data string with the probability characteristic value outside the preset range is removed.
Referring to fig. 2, in an embodiment of the present invention, in step S15, the determining the candidate data string to find a new word includes:
s153, calculating the probability characteristic value of the frequency correlation of the candidate data string, and removing the candidate data string when the probability characteristic value of the frequency correlation of the candidate data string is out of a preset range.
In a specific implementation, the frequency-dependent probability characteristic values include: the frequency and frequency of the candidate data string or the value calculated according to the frequency and frequency of the candidate data string.
The frequency of occurrence of the candidate data string refers to the frequency of occurrence of the candidate data string in the corpus, frequency filtering is used for judging the combination frequency of the candidate data string, and when the frequency is lower than a certain threshold value, the candidate data string is filtered; the frequency of occurrence of the candidate data string is related to the frequency of occurrence and the total word quantity in the corpus. And the numerical value obtained by calculation according to the frequency and frequency of the candidate data string is used as the probability characteristic value of the candidate data string, so that the accuracy is higher. In an embodiment of the present invention, the probability feature value obtained by calculating according to the frequency and frequency of occurrence of the candidate data string may adopt a TF-IDF (term frequency-inverse document frequency) technique.
TF-IDF is a commonly used weighting technique for information retrieval and information exploration to evaluate the importance of a word to one of a set of documents or a corpus, i.e., to the corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus.
The main idea of TF-IDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. TF-IDF is actually: TF, IDF, TF Term Frequency (Term Frequency), IDF Inverse document Frequency (Inverse document Frequency). TF represents the Frequency of occurrence of terms in document d (otherwise: TF Term Frequency (Term Frequency) refers to the number of times a given Term occurs in the document). The main idea of IDF is: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong. (in other words, the IDF Inverse Document Frequency (Inverse Document Frequency) means that the fewer documents containing entries, the larger the IDF, the better the category distinguishing capability of the entries, but actually, if an entry frequently appears in a class of documents, that is, in corpus, the feature that the entry can well represent the text of the class is indicated, such an entry should be given a higher weight and selected as a feature word of the class of text to distinguish the documents from other classes. I.e. such entries may be used as new words in the domain of dictionary applications.
And S151, calculating the information entropy of each word and the outer words in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer words is out of a preset range.
The information entropy is a measure of the uncertainty of a random variable, and is calculated as follows:
H(X)=-∑p(xi)log p(xi)
the larger the information entropy, the larger the uncertainty of the representing variable; i.e. the more average the probability that each possible value occurs. If the probability of a certain value of the variable occurring is 1, the entropy is 0. The variable is only subjected to the occurrence of one current value, and is a necessary event.
The formula for calculating the left information entropy and the right information entropy of the word W is as follows:
H1(W)=∑x∈X(#XW>0)p (X | W) log P (X | W), where X is the set of all word data appearing to the left of W; h1(W) is the left entropy of word data W.
H2(W)=∑x∈Y(#WY>0)P (Y | W) log P (Y | W), where Y is the set of all word data appearing to the right of W, H2(W) is the right information entropy of the word data W.
And calculating the entropy values of the word data in the candidate data string and the word data outside the word data to show the chaos degree of the word data outside the word data. For example, by computing candidate data string W1W2Word data W on the middle and left sides1Left side entropy of information, right side word data W2The right side entropy of the word data W can be judged1And W2And the degree of disorder of the outer side can be screened by setting a preset range, and candidate data strings with probability characteristic values of new words formed by all words and the words on the outer side out of the preset range are excluded.
And S152, taking the remaining candidate data strings as new words.
It is understood that step S153 and step S151 are both specific embodiments of the determination process performed on the candidate data string, and step S153 may be before step S151 or after step S151.
Referring to fig. 3, in another specific implementation, the step S15 of performing a judgment process on the candidate data string to find a new word includes:
s154, mutual information among word data in the candidate data string is calculated; and removing the candidate data string of the mutual information out of the preset range.
Mutual Information (MI) is defined as follows:
Figure BDA0001710026960000101
the mutual information reflects the co-occurrence relationship between the candidate data string and the word data therein, the mutual information of the candidate data string composed of two separate words is a value (i.e. the mutual information between two separate words), when the co-occurrence frequency of a candidate data string W and the word data therein is high, i.e. the frequency of occurrence is close, it can be known that the mutual information MI of the candidate data string W is close to 1, i.e. the probability that the candidate data string W becomes a word is high at this time. If the value of the mutual information MI is small, close to 0, it means that W is hardly possible to be a word, and is even less likely to be a new word. The mutual information reflects the degree of dependency within a candidate data string and can be used to determine whether the candidate data string is likely to become a new word.
And S151, calculating the information entropy of each word and the outer words in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer words is out of a preset range.
And S152, taking the remaining candidate data strings as new words.
The sequence of step S154 and step S151 is not limited. Step S15 may further include step S153, and similarly, the order of execution among step S153, step S154, and step S151 may be set according to the actual need of the determination process.
Referring to fig. 4, in yet another specific implementation, the determining process may further include: s155, calculating the information entropy of the candidate data string boundary word data and the inner word data, and removing the candidate data string of which the information entropy is out of a preset range.
The inner information entropy is the information entropy of each individual word data fixed in turn to the candidate data string and the occurrence of another word under the occurrence of the word data is calculated. If the candidate data string is (w1w2), the right information entropy of the word data w1 and the left information entropy of the word data w2 are calculated.
By taking the example that the candidate data string only comprises two separate words (w1w2), the separate word w1 and the separate word in the adjacent candidate data string have one outer information entropy, and the separate word w1 and the separate word w2 in the same candidate data string have one inner information entropy; the individual word w2 has an inner information entropy with the individual word w1 in the same candidate data string, and the individual word w2 has an outer information entropy with the individual words in the adjacent candidate data strings, that is, the individual words located at the middle positions (non-end positions) have an inner information entropy and an outer information entropy.
When judging the inner information entropy or the outer information entropy, two inner information entropies or two outer information entropies in a candidate data string need to be judged, and the inner information entropy or the outer information entropy of the candidate data string is considered to be in a preset range only when the two inner information entropies or the two outer information entropies are in the preset range; otherwise, as long as one inner information entropy or one outer information entropy is outside the preset range, the inner information entropy or the outer information entropy of the candidate data string is considered to be outside the preset range.
For example, two adjacent candidate data strings are: a candidate data string consisting of the separate word "me" and the separate word "transact"; a candidate data string consisting of the individual word "north China" and the individual word "shopping mall". The internal information entropies of the two candidate data strings are respectively: entropy of information between the separate word "i" and the separate word "transact": entropy between the individual word "north China" and the individual word "shopping mall". The extrinsic information entropy between two candidate data strings is: the entropy of information between the individual word "transact" and the individual word "north China".
It is understood that the determination process may include any one or more of steps S152 and steps S153 to S155, and may be selected according to a specific application.
Fig. 5 is a flowchart of another determination process in the embodiment of the present invention.
S351, calculating the frequency of the candidate data strings.
S352, judging whether the frequency of the candidate data string is within a preset range, and if the frequency of the candidate data string is within the preset range, executing the step S353; if the frequency of the candidate data string is not within the preset range, step S361 is executed.
And S353, calculating mutual information among word data in the candidate data string. It is understood that the calculation of the mutual information is only performed for the candidate data strings with the frequency within the preset range.
S354, determining whether mutual information between word data in the candidate data string is within a preset range, and if the mutual information between word data in the candidate data string is within the preset range, performing step S355; if the mutual information between the word data in the candidate data string is not within the preset range, step S361 is executed.
And S355, calculating the information entropy of the boundary word data and the inner word data of the candidate data string.
It is understood that, at this time, the calculation of the information entropy of the candidate data string and the inside word data is performed only for candidate data strings whose mutual information is within a preset range and whose frequency is within the preset range.
S356, judging whether the information entropy of the boundary word data and the inner word data of the candidate data string is in a preset range, and if the information entropy of the boundary word data and the inner word data of the candidate data string is in the preset range, executing the step S357; if the information entropy of the boundary word data and the inner word data of the candidate data string is not within the preset range, step S361 is executed.
And S357, calculating the information entropy of the boundary word data and the outer word data of the candidate data string.
It can be understood that, at this time, the calculation of the information entropies of the boundary word data and the outer word data of the selected data string is only performed on the candidate data string of which the mutual information is within the preset range, the frequency is within the preset range, and the information entropies of the boundary word data and the inner word data are within the preset range.
S358, judging whether the information entropy of the boundary word data and the outer word data of the candidate data string is in a preset range, and executing the step S361 if the information entropy of the boundary word data and the outer word data of the candidate data string is in the preset range; if the information entropy of the boundary word data and the outer word data of the candidate data string is not within the preset range, step S362 is performed.
In the embodiment of the invention, because the frequency, the mutual information, the information entropy of the boundary word data and the inner word data of the candidate data string are calculated in sequence, and the calculation difficulty of the three probability characteristic values is increased progressively, the candidate data string which is not in the preset range can be eliminated by calculation in the front order, and the eliminated candidate data string does not participate in calculation in the back order, so that the calculation time can be saved, and the efficiency of the new word discovery method is improved.
As described above, the new word discovery method in the embodiment of the present invention may be used for updating a dictionary, and when a new word is discovered, the new word is added to the dictionary, and the updated dictionary is used to perform the processes of word segmentation, combination, and new word discovery again until no new word is discovered.
In one specific example, the corpus received is speech data "how long did i take to transact mansion dragon card in north china? ". Processing the voice data into text data through first preprocessing; distinguishing the text data from text data of other lines by a first line division process; the text data is divided into: i, office, north china, shopping, dragon card, need, many, long, and time.
The following candidate data strings are obtained through the first combination processing: i transact, transact North China, North China shopping mall, shopping mall dragon card, dragon card needs, needs more, more long and long time; removing two candidate data strings of 'I transact' and 'North China transact' by calculating the frequency for the first time; removing three candidate data strings of 'more needed data strings', 'more long data strings' and 'long data strings' by calculating mutual information for the first time; the information entropy of the word data outside the word data is calculated for the first time, the candidate data string of the dragon card is removed, the new word of the North China shopping mall is obtained, and the North China shopping mall is added into the basic dictionary.
Dividing the text data into: i, transact, North China shopping mall, dragon card, need, many, long and time; and obtaining the following candidate data strings through the second combination processing: i transact and transact the dragon card of the North China commercial building and the North China commercial building, and the dragon card needs more, more and longer time; removing two candidate data strings of 'I transact' and 'transact North China shopping mall' through second-time calculation frequency; removing three candidate data strings of 'more needed data strings', 'more long data strings' and 'long data strings' by calculating mutual information for the second time; and removing the candidate data string required by the dragon card through second calculation and the information entropy of the outside word data, thereby obtaining a new word of the dragon card of the North China shopping mall, and adding the dragon card of the North China shopping mall into the basic dictionary.
The word segmentation processing, the combination processing and the judgment processing can be continuously carried out according to the basic dictionary comprising the Chinese Shanxi Shang Longka, and the basic dictionary can be continuously updated by using new words found each time.
In the above example, in the determination process to be performed later, it is possible to perform the determination again on all the candidate data strings; the previous judgment result can also be recorded, so that the previous judgment result can be directly called for the same candidate data string; it is also possible to form only a candidate data string including a new word, and thus to make a judgment only on a candidate data string including a new word.
The embodiment of the invention can judge the information entropy of each word and the outer word in the candidate data string by calculating the information entropy of each word and the outer word in the candidate data string, so as to judge the possibility of combining each word and the outer word in the candidate data string; and removing the candidate data strings of which the information entropies of the words and the words outside the words are out of the preset range, and removing the candidate data strings of which the combination possibility of the words and the words outside the words in the candidate data strings is high, so that the accuracy of the new word discovery method can be improved.
An embodiment of the present invention further provides a new word discovery apparatus, including: a preprocessing unit 61, a line segmentation processing unit 62, a word segmentation processing unit 63, a combination processing unit 64, and a new word discovery unit 65;
the preprocessing unit 61 is adapted to preprocess the received corpus to obtain text data;
the line-dividing processing unit 62 is adapted to perform line-dividing processing on the text data to obtain statement data;
the word segmentation processing unit 63 is adapted to perform word segmentation processing on the sentence data according to word data included in a dictionary to obtain word data after word segmentation;
the combination processing unit 64 is adapted to perform combination processing on the adjacent word data after word segmentation to generate a candidate data string;
the new word finding unit 65 is adapted to perform judgment processing on the candidate data string to find a new word; the judgment processing includes: and calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range.
In a specific implementation, the determining process may further include: and calculating the probability characteristic value of the frequency correlation of the candidate data string, and removing the candidate data string when the probability characteristic value of the frequency correlation of the candidate data string is out of a preset range.
In a specific implementation, the frequency-dependent probability characteristic values include: the frequency and frequency of the candidate data string or the value calculated according to the frequency and frequency of the candidate data string.
In a specific implementation, the determining process may further include: and calculating the information entropy of the boundary word data and the inner word data of the candidate data string, and removing the candidate data string of which the information entropy is out of a preset range.
In a specific implementation, the determining process may further include: and calculating the information entropy of the boundary word data and the inner word data of the candidate data string, and removing the candidate data string of which the information entropy is out of a preset range.
Referring to fig. 7, in a specific implementation, the new word discovery unit 65 may include: a frequency filtering unit 651, a mutual information filtering unit 652, an internal information entropy filtering unit 653, and an external information entropy filtering unit 654;
the frequency filtering unit 651 is adapted to calculate the frequency of the candidate data strings, and remove the candidate data strings with the frequency outside a preset range;
the mutual information filtering unit 652 is adapted to calculate the mutual information of the remaining candidate data strings after being filtered by the frequency filtering unit, and remove the candidate data strings of which the mutual information is outside the preset range;
the internal information entropy filtering unit 653 is adapted to calculate the information entropy of the remaining candidate data string boundary word data and the remaining inside word data after being filtered by the mutual information filtering unit, and remove the candidate data string of which the information entropy is outside the preset range;
the external information entropy filtering unit 654 is adapted to calculate the information entropy of the remaining candidate data string boundary word data and the remaining external word data after being filtered by the internal information entropy filtering unit, and remove the candidate data string whose information entropy is outside the preset range.
In a specific implementation, the combination processing unit is adapted to use Bigram models to take adjacent words in statement data of the same row as candidate data strings.
In a specific implementation, the preprocessing unit is adapted to unify the format of the corpus into a text format; filtering one or more of dirty words, sensitive words, and stop words.
In a specific implementation, the segmentation processing unit is adapted to employ one or more of a dictionary two-way maximum matching method, an HMM method, and a CRF method.
In a specific implementation, the new word discovery apparatus may further include: the length filtering unit 66 is adapted to set a length range of the candidate data strings to exclude candidate data strings having a length outside the length range.
The specific working process of the new word discovery device may refer to the foregoing method, and is not described herein again.
Those skilled in the art will appreciate that all or part of the steps in the methods of the above embodiments may be implemented by associated hardware instructed by a program, which may be stored in a computer-readable storage medium, and the storage medium may include: ROM, RAM, magnetic or optical disks, and the like.
Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be effected therein by one skilled in the art without departing from the spirit and scope of the invention as defined in the appended claims.

Claims (9)

1. A computer-readable storage medium on which a program is stored, the program, when executed, implementing a new word discovery method, the method comprising:
preprocessing the received linguistic data to obtain text data;
performing line-by-line processing on the text data to obtain statement data;
performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation;
combining the adjacent word data after word segmentation to generate a candidate data string;
judging the candidate data string to find new words; the judgment processing includes: calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range; and calculating the information entropy of the boundary word data and the inner word data of the candidate data string, and removing the candidate data string of which the information entropy is out of a preset range.
2. The computer-readable storage medium of claim 1, wherein the determining process further comprises: and calculating the probability characteristic value of the frequency correlation of the candidate data string, and removing the candidate data string when the probability characteristic value of the frequency correlation of the candidate data string is out of a preset range.
3. The computer-readable storage medium of claim 2, wherein the frequency-dependent probability feature values comprise: the frequency and frequency of the candidate data string or the value calculated according to the frequency and frequency of the candidate data string.
4. The computer-readable storage medium of claim 1, wherein the determining process further comprises: calculating mutual information among word data in the candidate data string; and removing the candidate data string of the mutual information out of the preset range.
5. The computer-readable storage medium of claim 1, wherein performing a judgment process on the candidate data strings to find new words sequentially comprises:
calculating the frequency of the candidate data strings, and removing the candidate data strings with the frequency outside a preset range;
calculating mutual information of the remaining candidate data strings, and removing the candidate data strings of which the mutual information is out of a preset range;
calculating the information entropy of the remaining candidate data string boundary word data and the remaining candidate data string boundary word data, and removing the candidate data string of which the information entropy is out of a preset range;
calculating the information entropy of the remaining candidate data string boundary word data and the outer word data, and removing the candidate data string of which the information entropy is out of a preset range;
and the rest of the candidate data strings are used as new words.
6. The computer-readable storage medium of claim 1, wherein the generating the candidate data string comprises: and using a Bigram model to take adjacent words in the statement data of the same row as candidate data strings.
7. The computer-readable storage medium of claim 1, wherein preprocessing the received corpus to obtain text data comprises: unifying the format of the corpus into a text format; filtering one or more of dirty words, sensitive words, and stop words.
8. The computer-readable storage medium of claim 1, further comprising: setting a length range of the candidate data strings to exclude candidate data strings having a length outside the length range.
9. A dictionary updating method, comprising:
new words are discovered in the following way: preprocessing the received linguistic data to obtain text data; performing line-by-line processing on the text data to obtain statement data; performing word segmentation processing on the sentence data according to the single words contained in the dictionary to obtain word data after word segmentation; combining the adjacent word data after word segmentation to generate a candidate data string; judging the candidate data string to find new words; the judgment processing includes: calculating the information entropy of each word and the outer word in the candidate data string, and removing the candidate data string of which the information entropy of each word and the outer word is out of a preset range; calculating the information entropy of the candidate data string boundary word data and the inner word data, and removing the candidate data string of which the information entropy is out of a preset range;
when a new word is found, the new word is added into the dictionary, and the updated dictionary is used for carrying out the processes of word segmentation processing, combination processing and new word finding again until no new word is found.
CN201810677081.7A 2015-10-27 2015-10-27 Dictionary updating method and computer-readable storage medium Active CN108875040B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810677081.7A CN108875040B (en) 2015-10-27 2015-10-27 Dictionary updating method and computer-readable storage medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810677081.7A CN108875040B (en) 2015-10-27 2015-10-27 Dictionary updating method and computer-readable storage medium
CN201510706254.XA CN105183923B (en) 2015-10-27 2015-10-27 New word discovery method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201510706254.XA Division CN105183923B (en) 2015-10-27 2015-10-27 New word discovery method and device

Publications (2)

Publication Number Publication Date
CN108875040A CN108875040A (en) 2018-11-23
CN108875040B true CN108875040B (en) 2020-08-18

Family

ID=54906004

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201810677081.7A Active CN108875040B (en) 2015-10-27 2015-10-27 Dictionary updating method and computer-readable storage medium
CN201510706254.XA Active CN105183923B (en) 2015-10-27 2015-10-27 New word discovery method and device

Family Applications After (1)

Application Number Title Priority Date Filing Date
CN201510706254.XA Active CN105183923B (en) 2015-10-27 2015-10-27 New word discovery method and device

Country Status (1)

Country Link
CN (2) CN108875040B (en)

Families Citing this family (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105975460A (en) * 2016-05-30 2016-09-28 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN107463548B (en) * 2016-06-02 2021-04-27 阿里巴巴集团控股有限公司 Phrase mining method and device
CN106126494B (en) * 2016-06-16 2018-12-28 上海智臻智能网络科技股份有限公司 Synonym finds method and device, data processing method and device
CN105955965A (en) * 2016-06-21 2016-09-21 上海智臻智能网络科技股份有限公司 Question information processing method and device
CN106502984B (en) * 2016-10-19 2019-05-24 上海智臻智能网络科技股份有限公司 A kind of method and device of field new word discovery
CN107066447B (en) * 2017-04-19 2021-03-26 广东惠禾科技发展有限公司 Method and equipment for identifying meaningless sentences
CN109241392A (en) * 2017-07-04 2019-01-18 北京搜狗科技发展有限公司 Recognition methods, device, system and the storage medium of target word
CN107622051A (en) * 2017-09-14 2018-01-23 马上消费金融股份有限公司 A kind of neologisms screening technique and device
CN107577667B (en) * 2017-09-14 2020-10-27 北京奇艺世纪科技有限公司 Entity word processing method and device
CN107861940A (en) * 2017-10-10 2018-03-30 昆明理工大学 A kind of Chinese word cutting method based on HMM
CN107704452B (en) * 2017-10-20 2020-12-22 传神联合(北京)信息技术有限公司 Method and device for extracting Thai terms
CN108509425B (en) * 2018-04-10 2021-08-24 中国人民解放军陆军工程大学 Chinese new word discovery method based on novelty
CN108595433A (en) * 2018-05-02 2018-09-28 北京中电普华信息技术有限公司 A kind of new word discovery method and device
CN108829658B (en) * 2018-05-02 2022-05-24 石家庄天亮教育科技有限公司 Method and device for discovering new words
CN108959259B (en) * 2018-07-05 2019-11-08 第四范式(北京)技术有限公司 New word discovery method and system
CN109408818B (en) * 2018-10-12 2023-04-07 平安科技(深圳)有限公司 New word recognition method and device, computer equipment and storage medium
CN110442685A (en) * 2019-08-14 2019-11-12 杭州品茗安控信息技术股份有限公司 Data extending method, apparatus, equipment and the storage medium of architectural discipline dictionary
CN111061866B (en) * 2019-08-20 2024-01-02 河北工程大学 Barrage text clustering method based on feature expansion and T-oBTM
CN110674252A (en) * 2019-08-26 2020-01-10 银江股份有限公司 High-precision semantic search system for judicial domain
CN111090742B (en) * 2019-12-19 2024-05-17 东软集团股份有限公司 Question-answer pair evaluation method, question-answer pair evaluation device, storage medium and equipment
CN111209746B (en) * 2019-12-30 2024-01-30 航天信息股份有限公司 Natural language processing method and device, storage medium and electronic equipment
CN111209372B (en) * 2020-01-02 2021-08-17 北京字节跳动网络技术有限公司 Keyword determination method and device, electronic equipment and storage medium
CN111832299A (en) * 2020-07-17 2020-10-27 成都信息工程大学 Chinese word segmentation system

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields
CN103678371A (en) * 2012-09-14 2014-03-26 富士通株式会社 Lexicon updating device, data integration device and method and electronic device
WO2014087703A1 (en) * 2012-12-06 2014-06-12 楽天株式会社 Word division device, word division method, and word division program
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8825648B2 (en) * 2010-04-15 2014-09-02 Microsoft Corporation Mining multilingual topics
CN102360383B (en) * 2011-10-15 2013-07-31 西安交通大学 Method for extracting text-oriented field term and term relationship

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
CN103678371A (en) * 2012-09-14 2014-03-26 富士通株式会社 Lexicon updating device, data integration device and method and electronic device
CN102930055A (en) * 2012-11-18 2013-02-13 浙江大学 New network word discovery method in combination with internal polymerization degree and external discrete information entropy
WO2014087703A1 (en) * 2012-12-06 2014-06-12 楽天株式会社 Word division device, word division method, and word division program
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103294664A (en) * 2013-07-04 2013-09-11 清华大学 Method and system for discovering new words in open fields
CN103970733A (en) * 2014-04-10 2014-08-06 北京大学 New Chinese word recognition method based on graph structure

Also Published As

Publication number Publication date
CN105183923A (en) 2015-12-23
CN108875040A (en) 2018-11-23
CN105183923B (en) 2018-06-22

Similar Documents

Publication Publication Date Title
CN108875040B (en) Dictionary updating method and computer-readable storage medium
CN108897842B (en) Computer readable storage medium and computer system
CN108776709B (en) Computer-readable storage medium and dictionary updating method
US8892420B2 (en) Text segmentation with multiple granularity levels
CN107480143B (en) Method and system for segmenting conversation topics based on context correlation
CN109960724B (en) Text summarization method based on TF-IDF
CN109933656B (en) Public opinion polarity prediction method, public opinion polarity prediction device, computer equipment and storage medium
JP3041268B2 (en) Chinese Error Checking (CEC) System
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN109271524B (en) Entity linking method in knowledge base question-answering system
CN109033066B (en) Abstract forming method and device
CN106681985A (en) Establishment system of multi-field dictionaries based on theme automatic matching
CN116227466B (en) Sentence generation method, device and equipment with similar semantic different expressions
CN110929510A (en) Chinese unknown word recognition method based on dictionary tree
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN115757743A (en) Document search term matching method and electronic equipment
CN110287493B (en) Risk phrase identification method and device, electronic equipment and storage medium
CN114491062B (en) Short text classification method integrating knowledge graph and topic model
CN114036957B (en) Rapid semantic similarity calculation method
CN106970919B (en) Method and device for discovering new word group
CN112711944B (en) Word segmentation method and system, and word segmentation device generation method and system
CN114036907A (en) Text data amplification method based on domain features
CN113434639A (en) Audit data processing method and device
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant