CN108268440A

CN108268440A - A kind of unknown word identification method

Info

Publication number: CN108268440A
Application number: CN201710003573.3A
Authority: CN
Inventors: 张春荣; 韦玮
Original assignee: Putian Information Technology Co Ltd
Current assignee: Potevio Information Technology Co Ltd; Putian Information Technology Co Ltd
Priority date: 2017-01-04
Filing date: 2017-01-04
Publication date: 2018-07-10

Abstract

This application provides a kind of unknown word identification method, this method identifies unregistered word using the method that rule and statistics are combined；During lexical feature and the domain knowledge in calling for tenders of project field etc. are dissolved into identification unregistered word by rule well, the method for statistics can preferably capture statistical information, select frequent word occur in text message；The identifying schemes that rule and statistics are combined can improve unknown word identification efficiency and accuracy.

Description

A kind of unknown word identification method

Technical field

The present invention relates to natural language processing field, more particularly to a kind of unknown word identification method.

Background technology

The meaning of natural language processing " understanding " language generally as unit of word, primary task is exactly to segment.In Chinese The various fields of information processing are required to complete corresponding function based on dictionary.By participle, problem retrieval, similarity mode, It determines answer of retrieval result or Intelligent dialogue etc., wherein each process is to be calculated by word for least unit, counts The basis of calculation is word dictionary, so word dictionary structure has the performance of whole system very big influence.

Many segmentation methods in the case that dictionary is complete all assuming that design, and many times this hypothesis is not in fact It sets up.With the continuous development of society and internet, neologism constantly emerges in large numbers in daily life, and the vocabulary of professional domain is also fresh It occurs in general basic dictionary.Unregistered word is defined as the word not occurred in dictionary, the neologism including appearance with And including field specialized vocabulary etc..The mistake caused due to segmenting mistake caused by unregistered word considerably beyond the uttrance of ambiguous segmentation Accidentally.Particularly in specific area, the dictionary that timely updates has conclusive shadow to the application system efficiency where word dictionary It rings, the scale and quality of dictionary are directly related to the performance of related application.

The structure and extending method of dictionary for word segmentation have artificial constructed method and automatic generation method.Using artificial constructed Mode adds unregistered word into dictionary, and artificial constructed method accuracy rate is high, but needs a large amount of domain expert for a long time It participates in, cost of labor and time cost are too high, and lack real-time.The method automatically generated is by analyzing vocabulary in different field The difference of statistical property in corpus, judges the domain attribute of vocabulary, and participation of this method without domain expert saves a large amount of Cost of labor, but the accuracy rate that dictionary is included is not high.

Therefore, how the unregistered word that automatic identification is constantly emerged in large numbers in daily life, and be added in dictionary, be nature One basic work of Language Processing.The identification of unregistered word is the larger difficulty wherein faced and influences to divide One of principal element of word precision.In terms of unlisted word discovery method, mainly have at present it is rule-based and based on statistics two kinds Method.

Its main thought of rule-based method be according to the word-building characteristic or external form feature of unregistered word establish rule base, Specialized dictionary or pattern base, then find unregistered word by rule match.Rule-based method is confined to some field, And it needs to establish rule base etc..

Statistics-Based Method usually extracts candidate string using counting strategy, then linguistry is recycled to exclude It is not the rubbish string for being not logged in word；The degree of correlation is either calculated, finds the word and combinatorics on words of degree of correlation maximum.Based on statistics Method, be confined to search can only find shorter neologism.

Invention content

In view of this, the application provides a kind of unknown word identification method, can improve unknown word identification efficiency and standard Exactness.

In order to solve the above technical problems, what the technical solution of the application was realized in：

A kind of unknown word identification method, this method include：

Html web page information is obtained, and is parsed into text message；

To the text message, automatic segmentation of Chinese word is carried out using segmenter, will fail to be named the word of Entity recognition As in candidate word storage to candidate dictionary；

To the candidate dictionary, it is filtered using cutting signature library and exclusion word collection；

For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, And it is filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration；

The frequency that the portmanteau word not filtered occurs in the text message is less than default portmanteau word word frequency threshold value Portmanteau word is stored as unregistered word into spare dictionary；And storage to the portmanteau word in spare dictionary and is filtered The corresponding candidate word of portmanteau word deleted from current candidate dictionary；

For each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy It calculates, the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary；

For each candidate word in current candidate dictionary, the boundary information entropy of the candidate word is calculated, it will be due to comentropy The candidate word that word combination cannot be carried out with corresponding boundary candidate word more than preset boundary entropy threshold filters；

The portmanteau word not filtered out using the filtering of deactivated word set by boundary information entropy；And it will not be deactivated word set and filter out Portmanteau word, increase in spare dictionary as unregistered word.

As can be seen from the above technical solution, the method identification unregistered word being combined in the application using rule and statistics； During lexical feature and the domain knowledge in calling for tenders of project field etc. are dissolved into identification unregistered word by rule well, statistics Method can preferably capture statistical information, select frequent word occur in text message.The knowledge that rule and statistics are combined Other scheme can improve unknown word identification efficiency and accuracy.

Description of the drawings

Fig. 1 is the flow diagram that unregistered word is identified in the embodiment of the present application；

Fig. 2 is the structure diagram that dictionary for word segmentation includes content in the embodiment of the present application.

Specific embodiment

In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and examples, Technical scheme of the present invention is described in detail.

The unknown word identification scheme that the application provides, in being built applied to calling for tenders of project domain lexicon, the technical solution Unregistered word is identified using the method that rule and statistics are combined；Rule is by lexical feature and the domain knowledge in calling for tenders of project field During identification unregistered word is dissolved into well, the method for statistics can preferably capture statistical information, select text Occurs frequent word in information.Rule and the identifying schemes that are combined of statistics can improve unknown word identification efficiency and accurate Degree.

In the application specific embodiment, while automatic word segmentation, unregistered word is identified, and entity will be named The word of identification increases the dictionary for word segmentation of structure；After unregistered word is identified, then it will be not logged in time being added in dictionary for word segmentation, it is real The structure of existing dictionary for word segmentation and update.

Here dictionary for word segmentation, alternatively referred to as core lexicon for the dictionary segmented, including general term and are not logged in Word.

For convenience, the identification of unregistered word and the structure of dictionary for word segmentation and newer equipment are realized, below Referred to as equipment

Below in conjunction with the accompanying drawings, the identification process of unregistered word is described first.

Referring to Fig. 1, Fig. 1 is the flow diagram that unregistered word is identified in the embodiment of the present application.The specific steps are：

Step 101, equipment obtains hypertext markup language (HTML) webpage information, and is parsed into text message.

During specific implementation, html web page information can be parsed into text message using resolver.

Step 102, which carries out automatic segmentation of Chinese word using segmenter, will fail to order to the text message The word of name Entity recognition is stored as candidate word into candidate dictionary.

In specific implementation, to which kind of segmenter is used not to be limited, LTP segmenter can be such as used.

When carrying out automatic segmentation of Chinese word, while realize part-of-speech tagging and name Entity recognition.In calling for tenders of project field reality Body name can include name, place name, mechanism name etc.；Using identified word, the word of Entity recognition can be named, is carried out The structure of dictionary for word segmentation；The word for failing to name Entity recognition is subjected to unknown word identification as candidate word.

By taking one section of text in text message as an example, specially：" No. 18 line engineering Hu Nan highways stations of Shanghai Rail Transit 800 sewage pipes of Ф are removed, elegant imperial bridge pile pulling, are removed obstacles and the construction projects such as backfill and get the bid publicity " for.

Carrying out the content after automatic word segmentation is：" Shanghai City " " track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " Southern " " highway " " stand " " Ф " " 800 " " sewage pipe " " dismounting " ", " " Xiu Longqiao " " pile pulling " ", " " remove obstacles " " and " " return " " and fill out " " Deng " " engineering " " construction " " project " " acceptance of the bid " " publicity ".

Assuming that " Shanghai City " is gone out by name Entity recognition, for building dictionary for word segmentation；It is other it is unidentified go out, as not stepping on The candidate word of record word is stored in candidate dictionary.

The content that current candidate dictionary includes is：" track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " south " " is public Road " " stand " " Ф " " 800 " " sewage pipe " " dismounting " ", " " Xiu Longqiao " " pile pulling " ", " " remove obstacles " " and " " return " " and fill out " " etc. " " work Journey " " construction " " project " " acceptance of the bid " " publicity ".

Step 103, which is filtered the candidate dictionary using cutting signature library and exclusion word collection.

Cutting signature library is using the non-Chinese character such as punctuation mark, number, English alphabet as segmentation mark, by cutting signature library The content that candidate dictionary after filtering includes is：" track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " south " " highway " " It stands " " Ф " " 800 " " sewage pipe " " dismounting "；" Xiu Longqiao " " pile pulling "；" removing obstacles " " and " " returning " " filling out " " etc. " " engineering " " construction " " Project " " acceptance of the bid " " publicity ".

Part of speech is preposition, the individual character of auxiliary word, pronoun, adverbial word, the function words such as conjunction, interrogative, interjection, such as ", Most, too, this, I ", have weaker there are word-building capacity, such as " be in have, e.g., when ", the structure in also other parts of speech The weaker probability for referring to form unregistered word with other word strings of word ability is less than 10%；These words are classified as and exclude word collection.

The content that candidate dictionary after cutting and excluding the filtering of word collection includes is：" track " " traffic " " 18 " " number " " line " " Engineering " " Shanghai " " south " " highway " " is stood " " Ф " " 800 " " sewage pipe " " dismounting "；" Xiu Longqiao " " pile pulling "；" removing obstacles "；" returning " " It fills out "；" engineering " " construction " " project " " acceptance of the bid " " publicity.

Step 104, the equipment is for each candidate word in current candidate dictionary, the adjacent candidate in text message Word tectonic association word, and be filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration.

In specific implementation, it is pre-configured with preceding asyllabia collection, rear asyllabia collection and part of speech rule of combination.

Wherein, word-building capacity is stronger, and the word in location A is known as preceding asyllabia in portmanteau word AB.As " on " " preceding ", " highest ", " successively ", " neutralization " etc..The probability that the word-building capacity refers to form unregistered word with other words more by force surpasses 90% preceding asyllabia is crossed, is preceding asyllabia collection by the prefix Prefix analysis.

Asyllabia after word-building capacity is stronger, and the word in B location is known as in portmanteau word AB, calling for tenders of project field is commonly used The tail word for having expression circuit, place name, station name etc., such as * * lines (Beijing-Guangzhou Railway), * * * roads, * * stations, * * streets, * * bid sections (the first mark Section) etc., it is rear asyllabia collection by the suffix Prefix analysis.

The part of speech rule of combination of configuration, including：Binary part of speech rule of combination, ternary part of speech rule of combination and quaternary part of speech group Normally.In practical applications, n-gram word rule of combination is respectively configured, n values are bigger, and reliability is higher, in specific implementation, The length range of candidate data string can also be set, to exclude candidate data string of the length except the length range.But Since the length of unregistered word in calling for tenders of project field will be generally less than 5 yuan of portmanteau words, in the application specific embodiment only Consider that carrying out part of speech rule of combination using 2 yuan, 3 yuan and 4 yuan portmanteau words is filtered.

Wherein, part of speech rule of combination meets 863 part-of-speech tagging collection, binary part of speech rule of combination, such as V+V (service bid), N + V (communications and transportation), A+N (intelligent transportation), V+N (call for tender), N+N (rail traffic) etc.；

Ternary part of speech rule of combination, such as j+j+N (Beijing-Tianjin highway), M+Q+N (No. nine lines), N+Nd+N (Xinhua East Road) etc..

Quaternary part of speech rule of combination：Such as N+N+M+N (the earth garden is all the way).

For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, And it is filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration；Including：

For each candidate word, 2 yuan of portmanteau words, 3 yuan of portmanteau words and 4 yuan of combinations are constructed respectively by the way of sliding window Word；

For 2 yuan of portmanteau words of construction, using the candidate word for being located at suffix in preceding asyllabia collection filtration combination word, and using pair The binary part of speech rule of combination answered is filtered.

It is " rail traffic " " traffic 18 " " No. 18 " " number line " " line for the candidate word construction binary combination word in candidate word Engineering " etc..

Such as " rail traffic ", candidate word " traffic " is filtered using preceding asyllabia collection, if filtered out, then it is assumed that rail traffic is not Candidate word of the energy as unregistered word；In the present embodiment for not being filtered, therefore, " rail traffic " can be used as and not step on Record the candidate word of word.

There are N+N this part of speech combinations in binary part of speech rule of combination, therefore, further determine that " rail traffic " this combination Word can as the candidate word of unregistered word, i.e., candidate word " track " and " traffic " can be used as candidate word in candidate dictionary not It is filtered.

Assuming that by the above-mentioned processing to binary combination word, " track " " traffic "；" line " " engineering "；" Shanghai " " south "；It is " public It stands on road " " "；" sewage pipe " " dismounting "；" Xiu Longqiao " " pile pulling "；" returning " " filling out "；" engineering " " construction "；" construction " " project "；" item Mesh " " acceptance of the bid "；" acceptance of the bid " " publicity ", these binary combination words are not filtered.

For 3 yuan of portmanteau words, two candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection；Make It is located at the candidate word of suffix in portmanteau word with rear asyllabia collection filtration combination；And it was carried out using corresponding ternary part of speech rule of combination Filter.

It is " track " " traffic " " 18 " for the candidate word construction triple combination word in candidate word；" traffic " " 18 " " number "；" 18 " " number " " line "；" number " " line " " engineering "；" line " " engineering " " Shanghai "；" engineering " " Shanghai " " south "；" Shanghai " " south " " highway "；" south " " Highway " " is stood "；" highway " " station " " Ф "；" station " " Ф " " 800 "；" Ф " " 800 " " sewage pipe "；" 800 " " sewage pipe " " is torn open Except "；" engineering " " construction " " project "；" construction " " project " " acceptance of the bid "；" project " " acceptance of the bid " " publicity ".

After being filtered by preceding asyllabia collection, rear asyllabia collection and ternary part of speech rule of combination, legal triple combination word Including：" 18 " " number " " line "；" Shanghai " " south " " highway ".

For 4 yuan of portmanteau words, three candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection, and It is filtered using corresponding ternary part of speech rule of combination；

It is " track " " traffic " " 18 " " number " for the candidate word construction four-place combination word in candidate word；" traffic " " 18 " " Number " " line "；" 18 " " number " " line " " engineering "；" number " " line " " engineering " " Shanghai "；" line " " engineering " " Shanghai " " south "；" engineering " " Shanghai " " South " " highway "；" Shanghai " " south " " highway " " station "；" south " " highway " " station " " Ф "；" highway " " station " " Ф " " 800 "；" station " " Ф " " 800 " " sewage pipes "；" Ф " " 800 " " sewage pipe " " dismounting "；" engineering " " construction " " project " " acceptance of the bid "；In " construction " " project " " Mark " " publicity ".

After preceding asyllabia collection and the filtering of quaternary part of speech rule of combination, legal four-place combination word includes：" Shanghai " " south " " highway " " is stood ".

If the binary combination word after filtering belongs to a part for triple combination word or four-place combination word after filtering, should Binary combination word filters；If the triple combination word after filtering belongs to a part for the four-place combination word after filtering, by the ternary Portmanteau word filters.

" Shanghai " " south " " highway " this triple combination word belongs to " Shanghai " " south " " highway " and " stands " the one of this four-place combination word Therefore part, this triple combination word is filtered out.

Step 105, which is less than preset group by the frequency that the portmanteau word not filtered occurs in the text message The portmanteau word for closing word word frequency threshold value is stored as unregistered word into spare dictionary；And it will store to the combination in spare dictionary Word and the corresponding candidate word of the portmanteau word being filtered are deleted from current candidate dictionary.

By the processing of step 105, the portmanteau word that will do not filtered, and the portmanteau word pair not being stored in spare dictionary The candidate word answered is retained in candidate dictionary.

Assuming that " No. 18 lines "；" Hu Nan highways station " the two portmanteau words are combined in the frequency that text message occurs less than default The two portmanteau words, then be directly stored in spare dictionary, as unregistered word by word word frequency threshold value；And by this two portmanteau word pair The candidate word answered is deleted from candidate dictionary.

After above-mentioned processing, it is assumed that the candidate word included in current candidate dictionary is：" track "；" traffic "；" elegant dragon Bridge "；" pile pulling "；" engineering "；" construction "；" acceptance of the bid "；" publicity ".

Step 106, for each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy calculates, and the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary.

It is carried out in this step for each candidate word in current candidate dictionary, the candidate word adjacent in text message When Mutual information entropy calculates, the method further includes：

If the Mutual information entropy of the candidate word candidate word adjacent with text message, should no more than default entropy threshold Candidate word is deleted from candidate dictionary.

It is that two words are combined closely the measurement of degree that word internal junction is right, for weighing the possibility that two words form word Property.Word internal junction is right bigger, shows that Chinese character combination is closer, and the possibility that they form portmanteau word is bigger.Finally use threshold value Carry out decision, when word internal junction is right is more than threshold value, portmanteau word can be formed by being considered as them.

Interdepend degree of the mutual information commonly used to weigh two signals can be used for measuring the inside of two tuples and word It is tightly combined degree.Mutual information is defined as：

Wherein：P (xy) be x and y in language material and meanwhile occur probability；P (x) is the probability that x individually occurs；P (y) is y The probability individually occurred.As MI (x, y)>>When 0, it is highly relevant to show x and y, i.e. x and y often occur simultaneously, character string x Y may more form neologisms；As MI (x, y)=0, show that x and y are distributed independently of each other；As MI (x, y)<<When 0, show x It is orthogonal distribution with y.Mutual information entropy is bigger, illustrates that the internal junction of two tuples is right bigger, and binary composition is is not logged in The possibility of a part for word or unregistered word is bigger.

Step 107, for each candidate word in current candidate dictionary, which calculates the boundary information of the candidate word Entropy, will due to comentropy be more than preset boundary entropy threshold and cannot with corresponding boundary candidate word progress word combination candidate word mistake Filter.

Boundary information entropy includes left margin comentropy and right margin comentropy, and left margin comentropy and right margin comentropy are used The boundary degree of freedom of word is weighed, so that it is determined that the left margin and right margin of word.

Boundary degree of freedom refers to the quantity of the adjoining word type in the adjacent set of a character string.Boundary degree of freedom is got over Greatly, show that the type of character in the boundary set of character string is more, i.e., the character adjacent with the character string is more complicated, then the word Symbol conspires to create bigger for the possibility on boundary, and vice versa.The size of boundary degree of freedom is usually weighed with comentropy.Assuming that X It is a discrete random variable, its valued space is R, and as its value x (x ∈ R), its probability distribution is p (x), that , the calculation formula of the comentropy of stochastic variable X is as follows.

H (X)=∑ p (x) log₂p(x)

The boundary degree of freedom of word is weighed using left comentropy and right comentropy, so that it is determined that the left margin of word and the right side Boundary.The calculation formula of left margin comentropy and right margin comentropy difference is as follows.

When the left margin comentropy of a word is more than preset boundary entropy threshold, illustrate that the word is not suitable for being located at combination The suffix of word；When the right margin comentropy of a word is more than preset boundary entropy threshold, illustrate that the word is not suitable for being located at group Close the prefix of word.

The equipment calculates the boundary information entropy of the candidate word in this step, will be more than preset boundary entropy threshold due to comentropy And the candidate word that word combination cannot be carried out with corresponding boundary candidate word filters, including：

First candidate word forms word combination with the second candidate word；Second candidate word forms portmanteau word with third candidate word；Meter Calculate the left margin comentropy of the second candidate word and right margin comentropy；

When the left margin comentropy of the second candidate word is more than preset boundary entropy threshold, the first candidate word and the second candidate word The portmanteau word of composition is not as candidate unregistered word；

When the right margin comentropy of the second candidate word is more than preset boundary entropy threshold, the second candidate word and third candidate word The portmanteau word of composition is not as candidate unregistered word；

When the left margin comentropy and right margin comentropy of the second candidate word are all higher than preset boundary entropy threshold, by second Candidate word is deleted from candidate dictionary.

Assuming that it is by the candidate word that above-mentioned steps candidate's dictionary includes：" track "；" traffic "；" engineering "；It " applies Work "；" acceptance of the bid "；" publicity ".

Step 108, the portmanteau word which is not filtered out using the filtering of deactivated word set by boundary information entropy；And will not by The portmanteau word that deactivated word set filters out increases in spare dictionary as unregistered word and filters spare word using word set is deactivated Allusion quotation.

Deactivated word set can use existing deactivated word set, can also be built according to practical application and deactivate word set, such as Based on Harbin Institute of Technology deactivates vocabulary, some general terms of bidding and usual word are added in, as the deactivated dictionary of profession.

In practical applications, it using cutting signature library and can also exclude before word collection is filtered candidate dictionary right, First with the candidate dictionary of word set filtering is deactivated, to improve the efficiency of identification unregistered word.

Below in conjunction with the accompanying drawings, the structure renewal process of dictionary for word segmentation is described in detail.

Dictionary for word segmentation construction is the dictionary for word segmentation design based on maximum matching algorithm.There are one tight for Max Match word segmentation arithmetic Weight the problem of be:The length of most major term length is relatively difficult to determine.Furthermore, it is contemplated that bid and tender for construction project field includes a large amount of suffix phases Same unregistered word, such as * * stations, * * streets isotype feature, therefore, the dictionary for word segmentation that the application provides are established in scheme, are added in Word is grown and tail word information.

Dictionary for word segmentation is made of a long concordance list of lead-in word, the long concordance list of a tail words and dictionary text.Wherein, it is first The long concordance list of words includes three contents, the lead-in of all words in dictionary text, in dictionary text with lead-in beginning most Major term is long and dictionary text in glossarial index started with the lead-in.For the long concordance list of tail words also comprising three contents, one is word The tail word of all words in allusion quotation text, the most major term terminated with the tail word in dictionary text is long and dictionary text in the tail Word terminates glossarial index.

Referring to Fig. 2, Fig. 2 is the structure diagram that dictionary for word segmentation includes content in the embodiment of the present application.Lead-in corresponds in Fig. 2 Lead-in concordance list in " capital " represent lead-in；" 5 " represent the most major term a length of 5 for lead-in with " capital "；" 1 ", " 3 " etc. are equivalent Line number；The corresponding tail word indexing table of tail word is similar with lead-in concordance list.

The above method can be used to carry out the structure of dictionary for word segmentation in the application specific embodiment, be carried out using segmenter During automatic segmentation of Chinese word, the word for being named Entity recognition is increased in the dictionary for word segmentation of structure.

The dictionary for word segmentation established by the scheme that the application provides, can carry out word lookup with the following method：

When needing to segment using dictionary for word segmentation, first to textual scan to be slit, find and start that word can be formed with each word Most major term is long and compares, and the most major term long value that will form word is long as most major term under this matched lead-in, then using inverse To maximum matching algorithm, judged according to lead-in, lead-in most major term length, tail word and remaining word sequence is subjected to matching participle.

When carrying out string matching, for each word, first determine that the word is long as the corresponding most major term of tail word, further according to Most major term length finds corresponding lead-in, if the corresponding most major term length of the lead-in is long not less than the corresponding most major term of the tail word, It primarily determines that corresponding lead-in and tail word and the corresponding portmanteau word of middle word can match, then goes between matching tail word and lead-in Word whether match one by one, when all matching, determine matched participle.

Dictionary in the embodiment of the present application includes the dictionary for word segmentation of structure and spare dictionary, and dictionary for word segmentation is used for automatic word segmentation When matching dictionary, spare dictionary is more not logged in preserving unregistered word during dictionary for dynamic, and easily plus is unloaded to participle In dictionary.

It, will be in the update to dictionary for word segmentation of newer unregistered word when the unregistered word in spare dictionary has update.

In specific implementation, since dictionary for word segmentation is used to store most entries, it is specifically used to matching participle.Segment word The update of allusion quotation can be interim, and the vocabulary that mainly spare dictionary identifies disposably is updated to core lexicon after stablizing, Being avoided that the unknown word identification of transient error in this way influences the accuracy of word segmentation result.

In conclusion the application identifies unregistered word by using the method that rule and statistics are combined；Rule is by vocabulary During the domain knowledge in feature and calling for tenders of project field etc. is dissolved into identification unregistered word well, the method for statistics can be with It is preferable to capture statistical information, will occur frequent word in text message.The identifying schemes that rule and statistics are combined can carry High unknown word identification efficiency and accuracy.

A kind of scheme for building dictionary for word segmentation is given in the embodiment of the present application, the efficiency for searching dictionary can be improved.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.

Claims

A kind of 1. unknown word identification method, which is characterized in that this method includes：

Hypertext markup language html web page information is obtained, and is parsed into text message；

To the text message, automatic segmentation of Chinese word is carried out using segmenter, will fail to be named the word of Entity recognition as In candidate word storage to candidate dictionary；

To the candidate dictionary, it is filtered using cutting signature library and exclusion word collection；

For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, and make It is filtered with the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration；

The frequency that the portmanteau word not filtered occurs in the text message is less than to the combination of default portmanteau word word frequency threshold value Word is stored as unregistered word into spare dictionary；And it will store to the portmanteau word in spare dictionary and the group being filtered The corresponding candidate word of word is closed to delete from current candidate dictionary；

For each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy meter It calculates, the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary；

For each candidate word in current candidate dictionary, the boundary information entropy of the candidate word is calculated, will be more than due to comentropy Preset boundary entropy threshold and cannot with corresponding boundary candidate word carry out word combination candidate word filter；

The portmanteau word not filtered out using the filtering of deactivated word set by boundary information entropy；And the group that will not be deactivated word set and filter out Word is closed, is increased in spare dictionary as unregistered word.
2. according to the method described in claim 1, it is characterized in that, each candidate word in current candidate dictionary, The candidate word tectonic association word adjacent in text message, and use the part of speech group of preceding asyllabia collection, rear asyllabia collection and configuration Normally it is filtered；Including：

For each candidate word, 2 yuan of portmanteau words, 3 yuan of portmanteau words and 4 yuan of portmanteau words are constructed respectively；

For 2 yuan of portmanteau words of construction, using the candidate word for being located at suffix in preceding asyllabia collection filtration combination word, and using corresponding Binary part of speech rule of combination is filtered；

For 3 yuan of portmanteau words, two candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection；After use Asyllabia collection filtration combination is located at the candidate word of suffix in portmanteau word；And it is filtered using corresponding ternary part of speech rule of combination；

For 4 yuan of portmanteau words, three candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection, and use Corresponding ternary part of speech rule of combination is filtered；

Wherein, the part of speech rule of combination of configuration, including：Binary part of speech rule of combination, ternary part of speech rule of combination and quaternary part of speech Rule of combination.
3. according to the method described in claim 2, it is characterized in that, the method further includes：

If the binary combination word after filtering belongs to a part for triple combination word or four-place combination word after filtering, by the binary Portmanteau word filters；If the triple combination word after filtering belongs to a part for the four-place combination word after filtering, by the triple combination Word filters.
4. according to the method described in claim 1, it is characterized in that, each candidate word in current candidate dictionary, When the candidate word adjacent in text message carries out Mutual information entropy calculating, the method further includes：

If the Mutual information entropy of the candidate word candidate word adjacent with text message, will no more than default Mutual information entropy threshold value The candidate word is deleted from candidate dictionary.
5. the according to the method described in claim 1, it is characterized in that, boundary degree of freedom use information for calculating the candidate word Entropy, the candidate word that information boundary entropy is more than to preset boundary entropy threshold filter, including：

First candidate word forms word combination with the second candidate word；Second candidate word forms portmanteau word with third candidate word；Calculate the The left margin comentropy of two candidate words and right margin comentropy；

When the left margin comentropy of the second candidate word is more than preset boundary entropy threshold, the first candidate word and the second candidate word are formed Portmanteau word not as candidate unregistered word；

When the right margin comentropy of the second candidate word is more than preset boundary entropy threshold, the second candidate word and third candidate word are formed Portmanteau word not as candidate unregistered word；

When the left margin comentropy and right margin comentropy of the second candidate word are all higher than preset boundary entropy threshold, by the second candidate Word is deleted from candidate dictionary.
6. according to the method described in claim 1, it is characterized in that, described to the text message, segmenter is used to carry out the Chinese Language automatic word segmentation will fail to be named the word of Entity recognition as after in candidate word storage to candidate dictionary；It is described to institute Candidate dictionary is stated, before being filtered using cutting signature library and exclusion word collection, the method further includes：

It is filtered using dictionary candidate described in stop words set pair.
7. according to the method described in claim 1-6 any one, which is characterized in that the method further includes：

Carry out dictionary for word segmentation structure；

When segmenter is used to carry out automatic segmentation of Chinese word, the word for being named Entity recognition is increased to the participle word of structure In allusion quotation.
8. the method according to the description of claim 7 is characterized in that it is described progress dictionary for word segmentation structure, including：

Dictionary for word segmentation includes：One long concordance list of lead-in word, the long concordance list of a tail words and dictionary text；Wherein, lead-in word Long concordance list includes：The lead-in of all words in dictionary for word segmentation text, it is long with the most major term of lead-in beginning in dictionary text, with And glossarial index is started with the lead-in in dictionary text；The long concordance list of tail words includes：The tail of all words in domain lexicon text Word, the most major term terminated with the tail word in dictionary text is long and dictionary text in glossarial index terminated with the tail word.
9. according to the method described in claim 8, it is characterized in that, the method further includes：

When needing to segment using dictionary for word segmentation, first to textual scan to be slit, find and start the maximum that can form word with each word Word is long and compares, and the most major term long value that will form word is long as most major term under this matched lead-in, then utilizes inversely most Big matching algorithm is judged according to lead-in, lead-in most major term length, tail word and remaining word sequence is carried out matching participle.
10. according to the method described in claim 9, it is characterized in that, the method further includes：

It, will be in the update to dictionary for word segmentation of newer unregistered word when the unregistered word in spare dictionary has update.