CN108268440A - A kind of unknown word identification method - Google Patents

A kind of unknown word identification method Download PDF

Info

Publication number
CN108268440A
CN108268440A CN201710003573.3A CN201710003573A CN108268440A CN 108268440 A CN108268440 A CN 108268440A CN 201710003573 A CN201710003573 A CN 201710003573A CN 108268440 A CN108268440 A CN 108268440A
Authority
CN
China
Prior art keywords
word
candidate
dictionary
combination
portmanteau
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710003573.3A
Other languages
Chinese (zh)
Inventor
张春荣
韦玮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Potevio Information Technology Co Ltd
Putian Information Technology Co Ltd
Original Assignee
Putian Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Putian Information Technology Co Ltd filed Critical Putian Information Technology Co Ltd
Priority to CN201710003573.3A priority Critical patent/CN108268440A/en
Publication of CN108268440A publication Critical patent/CN108268440A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

This application provides a kind of unknown word identification method, this method identifies unregistered word using the method that rule and statistics are combined;During lexical feature and the domain knowledge in calling for tenders of project field etc. are dissolved into identification unregistered word by rule well, the method for statistics can preferably capture statistical information, select frequent word occur in text message;The identifying schemes that rule and statistics are combined can improve unknown word identification efficiency and accuracy.

Description

A kind of unknown word identification method
Technical field
The present invention relates to natural language processing field, more particularly to a kind of unknown word identification method.
Background technology
The meaning of natural language processing " understanding " language generally as unit of word, primary task is exactly to segment.In Chinese The various fields of information processing are required to complete corresponding function based on dictionary.By participle, problem retrieval, similarity mode, It determines answer of retrieval result or Intelligent dialogue etc., wherein each process is to be calculated by word for least unit, counts The basis of calculation is word dictionary, so word dictionary structure has the performance of whole system very big influence.
Many segmentation methods in the case that dictionary is complete all assuming that design, and many times this hypothesis is not in fact It sets up.With the continuous development of society and internet, neologism constantly emerges in large numbers in daily life, and the vocabulary of professional domain is also fresh It occurs in general basic dictionary.Unregistered word is defined as the word not occurred in dictionary, the neologism including appearance with And including field specialized vocabulary etc..The mistake caused due to segmenting mistake caused by unregistered word considerably beyond the uttrance of ambiguous segmentation Accidentally.Particularly in specific area, the dictionary that timely updates has conclusive shadow to the application system efficiency where word dictionary It rings, the scale and quality of dictionary are directly related to the performance of related application.
The structure and extending method of dictionary for word segmentation have artificial constructed method and automatic generation method.Using artificial constructed Mode adds unregistered word into dictionary, and artificial constructed method accuracy rate is high, but needs a large amount of domain expert for a long time It participates in, cost of labor and time cost are too high, and lack real-time.The method automatically generated is by analyzing vocabulary in different field The difference of statistical property in corpus, judges the domain attribute of vocabulary, and participation of this method without domain expert saves a large amount of Cost of labor, but the accuracy rate that dictionary is included is not high.
Therefore, how the unregistered word that automatic identification is constantly emerged in large numbers in daily life, and be added in dictionary, be nature One basic work of Language Processing.The identification of unregistered word is the larger difficulty wherein faced and influences to divide One of principal element of word precision.In terms of unlisted word discovery method, mainly have at present it is rule-based and based on statistics two kinds Method.
Its main thought of rule-based method be according to the word-building characteristic or external form feature of unregistered word establish rule base, Specialized dictionary or pattern base, then find unregistered word by rule match.Rule-based method is confined to some field, And it needs to establish rule base etc..
Statistics-Based Method usually extracts candidate string using counting strategy, then linguistry is recycled to exclude It is not the rubbish string for being not logged in word;The degree of correlation is either calculated, finds the word and combinatorics on words of degree of correlation maximum.Based on statistics Method, be confined to search can only find shorter neologism.
Invention content
In view of this, the application provides a kind of unknown word identification method, can improve unknown word identification efficiency and standard Exactness.
In order to solve the above technical problems, what the technical solution of the application was realized in:
A kind of unknown word identification method, this method include:
Html web page information is obtained, and is parsed into text message;
To the text message, automatic segmentation of Chinese word is carried out using segmenter, will fail to be named the word of Entity recognition As in candidate word storage to candidate dictionary;
To the candidate dictionary, it is filtered using cutting signature library and exclusion word collection;
For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, And it is filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration;
The frequency that the portmanteau word not filtered occurs in the text message is less than default portmanteau word word frequency threshold value Portmanteau word is stored as unregistered word into spare dictionary;And storage to the portmanteau word in spare dictionary and is filtered The corresponding candidate word of portmanteau word deleted from current candidate dictionary;
For each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy It calculates, the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary;
For each candidate word in current candidate dictionary, the boundary information entropy of the candidate word is calculated, it will be due to comentropy The candidate word that word combination cannot be carried out with corresponding boundary candidate word more than preset boundary entropy threshold filters;
The portmanteau word not filtered out using the filtering of deactivated word set by boundary information entropy;And it will not be deactivated word set and filter out Portmanteau word, increase in spare dictionary as unregistered word.
As can be seen from the above technical solution, the method identification unregistered word being combined in the application using rule and statistics; During lexical feature and the domain knowledge in calling for tenders of project field etc. are dissolved into identification unregistered word by rule well, statistics Method can preferably capture statistical information, select frequent word occur in text message.The knowledge that rule and statistics are combined Other scheme can improve unknown word identification efficiency and accuracy.
Description of the drawings
Fig. 1 is the flow diagram that unregistered word is identified in the embodiment of the present application;
Fig. 2 is the structure diagram that dictionary for word segmentation includes content in the embodiment of the present application.
Specific embodiment
In order to make the purpose , technical scheme and advantage of the present invention be clearer, with reference to the accompanying drawings and examples, Technical scheme of the present invention is described in detail.
The unknown word identification scheme that the application provides, in being built applied to calling for tenders of project domain lexicon, the technical solution Unregistered word is identified using the method that rule and statistics are combined;Rule is by lexical feature and the domain knowledge in calling for tenders of project field During identification unregistered word is dissolved into well, the method for statistics can preferably capture statistical information, select text Occurs frequent word in information.Rule and the identifying schemes that are combined of statistics can improve unknown word identification efficiency and accurate Degree.
In the application specific embodiment, while automatic word segmentation, unregistered word is identified, and entity will be named The word of identification increases the dictionary for word segmentation of structure;After unregistered word is identified, then it will be not logged in time being added in dictionary for word segmentation, it is real The structure of existing dictionary for word segmentation and update.
Here dictionary for word segmentation, alternatively referred to as core lexicon for the dictionary segmented, including general term and are not logged in Word.
For convenience, the identification of unregistered word and the structure of dictionary for word segmentation and newer equipment are realized, below Referred to as equipment
Below in conjunction with the accompanying drawings, the identification process of unregistered word is described first.
Referring to Fig. 1, Fig. 1 is the flow diagram that unregistered word is identified in the embodiment of the present application.The specific steps are:
Step 101, equipment obtains hypertext markup language (HTML) webpage information, and is parsed into text message.
During specific implementation, html web page information can be parsed into text message using resolver.
Step 102, which carries out automatic segmentation of Chinese word using segmenter, will fail to order to the text message The word of name Entity recognition is stored as candidate word into candidate dictionary.
In specific implementation, to which kind of segmenter is used not to be limited, LTP segmenter can be such as used.
When carrying out automatic segmentation of Chinese word, while realize part-of-speech tagging and name Entity recognition.In calling for tenders of project field reality Body name can include name, place name, mechanism name etc.;Using identified word, the word of Entity recognition can be named, is carried out The structure of dictionary for word segmentation;The word for failing to name Entity recognition is subjected to unknown word identification as candidate word.
By taking one section of text in text message as an example, specially:" No. 18 line engineering Hu Nan highways stations of Shanghai Rail Transit 800 sewage pipes of Ф are removed, elegant imperial bridge pile pulling, are removed obstacles and the construction projects such as backfill and get the bid publicity " for.
Carrying out the content after automatic word segmentation is:" Shanghai City " " track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " Southern " " highway " " stand " " Ф " " 800 " " sewage pipe " " dismounting " ", " " Xiu Longqiao " " pile pulling " ", " " remove obstacles " " and " " return " " and fill out " " Deng " " engineering " " construction " " project " " acceptance of the bid " " publicity ".
Assuming that " Shanghai City " is gone out by name Entity recognition, for building dictionary for word segmentation;It is other it is unidentified go out, as not stepping on The candidate word of record word is stored in candidate dictionary.
The content that current candidate dictionary includes is:" track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " south " " is public Road " " stand " " Ф " " 800 " " sewage pipe " " dismounting " ", " " Xiu Longqiao " " pile pulling " ", " " remove obstacles " " and " " return " " and fill out " " etc. " " work Journey " " construction " " project " " acceptance of the bid " " publicity ".
Step 103, which is filtered the candidate dictionary using cutting signature library and exclusion word collection.
Cutting signature library is using the non-Chinese character such as punctuation mark, number, English alphabet as segmentation mark, by cutting signature library The content that candidate dictionary after filtering includes is:" track " " traffic " " 18 " " number " " line " " engineering " " Shanghai " " south " " highway " " It stands " " Ф " " 800 " " sewage pipe " " dismounting ";" Xiu Longqiao " " pile pulling ";" removing obstacles " " and " " returning " " filling out " " etc. " " engineering " " construction " " Project " " acceptance of the bid " " publicity ".
Part of speech is preposition, the individual character of auxiliary word, pronoun, adverbial word, the function words such as conjunction, interrogative, interjection, such as ", Most, too, this, I ", have weaker there are word-building capacity, such as " be in have, e.g., when ", the structure in also other parts of speech The weaker probability for referring to form unregistered word with other word strings of word ability is less than 10%;These words are classified as and exclude word collection.
The content that candidate dictionary after cutting and excluding the filtering of word collection includes is:" track " " traffic " " 18 " " number " " line " " Engineering " " Shanghai " " south " " highway " " is stood " " Ф " " 800 " " sewage pipe " " dismounting ";" Xiu Longqiao " " pile pulling ";" removing obstacles ";" returning " " It fills out ";" engineering " " construction " " project " " acceptance of the bid " " publicity.
Step 104, the equipment is for each candidate word in current candidate dictionary, the adjacent candidate in text message Word tectonic association word, and be filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration.
In specific implementation, it is pre-configured with preceding asyllabia collection, rear asyllabia collection and part of speech rule of combination.
Wherein, word-building capacity is stronger, and the word in location A is known as preceding asyllabia in portmanteau word AB.As " on " " preceding ", " highest ", " successively ", " neutralization " etc..The probability that the word-building capacity refers to form unregistered word with other words more by force surpasses 90% preceding asyllabia is crossed, is preceding asyllabia collection by the prefix Prefix analysis.
Asyllabia after word-building capacity is stronger, and the word in B location is known as in portmanteau word AB, calling for tenders of project field is commonly used The tail word for having expression circuit, place name, station name etc., such as * * lines (Beijing-Guangzhou Railway), * * * roads, * * stations, * * streets, * * bid sections (the first mark Section) etc., it is rear asyllabia collection by the suffix Prefix analysis.
The part of speech rule of combination of configuration, including:Binary part of speech rule of combination, ternary part of speech rule of combination and quaternary part of speech group Normally.In practical applications, n-gram word rule of combination is respectively configured, n values are bigger, and reliability is higher, in specific implementation, The length range of candidate data string can also be set, to exclude candidate data string of the length except the length range.But Since the length of unregistered word in calling for tenders of project field will be generally less than 5 yuan of portmanteau words, in the application specific embodiment only Consider that carrying out part of speech rule of combination using 2 yuan, 3 yuan and 4 yuan portmanteau words is filtered.
Wherein, part of speech rule of combination meets 863 part-of-speech tagging collection, binary part of speech rule of combination, such as V+V (service bid), N + V (communications and transportation), A+N (intelligent transportation), V+N (call for tender), N+N (rail traffic) etc.;
Ternary part of speech rule of combination, such as j+j+N (Beijing-Tianjin highway), M+Q+N (No. nine lines), N+Nd+N (Xinhua East Road) etc..
Quaternary part of speech rule of combination:Such as N+N+M+N (the earth garden is all the way).
For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, And it is filtered using the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration;Including:
For each candidate word, 2 yuan of portmanteau words, 3 yuan of portmanteau words and 4 yuan of combinations are constructed respectively by the way of sliding window Word;
For 2 yuan of portmanteau words of construction, using the candidate word for being located at suffix in preceding asyllabia collection filtration combination word, and using pair The binary part of speech rule of combination answered is filtered.
It is " rail traffic " " traffic 18 " " No. 18 " " number line " " line for the candidate word construction binary combination word in candidate word Engineering " etc..
Such as " rail traffic ", candidate word " traffic " is filtered using preceding asyllabia collection, if filtered out, then it is assumed that rail traffic is not Candidate word of the energy as unregistered word;In the present embodiment for not being filtered, therefore, " rail traffic " can be used as and not step on Record the candidate word of word.
There are N+N this part of speech combinations in binary part of speech rule of combination, therefore, further determine that " rail traffic " this combination Word can as the candidate word of unregistered word, i.e., candidate word " track " and " traffic " can be used as candidate word in candidate dictionary not It is filtered.
Assuming that by the above-mentioned processing to binary combination word, " track " " traffic ";" line " " engineering ";" Shanghai " " south ";It is " public It stands on road " " ";" sewage pipe " " dismounting ";" Xiu Longqiao " " pile pulling ";" returning " " filling out ";" engineering " " construction ";" construction " " project ";" item Mesh " " acceptance of the bid ";" acceptance of the bid " " publicity ", these binary combination words are not filtered.
For 3 yuan of portmanteau words, two candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection;Make It is located at the candidate word of suffix in portmanteau word with rear asyllabia collection filtration combination;And it was carried out using corresponding ternary part of speech rule of combination Filter.
It is " track " " traffic " " 18 " for the candidate word construction triple combination word in candidate word;" traffic " " 18 " " number ";" 18 " " number " " line ";" number " " line " " engineering ";" line " " engineering " " Shanghai ";" engineering " " Shanghai " " south ";" Shanghai " " south " " highway ";" south " " Highway " " is stood ";" highway " " station " " Ф ";" station " " Ф " " 800 ";" Ф " " 800 " " sewage pipe ";" 800 " " sewage pipe " " is torn open Except ";" engineering " " construction " " project ";" construction " " project " " acceptance of the bid ";" project " " acceptance of the bid " " publicity ".
After being filtered by preceding asyllabia collection, rear asyllabia collection and ternary part of speech rule of combination, legal triple combination word Including:" 18 " " number " " line ";" Shanghai " " south " " highway ".
For 4 yuan of portmanteau words, three candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection, and It is filtered using corresponding ternary part of speech rule of combination;
It is " track " " traffic " " 18 " " number " for the candidate word construction four-place combination word in candidate word;" traffic " " 18 " " Number " " line ";" 18 " " number " " line " " engineering ";" number " " line " " engineering " " Shanghai ";" line " " engineering " " Shanghai " " south ";" engineering " " Shanghai " " South " " highway ";" Shanghai " " south " " highway " " station ";" south " " highway " " station " " Ф ";" highway " " station " " Ф " " 800 ";" station " " Ф " " 800 " " sewage pipes ";" Ф " " 800 " " sewage pipe " " dismounting ";" engineering " " construction " " project " " acceptance of the bid ";In " construction " " project " " Mark " " publicity ".
After preceding asyllabia collection and the filtering of quaternary part of speech rule of combination, legal four-place combination word includes:" Shanghai " " south " " highway " " is stood ".
If the binary combination word after filtering belongs to a part for triple combination word or four-place combination word after filtering, should Binary combination word filters;If the triple combination word after filtering belongs to a part for the four-place combination word after filtering, by the ternary Portmanteau word filters.
" Shanghai " " south " " highway " this triple combination word belongs to " Shanghai " " south " " highway " and " stands " the one of this four-place combination word Therefore part, this triple combination word is filtered out.
Step 105, which is less than preset group by the frequency that the portmanteau word not filtered occurs in the text message The portmanteau word for closing word word frequency threshold value is stored as unregistered word into spare dictionary;And it will store to the combination in spare dictionary Word and the corresponding candidate word of the portmanteau word being filtered are deleted from current candidate dictionary.
By the processing of step 105, the portmanteau word that will do not filtered, and the portmanteau word pair not being stored in spare dictionary The candidate word answered is retained in candidate dictionary.
Assuming that " No. 18 lines ";" Hu Nan highways station " the two portmanteau words are combined in the frequency that text message occurs less than default The two portmanteau words, then be directly stored in spare dictionary, as unregistered word by word word frequency threshold value;And by this two portmanteau word pair The candidate word answered is deleted from candidate dictionary.
After above-mentioned processing, it is assumed that the candidate word included in current candidate dictionary is:" track ";" traffic ";" elegant dragon Bridge ";" pile pulling ";" engineering ";" construction ";" acceptance of the bid ";" publicity ".
Step 106, for each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy calculates, and the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary.
It is carried out in this step for each candidate word in current candidate dictionary, the candidate word adjacent in text message When Mutual information entropy calculates, the method further includes:
If the Mutual information entropy of the candidate word candidate word adjacent with text message, should no more than default entropy threshold Candidate word is deleted from candidate dictionary.
It is that two words are combined closely the measurement of degree that word internal junction is right, for weighing the possibility that two words form word Property.Word internal junction is right bigger, shows that Chinese character combination is closer, and the possibility that they form portmanteau word is bigger.Finally use threshold value Carry out decision, when word internal junction is right is more than threshold value, portmanteau word can be formed by being considered as them.
Interdepend degree of the mutual information commonly used to weigh two signals can be used for measuring the inside of two tuples and word It is tightly combined degree.Mutual information is defined as:
Wherein:P (xy) be x and y in language material and meanwhile occur probability;P (x) is the probability that x individually occurs;P (y) is y The probability individually occurred.As MI (x, y)>>When 0, it is highly relevant to show x and y, i.e. x and y often occur simultaneously, character string x Y may more form neologisms;As MI (x, y)=0, show that x and y are distributed independently of each other;As MI (x, y)<<When 0, show x It is orthogonal distribution with y.Mutual information entropy is bigger, illustrates that the internal junction of two tuples is right bigger, and binary composition is is not logged in The possibility of a part for word or unregistered word is bigger.
Step 107, for each candidate word in current candidate dictionary, which calculates the boundary information of the candidate word Entropy, will due to comentropy be more than preset boundary entropy threshold and cannot with corresponding boundary candidate word progress word combination candidate word mistake Filter.
Boundary information entropy includes left margin comentropy and right margin comentropy, and left margin comentropy and right margin comentropy are used The boundary degree of freedom of word is weighed, so that it is determined that the left margin and right margin of word.
Boundary degree of freedom refers to the quantity of the adjoining word type in the adjacent set of a character string.Boundary degree of freedom is got over Greatly, show that the type of character in the boundary set of character string is more, i.e., the character adjacent with the character string is more complicated, then the word Symbol conspires to create bigger for the possibility on boundary, and vice versa.The size of boundary degree of freedom is usually weighed with comentropy.Assuming that X It is a discrete random variable, its valued space is R, and as its value x (x ∈ R), its probability distribution is p (x), that , the calculation formula of the comentropy of stochastic variable X is as follows.
H (X)=∑ p (x) log2p(x)
The boundary degree of freedom of word is weighed using left comentropy and right comentropy, so that it is determined that the left margin of word and the right side Boundary.The calculation formula of left margin comentropy and right margin comentropy difference is as follows.
When the left margin comentropy of a word is more than preset boundary entropy threshold, illustrate that the word is not suitable for being located at combination The suffix of word;When the right margin comentropy of a word is more than preset boundary entropy threshold, illustrate that the word is not suitable for being located at group Close the prefix of word.
The equipment calculates the boundary information entropy of the candidate word in this step, will be more than preset boundary entropy threshold due to comentropy And the candidate word that word combination cannot be carried out with corresponding boundary candidate word filters, including:
First candidate word forms word combination with the second candidate word;Second candidate word forms portmanteau word with third candidate word;Meter Calculate the left margin comentropy of the second candidate word and right margin comentropy;
When the left margin comentropy of the second candidate word is more than preset boundary entropy threshold, the first candidate word and the second candidate word The portmanteau word of composition is not as candidate unregistered word;
When the right margin comentropy of the second candidate word is more than preset boundary entropy threshold, the second candidate word and third candidate word The portmanteau word of composition is not as candidate unregistered word;
When the left margin comentropy and right margin comentropy of the second candidate word are all higher than preset boundary entropy threshold, by second Candidate word is deleted from candidate dictionary.
Assuming that it is by the candidate word that above-mentioned steps candidate's dictionary includes:" track ";" traffic ";" engineering ";It " applies Work ";" acceptance of the bid ";" publicity ".
Step 108, the portmanteau word which is not filtered out using the filtering of deactivated word set by boundary information entropy;And will not by The portmanteau word that deactivated word set filters out increases in spare dictionary as unregistered word and filters spare word using word set is deactivated Allusion quotation.
Deactivated word set can use existing deactivated word set, can also be built according to practical application and deactivate word set, such as Based on Harbin Institute of Technology deactivates vocabulary, some general terms of bidding and usual word are added in, as the deactivated dictionary of profession.
In practical applications, it using cutting signature library and can also exclude before word collection is filtered candidate dictionary right, First with the candidate dictionary of word set filtering is deactivated, to improve the efficiency of identification unregistered word.
Below in conjunction with the accompanying drawings, the structure renewal process of dictionary for word segmentation is described in detail.
Dictionary for word segmentation construction is the dictionary for word segmentation design based on maximum matching algorithm.There are one tight for Max Match word segmentation arithmetic Weight the problem of be:The length of most major term length is relatively difficult to determine.Furthermore, it is contemplated that bid and tender for construction project field includes a large amount of suffix phases Same unregistered word, such as * * stations, * * streets isotype feature, therefore, the dictionary for word segmentation that the application provides are established in scheme, are added in Word is grown and tail word information.
Dictionary for word segmentation is made of a long concordance list of lead-in word, the long concordance list of a tail words and dictionary text.Wherein, it is first The long concordance list of words includes three contents, the lead-in of all words in dictionary text, in dictionary text with lead-in beginning most Major term is long and dictionary text in glossarial index started with the lead-in.For the long concordance list of tail words also comprising three contents, one is word The tail word of all words in allusion quotation text, the most major term terminated with the tail word in dictionary text is long and dictionary text in the tail Word terminates glossarial index.
Referring to Fig. 2, Fig. 2 is the structure diagram that dictionary for word segmentation includes content in the embodiment of the present application.Lead-in corresponds in Fig. 2 Lead-in concordance list in " capital " represent lead-in;" 5 " represent the most major term a length of 5 for lead-in with " capital ";" 1 ", " 3 " etc. are equivalent Line number;The corresponding tail word indexing table of tail word is similar with lead-in concordance list.
The above method can be used to carry out the structure of dictionary for word segmentation in the application specific embodiment, be carried out using segmenter During automatic segmentation of Chinese word, the word for being named Entity recognition is increased in the dictionary for word segmentation of structure.
The dictionary for word segmentation established by the scheme that the application provides, can carry out word lookup with the following method:
When needing to segment using dictionary for word segmentation, first to textual scan to be slit, find and start that word can be formed with each word Most major term is long and compares, and the most major term long value that will form word is long as most major term under this matched lead-in, then using inverse To maximum matching algorithm, judged according to lead-in, lead-in most major term length, tail word and remaining word sequence is subjected to matching participle.
When carrying out string matching, for each word, first determine that the word is long as the corresponding most major term of tail word, further according to Most major term length finds corresponding lead-in, if the corresponding most major term length of the lead-in is long not less than the corresponding most major term of the tail word, It primarily determines that corresponding lead-in and tail word and the corresponding portmanteau word of middle word can match, then goes between matching tail word and lead-in Word whether match one by one, when all matching, determine matched participle.
Dictionary in the embodiment of the present application includes the dictionary for word segmentation of structure and spare dictionary, and dictionary for word segmentation is used for automatic word segmentation When matching dictionary, spare dictionary is more not logged in preserving unregistered word during dictionary for dynamic, and easily plus is unloaded to participle In dictionary.
It, will be in the update to dictionary for word segmentation of newer unregistered word when the unregistered word in spare dictionary has update.
In specific implementation, since dictionary for word segmentation is used to store most entries, it is specifically used to matching participle.Segment word The update of allusion quotation can be interim, and the vocabulary that mainly spare dictionary identifies disposably is updated to core lexicon after stablizing, Being avoided that the unknown word identification of transient error in this way influences the accuracy of word segmentation result.
In conclusion the application identifies unregistered word by using the method that rule and statistics are combined;Rule is by vocabulary During the domain knowledge in feature and calling for tenders of project field etc. is dissolved into identification unregistered word well, the method for statistics can be with It is preferable to capture statistical information, will occur frequent word in text message.The identifying schemes that rule and statistics are combined can carry High unknown word identification efficiency and accuracy.
A kind of scheme for building dictionary for word segmentation is given in the embodiment of the present application, the efficiency for searching dictionary can be improved.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all essences in the present invention God and any modification, equivalent substitution, improvement and etc. within principle, done, should be included within the scope of protection of the invention.

Claims (10)

  1. A kind of 1. unknown word identification method, which is characterized in that this method includes:
    Hypertext markup language html web page information is obtained, and is parsed into text message;
    To the text message, automatic segmentation of Chinese word is carried out using segmenter, will fail to be named the word of Entity recognition as In candidate word storage to candidate dictionary;
    To the candidate dictionary, it is filtered using cutting signature library and exclusion word collection;
    For each candidate word in current candidate dictionary, the candidate word tectonic association word adjacent in text message, and make It is filtered with the part of speech rule of combination of preceding asyllabia collection, rear asyllabia collection and configuration;
    The frequency that the portmanteau word not filtered occurs in the text message is less than to the combination of default portmanteau word word frequency threshold value Word is stored as unregistered word into spare dictionary;And it will store to the portmanteau word in spare dictionary and the group being filtered The corresponding candidate word of word is closed to delete from current candidate dictionary;
    For each candidate word in current candidate dictionary, the candidate word adjacent in text message carries out Mutual information entropy meter It calculates, the candidate word that Mutual information entropy is more than to default Mutual information entropy threshold value is retained in candidate dictionary;
    For each candidate word in current candidate dictionary, the boundary information entropy of the candidate word is calculated, will be more than due to comentropy Preset boundary entropy threshold and cannot with corresponding boundary candidate word carry out word combination candidate word filter;
    The portmanteau word not filtered out using the filtering of deactivated word set by boundary information entropy;And the group that will not be deactivated word set and filter out Word is closed, is increased in spare dictionary as unregistered word.
  2. 2. according to the method described in claim 1, it is characterized in that, each candidate word in current candidate dictionary, The candidate word tectonic association word adjacent in text message, and use the part of speech group of preceding asyllabia collection, rear asyllabia collection and configuration Normally it is filtered;Including:
    For each candidate word, 2 yuan of portmanteau words, 3 yuan of portmanteau words and 4 yuan of portmanteau words are constructed respectively;
    For 2 yuan of portmanteau words of construction, using the candidate word for being located at suffix in preceding asyllabia collection filtration combination word, and using corresponding Binary part of speech rule of combination is filtered;
    For 3 yuan of portmanteau words, two candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection;After use Asyllabia collection filtration combination is located at the candidate word of suffix in portmanteau word;And it is filtered using corresponding ternary part of speech rule of combination;
    For 4 yuan of portmanteau words, three candidate words being located in filtration combination word except prefix are distinguished using preceding asyllabia collection, and use Corresponding ternary part of speech rule of combination is filtered;
    Wherein, the part of speech rule of combination of configuration, including:Binary part of speech rule of combination, ternary part of speech rule of combination and quaternary part of speech Rule of combination.
  3. 3. according to the method described in claim 2, it is characterized in that, the method further includes:
    If the binary combination word after filtering belongs to a part for triple combination word or four-place combination word after filtering, by the binary Portmanteau word filters;If the triple combination word after filtering belongs to a part for the four-place combination word after filtering, by the triple combination Word filters.
  4. 4. according to the method described in claim 1, it is characterized in that, each candidate word in current candidate dictionary, When the candidate word adjacent in text message carries out Mutual information entropy calculating, the method further includes:
    If the Mutual information entropy of the candidate word candidate word adjacent with text message, will no more than default Mutual information entropy threshold value The candidate word is deleted from candidate dictionary.
  5. 5. the according to the method described in claim 1, it is characterized in that, boundary degree of freedom use information for calculating the candidate word Entropy, the candidate word that information boundary entropy is more than to preset boundary entropy threshold filter, including:
    First candidate word forms word combination with the second candidate word;Second candidate word forms portmanteau word with third candidate word;Calculate the The left margin comentropy of two candidate words and right margin comentropy;
    When the left margin comentropy of the second candidate word is more than preset boundary entropy threshold, the first candidate word and the second candidate word are formed Portmanteau word not as candidate unregistered word;
    When the right margin comentropy of the second candidate word is more than preset boundary entropy threshold, the second candidate word and third candidate word are formed Portmanteau word not as candidate unregistered word;
    When the left margin comentropy and right margin comentropy of the second candidate word are all higher than preset boundary entropy threshold, by the second candidate Word is deleted from candidate dictionary.
  6. 6. according to the method described in claim 1, it is characterized in that, described to the text message, segmenter is used to carry out the Chinese Language automatic word segmentation will fail to be named the word of Entity recognition as after in candidate word storage to candidate dictionary;It is described to institute Candidate dictionary is stated, before being filtered using cutting signature library and exclusion word collection, the method further includes:
    It is filtered using dictionary candidate described in stop words set pair.
  7. 7. according to the method described in claim 1-6 any one, which is characterized in that the method further includes:
    Carry out dictionary for word segmentation structure;
    When segmenter is used to carry out automatic segmentation of Chinese word, the word for being named Entity recognition is increased to the participle word of structure In allusion quotation.
  8. 8. the method according to the description of claim 7 is characterized in that it is described progress dictionary for word segmentation structure, including:
    Dictionary for word segmentation includes:One long concordance list of lead-in word, the long concordance list of a tail words and dictionary text;Wherein, lead-in word Long concordance list includes:The lead-in of all words in dictionary for word segmentation text, it is long with the most major term of lead-in beginning in dictionary text, with And glossarial index is started with the lead-in in dictionary text;The long concordance list of tail words includes:The tail of all words in domain lexicon text Word, the most major term terminated with the tail word in dictionary text is long and dictionary text in glossarial index terminated with the tail word.
  9. 9. according to the method described in claim 8, it is characterized in that, the method further includes:
    When needing to segment using dictionary for word segmentation, first to textual scan to be slit, find and start the maximum that can form word with each word Word is long and compares, and the most major term long value that will form word is long as most major term under this matched lead-in, then utilizes inversely most Big matching algorithm is judged according to lead-in, lead-in most major term length, tail word and remaining word sequence is carried out matching participle.
  10. 10. according to the method described in claim 9, it is characterized in that, the method further includes:
    It, will be in the update to dictionary for word segmentation of newer unregistered word when the unregistered word in spare dictionary has update.
CN201710003573.3A 2017-01-04 2017-01-04 A kind of unknown word identification method Pending CN108268440A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710003573.3A CN108268440A (en) 2017-01-04 2017-01-04 A kind of unknown word identification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710003573.3A CN108268440A (en) 2017-01-04 2017-01-04 A kind of unknown word identification method

Publications (1)

Publication Number Publication Date
CN108268440A true CN108268440A (en) 2018-07-10

Family

ID=62771561

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710003573.3A Pending CN108268440A (en) 2017-01-04 2017-01-04 A kind of unknown word identification method

Country Status (1)

Country Link
CN (1) CN108268440A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN110442685A (en) * 2019-08-14 2019-11-12 杭州品茗安控信息技术股份有限公司 Data extending method, apparatus, equipment and the storage medium of architectural discipline dictionary
CN111353020A (en) * 2020-02-27 2020-06-30 北京奇艺世纪科技有限公司 Method, device, computer equipment and storage medium for mining text data
CN112199943A (en) * 2020-09-24 2021-01-08 东北大学 Unknown word recognition method based on maximum agglomeration coefficient and boundary entropy
CN112800760A (en) * 2019-11-14 2021-05-14 云拓科技有限公司 Device for automatically determining the location of a claim element and its associated element
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture
CN113468879A (en) * 2021-07-16 2021-10-01 上海明略人工智能(集团)有限公司 Method, system, electronic device and medium for judging unknown words
WO2021258739A1 (en) * 2020-06-22 2021-12-30 中国标准化研究院 Method for automatically identifying word repetition error

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326927A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Adaptive generation of out-of-dictionary personalized long words
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
CN102929873A (en) * 2011-08-08 2013-02-13 腾讯科技(深圳)有限公司 Method and device for extracting searching value terms based on context search
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device
CN103235774A (en) * 2013-04-27 2013-08-07 杭州电子科技大学 Extraction method of feature words of science and technology project application form
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090326927A1 (en) * 2008-06-27 2009-12-31 Microsoft Corporation Adaptive generation of out-of-dictionary personalized long words
CN102169496A (en) * 2011-04-12 2011-08-31 清华大学 Anchor text analysis-based automatic domain term generating method
CN102929873A (en) * 2011-08-08 2013-02-13 腾讯科技(深圳)有限公司 Method and device for extracting searching value terms based on context search
CN103049501A (en) * 2012-12-11 2013-04-17 上海大学 Chinese domain term recognition method based on mutual information and conditional random field model
CN103092966A (en) * 2013-01-23 2013-05-08 盘古文化传播有限公司 Vocabulary mining method and device
CN103235774A (en) * 2013-04-27 2013-08-07 杭州电子科技大学 Extraction method of feature words of science and technology project application form
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN103778243A (en) * 2014-02-11 2014-05-07 北京信息科技大学 Domain term extraction method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张锋 等: "基于互信息的中文术语抽取系统", 《计算机应用研究》 *
徐亮: "中文新词识别研究", 《中国优秀硕士学位论文全文数据库信息科技辑》 *
魏莎莎: "一种中文未登录词识别及词典设计新方法", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109492224A (en) * 2018-11-07 2019-03-19 北京金山数字娱乐科技有限公司 A kind of method and device of vocabulary building
CN109492224B (en) * 2018-11-07 2024-05-03 北京金山数字娱乐科技有限公司 Vocabulary construction method and device
CN110442685A (en) * 2019-08-14 2019-11-12 杭州品茗安控信息技术股份有限公司 Data extending method, apparatus, equipment and the storage medium of architectural discipline dictionary
CN112800760A (en) * 2019-11-14 2021-05-14 云拓科技有限公司 Device for automatically determining the location of a claim element and its associated element
CN111353020A (en) * 2020-02-27 2020-06-30 北京奇艺世纪科技有限公司 Method, device, computer equipment and storage medium for mining text data
CN111353020B (en) * 2020-02-27 2023-06-30 北京奇艺世纪科技有限公司 Method, device, computer equipment and storage medium for mining text data
WO2021258739A1 (en) * 2020-06-22 2021-12-30 中国标准化研究院 Method for automatically identifying word repetition error
CN112199943A (en) * 2020-09-24 2021-01-08 东北大学 Unknown word recognition method based on maximum agglomeration coefficient and boundary entropy
CN112199943B (en) * 2020-09-24 2023-10-03 东北大学 Unknown word recognition method based on maximum condensation coefficient and boundary entropy
CN113157903A (en) * 2020-12-28 2021-07-23 国网浙江省电力有限公司信息通信分公司 Multi-field-oriented electric power word stock construction method
CN113468879A (en) * 2021-07-16 2021-10-01 上海明略人工智能(集团)有限公司 Method, system, electronic device and medium for judging unknown words
CN113435199A (en) * 2021-07-18 2021-09-24 谢勇 Storage and reading interference method and system for character corresponding culture

Similar Documents

Publication Publication Date Title
CN108268440A (en) A kind of unknown word identification method
CN108874878B (en) Knowledge graph construction system and method
CN110781317B (en) Method and device for constructing event map and electronic equipment
CN102841920B (en) Method and device for extracting webpage frame information
CN106909611B (en) Hotel automatic matching method based on text information extraction
US20020021838A1 (en) Adaptively weighted, partitioned context edit distance string matching
CN108776762B (en) Data desensitization processing method and device
CN107608949A (en) A kind of Text Information Extraction method and device based on semantic model
CN101079031A (en) Web page subject extraction system and method
CN104199972A (en) Named entity relation extraction and construction method based on deep learning
JP2005539283A (en) System, method, and software for hyperlinking names
CN108228710B (en) Word segmentation method and device for URL
US20190147038A1 (en) Preserving and processing ambiguity in natural language
CN107203526A (en) A kind of query string semantic requirement analysis method and device
CN109918664B (en) Word segmentation method and device
CN109815340A (en) A kind of construction method of national culture information resources knowledge mapping
CN110866125A (en) Knowledge graph construction system based on bert algorithm model
CN104915443A (en) Extraction method of Chinese Microblog evaluation object
CN111221976A (en) Knowledge graph construction method based on bert algorithm model
JPWO2014002774A1 (en) Synonym extraction system, method and recording medium
CN115269834A (en) High-precision text classification method and device based on BERT
CN112069824B (en) Region identification method, device and medium based on context probability and citation
CN110580280B (en) New word discovery method, device and storage medium
CN113254668B (en) Knowledge graph construction method and system based on scene latitude
CN116340507A (en) Aspect-level emotion analysis method based on mixed weight and double-channel graph convolution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180710