CN107577668A

CN107577668A - Social media non-standard word correcting method based on semanteme

Info

Publication number: CN107577668A
Application number: CN201710829908.7A
Authority: CN
Inventors: 费高雷; 郑夏; 李元磊; 胡光岷
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2017-09-15
Filing date: 2017-09-15
Publication date: 2018-01-12

Abstract

The present invention discloses a kind of social media non-standard word correcting method based on semanteme, the problem of detection and identification for expense modular word at present lack effective ways, the mode that the application employs smallest edit distance characterizes the morphology of word, and the semantic information similarity of word is characterized by the way of COS distance between term vector, common screening is expected correct word to replace non-standard word；And some existing PyEnchant, PyTypo instruments are combined, significantly reducing needs the word range for comparing semantic dependency, to reach the purpose for improving word standardization speed.

Description

Social media non-standard word correcting method based on semanteme

Technical field

The invention belongs to Data Mining, more particularly to a kind of non-standard word detection and identification technology.

Background technology

With Web2.0 rise, the Internet model is changed into all users by professional's knitmesh and participates in knitmesh, in shape While formula more democratizes, the largely inexpensive low-quality information generation from user will be had by also implying that.Social media It is the Important Platform of user's issue and propagation information, it is allowed to which life and the idea of oneself are shared in the unlimited place of user, unlimited time. Twitter is the social network sites towards global range, and its feature is that user delivers when pushing away text and has character quantity limitation. Which results in user to express viewpoint using more convenient brief abbreviation or network words, will produce substantial amounts of non-standard Vocabulary, correlative study person is influenceed to pushing away literary subsequent analysis.It is very important so carrying out correction to non-standard vocabulary.

Professor Zhang Yangsen proposes two kinds of vocabulary error situations of non-word mistake and true word mistake mainly occur in English text, non- Word mistake refers to the writing that can not be found in dictionary vocabulary lack of standardization, and true word mistake refers to that linguistic context can be found but be not inconsistent in dictionary Syntax error class vocabulary.The present invention is only for non-word error situation.The non-standard word for non-word mistake occur refers generally to misspelling Vocabulary, extension word and nonsense words by mistake.

In terms of detection firstly for non-standard word and identification, not many effective ways at present.The most frequently used is exactly Look up the dictionary method, that is, the word to be matched by traversal dictionary library lookup and word to be identified, if being then determined as modular word, It is on the contrary then be determined as non-standard word.Some scholars carry out non-standard word identification using the method for searching N-gram tables, that is, travel through Some N-gram tables simultaneously count its occurrence number, when word frequency is then determined as non-standard word less than certain threshold value.

For in terms of the correction of non-standard word, having some more effective methods and research, and be applied to part searches In the commercial products such as engine, input method.It is the most frequently used have smallest edit distance method, stem method, statistic law, summarize regular method and Build dictionary method etc..

Natural language is given to the algorithm in machine learning to handle, first will be by linguistic mathematics, one is most normal Method is exactly that each vocabulary is shown as a term vector.Term vector is a mode for showing the semanteme of word well.Its Middle One-hot Representation are most directly perceived, and each vocabulary is shown as a very long vector, only one by this method The value of dimension is 1, represents current word, remaining is 0.Though the method is simple but can bring dimension disaster, later scholar carries Distributed Representation method, a kind of low-dimensional real number vector of expression are gone out.It is current conventional obtain word to The model of amount has：1. 2. 3. 4. PLSA latent semantic analysis is general for LSA matrix decompositions model for GloVe models for Word2Vector models Rate model.

Social media by taking Twitter as an example constantly produces a large amount of insignificant noise informations and the letter of repeated and redundant Breath, such as the chat and forwarding of user.For convenience of analysis of the researcher to social media data, text message denoising and specification Change is necessary.The depth characterized with the development of natural language processing technique and word standardization and the semanteme of word is ground Study carefully, occur the standardization system for being much directed to plain text in recent years, but these conventional methods are limited to word mostly Morphology so that its application effect in Twitter substantially reduces.

The content of the invention

In order to solve the above technical problems, the application proposes a kind of social media non-standard word correcting method based on semanteme, On the spelling error correcting technique of routine, the semantic information of non-standard vocabulary is added as another consideration factor, is optimized most short Editing distance method is difficult to handle the non-standard word problem that morphology differs greatly.

The technical solution adopted by the present invention is：Based on the social media non-standard word correcting method of semanteme, including：

S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector The distance between；

S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text；By each list in list Word is compared with the word in dictionary set；If the word in list if successful match be modular word；Otherwise it is non-standard Word；

S3, the non-standard word for judging to obtain for step S2, it is corresponding to find out each term vectors of the N in small distance with its term vector Word；The specification word in these words is found out, wherein minimum specification word is carried out to it with non-standard term vector for selection Replace.

Further, the method for the distance between described calculating any two term vector is：Euclidean distance or it is bright can Paderewski distance or Chebyshev's distance or manhatton distance or Mahalanobis generalised distance or cosine angle.

Further, pretreatment is specially described in step S2：

A1, filtering push away literary noise；It is described to push away literary noise and include：Idle character and mess code；

The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name；

A3, remove repetitor；

A4, using all non-letter characters to English word carry out word segmentation processing.

Further, the dictionary set described in step S2 comprises at least：Each conventional english dictionary, by pretreatment obtain Topic word and user name.

Further, step S3 also includes：

B1, using between word smallest edit distance d represent morphology it is similar；

B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents word Adopted similitude；

B3, according to following formula calculate with non-standard word close relation degree highest modular word；And according to the modular word come to taking Modular word is corrected；

S(ω₁,ω₂)=d+ β × l × α

Wherein, S (ω₁, ω₂) represent two words between close relation degree, S (ω₁, ω₂) the smaller tight ness rating of value is higher, β Represent semantic weight.

Based on the social media non-standard word correcting method of semanteme, including：

S3, the non-standard word for judging to obtain for step S2, using PyEnchant and PyTypo to respectively non-standard to be corrected Word is handled, and obtains corresponding Correcting Suggestion word list；

Correcting Suggestion word list is traveled through, each in calculations list suggests that word is edited with the minimum of non-standard word to be corrected Distance, obtain morphology similarity；Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, obtains semantic phase Like degree；

According to following formula combination morphology similarity and semantic similarity, each suggestion word is calculated with non-standard word to be corrected Go out a score value；Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, most Suggestion word above is replaced to non-standard word；If Correcting Suggestion word list is sky, the non-standard word is directly filtered；

S(ω₁,ω₂)=d+ β × l × α

Wherein, S (ω₁, ω₂) represent two words between close relation degree, S (ω₁, ω₂) the smaller tight ness rating of value is higher, β Semantic weight is represented, α is meaning of a word parameter, distances of the l between term vector.

Further, pretreatment is specially described in step S2：

A3, remove repetitor；

Beneficial effects of the present invention：The social media non-standard word correcting method based on semanteme of the present invention, make use of most The mode of small editing distance characterizes the morphology of word, and is characterized the semantic of word using the mode of COS distance between term vector and believed Similarity is ceased, common screening is expected correct word to replace non-standard word, can reach following beneficial effect：

1st, existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduced Amount of calculation, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection；

2nd, when locking correct word, common morphology Similar Problems are not only allowed for, it is also contemplated that term vector can characterize The semantic information of word, the similarity degree of semantic information between term vector is characterized with COS distance method, more fully devised correct The decision condition of word degree close with non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary, Error correction effect is more preferable；

3rd, different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable Long word utilizes PyTypo processing；Artificial judgment is carried out after meaningless word filtering, adds wordbook；During pushing away text pretreatment The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.

Brief description of the drawings

Fig. 1 is the protocol procedures figure of the application；

Fig. 2 is non-standard word identification process figure；

Fig. 3 is word error correction flow chart.

Embodiment

For ease of skilled artisan understands that the technology contents of the present invention, enter one to present invention below in conjunction with the accompanying drawings Step explaination.

It is the protocol procedures figure of the application as shown in Figure 1, the technical scheme of the application is：Social media based on semanteme is non- Modular word correcting method, including：

GloVe is the vector difference than being encoded in space by the co-occurrence probability of two words.Value such as P (k | i)/P (k | j) is got over Greatly, word k's is semantic just closer to i, on the contrary, P (k | i)/P (k | j) for value closer to 0, word k's is semantic just closer to j.

Define P_ijRepresent that word j appears in the probability of word i contexts：

Define F (ω_i,ω_j,ω_k) represent that word i appears in the probability of word k contexts and word j is appeared on word k The ratio of probability hereafter：

Wherein ω_i,ω_j,ω_kWord i, j, k term vector are represented respectively, and their dimension is that the d dimensions being previously set are empty Between in real number vector.

However, GloVe purpose is that vector difference is encoded to the ratio of Term co-occurrence probability, institute's above formula is revised as：

In above formula, the variable on the left side is the vector in two spaces, and right formula, which is one, to be total to by what corpus counted The real number that existing matrix is drawn.To simplify above formula, pass through vectorial ω_kWith vector difference ω_i-ω_jInner product operation, can be left by equation The dependent variable on side is converted into real number from vector：

In order to meet the symmetry of co-occurrence matrix, the levoform of above-mentioned equation is made into following deformation：

So, following equalities can be obtained：

Equation is solved, is obtained：

Above formula is deformed again, introduces biasing, this is due to logarithmic function property, prevents it from dissipating：

In order to which above formula is set up, by optimizing least squares method, and weighting function f (X are introduced_ij), finally give target letter Number：

Wherein f (X_ij) and X_ijCorrelation, when co-occurrence number is relatively low between word, f (X_ij) value should be relatively low, to reduce f (X_ij) influence to object function；, should be in X to avoid influenceing overall training effect when co-occurrence number is too high_ijReach certain During threshold value, f (X are kept_ij) value is constant.

The mode of conventional measurement vector distance has：Euclidean distance, Minkowski distance, Chebyshev distance, Manhatton distance, Mahalanobis generalised distance, cosine angle etc..Most popular is Euclidean distance method and cosine angle Method.Tested through analysis, the embodiment of the present application weighs vector distance from the method for calculating cosine angle.Calculation formula is as follows：

Literary information is largely pushed away as shown in Fig. 2 being got from Twitter api interfaces, the original of text is pushed away from lane database extraction Beginning text data, certain pretreatment is carried out first：

1. filter the text noise such as invalid character and mess code；

2. word of the extraction comprising " # " (topic) and the user name of "@" (referring to), as lay-by material；

3. duplicate removal；

4. segment；

5. filter symbol and Arabic numerals etc..

When processing is comprising digital word, the present invention is carried out at participle using all non-letter characters to English word Reason.Due to having filtered Arabic numerals, and the dynamics of participle is increased, effectively decomposed compound word and containing digital word, in fact A kind of process of standardization can also be regarded as.Storehouse is increased income present invention uses python WordSegment to decompose compound word.

A whole-word list will be obtained after pretreatment, judged again by list each word whether specification.Above carry To the method currently without good identification word, only using the method for most basic traversal dictionary.Word list is each Individual word is compared with each word in dictionary, if in dictionary successful match to same word, it is believed that word to be detected is Modular word, it is on the contrary then be considered non-standard word.Dictionary set in the present invention not only has the multiple conventional English words collected on the net Allusion quotation, it also added the topic word and user name obtained in preprocessing process.Consider the particularity of social media, can constantly there is new production Raw cyberspeak or uncommon word etc., the present invention improve wordbook by the normalization of artificial judgment word.Conventional English words Allusion quotation includes：Longman dictionary, Oxford Dictionary, Collins's dictionary etc..

The list of non-standard word is finally obtained, then carries out subsequent error correction processing.

S3, error correction is carried out for the step S2 non-standard words for judging to obtain, present applicant proposes two kinds of solutions：

Scheme one：Find out the corresponding word of each term vectors of the N in small distance with its term vector；Find out the rule in these words Model word, wherein minimum specification word is replaced to it with non-standard term vector for selection；Specially：

Standardize in pre-processing above by compound word and containing digital word, remaining misspelling word, extend word and nothing Meaning word is untreated.For meaningless word, the present invention is handled by the way of filtering.For misspelling word and extension Word, error correction is carried out using COS distance method between the term vector told about above, that is, for each incorrect word, find with Word corresponding to those immediate term vectors of its term vector COS distance, non-standard word is replaced with correct word therein Change.

But if directly select the effect of the corresponding correct word of selection less desirable using COS distance, between word Similarity can be divided into morphology the phase Sihe meaning of a word it is similar.According to traditional word error correction method, smallest edit distance method It is the actual parameter for weighing morphology similarity degree between vocabulary, smallest edit distance is represented with d between word.From above, word to COS distance between amount can represent the semantic similar situation of vocabulary, and a parameter alpha is multiplied by come table with the COS distance l between term vector Show meaning of a word similitude.Parameter alpha taking human as adjustment, can represent the quality of term vector.

Obtain following formula of score：

S(ω₁,ω₂)=d+ β × l × α (11)

β in formula represents semantic weight.If the value of the score company between two words is lower, then it represents that the two words close System is closer.Because parameter alpha is related to semantic weight β, to simplify formula, h=α × β are made, formula of score is changed into：

S(ω₁,ω₂)=d+h × l (12)

Make the score value between non-standard word and correct word minimum using above-mentioned formula, be replaced to reach correction mesh 's.This formula not only allows for the feature of morphology, effectively handles the problem of morphology difference is big between non-standard word and correct word, also Semantic information is combined, influence of the lexical semantic information inaccuracy to error correction effect can be avoided by adjusting parameter.

Scheme two：If only directly being judged using formula, the scope of retrieval is too big, causes workload to increase severely, and the used time is oversize. Introducing python above has the storehouse PyEnchant that increases income, and has processing incorrect word, and return to a Correcting Suggestion word The function of list.And PyTypo has well the directly function of processing extension word.

As shown in figure 3, the non-standard word that correction is treated first with PyEnchant and PyTypo is handled, obtain corresponding Correcting Suggestion word list.Travel through Correcting Suggestion list again, each in calculations list suggests the minimum of word and word to be corrected Editing distance, obtain morphology similarity；The COS distance between their term vector is also calculated, obtains semantic similarity.It is logical Cross formula (12) and combine morphology similarity and semantic similarity, calculate a score value.Will correct suggestion lists by goals for by As little as high rearrangement, the lower suggestion word sequence of score is more forward, thus obtains a new Correcting Suggestion list.Choose The word of each new list foremost is replaced to non-standard word, reaches error correction purpose.But if Correcting Suggestion word list For sky, it is meaningless word to illustrate word to be corrected, present invention selection directly filtering, and later stage artificial judgment these meaningless words whether Correct word can be used as, if it is, adding wordbook.

PyEnchant is based on Enchant storehouses, is the very powerful python kits of increasing income of One function, mainly by with In functions such as word detection, spelling error correction.It can return to a Correcting Suggestion list, and this usual Correcting Suggestion list is one The list being made up of multiple correct words, the forward word that sorts are the optimal error correction selecting object of system recommendations.

PyTypo is a Python storehouse of increasing income, and is designed to (in particular prolong specifically for pushing away non-standard word in text Long word) processing Python kits.It has not only used structure dictionary method, also add the function of removing and repeat substring, effectively Processing extension word.But it is undesirable for the error correction effect of misspelling vocabulary and nonsense words.

Existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduce meter Calculation amount, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection；

When locking correct word, not only allow for common morphology Similar Problems, it is also contemplated that term vector can characterize word Semantic information, with COS distance method characterize term vector between semantic information similarity degree, more fully devise correct word with The decision condition of the close degree of non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary, error correction Effect is more preferable；Different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable Long word utilizes PyTypo processing；Artificial judgment is carried out after meaningless word filtering, adds wordbook；During pushing away text pretreatment The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.

One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.For ability For the technical staff in domain, the present invention can have various modifications and variations.Within the spirit and principles of the invention, made Any modification, equivalent substitution and improvements etc., should be included within scope of the presently claimed invention.

Claims

1. the social media non-standard word correcting method based on semanteme, it is characterised in that including：

S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculated between any two term vector Distance；

S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text；By each word in list with Word in dictionary set is compared；If the word in list if successful match be modular word；Otherwise it is non-standard word；

S3, the non-standard word for judging to obtain for step S2, it is corresponding single to find out each term vectors of the N in small distance with its term vector Word；The specification word in these words is found out, wherein minimum specification word is replaced to it with non-standard term vector for selection.

2. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that the meter The method for calculating the distance between any two term vector is：Euclidean distance or Minkowski distance or Chebyshev's distance Or manhatton distance or Mahalanobis generalised distance or cosine angle.

3. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S2 The pretreatment is specially：

A3, remove repetitor；

4. the social media non-standard word correcting method according to claim 3 based on semanteme, it is characterised in that step S2 Described dictionary set comprises at least：Each conventional english dictionary, the topic word and user name obtained by pretreatment.

5. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S3 Also include：

B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents meaning of a word phase Like property；

B3, according to following formula calculate with non-standard word close relation degree highest modular word；And according to the modular word come to taking specification Word is corrected；

S(ω₁,ω₂)=d+ β × l × α

Wherein, S (ω₁, ω₂) represent two words between close relation degree, S (ω₁, ω₂) the smaller tight ness rating of value is higher, β is represented Semantic weight.

6. the social media non-standard word correcting method based on semanteme, it is characterised in that including：

S3, the non-standard word for judging to obtain for step S2, respectively non-standard word to be corrected is entered using PyEnchant and PyTypo Row processing, obtains corresponding Correcting Suggestion word list；

Travel through Correcting Suggestion word list, each in calculations list suggest word and non-standard word to be corrected it is minimum edit away from From obtaining morphology similarity；Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, is obtained semantic similar Degree；

According to following formula combination morphology similarity and semantic similarity, one is calculated with non-standard word to be corrected to each suggestion word Individual score value；Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, foremost Suggestion word non-standard word is replaced；If Correcting Suggestion word list is sky, the non-standard word is directly filtered；

S(ω₁,ω₂)=d+ β × l × α

Wherein, S (ω₁, ω₂) represent two words between close relation degree, S (ω₁, ω₂) the smaller tight ness rating of value is higher, β is represented Semantic weight, α are meaning of a word parameter, distances of the l between term vector.

7. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that the meter The method for calculating the distance between any two term vector is：Euclidean distance or Minkowski distance or Chebyshev's distance Or manhatton distance or Mahalanobis generalised distance or cosine angle.

8. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2 The pretreatment is specially：

A3, remove repetitor；

9. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2 Described dictionary set comprises at least：Each conventional english dictionary, the topic word and user name obtained by pretreatment.