CN110032738A

CN110032738A - Microblogging text normalization method based on context graph random walk and phonetic-stroke code

Info

Publication number: CN110032738A
Application number: CN201910305628.5A
Authority: CN
Inventors: 不公告发明人
Original assignee: Zhongsen Yunchain (chengdu) Technology Co Ltd
Current assignee: Zhongsen Yunchain (chengdu) Technology Co Ltd
Priority date: 2019-04-16
Filing date: 2019-04-16
Publication date: 2019-07-19

Abstract

The microblogging text normalization method based on context graph random walk and phonetic-stroke code that the present invention provides a kind of, belongs to computer technology social media text content analysis and digging technology field.This method comprises: identification non-standard word, extracts word context；It constructs context graph and carries out random walk, obtain the standardization Candidate Set based on context；Using Chinese-character sound-shape code, the standardization Candidate Set based on sound shape is obtained；Two standardization Candidate Sets are handled, result of finally standardizing is obtained.This method overcomes the deficiency that conventional method does not fully consider Chinese-character sound-shape.Substantially, social media is different from the written words such as news, wherein being flooded with a large amount of non-standard abbreviation, homonym and homograph, this makes the effect of natural language processing tool processing microblogging text undesirable.Therefore, the invention proposes phonetic-stroke code is understood the microblogging text normalization method that combines with context, to carry out analysis using natural language processing tool after standardization and excavation provides possibility.

Description

Microblogging text normalization method based on context graph random walk and phonetic-stroke code

Technical field

The invention belongs to field of computer technology, specifically a kind of microblogging based on context graph random walk and phonetic-stroke code Text normalization method.

Background technique

With popularizing for social networks, constantly there is new user to be added in social networks, daily in each social platform The text data of generation is all ten hundreds of.The features such as microblogging is due to its instant, short and small and fast propagation, it has also become current One of most important social network-i i-platform.It also becomes people and obtains news and current affairs, human communication, self-expression, society's sharing With the important medium of communal participation.Therefore, these microblog datas have great researching value.But exist in microblogging text big The non-standard word of amount, the effect is unsatisfactory when so that existing natural language tool directly handling microblogging text.If can Standardize to the non-standard word in microblogging text, can undoubtedly improve natural language processing correlative study to a certain extent Effect.

In recent years, work on hand proposed a variety of normalization methods for English text.But with these methods come Handling Chinese text, more or less there is some problems.For example, being counted using maximum identical word string rate and edit distance approach The literal similarity between non-standard word and modular word is calculated, the method is relatively specific for calculating the similarity of English text, not It is suitable for very much the calculating of Chinese similarity.Therefore, existing normalization method is unable to satisfy the need of Chinese microblogging text normalization It asks.

The present invention proposes a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code.In order to allow The effect of Chinese microblogging text normalization is more preferable, and the present invention is considered at following two aspects: first, we are in context On the basis of figure random walk, phonetic-stroke code method is introduced, the sound shape phase between non-standard word and modular word can be obtained well Like property, it is contemplated that the characteristics of Chinese language.Second, we have carried out some modifications to original phonetic-stroke code, meet it more The expression characteristic of microblogging text can preferably complete microblogging text normalization task.

Summary of the invention

The purpose of the present invention is to provide a kind of microblogging text normalization based on context graph random walk and phonetic-stroke code Method.The present invention calculates the sound shape similarity between non-standard word and modular word by introducing phonetic-stroke code method, so that finally Standardization result it is more accurate.

The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, including following Step:

Step 1: participle operation is carried out to Chinese microblogging text.

Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word.

Step 3: being constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context Lower texts and pictures.

Step 4: carrying out random walk on context graph, it is candidate in the standardization of context to obtain each non-standard base Collection.

Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word.

Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-rule The feature vector of the modular word of the corresponding prediction of model word.

Step 7: being compared with the feature vector of modular word in standard dictionary, find out prediction corresponding with non-standard word The immediate k word of the feature vector of modular word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation With the standardization candidate sequence of font.

Step 8: two standardization Candidate Sets of processing export standardization knot of the top n modular word as corresponding non-standard word Fruit.

Using the non-standard word in standard dictionary identification microblogging text in the step 2, and extract the stationery up and down of word Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.

Wherein, the contextual definition of each word is the word sequence of word composition each before and after word by the present invention.

According to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3 Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context Context graph G (W, C, E) out.

Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation The side of conjunction node and context node in table figure, the weight on side are the co-occurrence numbers of word and context.

Random walk is carried out in the step 4 on context graph, obtains each non-standard base in the specification of context Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base In the standardization Candidate Set of context.

Random walk each time is subordinated to the node N of non-standard word_iStart, then with probability P_ijBe moved to arbitrarily with its phase Context node M even_j.Conversion between each node pair is by transition probability P_ijDefinition, any two node N_i, M_jBetween Transition probability is defined as:

Wherein, N_iIndicate non-standard word node, M_jIndicate context node, P_ijIndicate node N_i, M_jBetween transfer it is general Rate, W_ijIndicate node N_i, M_jBetween side right weight, W_ikExpression and N_iConnected any one context node M_kSide right weight.

The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution.For any random Migration, the step number passed through between any two node, which is referred to as, hits the time.Therefore, the non-standard word of the r times random walk It is hr (n, m) to the hit time between (n, m) with modular word node.Cost between two nodes is defined as connecting the two The average hit time H (n, m) of all random walks of node:

Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, h_r(n, m) indicates node To the hit time of (n, m) the r times random walk, R indicates connecting node to the number of all random walks of (n, m).

Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two sections The relative frequency for the every other modular word node that the average hit H (n, m) of point connect with the non-standard word.Therefore L (n, m) meter It calculates as follows:

Wherein, L (n, m) indicates node to the Context similarity of (n, m).

The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is based on The standardization candidate sequence of Context similarity.

Based on the phonetic-stroke code of individual Chinese character in the step 5, the phonetic-stroke code of word is found out specifically, according to modified sound Shape code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtains each non-standard word and rule The phonetic-stroke code of model word.

Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will Its matrix for being expressed as one 4 × 10；If number of words is less than four, in end zero padding.

To each non-standard word in the step 6, the feature vector of phonetic-stroke code is extracted, is input in phonetic-stroke code model, it is defeated The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word The feature vector of the modular word of corresponding prediction.

Feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.

Two standardization Candidate Sets are handled in the step 8, export specification of the top n modular word as corresponding non-standard word Change result specifically, waiting to each non-standard base in the standardization Candidate Set of context and the standardization based on pronunciation and font Selected works reorder, and export top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.

Detailed description of the invention

Fig. 1 is flow diagram of the invention.

Fig. 2 is random walk flow chart.

Fig. 3 is modified phonetic-stroke code structure chart.

Fig. 4 is that the phonetic-stroke code of word indicates.

Fig. 5 is characterized weight setting when extraction.

Fig. 6 is word feature expression.

Specific embodiment

The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, overall flow As shown in Figure 1, comprising the following steps:

Step 1: participle operation is carried out to Chinese microblogging text.

Participle operation is carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.

Wherein, the word sequence that the present invention forms the contextual definition of each word for a word each before and after word, such as Fig. 2, One context of word " refreshing horse " is " inquiring into topic ".

Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation The side of conjunction node and context node in table figure.The weight on side is the co-occurrence number of word and context, such as Fig. 2, the left side White nodes illustrate word " refreshing horse " and " what " corresponding context " being thing " etc., the grey word node on the right indicates Non-standard word " refreshing horse ", white word node indicate modular word " what ", and the weight 1 on the side of " refreshing horse " and " being thing " is connected in figure Indicate that " being refreshing horse thing " occurs once in the text.

Random walk each time is subordinated to the node N of non-standard word_iStart, then with probability P_ijBe moved to arbitrarily with its phase Context node M even_j.Conversion between each node pair is by transition probability P_ijIt defines, any two node i, turns between j Move definition of probability are as follows:

Such as Fig. 2: step 1., from non-standard word " refreshing horse ", find its corresponding context node be " being thing ", " inquiring into topic ", and " for meeting ", corresponding side right weight are (1,1,3), find out " refreshing horse " according to transition probability Pij and arrive each hereafter The transition probability of node, respectively [0.2,0.2,0.6] then generate a random number, it is assumed that are 0.61221, fall in (0.4,1.0] section, so index value be 2, the context node selected is " for meeting "；Step 2., from context and node " for meeting " sets out, and finding the word node connecting with it is " refreshing horse " and " what ", and there are two word nodes, because the present invention is arranged Last node cannot be returned in random walk, so index value is 1, selected word node " what "；3. step, carries out two Secondary judgement judges whether the word node is modular word, judges whether migration step number reaches maximum value S, and S=4 is arranged in the present invention, if The word node is that modular word node or migration step number are 4 and stop this migration, if the word is not modular word and migration step number 2. 3. do not reach 4, then continue migration, repeat step 1.；In Fig. 2, " what " is modular word, stops this migration, finds The modular word " what " of " refreshing horse ".

Wherein, L (n, m) indicates node to the Context similarity of (n, m).

Such as Fig. 3, modified phonetic-stroke code structure is divided into two parts, and first part is tone code, indicates the phonetic of Chinese character, packet Initial consonant, simple or compound vowel of a Chinese syllable and auxiliary simple or compound vowel of a Chinese syllable are contained；Second part is shape code, illustrates the font of Chinese character, contains structure, quadrangle coding and Stroke number.

Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will Its matrix for being expressed as one 4 × 10；If number of words is less than four, in end zero padding, such as Fig. 4, the sound shape of word " pear mountain is big " Code is " duck ", " pears ", " mountain ", one 4 × 10 matrix of the phonetic-stroke code composition of " big " this four words.

Feature extraction, such as Fig. 6, by word " pear mountain are carried out to word according to the weight setting in Fig. 5 when feature extraction Phonetic-stroke code greatly " indicates and multiplied by weight, and the feature vector for having obtained " pear mountain is big " is [6.86 4.77 9.33 2.34].

Feature vector in the step 7 with modular word in standard dictionary compares, using k-d tree algorithm find out in advance The feature vector of the modular word of survey obtains base apart from k nearest word, the as highest modular word of top K sound shape similarity In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.

Using implementation method of the invention, have the beneficial effect that: first, on the basis of context graph random walk, draw Phonetic-stroke code method is entered, the sound shape similitude between non-standard word and modular word can have been obtained well, it is contemplated that Chinese language The characteristics of；Second, some modifications have been carried out to original phonetic-stroke code, it is made more to meet the expression characteristic of microblogging text, Neng Gougeng Microblogging text normalization task is completed well.

The present invention is implemented above provided a kind of based on the microblogging text of context graph random walk and phonetic-stroke code rule Generalized method is introduced in detail, is expounded herein to the principle of the present invention and embodiment, and the above implementation is said The bright auxiliary that is only intended to understands method and its core concept of the invention.

Claims

1. a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, which is characterized in that the side Method is applied to Chinese microblogging text normalization, comprising the following steps:

Step 1: participle operation is carried out to Chinese microblogging text；

Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word；

Step 3: context is constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context Figure；

Step 4: carrying out random walk on context graph, obtain each non-standard base in the standardization Candidate Set of context；

Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word；

Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-standard word The feature vector of the modular word of corresponding prediction；

Step 7: being compared with the feature vector of modular word in standard dictionary, find out the specification of prediction corresponding with non-standard word The immediate k word of the feature vector of word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation and word The standardization candidate sequence of shape；

Step 8: two standardization Candidate Sets of processing export standardization result of the top n modular word as corresponding non-standard word.

2. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: participle operation being carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.

3. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: using the non-standard word in standard dictionary identification microblogging text in the step 2, and extracting the stationery up and down of word Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.

4. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: according to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3 Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context Context graph G (W, C, E) out.

Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, and E represents figure The side of middle conjunction node and context node, the weight on side are the co-occurrence numbers of word and context.

5. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: carrying out random walk in the step 4 on context graph, obtain each non-standard base in the specification of context Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base In the standardization Candidate Set of context.

Random walk each time is all slaves to the node N of non-standard word_iStart, then with probability P_ijBe moved to arbitrarily with its phase Context node M even_j, the conversion between each node pair is by transition probability P_ijIt defines, any two node i, turns between j Move definition of probability are as follows:

Wherein, N_iIndicate non-standard word node, M_jIndicate context node, P_ijIndicate node N_i, M_jBetween transition probability, W_ij Indicate node N_i, M_jBetween side right weight, W_ikExpression and N_iConnected any one context node M_kSide right weight.

The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution, for any random trip Walk, the step number passed through between any two node be referred to as hit the time, therefore, the non-standard word of the r time random walk with Modular word node is h to the hit time between (n, m)_r(n, m), the cost between two nodes are defined as connecting the two sections The average hit time H (n, m) of all random walks of point:

Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, h_r(n, m) indicate node to (n, M) the hit time of the r times random walk, R indicate connecting node to the number of all random walks of (n, m).

Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two nodes The relative frequency for the every other modular word node that H (n, m) is connect with the non-standard word is averagely hit, therefore L (n, m) is calculated such as Under:

Wherein, L (n, m) indicates node to the Context similarity of (n, m).

The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is obtained based on up and down The standardization candidate sequence of literary similitude.

6. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: based on the phonetic-stroke code of individual Chinese character in the step 5, finding out the phonetic-stroke code of word specifically, according to modified Phonetic-stroke code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtain each non-standard word and The phonetic-stroke code of modular word.

Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, by its table It is shown as one 4 × 10 matrix；If number of words is less than four, in end zero padding.

7. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: to each non-standard word in the step 6, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, it is defeated The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word The feature vector of the modular word of corresponding prediction.

8. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, Be characterized in that: the feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.

9. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: handling two standardization Candidate Sets in the step 8, export rule of the top n modular word as corresponding non-standard word Generalized result is specifically, to each non-standard base in the standardization Candidate Set of context and standardization based on pronunciation and font Candidate Set reorders, and exports top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.