CN110032738A - Microblogging text normalization method based on context graph random walk and phonetic-stroke code - Google Patents

Microblogging text normalization method based on context graph random walk and phonetic-stroke code Download PDF

Info

Publication number
CN110032738A
CN110032738A CN201910305628.5A CN201910305628A CN110032738A CN 110032738 A CN110032738 A CN 110032738A CN 201910305628 A CN201910305628 A CN 201910305628A CN 110032738 A CN110032738 A CN 110032738A
Authority
CN
China
Prior art keywords
word
context
standard
phonetic
modular
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910305628.5A
Other languages
Chinese (zh)
Inventor
不公告发明人
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongsen Yunchain (chengdu) Technology Co Ltd
Original Assignee
Zhongsen Yunchain (chengdu) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongsen Yunchain (chengdu) Technology Co Ltd filed Critical Zhongsen Yunchain (chengdu) Technology Co Ltd
Priority to CN201910305628.5A priority Critical patent/CN110032738A/en
Publication of CN110032738A publication Critical patent/CN110032738A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Economics (AREA)
  • Human Resources & Organizations (AREA)
  • Computing Systems (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • Primary Health Care (AREA)
  • Machine Translation (AREA)

Abstract

The microblogging text normalization method based on context graph random walk and phonetic-stroke code that the present invention provides a kind of, belongs to computer technology social media text content analysis and digging technology field.This method comprises: identification non-standard word, extracts word context;It constructs context graph and carries out random walk, obtain the standardization Candidate Set based on context;Using Chinese-character sound-shape code, the standardization Candidate Set based on sound shape is obtained;Two standardization Candidate Sets are handled, result of finally standardizing is obtained.This method overcomes the deficiency that conventional method does not fully consider Chinese-character sound-shape.Substantially, social media is different from the written words such as news, wherein being flooded with a large amount of non-standard abbreviation, homonym and homograph, this makes the effect of natural language processing tool processing microblogging text undesirable.Therefore, the invention proposes phonetic-stroke code is understood the microblogging text normalization method that combines with context, to carry out analysis using natural language processing tool after standardization and excavation provides possibility.

Description

Microblogging text normalization method based on context graph random walk and phonetic-stroke code
Technical field
The invention belongs to field of computer technology, specifically a kind of microblogging based on context graph random walk and phonetic-stroke code Text normalization method.
Background technique
With popularizing for social networks, constantly there is new user to be added in social networks, daily in each social platform The text data of generation is all ten hundreds of.The features such as microblogging is due to its instant, short and small and fast propagation, it has also become current One of most important social network-i i-platform.It also becomes people and obtains news and current affairs, human communication, self-expression, society's sharing With the important medium of communal participation.Therefore, these microblog datas have great researching value.But exist in microblogging text big The non-standard word of amount, the effect is unsatisfactory when so that existing natural language tool directly handling microblogging text.If can Standardize to the non-standard word in microblogging text, can undoubtedly improve natural language processing correlative study to a certain extent Effect.
In recent years, work on hand proposed a variety of normalization methods for English text.But with these methods come Handling Chinese text, more or less there is some problems.For example, being counted using maximum identical word string rate and edit distance approach The literal similarity between non-standard word and modular word is calculated, the method is relatively specific for calculating the similarity of English text, not It is suitable for very much the calculating of Chinese similarity.Therefore, existing normalization method is unable to satisfy the need of Chinese microblogging text normalization It asks.
The present invention proposes a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code.In order to allow The effect of Chinese microblogging text normalization is more preferable, and the present invention is considered at following two aspects: first, we are in context On the basis of figure random walk, phonetic-stroke code method is introduced, the sound shape phase between non-standard word and modular word can be obtained well Like property, it is contemplated that the characteristics of Chinese language.Second, we have carried out some modifications to original phonetic-stroke code, meet it more The expression characteristic of microblogging text can preferably complete microblogging text normalization task.
Summary of the invention
The purpose of the present invention is to provide a kind of microblogging text normalization based on context graph random walk and phonetic-stroke code Method.The present invention calculates the sound shape similarity between non-standard word and modular word by introducing phonetic-stroke code method, so that finally Standardization result it is more accurate.
The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, including following Step:
Step 1: participle operation is carried out to Chinese microblogging text.
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word.
Step 3: being constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context Lower texts and pictures.
Step 4: carrying out random walk on context graph, it is candidate in the standardization of context to obtain each non-standard base Collection.
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word.
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-rule The feature vector of the modular word of the corresponding prediction of model word.
Step 7: being compared with the feature vector of modular word in standard dictionary, find out prediction corresponding with non-standard word The immediate k word of the feature vector of modular word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation With the standardization candidate sequence of font.
Step 8: two standardization Candidate Sets of processing export standardization knot of the top n modular word as corresponding non-standard word Fruit.
Using the non-standard word in standard dictionary identification microblogging text in the step 2, and extract the stationery up and down of word Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the contextual definition of each word is the word sequence of word composition each before and after word by the present invention.
According to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3 Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation The side of conjunction node and context node in table figure, the weight on side are the co-occurrence numbers of word and context.
Random walk is carried out in the step 4 on context graph, obtains each non-standard base in the specification of context Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base In the standardization Candidate Set of context.
Random walk each time is subordinated to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase Context node M evenj.Conversion between each node pair is by transition probability PijDefinition, any two node Ni, MjBetween Transition probability is defined as:
Wherein, NiIndicate non-standard word node, MjIndicate context node, PijIndicate node Ni, MjBetween transfer it is general Rate, WijIndicate node Ni, MjBetween side right weight, WikExpression and NiConnected any one context node MkSide right weight.
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution.For any random Migration, the step number passed through between any two node, which is referred to as, hits the time.Therefore, the non-standard word of the r times random walk It is hr (n, m) to the hit time between (n, m) with modular word node.Cost between two nodes is defined as connecting the two The average hit time H (n, m) of all random walks of node:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicates node To the hit time of (n, m) the r times random walk, R indicates connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two sections The relative frequency for the every other modular word node that the average hit H (n, m) of point connect with the non-standard word.Therefore L (n, m) meter It calculates as follows:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is based on The standardization candidate sequence of Context similarity.
Based on the phonetic-stroke code of individual Chinese character in the step 5, the phonetic-stroke code of word is found out specifically, according to modified sound Shape code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtains each non-standard word and rule The phonetic-stroke code of model word.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will Its matrix for being expressed as one 4 × 10;If number of words is less than four, in end zero padding.
To each non-standard word in the step 6, the feature vector of phonetic-stroke code is extracted, is input in phonetic-stroke code model, it is defeated The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word The feature vector of the modular word of corresponding prediction.
Feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.
Two standardization Candidate Sets are handled in the step 8, export specification of the top n modular word as corresponding non-standard word Change result specifically, waiting to each non-standard base in the standardization Candidate Set of context and the standardization based on pronunciation and font Selected works reorder, and export top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Fig. 2 is random walk flow chart.
Fig. 3 is modified phonetic-stroke code structure chart.
Fig. 4 is that the phonetic-stroke code of word indicates.
Fig. 5 is characterized weight setting when extraction.
Fig. 6 is word feature expression.
Specific embodiment
The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, overall flow As shown in Figure 1, comprising the following steps:
Step 1: participle operation is carried out to Chinese microblogging text.
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word.
Step 3: being constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context Lower texts and pictures.
Step 4: carrying out random walk on context graph, it is candidate in the standardization of context to obtain each non-standard base Collection.
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word.
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-rule The feature vector of the modular word of the corresponding prediction of model word.
Step 7: being compared with the feature vector of modular word in standard dictionary, find out prediction corresponding with non-standard word The immediate k word of the feature vector of modular word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation With the standardization candidate sequence of font.
Step 8: two standardization Candidate Sets of processing export standardization knot of the top n modular word as corresponding non-standard word Fruit.
Participle operation is carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.
Using the non-standard word in standard dictionary identification microblogging text in the step 2, and extract the stationery up and down of word Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the word sequence that the present invention forms the contextual definition of each word for a word each before and after word, such as Fig. 2, One context of word " refreshing horse " is " inquiring into topic ".
According to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3 Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation The side of conjunction node and context node in table figure.The weight on side is the co-occurrence number of word and context, such as Fig. 2, the left side White nodes illustrate word " refreshing horse " and " what " corresponding context " being thing " etc., the grey word node on the right indicates Non-standard word " refreshing horse ", white word node indicate modular word " what ", and the weight 1 on the side of " refreshing horse " and " being thing " is connected in figure Indicate that " being refreshing horse thing " occurs once in the text.
Random walk is carried out in the step 4 on context graph, obtains each non-standard base in the specification of context Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base In the standardization Candidate Set of context.
Random walk each time is subordinated to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase Context node M evenj.Conversion between each node pair is by transition probability PijIt defines, any two node i, turns between j Move definition of probability are as follows:
Such as Fig. 2: step 1., from non-standard word " refreshing horse ", find its corresponding context node be " being thing ", " inquiring into topic ", and " for meeting ", corresponding side right weight are (1,1,3), find out " refreshing horse " according to transition probability Pij and arrive each hereafter The transition probability of node, respectively [0.2,0.2,0.6] then generate a random number, it is assumed that are 0.61221, fall in (0.4,1.0] section, so index value be 2, the context node selected is " for meeting ";Step 2., from context and node " for meeting " sets out, and finding the word node connecting with it is " refreshing horse " and " what ", and there are two word nodes, because the present invention is arranged Last node cannot be returned in random walk, so index value is 1, selected word node " what ";3. step, carries out two Secondary judgement judges whether the word node is modular word, judges whether migration step number reaches maximum value S, and S=4 is arranged in the present invention, if The word node is that modular word node or migration step number are 4 and stop this migration, if the word is not modular word and migration step number 2. 3. do not reach 4, then continue migration, repeat step 1.;In Fig. 2, " what " is modular word, stops this migration, finds The modular word " what " of " refreshing horse ".
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution.For any random Migration, the step number passed through between any two node, which is referred to as, hits the time.Therefore, the non-standard word of the r times random walk It is hr (n, m) to the hit time between (n, m) with modular word node.Cost between two nodes is defined as connecting the two The average hit time H (n, m) of all random walks of node:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicates node To the hit time of (n, m) the r times random walk, R indicates connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two sections The relative frequency for the every other modular word node that the average hit H (n, m) of point connect with the non-standard word.Therefore L (n, m) meter It calculates as follows:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is based on The standardization candidate sequence of Context similarity.
Based on the phonetic-stroke code of individual Chinese character in the step 5, the phonetic-stroke code of word is found out specifically, according to modified sound Shape code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtains each non-standard word and rule The phonetic-stroke code of model word.
Such as Fig. 3, modified phonetic-stroke code structure is divided into two parts, and first part is tone code, indicates the phonetic of Chinese character, packet Initial consonant, simple or compound vowel of a Chinese syllable and auxiliary simple or compound vowel of a Chinese syllable are contained;Second part is shape code, illustrates the font of Chinese character, contains structure, quadrangle coding and Stroke number.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will Its matrix for being expressed as one 4 × 10;If number of words is less than four, in end zero padding, such as Fig. 4, the sound shape of word " pear mountain is big " Code is " duck ", " pears ", " mountain ", one 4 × 10 matrix of the phonetic-stroke code composition of " big " this four words.
To each non-standard word in the step 6, the feature vector of phonetic-stroke code is extracted, is input in phonetic-stroke code model, it is defeated The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word The feature vector of the modular word of corresponding prediction.
Feature extraction, such as Fig. 6, by word " pear mountain are carried out to word according to the weight setting in Fig. 5 when feature extraction Phonetic-stroke code greatly " indicates and multiplied by weight, and the feature vector for having obtained " pear mountain is big " is [6.86 4.77 9.33 2.34].
Feature vector in the step 7 with modular word in standard dictionary compares, using k-d tree algorithm find out in advance The feature vector of the modular word of survey obtains base apart from k nearest word, the as highest modular word of top K sound shape similarity In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.
Two standardization Candidate Sets are handled in the step 8, export specification of the top n modular word as corresponding non-standard word Change result specifically, waiting to each non-standard base in the standardization Candidate Set of context and the standardization based on pronunciation and font Selected works reorder, and export top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
Using implementation method of the invention, have the beneficial effect that: first, on the basis of context graph random walk, draw Phonetic-stroke code method is entered, the sound shape similitude between non-standard word and modular word can have been obtained well, it is contemplated that Chinese language The characteristics of;Second, some modifications have been carried out to original phonetic-stroke code, it is made more to meet the expression characteristic of microblogging text, Neng Gougeng Microblogging text normalization task is completed well.
The present invention is implemented above provided a kind of based on the microblogging text of context graph random walk and phonetic-stroke code rule Generalized method is introduced in detail, is expounded herein to the principle of the present invention and embodiment, and the above implementation is said The bright auxiliary that is only intended to understands method and its core concept of the invention.

Claims (9)

1. a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, which is characterized in that the side Method is applied to Chinese microblogging text normalization, comprising the following steps:
Step 1: participle operation is carried out to Chinese microblogging text;
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word;
Step 3: context is constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context Figure;
Step 4: carrying out random walk on context graph, obtain each non-standard base in the standardization Candidate Set of context;
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word;
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-standard word The feature vector of the modular word of corresponding prediction;
Step 7: being compared with the feature vector of modular word in standard dictionary, find out the specification of prediction corresponding with non-standard word The immediate k word of the feature vector of word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation and word The standardization candidate sequence of shape;
Step 8: two standardization Candidate Sets of processing export standardization result of the top n modular word as corresponding non-standard word.
2. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: participle operation being carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.
3. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: using the non-standard word in standard dictionary identification microblogging text in the step 2, and extracting the stationery up and down of word Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the contextual definition of each word is the word sequence of word composition each before and after word by the present invention.
4. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: according to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3 Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, and E represents figure The side of middle conjunction node and context node, the weight on side are the co-occurrence numbers of word and context.
5. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: carrying out random walk in the step 4 on context graph, obtain each non-standard base in the specification of context Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base In the standardization Candidate Set of context.
Random walk each time is all slaves to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase Context node M evenj, the conversion between each node pair is by transition probability PijIt defines, any two node i, turns between j Move definition of probability are as follows:
Wherein, NiIndicate non-standard word node, MjIndicate context node, PijIndicate node Ni, MjBetween transition probability, Wij Indicate node Ni, MjBetween side right weight, WikExpression and NiConnected any one context node MkSide right weight.
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution, for any random trip Walk, the step number passed through between any two node be referred to as hit the time, therefore, the non-standard word of the r time random walk with Modular word node is h to the hit time between (n, m)r(n, m), the cost between two nodes are defined as connecting the two sections The average hit time H (n, m) of all random walks of point:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicate node to (n, M) the hit time of the r times random walk, R indicate connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two nodes The relative frequency for the every other modular word node that H (n, m) is connect with the non-standard word is averagely hit, therefore L (n, m) is calculated such as Under:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is obtained based on up and down The standardization candidate sequence of literary similitude.
6. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: based on the phonetic-stroke code of individual Chinese character in the step 5, finding out the phonetic-stroke code of word specifically, according to modified Phonetic-stroke code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtain each non-standard word and The phonetic-stroke code of modular word.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, by its table It is shown as one 4 × 10 matrix;If number of words is less than four, in end zero padding.
7. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: to each non-standard word in the step 6, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, it is defeated The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word The feature vector of the modular word of corresponding prediction.
8. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, Be characterized in that: the feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font Standardization candidate sequence.
9. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code, It is characterized in that: handling two standardization Candidate Sets in the step 8, export rule of the top n modular word as corresponding non-standard word Generalized result is specifically, to each non-standard base in the standardization Candidate Set of context and standardization based on pronunciation and font Candidate Set reorders, and exports top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
CN201910305628.5A 2019-04-16 2019-04-16 Microblogging text normalization method based on context graph random walk and phonetic-stroke code Pending CN110032738A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910305628.5A CN110032738A (en) 2019-04-16 2019-04-16 Microblogging text normalization method based on context graph random walk and phonetic-stroke code

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910305628.5A CN110032738A (en) 2019-04-16 2019-04-16 Microblogging text normalization method based on context graph random walk and phonetic-stroke code

Publications (1)

Publication Number Publication Date
CN110032738A true CN110032738A (en) 2019-07-19

Family

ID=67238712

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910305628.5A Pending CN110032738A (en) 2019-04-16 2019-04-16 Microblogging text normalization method based on context graph random walk and phonetic-stroke code

Country Status (1)

Country Link
CN (1) CN110032738A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825852A (en) * 2019-11-07 2020-02-21 四川长虹电器股份有限公司 Long text-oriented semantic matching method and system
CN111767422A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Data auditing method, device, terminal and storage medium
CN112801425A (en) * 2021-03-31 2021-05-14 腾讯科技(深圳)有限公司 Method and device for determining information click rate, computer equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120303355A1 (en) * 2011-05-27 2012-11-29 Robert Bosch Gmbh Method and System for Text Message Normalization Based on Character Transformation and Web Data
CN104536951A (en) * 2014-12-29 2015-04-22 北京牡丹电子集团有限责任公司数字电视技术中心 Microblog text normalizing, word segmenting and part-speech tagging method and system
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN108009253A (en) * 2017-12-05 2018-05-08 昆明理工大学 A kind of improved character string Similar contrasts method
CN108681609A (en) * 2018-05-28 2018-10-19 盐城工学院 A kind of adaptively selected property text cluster integrated approach based on hierarchical clustering

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20120303355A1 (en) * 2011-05-27 2012-11-29 Robert Bosch Gmbh Method and System for Text Message Normalization Based on Character Transformation and Web Data
CN104536951A (en) * 2014-12-29 2015-04-22 北京牡丹电子集团有限责任公司数字电视技术中心 Microblog text normalizing, word segmenting and part-speech tagging method and system
CN107577668A (en) * 2017-09-15 2018-01-12 电子科技大学 Social media non-standard word correcting method based on semanteme
CN108009253A (en) * 2017-12-05 2018-05-08 昆明理工大学 A kind of improved character string Similar contrasts method
CN108681609A (en) * 2018-05-28 2018-10-19 盐城工学院 A kind of adaptively selected property text cluster integrated approach based on hierarchical clustering

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
宋亚军 等: "一种改进的社交媒体文本规范化方法", 《中文信息学报》 *
数据中国: "中文相似度匹配算法", 《HTTPS://BLOG.CSDN.NET/CHNDATA/ARTICLE/DETAILS/41114771》 *
邓加原 等: "基于无监督学习算法的推特文本规范化", 《计算机应用》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110825852A (en) * 2019-11-07 2020-02-21 四川长虹电器股份有限公司 Long text-oriented semantic matching method and system
CN111767422A (en) * 2020-06-30 2020-10-13 平安国际智慧城市科技股份有限公司 Data auditing method, device, terminal and storage medium
CN112801425A (en) * 2021-03-31 2021-05-14 腾讯科技(深圳)有限公司 Method and device for determining information click rate, computer equipment and storage medium

Similar Documents

Publication Publication Date Title
WO2021068339A1 (en) Text classification method and device, and computer readable storage medium
CN107085581B (en) Short text classification method and device
CN106294593B (en) In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study
WO2019080863A1 (en) Text sentiment classification method, storage medium and computer
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN108763213A (en) Theme feature text key word extracting method
CN104778256B (en) A kind of the quick of field question answering system consulting can increment clustering method
CN103617290B (en) Chinese machine-reading system
CN106598940A (en) Text similarity solution algorithm based on global optimization of keyword quality
CN102043851A (en) Multiple-document automatic abstracting method based on frequent itemset
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN107423282A (en) Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN106610951A (en) Improved text similarity solving algorithm based on semantic analysis
CN106599041A (en) Text processing and retrieval system based on big data platform
CN107273474A (en) Autoabstract abstracting method and system based on latent semantic analysis
CN110032738A (en) Microblogging text normalization method based on context graph random walk and phonetic-stroke code
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN110362678A (en) A kind of method and apparatus automatically extracting Chinese text keyword
CN112559684A (en) Keyword extraction and information retrieval method
CN110765755A (en) Semantic similarity feature extraction method based on double selection gates
CN104331523B (en) A kind of question sentence search method based on conceptual object model
CN106528621A (en) Improved density text clustering algorithm
CN106570112A (en) Improved ant colony algorithm-based text clustering realization method
CN103164399A (en) Punctuation addition method and device in speech recognition
CN104281565A (en) Semantic dictionary constructing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20190719

WD01 Invention patent application deemed withdrawn after publication