CN110032738A - Microblogging text normalization method based on context graph random walk and phonetic-stroke code - Google Patents
Microblogging text normalization method based on context graph random walk and phonetic-stroke code Download PDFInfo
- Publication number
- CN110032738A CN110032738A CN201910305628.5A CN201910305628A CN110032738A CN 110032738 A CN110032738 A CN 110032738A CN 201910305628 A CN201910305628 A CN 201910305628A CN 110032738 A CN110032738 A CN 110032738A
- Authority
- CN
- China
- Prior art keywords
- word
- context
- standard
- phonetic
- modular
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- General Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Economics (AREA)
- Human Resources & Organizations (AREA)
- Computing Systems (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- Primary Health Care (AREA)
- Machine Translation (AREA)
Abstract
The microblogging text normalization method based on context graph random walk and phonetic-stroke code that the present invention provides a kind of, belongs to computer technology social media text content analysis and digging technology field.This method comprises: identification non-standard word, extracts word context;It constructs context graph and carries out random walk, obtain the standardization Candidate Set based on context;Using Chinese-character sound-shape code, the standardization Candidate Set based on sound shape is obtained;Two standardization Candidate Sets are handled, result of finally standardizing is obtained.This method overcomes the deficiency that conventional method does not fully consider Chinese-character sound-shape.Substantially, social media is different from the written words such as news, wherein being flooded with a large amount of non-standard abbreviation, homonym and homograph, this makes the effect of natural language processing tool processing microblogging text undesirable.Therefore, the invention proposes phonetic-stroke code is understood the microblogging text normalization method that combines with context, to carry out analysis using natural language processing tool after standardization and excavation provides possibility.
Description
Technical field
The invention belongs to field of computer technology, specifically a kind of microblogging based on context graph random walk and phonetic-stroke code
Text normalization method.
Background technique
With popularizing for social networks, constantly there is new user to be added in social networks, daily in each social platform
The text data of generation is all ten hundreds of.The features such as microblogging is due to its instant, short and small and fast propagation, it has also become current
One of most important social network-i i-platform.It also becomes people and obtains news and current affairs, human communication, self-expression, society's sharing
With the important medium of communal participation.Therefore, these microblog datas have great researching value.But exist in microblogging text big
The non-standard word of amount, the effect is unsatisfactory when so that existing natural language tool directly handling microblogging text.If can
Standardize to the non-standard word in microblogging text, can undoubtedly improve natural language processing correlative study to a certain extent
Effect.
In recent years, work on hand proposed a variety of normalization methods for English text.But with these methods come
Handling Chinese text, more or less there is some problems.For example, being counted using maximum identical word string rate and edit distance approach
The literal similarity between non-standard word and modular word is calculated, the method is relatively specific for calculating the similarity of English text, not
It is suitable for very much the calculating of Chinese similarity.Therefore, existing normalization method is unable to satisfy the need of Chinese microblogging text normalization
It asks.
The present invention proposes a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code.In order to allow
The effect of Chinese microblogging text normalization is more preferable, and the present invention is considered at following two aspects: first, we are in context
On the basis of figure random walk, phonetic-stroke code method is introduced, the sound shape phase between non-standard word and modular word can be obtained well
Like property, it is contemplated that the characteristics of Chinese language.Second, we have carried out some modifications to original phonetic-stroke code, meet it more
The expression characteristic of microblogging text can preferably complete microblogging text normalization task.
Summary of the invention
The purpose of the present invention is to provide a kind of microblogging text normalization based on context graph random walk and phonetic-stroke code
Method.The present invention calculates the sound shape similarity between non-standard word and modular word by introducing phonetic-stroke code method, so that finally
Standardization result it is more accurate.
The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, including following
Step:
Step 1: participle operation is carried out to Chinese microblogging text.
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word.
Step 3: being constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context
Lower texts and pictures.
Step 4: carrying out random walk on context graph, it is candidate in the standardization of context to obtain each non-standard base
Collection.
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word.
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-rule
The feature vector of the modular word of the corresponding prediction of model word.
Step 7: being compared with the feature vector of modular word in standard dictionary, find out prediction corresponding with non-standard word
The immediate k word of the feature vector of modular word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation
With the standardization candidate sequence of font.
Step 8: two standardization Candidate Sets of processing export standardization knot of the top n modular word as corresponding non-standard word
Fruit.
Using the non-standard word in standard dictionary identification microblogging text in the step 2, and extract the stationery up and down of word
Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text
Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the contextual definition of each word is the word sequence of word composition each before and after word by the present invention.
According to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3
Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context
Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation
The side of conjunction node and context node in table figure, the weight on side are the co-occurrence numbers of word and context.
Random walk is carried out in the step 4 on context graph, obtains each non-standard base in the specification of context
Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base
In the standardization Candidate Set of context.
Random walk each time is subordinated to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase
Context node M evenj.Conversion between each node pair is by transition probability PijDefinition, any two node Ni, MjBetween
Transition probability is defined as:
Wherein, NiIndicate non-standard word node, MjIndicate context node, PijIndicate node Ni, MjBetween transfer it is general
Rate, WijIndicate node Ni, MjBetween side right weight, WikExpression and NiConnected any one context node MkSide right weight.
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution.For any random
Migration, the step number passed through between any two node, which is referred to as, hits the time.Therefore, the non-standard word of the r times random walk
It is hr (n, m) to the hit time between (n, m) with modular word node.Cost between two nodes is defined as connecting the two
The average hit time H (n, m) of all random walks of node:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicates node
To the hit time of (n, m) the r times random walk, R indicates connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two sections
The relative frequency for the every other modular word node that the average hit H (n, m) of point connect with the non-standard word.Therefore L (n, m) meter
It calculates as follows:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is based on
The standardization candidate sequence of Context similarity.
Based on the phonetic-stroke code of individual Chinese character in the step 5, the phonetic-stroke code of word is found out specifically, according to modified sound
Shape code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtains each non-standard word and rule
The phonetic-stroke code of model word.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will
Its matrix for being expressed as one 4 × 10;If number of words is less than four, in end zero padding.
To each non-standard word in the step 6, the feature vector of phonetic-stroke code is extracted, is input in phonetic-stroke code model, it is defeated
The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out
Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word
The feature vector of the modular word of corresponding prediction.
Feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word
The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base
In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained
Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word
The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font
Standardization candidate sequence.
Two standardization Candidate Sets are handled in the step 8, export specification of the top n modular word as corresponding non-standard word
Change result specifically, waiting to each non-standard base in the standardization Candidate Set of context and the standardization based on pronunciation and font
Selected works reorder, and export top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Fig. 2 is random walk flow chart.
Fig. 3 is modified phonetic-stroke code structure chart.
Fig. 4 is that the phonetic-stroke code of word indicates.
Fig. 5 is characterized weight setting when extraction.
Fig. 6 is word feature expression.
Specific embodiment
The present invention is a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, overall flow
As shown in Figure 1, comprising the following steps:
Step 1: participle operation is carried out to Chinese microblogging text.
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word.
Step 3: being constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context
Lower texts and pictures.
Step 4: carrying out random walk on context graph, it is candidate in the standardization of context to obtain each non-standard base
Collection.
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word.
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-rule
The feature vector of the modular word of the corresponding prediction of model word.
Step 7: being compared with the feature vector of modular word in standard dictionary, find out prediction corresponding with non-standard word
The immediate k word of the feature vector of modular word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation
With the standardization candidate sequence of font.
Step 8: two standardization Candidate Sets of processing export standardization knot of the top n modular word as corresponding non-standard word
Fruit.
Participle operation is carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text
This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.
Using the non-standard word in standard dictionary identification microblogging text in the step 2, and extract the stationery up and down of word
Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text
Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the word sequence that the present invention forms the contextual definition of each word for a word each before and after word, such as Fig. 2,
One context of word " refreshing horse " is " inquiring into topic ".
According to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3
Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context
Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, E generation
The side of conjunction node and context node in table figure.The weight on side is the co-occurrence number of word and context, such as Fig. 2, the left side
White nodes illustrate word " refreshing horse " and " what " corresponding context " being thing " etc., the grey word node on the right indicates
Non-standard word " refreshing horse ", white word node indicate modular word " what ", and the weight 1 on the side of " refreshing horse " and " being thing " is connected in figure
Indicate that " being refreshing horse thing " occurs once in the text.
Random walk is carried out in the step 4 on context graph, obtains each non-standard base in the specification of context
Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base
In the standardization Candidate Set of context.
Random walk each time is subordinated to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase
Context node M evenj.Conversion between each node pair is by transition probability PijIt defines, any two node i, turns between j
Move definition of probability are as follows:
Such as Fig. 2: step 1., from non-standard word " refreshing horse ", find its corresponding context node be " being thing ",
" inquiring into topic ", and " for meeting ", corresponding side right weight are (1,1,3), find out " refreshing horse " according to transition probability Pij and arrive each hereafter
The transition probability of node, respectively [0.2,0.2,0.6] then generate a random number, it is assumed that are 0.61221, fall in
(0.4,1.0] section, so index value be 2, the context node selected is " for meeting ";Step 2., from context and node
" for meeting " sets out, and finding the word node connecting with it is " refreshing horse " and " what ", and there are two word nodes, because the present invention is arranged
Last node cannot be returned in random walk, so index value is 1, selected word node " what ";3. step, carries out two
Secondary judgement judges whether the word node is modular word, judges whether migration step number reaches maximum value S, and S=4 is arranged in the present invention, if
The word node is that modular word node or migration step number are 4 and stop this migration, if the word is not modular word and migration step number
2. 3. do not reach 4, then continue migration, repeat step 1.;In Fig. 2, " what " is modular word, stops this migration, finds
The modular word " what " of " refreshing horse ".
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution.For any random
Migration, the step number passed through between any two node, which is referred to as, hits the time.Therefore, the non-standard word of the r times random walk
It is hr (n, m) to the hit time between (n, m) with modular word node.Cost between two nodes is defined as connecting the two
The average hit time H (n, m) of all random walks of node:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicates node
To the hit time of (n, m) the r times random walk, R indicates connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two sections
The relative frequency for the every other modular word node that the average hit H (n, m) of point connect with the non-standard word.Therefore L (n, m) meter
It calculates as follows:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is based on
The standardization candidate sequence of Context similarity.
Based on the phonetic-stroke code of individual Chinese character in the step 5, the phonetic-stroke code of word is found out specifically, according to modified sound
Shape code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtains each non-standard word and rule
The phonetic-stroke code of model word.
Such as Fig. 3, modified phonetic-stroke code structure is divided into two parts, and first part is tone code, indicates the phonetic of Chinese character, packet
Initial consonant, simple or compound vowel of a Chinese syllable and auxiliary simple or compound vowel of a Chinese syllable are contained;Second part is shape code, illustrates the font of Chinese character, contains structure, quadrangle coding and
Stroke number.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, will
Its matrix for being expressed as one 4 × 10;If number of words is less than four, in end zero padding, such as Fig. 4, the sound shape of word " pear mountain is big "
Code is " duck ", " pears ", " mountain ", one 4 × 10 matrix of the phonetic-stroke code composition of " big " this four words.
To each non-standard word in the step 6, the feature vector of phonetic-stroke code is extracted, is input in phonetic-stroke code model, it is defeated
The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out
Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word
The feature vector of the modular word of corresponding prediction.
Feature extraction, such as Fig. 6, by word " pear mountain are carried out to word according to the weight setting in Fig. 5 when feature extraction
Phonetic-stroke code greatly " indicates and multiplied by weight, and the feature vector for having obtained " pear mountain is big " is [6.86 4.77 9.33 2.34].
Feature vector in the step 7 with modular word in standard dictionary compares, using k-d tree algorithm find out in advance
The feature vector of the modular word of survey obtains base apart from k nearest word, the as highest modular word of top K sound shape similarity
In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained
Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word
The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font
Standardization candidate sequence.
Two standardization Candidate Sets are handled in the step 8, export specification of the top n modular word as corresponding non-standard word
Change result specifically, waiting to each non-standard base in the standardization Candidate Set of context and the standardization based on pronunciation and font
Selected works reorder, and export top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
Using implementation method of the invention, have the beneficial effect that: first, on the basis of context graph random walk, draw
Phonetic-stroke code method is entered, the sound shape similitude between non-standard word and modular word can have been obtained well, it is contemplated that Chinese language
The characteristics of;Second, some modifications have been carried out to original phonetic-stroke code, it is made more to meet the expression characteristic of microblogging text, Neng Gougeng
Microblogging text normalization task is completed well.
The present invention is implemented above provided a kind of based on the microblogging text of context graph random walk and phonetic-stroke code rule
Generalized method is introduced in detail, is expounded herein to the principle of the present invention and embodiment, and the above implementation is said
The bright auxiliary that is only intended to understands method and its core concept of the invention.
Claims (9)
1. a kind of microblogging text normalization method based on context graph random walk and phonetic-stroke code, which is characterized in that the side
Method is applied to Chinese microblogging text normalization, comprising the following steps:
Step 1: participle operation is carried out to Chinese microblogging text;
Step 2: using the non-standard word in standard dictionary identification microblogging text, and extracting the context of word;
Step 3: context is constructed according to the co-occurrence number of word, the corresponding context of word and word and corresponding context
Figure;
Step 4: carrying out random walk on context graph, obtain each non-standard base in the standardization Candidate Set of context;
Step 5: the phonetic-stroke code based on individual Chinese character finds out the phonetic-stroke code of word;
Step 6: to each non-standard word, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, export non-standard word
The feature vector of the modular word of corresponding prediction;
Step 7: being compared with the feature vector of modular word in standard dictionary, find out the specification of prediction corresponding with non-standard word
The immediate k word of the feature vector of word, the as highest modular word of top K sound shape similarity are obtained based on pronunciation and word
The standardization candidate sequence of shape;
Step 8: two standardization Candidate Sets of processing export standardization result of the top n modular word as corresponding non-standard word.
2. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: participle operation being carried out to Chinese microblogging text in the step 1 specifically, using participle tool to Chinese microblogging text
This carries out participle operation, obtains the word for including in text, prepares for identification non-standard word in next step.
3. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: using the non-standard word in standard dictionary identification microblogging text in the step 2, and extracting the stationery up and down of word
Body is to compare the word in the word and standard dictionary that obtain after participle, identifies the non-rule in Chinese microblogging text
Model word and modular word, and non-standard word and the corresponding context of modular word are found out, it prepares to establish context graph in next step.
Wherein, the contextual definition of each word is the word sequence of word composition each before and after word by the present invention.
4. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: according to the co-occurrence number structure of word, the corresponding context of word and word and corresponding context in the step 3
Context graph is built out specifically, constructing using the co-occurrence number of word, the corresponding context of word and word and its context
Context graph G (W, C, E) out.
Wherein W includes all nodes for indicating modular word and non-standard word, and C includes all nodes for indicating context, and E represents figure
The side of middle conjunction node and context node, the weight on side are the co-occurrence numbers of word and context.
5. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: carrying out random walk in the step 4 on context graph, obtain each non-standard base in the specification of context
Change Candidate Set specifically, carrying out K random walk on context graph to each non-standard word, obtaining each non-standard base
In the standardization Candidate Set of context.
Random walk each time is all slaves to the node N of non-standard wordiStart, then with probability PijBe moved to arbitrarily with its phase
Context node M evenj, the conversion between each node pair is by transition probability PijIt defines, any two node i, turns between j
Move definition of probability are as follows:
Wherein, NiIndicate non-standard word node, MjIndicate context node, PijIndicate node Ni, MjBetween transition probability, Wij
Indicate node Ni, MjBetween side right weight, WikExpression and NiConnected any one context node MkSide right weight.
The independent random migration for repeating K times, randomly traverses bigraph (bipartite graph) according to transfering probability distribution, for any random trip
Walk, the step number passed through between any two node be referred to as hit the time, therefore, the non-standard word of the r time random walk with
Modular word node is h to the hit time between (n, m)r(n, m), the cost between two nodes are defined as connecting the two sections
The average hit time H (n, m) of all random walks of point:
Wherein, H (n, m) indicates average hit time of the node to (n, m) all random walks, hr(n, m) indicate node to (n,
M) the hit time of the r times random walk, R indicate connecting node to the number of all random walks of (n, m).
Non-standard word node and modular word node are defined as L (n, m) to the Context similarity of (n, m), are the two nodes
The relative frequency for the every other modular word node that H (n, m) is connect with the non-standard word is averagely hit, therefore L (n, m) is calculated such as
Under:
Wherein, L (n, m) indicates node to the Context similarity of (n, m).
The context similarity of the corresponding multiple modular words of each non-standard word is calculated, and is ranked up, is obtained based on up and down
The standardization candidate sequence of literary similitude.
6. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: based on the phonetic-stroke code of individual Chinese character in the step 5, finding out the phonetic-stroke code of word specifically, according to modified
Phonetic-stroke code structure calculates the phonetic-stroke code of each Chinese character, is then based on the phonetic-stroke code of individual Chinese character, obtain each non-standard word and
The phonetic-stroke code of modular word.
Wherein, the phonetic-stroke code of individual Chinese character is one 1 × 10 vector, it is assumed that word is all at most made of four words, by its table
It is shown as one 4 × 10 matrix;If number of words is less than four, in end zero padding.
7. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: to each non-standard word in the step 6, extracting the feature vector of phonetic-stroke code, be input in phonetic-stroke code model, it is defeated
The feature vector of the modular word of the corresponding prediction of non-standard word is specifically, operate each non-standard word, by its sound out
Shape code is multiplied with weight matrix, extracts feature vector (1 × 4 vector), is input in phonetic-stroke code model, exports non-standard word
The feature vector of the modular word of corresponding prediction.
8. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
Be characterized in that: the feature vector in the step 7 with modular word in standard dictionary compares, and finds out corresponding with non-standard word
The immediate k word of the feature vector of the modular word of prediction, the as highest modular word of top K sound shape similarity, obtain base
In the standardization candidate sequence of pronunciation and font specifically, the modular word of the corresponding prediction of non-standard word that step 6 is obtained
Feature vector and the feature vector of modular word in standard dictionary compare, and find out the modular word of prediction corresponding with non-standard word
The immediate k word of feature vector, the as highest modular word of top K sound shape similarity obtains based on pronunciation and font
Standardization candidate sequence.
9. the microblogging text normalization method according to claim 1 based on context graph random walk and phonetic-stroke code,
It is characterized in that: handling two standardization Candidate Sets in the step 8, export rule of the top n modular word as corresponding non-standard word
Generalized result is specifically, to each non-standard base in the standardization Candidate Set of context and standardization based on pronunciation and font
Candidate Set reorders, and exports top n modular word, i.e. TopN is as the corresponding standardization result of non-standard word.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910305628.5A CN110032738A (en) | 2019-04-16 | 2019-04-16 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910305628.5A CN110032738A (en) | 2019-04-16 | 2019-04-16 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110032738A true CN110032738A (en) | 2019-07-19 |
Family
ID=67238712
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910305628.5A Pending CN110032738A (en) | 2019-04-16 | 2019-04-16 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110032738A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825852A (en) * | 2019-11-07 | 2020-02-21 | 四川长虹电器股份有限公司 | Long text-oriented semantic matching method and system |
CN111767422A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Data auditing method, device, terminal and storage medium |
CN112801425A (en) * | 2021-03-31 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Method and device for determining information click rate, computer equipment and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120303355A1 (en) * | 2011-05-27 | 2012-11-29 | Robert Bosch Gmbh | Method and System for Text Message Normalization Based on Character Transformation and Web Data |
CN104536951A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Microblog text normalizing, word segmenting and part-speech tagging method and system |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN108009253A (en) * | 2017-12-05 | 2018-05-08 | 昆明理工大学 | A kind of improved character string Similar contrasts method |
CN108681609A (en) * | 2018-05-28 | 2018-10-19 | 盐城工学院 | A kind of adaptively selected property text cluster integrated approach based on hierarchical clustering |
-
2019
- 2019-04-16 CN CN201910305628.5A patent/CN110032738A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120303355A1 (en) * | 2011-05-27 | 2012-11-29 | Robert Bosch Gmbh | Method and System for Text Message Normalization Based on Character Transformation and Web Data |
CN104536951A (en) * | 2014-12-29 | 2015-04-22 | 北京牡丹电子集团有限责任公司数字电视技术中心 | Microblog text normalizing, word segmenting and part-speech tagging method and system |
CN107577668A (en) * | 2017-09-15 | 2018-01-12 | 电子科技大学 | Social media non-standard word correcting method based on semanteme |
CN108009253A (en) * | 2017-12-05 | 2018-05-08 | 昆明理工大学 | A kind of improved character string Similar contrasts method |
CN108681609A (en) * | 2018-05-28 | 2018-10-19 | 盐城工学院 | A kind of adaptively selected property text cluster integrated approach based on hierarchical clustering |
Non-Patent Citations (3)
Title |
---|
宋亚军 等: "一种改进的社交媒体文本规范化方法", 《中文信息学报》 * |
数据中国: "中文相似度匹配算法", 《HTTPS://BLOG.CSDN.NET/CHNDATA/ARTICLE/DETAILS/41114771》 * |
邓加原 等: "基于无监督学习算法的推特文本规范化", 《计算机应用》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110825852A (en) * | 2019-11-07 | 2020-02-21 | 四川长虹电器股份有限公司 | Long text-oriented semantic matching method and system |
CN111767422A (en) * | 2020-06-30 | 2020-10-13 | 平安国际智慧城市科技股份有限公司 | Data auditing method, device, terminal and storage medium |
CN112801425A (en) * | 2021-03-31 | 2021-05-14 | 腾讯科技(深圳)有限公司 | Method and device for determining information click rate, computer equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021068339A1 (en) | Text classification method and device, and computer readable storage medium | |
CN107085581B (en) | Short text classification method and device | |
CN106294593B (en) | In conjunction with the Relation extraction method of subordinate clause grade remote supervisory and semi-supervised integrated study | |
WO2019080863A1 (en) | Text sentiment classification method, storage medium and computer | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN108763213A (en) | Theme feature text key word extracting method | |
CN104778256B (en) | A kind of the quick of field question answering system consulting can increment clustering method | |
CN103617290B (en) | Chinese machine-reading system | |
CN106598940A (en) | Text similarity solution algorithm based on global optimization of keyword quality | |
CN102043851A (en) | Multiple-document automatic abstracting method based on frequent itemset | |
CN111858932A (en) | Multiple-feature Chinese and English emotion classification method and system based on Transformer | |
CN107423282A (en) | Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character | |
CN106610951A (en) | Improved text similarity solving algorithm based on semantic analysis | |
CN106599041A (en) | Text processing and retrieval system based on big data platform | |
CN107273474A (en) | Autoabstract abstracting method and system based on latent semantic analysis | |
CN110032738A (en) | Microblogging text normalization method based on context graph random walk and phonetic-stroke code | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN110362678A (en) | A kind of method and apparatus automatically extracting Chinese text keyword | |
CN112559684A (en) | Keyword extraction and information retrieval method | |
CN110765755A (en) | Semantic similarity feature extraction method based on double selection gates | |
CN104331523B (en) | A kind of question sentence search method based on conceptual object model | |
CN106528621A (en) | Improved density text clustering algorithm | |
CN106570112A (en) | Improved ant colony algorithm-based text clustering realization method | |
CN103164399A (en) | Punctuation addition method and device in speech recognition | |
CN104281565A (en) | Semantic dictionary constructing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20190719 |
|
WD01 | Invention patent application deemed withdrawn after publication |