CN107577668A - Social media non-standard word correcting method based on semanteme - Google Patents
Social media non-standard word correcting method based on semanteme Download PDFInfo
- Publication number
- CN107577668A CN107577668A CN201710829908.7A CN201710829908A CN107577668A CN 107577668 A CN107577668 A CN 107577668A CN 201710829908 A CN201710829908 A CN 201710829908A CN 107577668 A CN107577668 A CN 107577668A
- Authority
- CN
- China
- Prior art keywords
- word
- standard
- distance
- term vector
- list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Landscapes
- Machine Translation (AREA)
Abstract
The present invention discloses a kind of social media non-standard word correcting method based on semanteme, the problem of detection and identification for expense modular word at present lack effective ways, the mode that the application employs smallest edit distance characterizes the morphology of word, and the semantic information similarity of word is characterized by the way of COS distance between term vector, common screening is expected correct word to replace non-standard word;And some existing PyEnchant, PyTypo instruments are combined, significantly reducing needs the word range for comparing semantic dependency, to reach the purpose for improving word standardization speed.
Description
Technical field
The invention belongs to Data Mining, more particularly to a kind of non-standard word detection and identification technology.
Background technology
With Web2.0 rise, the Internet model is changed into all users by professional's knitmesh and participates in knitmesh, in shape
While formula more democratizes, the largely inexpensive low-quality information generation from user will be had by also implying that.Social media
It is the Important Platform of user's issue and propagation information, it is allowed to which life and the idea of oneself are shared in the unlimited place of user, unlimited time.
Twitter is the social network sites towards global range, and its feature is that user delivers when pushing away text and has character quantity limitation.
Which results in user to express viewpoint using more convenient brief abbreviation or network words, will produce substantial amounts of non-standard
Vocabulary, correlative study person is influenceed to pushing away literary subsequent analysis.It is very important so carrying out correction to non-standard vocabulary.
Professor Zhang Yangsen proposes two kinds of vocabulary error situations of non-word mistake and true word mistake mainly occur in English text, non-
Word mistake refers to the writing that can not be found in dictionary vocabulary lack of standardization, and true word mistake refers to that linguistic context can be found but be not inconsistent in dictionary
Syntax error class vocabulary.The present invention is only for non-word error situation.The non-standard word for non-word mistake occur refers generally to misspelling
Vocabulary, extension word and nonsense words by mistake.
In terms of detection firstly for non-standard word and identification, not many effective ways at present.The most frequently used is exactly
Look up the dictionary method, that is, the word to be matched by traversal dictionary library lookup and word to be identified, if being then determined as modular word,
It is on the contrary then be determined as non-standard word.Some scholars carry out non-standard word identification using the method for searching N-gram tables, that is, travel through
Some N-gram tables simultaneously count its occurrence number, when word frequency is then determined as non-standard word less than certain threshold value.
For in terms of the correction of non-standard word, having some more effective methods and research, and be applied to part searches
In the commercial products such as engine, input method.It is the most frequently used have smallest edit distance method, stem method, statistic law, summarize regular method and
Build dictionary method etc..
Natural language is given to the algorithm in machine learning to handle, first will be by linguistic mathematics, one is most normal
Method is exactly that each vocabulary is shown as a term vector.Term vector is a mode for showing the semanteme of word well.Its
Middle One-hot Representation are most directly perceived, and each vocabulary is shown as a very long vector, only one by this method
The value of dimension is 1, represents current word, remaining is 0.Though the method is simple but can bring dimension disaster, later scholar carries
Distributed Representation method, a kind of low-dimensional real number vector of expression are gone out.It is current conventional obtain word to
The model of amount has:1. 2. 3. 4. PLSA latent semantic analysis is general for LSA matrix decompositions model for GloVe models for Word2Vector models
Rate model.
Social media by taking Twitter as an example constantly produces a large amount of insignificant noise informations and the letter of repeated and redundant
Breath, such as the chat and forwarding of user.For convenience of analysis of the researcher to social media data, text message denoising and specification
Change is necessary.The depth characterized with the development of natural language processing technique and word standardization and the semanteme of word is ground
Study carefully, occur the standardization system for being much directed to plain text in recent years, but these conventional methods are limited to word mostly
Morphology so that its application effect in Twitter substantially reduces.
The content of the invention
In order to solve the above technical problems, the application proposes a kind of social media non-standard word correcting method based on semanteme,
On the spelling error correcting technique of routine, the semantic information of non-standard vocabulary is added as another consideration factor, is optimized most short
Editing distance method is difficult to handle the non-standard word problem that morphology differs greatly.
The technical solution adopted by the present invention is:Based on the social media non-standard word correcting method of semanteme, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector
The distance between;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list
Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard
Word;
S3, the non-standard word for judging to obtain for step S2, it is corresponding to find out each term vectors of the N in small distance with its term vector
Word;The specification word in these words is found out, wherein minimum specification word is carried out to it with non-standard term vector for selection
Replace.
Further, the method for the distance between described calculating any two term vector is:Euclidean distance or it is bright can
Paderewski distance or Chebyshev's distance or manhatton distance or Mahalanobis generalised distance or cosine angle.
Further, pretreatment is specially described in step S2:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
Further, the dictionary set described in step S2 comprises at least:Each conventional english dictionary, by pretreatment obtain
Topic word and user name.
Further, step S3 also includes:
B1, using between word smallest edit distance d represent morphology it is similar;
B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents word
Adopted similitude;
B3, according to following formula calculate with non-standard word close relation degree highest modular word;And according to the modular word come to taking
Modular word is corrected;
S(ω1,ω2)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β
Represent semantic weight.
Based on the social media non-standard word correcting method of semanteme, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector
The distance between;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list
Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard
Word;
S3, the non-standard word for judging to obtain for step S2, using PyEnchant and PyTypo to respectively non-standard to be corrected
Word is handled, and obtains corresponding Correcting Suggestion word list;
Correcting Suggestion word list is traveled through, each in calculations list suggests that word is edited with the minimum of non-standard word to be corrected
Distance, obtain morphology similarity;Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, obtains semantic phase
Like degree;
According to following formula combination morphology similarity and semantic similarity, each suggestion word is calculated with non-standard word to be corrected
Go out a score value;Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, most
Suggestion word above is replaced to non-standard word;If Correcting Suggestion word list is sky, the non-standard word is directly filtered;
S(ω1,ω2)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β
Semantic weight is represented, α is meaning of a word parameter, distances of the l between term vector.
Further, the method for the distance between described calculating any two term vector is:Euclidean distance or it is bright can
Paderewski distance or Chebyshev's distance or manhatton distance or Mahalanobis generalised distance or cosine angle.
Further, pretreatment is specially described in step S2:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
Further, the dictionary set described in step S2 comprises at least:Each conventional english dictionary, by pretreatment obtain
Topic word and user name.
Beneficial effects of the present invention:The social media non-standard word correcting method based on semanteme of the present invention, make use of most
The mode of small editing distance characterizes the morphology of word, and is characterized the semantic of word using the mode of COS distance between term vector and believed
Similarity is ceased, common screening is expected correct word to replace non-standard word, can reach following beneficial effect:
1st, existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduced
Amount of calculation, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection;
2nd, when locking correct word, common morphology Similar Problems are not only allowed for, it is also contemplated that term vector can characterize
The semantic information of word, the similarity degree of semantic information between term vector is characterized with COS distance method, more fully devised correct
The decision condition of word degree close with non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary,
Error correction effect is more preferable;
3rd, different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable
Long word utilizes PyTypo processing;Artificial judgment is carried out after meaningless word filtering, adds wordbook;During pushing away text pretreatment
The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.
Brief description of the drawings
Fig. 1 is the protocol procedures figure of the application;
Fig. 2 is non-standard word identification process figure;
Fig. 3 is word error correction flow chart.
Embodiment
For ease of skilled artisan understands that the technology contents of the present invention, enter one to present invention below in conjunction with the accompanying drawings
Step explaination.
It is the protocol procedures figure of the application as shown in Figure 1, the technical scheme of the application is:Social media based on semanteme is non-
Modular word correcting method, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector
The distance between;
GloVe is the vector difference than being encoded in space by the co-occurrence probability of two words.Value such as P (k | i)/P (k | j) is got over
Greatly, word k's is semantic just closer to i, on the contrary, P (k | i)/P (k | j) for value closer to 0, word k's is semantic just closer to j.
Define PijRepresent that word j appears in the probability of word i contexts:
Define F (ωi,ωj,ωk) represent that word i appears in the probability of word k contexts and word j is appeared on word k
The ratio of probability hereafter:
Wherein ωi,ωj,ωkWord i, j, k term vector are represented respectively, and their dimension is that the d dimensions being previously set are empty
Between in real number vector.
However, GloVe purpose is that vector difference is encoded to the ratio of Term co-occurrence probability, institute's above formula is revised as:
In above formula, the variable on the left side is the vector in two spaces, and right formula, which is one, to be total to by what corpus counted
The real number that existing matrix is drawn.To simplify above formula, pass through vectorial ωkWith vector difference ωi-ωjInner product operation, can be left by equation
The dependent variable on side is converted into real number from vector:
In order to meet the symmetry of co-occurrence matrix, the levoform of above-mentioned equation is made into following deformation:
So, following equalities can be obtained:
Equation is solved, is obtained:
Above formula is deformed again, introduces biasing, this is due to logarithmic function property, prevents it from dissipating:
In order to which above formula is set up, by optimizing least squares method, and weighting function f (X are introducedij), finally give target letter
Number:
Wherein f (Xij) and XijCorrelation, when co-occurrence number is relatively low between word, f (Xij) value should be relatively low, to reduce f
(Xij) influence to object function;, should be in X to avoid influenceing overall training effect when co-occurrence number is too highijReach certain
During threshold value, f (X are keptij) value is constant.
The mode of conventional measurement vector distance has:Euclidean distance, Minkowski distance, Chebyshev distance,
Manhatton distance, Mahalanobis generalised distance, cosine angle etc..Most popular is Euclidean distance method and cosine angle
Method.Tested through analysis, the embodiment of the present application weighs vector distance from the method for calculating cosine angle.Calculation formula is as follows:
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list
Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard
Word;
Literary information is largely pushed away as shown in Fig. 2 being got from Twitter api interfaces, the original of text is pushed away from lane database extraction
Beginning text data, certain pretreatment is carried out first:
1. filter the text noise such as invalid character and mess code;
2. word of the extraction comprising " # " (topic) and the user name of "@" (referring to), as lay-by material;
3. duplicate removal;
4. segment;
5. filter symbol and Arabic numerals etc..
When processing is comprising digital word, the present invention is carried out at participle using all non-letter characters to English word
Reason.Due to having filtered Arabic numerals, and the dynamics of participle is increased, effectively decomposed compound word and containing digital word, in fact
A kind of process of standardization can also be regarded as.Storehouse is increased income present invention uses python WordSegment to decompose compound word.
A whole-word list will be obtained after pretreatment, judged again by list each word whether specification.Above carry
To the method currently without good identification word, only using the method for most basic traversal dictionary.Word list is each
Individual word is compared with each word in dictionary, if in dictionary successful match to same word, it is believed that word to be detected is
Modular word, it is on the contrary then be considered non-standard word.Dictionary set in the present invention not only has the multiple conventional English words collected on the net
Allusion quotation, it also added the topic word and user name obtained in preprocessing process.Consider the particularity of social media, can constantly there is new production
Raw cyberspeak or uncommon word etc., the present invention improve wordbook by the normalization of artificial judgment word.Conventional English words
Allusion quotation includes:Longman dictionary, Oxford Dictionary, Collins's dictionary etc..
The list of non-standard word is finally obtained, then carries out subsequent error correction processing.
S3, error correction is carried out for the step S2 non-standard words for judging to obtain, present applicant proposes two kinds of solutions:
Scheme one:Find out the corresponding word of each term vectors of the N in small distance with its term vector;Find out the rule in these words
Model word, wherein minimum specification word is replaced to it with non-standard term vector for selection;Specially:
Standardize in pre-processing above by compound word and containing digital word, remaining misspelling word, extend word and nothing
Meaning word is untreated.For meaningless word, the present invention is handled by the way of filtering.For misspelling word and extension
Word, error correction is carried out using COS distance method between the term vector told about above, that is, for each incorrect word, find with
Word corresponding to those immediate term vectors of its term vector COS distance, non-standard word is replaced with correct word therein
Change.
But if directly select the effect of the corresponding correct word of selection less desirable using COS distance, between word
Similarity can be divided into morphology the phase Sihe meaning of a word it is similar.According to traditional word error correction method, smallest edit distance method
It is the actual parameter for weighing morphology similarity degree between vocabulary, smallest edit distance is represented with d between word.From above, word to
COS distance between amount can represent the semantic similar situation of vocabulary, and a parameter alpha is multiplied by come table with the COS distance l between term vector
Show meaning of a word similitude.Parameter alpha taking human as adjustment, can represent the quality of term vector.
Obtain following formula of score:
S(ω1,ω2)=d+ β × l × α (11)
β in formula represents semantic weight.If the value of the score company between two words is lower, then it represents that the two words close
System is closer.Because parameter alpha is related to semantic weight β, to simplify formula, h=α × β are made, formula of score is changed into:
S(ω1,ω2)=d+h × l (12)
Make the score value between non-standard word and correct word minimum using above-mentioned formula, be replaced to reach correction mesh
's.This formula not only allows for the feature of morphology, effectively handles the problem of morphology difference is big between non-standard word and correct word, also
Semantic information is combined, influence of the lexical semantic information inaccuracy to error correction effect can be avoided by adjusting parameter.
Scheme two:If only directly being judged using formula, the scope of retrieval is too big, causes workload to increase severely, and the used time is oversize.
Introducing python above has the storehouse PyEnchant that increases income, and has processing incorrect word, and return to a Correcting Suggestion word
The function of list.And PyTypo has well the directly function of processing extension word.
As shown in figure 3, the non-standard word that correction is treated first with PyEnchant and PyTypo is handled, obtain corresponding
Correcting Suggestion word list.Travel through Correcting Suggestion list again, each in calculations list suggests the minimum of word and word to be corrected
Editing distance, obtain morphology similarity;The COS distance between their term vector is also calculated, obtains semantic similarity.It is logical
Cross formula (12) and combine morphology similarity and semantic similarity, calculate a score value.Will correct suggestion lists by goals for by
As little as high rearrangement, the lower suggestion word sequence of score is more forward, thus obtains a new Correcting Suggestion list.Choose
The word of each new list foremost is replaced to non-standard word, reaches error correction purpose.But if Correcting Suggestion word list
For sky, it is meaningless word to illustrate word to be corrected, present invention selection directly filtering, and later stage artificial judgment these meaningless words whether
Correct word can be used as, if it is, adding wordbook.
PyEnchant is based on Enchant storehouses, is the very powerful python kits of increasing income of One function, mainly by with
In functions such as word detection, spelling error correction.It can return to a Correcting Suggestion list, and this usual Correcting Suggestion list is one
The list being made up of multiple correct words, the forward word that sorts are the optimal error correction selecting object of system recommendations.
PyTypo is a Python storehouse of increasing income, and is designed to (in particular prolong specifically for pushing away non-standard word in text
Long word) processing Python kits.It has not only used structure dictionary method, also add the function of removing and repeat substring, effectively
Processing extension word.But it is undesirable for the error correction effect of misspelling vocabulary and nonsense words.
Existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduce meter
Calculation amount, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection;
When locking correct word, not only allow for common morphology Similar Problems, it is also contemplated that term vector can characterize word
Semantic information, with COS distance method characterize term vector between semantic information similarity degree, more fully devise correct word with
The decision condition of the close degree of non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary, error correction
Effect is more preferable;Different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable
Long word utilizes PyTypo processing;Artificial judgment is carried out after meaningless word filtering, adds wordbook;During pushing away text pretreatment
The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair
Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.For ability
For the technical staff in domain, the present invention can have various modifications and variations.Within the spirit and principles of the invention, made
Any modification, equivalent substitution and improvements etc., should be included within scope of the presently claimed invention.
Claims (9)
1. the social media non-standard word correcting method based on semanteme, it is characterised in that including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculated between any two term vector
Distance;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each word in list with
Word in dictionary set is compared;If the word in list if successful match be modular word;Otherwise it is non-standard word;
S3, the non-standard word for judging to obtain for step S2, it is corresponding single to find out each term vectors of the N in small distance with its term vector
Word;The specification word in these words is found out, wherein minimum specification word is replaced to it with non-standard term vector for selection.
2. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that the meter
The method for calculating the distance between any two term vector is:Euclidean distance or Minkowski distance or Chebyshev's distance
Or manhatton distance or Mahalanobis generalised distance or cosine angle.
3. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S2
The pretreatment is specially:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
4. the social media non-standard word correcting method according to claim 3 based on semanteme, it is characterised in that step S2
Described dictionary set comprises at least:Each conventional english dictionary, the topic word and user name obtained by pretreatment.
5. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S3
Also include:
B1, using between word smallest edit distance d represent morphology it is similar;
B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents meaning of a word phase
Like property;
B3, according to following formula calculate with non-standard word close relation degree highest modular word;And according to the modular word come to taking specification
Word is corrected;
S(ω1,ω2)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β is represented
Semantic weight.
6. the social media non-standard word correcting method based on semanteme, it is characterised in that including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculated between any two term vector
Distance;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each word in list with
Word in dictionary set is compared;If the word in list if successful match be modular word;Otherwise it is non-standard word;
S3, the non-standard word for judging to obtain for step S2, respectively non-standard word to be corrected is entered using PyEnchant and PyTypo
Row processing, obtains corresponding Correcting Suggestion word list;
Travel through Correcting Suggestion word list, each in calculations list suggest word and non-standard word to be corrected it is minimum edit away from
From obtaining morphology similarity;Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, is obtained semantic similar
Degree;
According to following formula combination morphology similarity and semantic similarity, one is calculated with non-standard word to be corrected to each suggestion word
Individual score value;Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, foremost
Suggestion word non-standard word is replaced;If Correcting Suggestion word list is sky, the non-standard word is directly filtered;
S(ω1,ω2)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β is represented
Semantic weight, α are meaning of a word parameter, distances of the l between term vector.
7. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that the meter
The method for calculating the distance between any two term vector is:Euclidean distance or Minkowski distance or Chebyshev's distance
Or manhatton distance or Mahalanobis generalised distance or cosine angle.
8. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2
The pretreatment is specially:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
9. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2
Described dictionary set comprises at least:Each conventional english dictionary, the topic word and user name obtained by pretreatment.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710829908.7A CN107577668A (en) | 2017-09-15 | 2017-09-15 | Social media non-standard word correcting method based on semanteme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710829908.7A CN107577668A (en) | 2017-09-15 | 2017-09-15 | Social media non-standard word correcting method based on semanteme |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107577668A true CN107577668A (en) | 2018-01-12 |
Family
ID=61036080
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710829908.7A Pending CN107577668A (en) | 2017-09-15 | 2017-09-15 | Social media non-standard word correcting method based on semanteme |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107577668A (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN109670171A (en) * | 2018-11-23 | 2019-04-23 | 山西大学 | A kind of word-based term vector expression learning method to asymmetric co-occurrence |
CN110032738A (en) * | 2019-04-16 | 2019-07-19 | 中森云链(成都)科技有限责任公司 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
CN110348497A (en) * | 2019-06-28 | 2019-10-18 | 西安理工大学 | A kind of document representation method based on the building of WT-GloVe term vector |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
CN112765962A (en) * | 2021-01-15 | 2021-05-07 | 上海微盟企业发展有限公司 | Text error correction method, device and medium |
WO2021129411A1 (en) * | 2019-12-23 | 2021-07-01 | 华为技术有限公司 | Text processing method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
CN105824800A (en) * | 2016-03-15 | 2016-08-03 | 江苏科技大学 | Automatic Chinese real word error proofreading method |
CN106126494A (en) * | 2016-06-16 | 2016-11-16 | 上海智臻智能网络科技股份有限公司 | Synonym finds method and device, data processing method and device |
US20160335244A1 (en) * | 2015-05-14 | 2016-11-17 | Nice-Systems Ltd. | System and method for text normalization in noisy channels |
CN106506327A (en) * | 2016-10-11 | 2017-03-15 | 东软集团股份有限公司 | A kind of spam filtering method and device |
-
2017
- 2017-09-15 CN CN201710829908.7A patent/CN107577668A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105095204A (en) * | 2014-04-17 | 2015-11-25 | 阿里巴巴集团控股有限公司 | Method and device for obtaining synonym |
US20160335244A1 (en) * | 2015-05-14 | 2016-11-17 | Nice-Systems Ltd. | System and method for text normalization in noisy channels |
CN105824800A (en) * | 2016-03-15 | 2016-08-03 | 江苏科技大学 | Automatic Chinese real word error proofreading method |
CN106126494A (en) * | 2016-06-16 | 2016-11-16 | 上海智臻智能网络科技股份有限公司 | Synonym finds method and device, data processing method and device |
CN106506327A (en) * | 2016-10-11 | 2017-03-15 | 东软集团股份有限公司 | A kind of spam filtering method and device |
Non-Patent Citations (1)
Title |
---|
宋亚军等: "一种改进的社交媒体文本规范化方法", 《中文信息学报》 * |
Cited By (13)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108804423A (en) * | 2018-05-30 | 2018-11-13 | 平安医疗健康管理股份有限公司 | Medical Text character extraction and automatic matching method and system |
CN108804423B (en) * | 2018-05-30 | 2023-09-08 | 深圳平安医疗健康科技服务有限公司 | Medical text feature extraction and automatic matching method and system |
CN109670171B (en) * | 2018-11-23 | 2021-05-14 | 山西大学 | Word vector representation learning method based on word pair asymmetric co-occurrence |
CN109670171A (en) * | 2018-11-23 | 2019-04-23 | 山西大学 | A kind of word-based term vector expression learning method to asymmetric co-occurrence |
CN110032738A (en) * | 2019-04-16 | 2019-07-19 | 中森云链(成都)科技有限责任公司 | Microblogging text normalization method based on context graph random walk and phonetic-stroke code |
CN110348497A (en) * | 2019-06-28 | 2019-10-18 | 西安理工大学 | A kind of document representation method based on the building of WT-GloVe term vector |
CN110348497B (en) * | 2019-06-28 | 2021-09-10 | 西安理工大学 | Text representation method constructed based on WT-GloVe word vector |
WO2021129411A1 (en) * | 2019-12-23 | 2021-07-01 | 华为技术有限公司 | Text processing method and device |
CN113095072A (en) * | 2019-12-23 | 2021-07-09 | 华为技术有限公司 | Text processing method and device |
EP4060526A4 (en) * | 2019-12-23 | 2022-12-28 | Huawei Technologies Co., Ltd. | Text processing method and device |
CN112597373A (en) * | 2020-12-29 | 2021-04-02 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
CN112597373B (en) * | 2020-12-29 | 2023-09-15 | 科技谷(厦门)信息技术有限公司 | Data acquisition method based on distributed crawler engine |
CN112765962A (en) * | 2021-01-15 | 2021-05-07 | 上海微盟企业发展有限公司 | Text error correction method, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107577668A (en) | Social media non-standard word correcting method based on semanteme | |
US10515090B2 (en) | Data extraction and transformation method and system | |
CN100517301C (en) | Systems and methods for improved spell checking | |
CN109684642B (en) | Abstract extraction method combining page parsing rule and NLP text vectorization | |
CN110543639A (en) | english sentence simplification algorithm based on pre-training Transformer language model | |
US10915707B2 (en) | Word replaceability through word vectors | |
CN107102983B (en) | Word vector representation method of Chinese concept based on network knowledge source | |
CN108073571B (en) | Multi-language text quality evaluation method and system and intelligent text processing system | |
CN108920482B (en) | Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model | |
CN110209818B (en) | Semantic sensitive word and sentence oriented analysis method | |
CN111967258B (en) | Method for constructing coreference resolution model, coreference resolution method and medium | |
CN106959943B (en) | Language identification updating method and device | |
CN114065758A (en) | Document keyword extraction method based on hypergraph random walk | |
CN111507093A (en) | Text attack method and device based on similar dictionary and storage medium | |
Lahbari et al. | Arabic question classification using machine learning approaches | |
WO2014002774A1 (en) | Synonym extraction system, method, and recording medium | |
CN113673252A (en) | Automatic join recommendation method for data table based on field semantics | |
CN114997288A (en) | Design resource association method | |
CN108319584A (en) | A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms | |
CN114722176A (en) | Intelligent question answering method, device, medium and electronic equipment | |
CN111178091A (en) | Multi-dimensional Chinese-English bilingual data cleaning method | |
CN107092595A (en) | New keyword extraction techniques | |
CN115828854B (en) | Efficient table entity linking method based on context disambiguation | |
CN104216880A (en) | Term definition discriminating and analysis method based on Internet | |
CN113254473B (en) | Method and device for acquiring weather service knowledge |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180112 |
|
RJ01 | Rejection of invention patent application after publication |