CN107577668A - Social media non-standard word correcting method based on semanteme - Google Patents

Social media non-standard word correcting method based on semanteme Download PDF

Info

Publication number
CN107577668A
CN107577668A CN201710829908.7A CN201710829908A CN107577668A CN 107577668 A CN107577668 A CN 107577668A CN 201710829908 A CN201710829908 A CN 201710829908A CN 107577668 A CN107577668 A CN 107577668A
Authority
CN
China
Prior art keywords
word
standard
distance
term vector
list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710829908.7A
Other languages
Chinese (zh)
Inventor
费高雷
郑夏
李元磊
胡光岷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201710829908.7A priority Critical patent/CN107577668A/en
Publication of CN107577668A publication Critical patent/CN107577668A/en
Pending legal-status Critical Current

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The present invention discloses a kind of social media non-standard word correcting method based on semanteme, the problem of detection and identification for expense modular word at present lack effective ways, the mode that the application employs smallest edit distance characterizes the morphology of word, and the semantic information similarity of word is characterized by the way of COS distance between term vector, common screening is expected correct word to replace non-standard word;And some existing PyEnchant, PyTypo instruments are combined, significantly reducing needs the word range for comparing semantic dependency, to reach the purpose for improving word standardization speed.

Description

Social media non-standard word correcting method based on semanteme
Technical field
The invention belongs to Data Mining, more particularly to a kind of non-standard word detection and identification technology.
Background technology
With Web2.0 rise, the Internet model is changed into all users by professional's knitmesh and participates in knitmesh, in shape While formula more democratizes, the largely inexpensive low-quality information generation from user will be had by also implying that.Social media It is the Important Platform of user's issue and propagation information, it is allowed to which life and the idea of oneself are shared in the unlimited place of user, unlimited time. Twitter is the social network sites towards global range, and its feature is that user delivers when pushing away text and has character quantity limitation. Which results in user to express viewpoint using more convenient brief abbreviation or network words, will produce substantial amounts of non-standard Vocabulary, correlative study person is influenceed to pushing away literary subsequent analysis.It is very important so carrying out correction to non-standard vocabulary.
Professor Zhang Yangsen proposes two kinds of vocabulary error situations of non-word mistake and true word mistake mainly occur in English text, non- Word mistake refers to the writing that can not be found in dictionary vocabulary lack of standardization, and true word mistake refers to that linguistic context can be found but be not inconsistent in dictionary Syntax error class vocabulary.The present invention is only for non-word error situation.The non-standard word for non-word mistake occur refers generally to misspelling Vocabulary, extension word and nonsense words by mistake.
In terms of detection firstly for non-standard word and identification, not many effective ways at present.The most frequently used is exactly Look up the dictionary method, that is, the word to be matched by traversal dictionary library lookup and word to be identified, if being then determined as modular word, It is on the contrary then be determined as non-standard word.Some scholars carry out non-standard word identification using the method for searching N-gram tables, that is, travel through Some N-gram tables simultaneously count its occurrence number, when word frequency is then determined as non-standard word less than certain threshold value.
For in terms of the correction of non-standard word, having some more effective methods and research, and be applied to part searches In the commercial products such as engine, input method.It is the most frequently used have smallest edit distance method, stem method, statistic law, summarize regular method and Build dictionary method etc..
Natural language is given to the algorithm in machine learning to handle, first will be by linguistic mathematics, one is most normal Method is exactly that each vocabulary is shown as a term vector.Term vector is a mode for showing the semanteme of word well.Its Middle One-hot Representation are most directly perceived, and each vocabulary is shown as a very long vector, only one by this method The value of dimension is 1, represents current word, remaining is 0.Though the method is simple but can bring dimension disaster, later scholar carries Distributed Representation method, a kind of low-dimensional real number vector of expression are gone out.It is current conventional obtain word to The model of amount has:1. 2. 3. 4. PLSA latent semantic analysis is general for LSA matrix decompositions model for GloVe models for Word2Vector models Rate model.
Social media by taking Twitter as an example constantly produces a large amount of insignificant noise informations and the letter of repeated and redundant Breath, such as the chat and forwarding of user.For convenience of analysis of the researcher to social media data, text message denoising and specification Change is necessary.The depth characterized with the development of natural language processing technique and word standardization and the semanteme of word is ground Study carefully, occur the standardization system for being much directed to plain text in recent years, but these conventional methods are limited to word mostly Morphology so that its application effect in Twitter substantially reduces.
The content of the invention
In order to solve the above technical problems, the application proposes a kind of social media non-standard word correcting method based on semanteme, On the spelling error correcting technique of routine, the semantic information of non-standard vocabulary is added as another consideration factor, is optimized most short Editing distance method is difficult to handle the non-standard word problem that morphology differs greatly.
The technical solution adopted by the present invention is:Based on the social media non-standard word correcting method of semanteme, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector The distance between;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard Word;
S3, the non-standard word for judging to obtain for step S2, it is corresponding to find out each term vectors of the N in small distance with its term vector Word;The specification word in these words is found out, wherein minimum specification word is carried out to it with non-standard term vector for selection Replace.
Further, the method for the distance between described calculating any two term vector is:Euclidean distance or it is bright can Paderewski distance or Chebyshev's distance or manhatton distance or Mahalanobis generalised distance or cosine angle.
Further, pretreatment is specially described in step S2:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
Further, the dictionary set described in step S2 comprises at least:Each conventional english dictionary, by pretreatment obtain Topic word and user name.
Further, step S3 also includes:
B1, using between word smallest edit distance d represent morphology it is similar;
B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents word Adopted similitude;
B3, according to following formula calculate with non-standard word close relation degree highest modular word;And according to the modular word come to taking Modular word is corrected;
S(ω12)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β Represent semantic weight.
Based on the social media non-standard word correcting method of semanteme, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector The distance between;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard Word;
S3, the non-standard word for judging to obtain for step S2, using PyEnchant and PyTypo to respectively non-standard to be corrected Word is handled, and obtains corresponding Correcting Suggestion word list;
Correcting Suggestion word list is traveled through, each in calculations list suggests that word is edited with the minimum of non-standard word to be corrected Distance, obtain morphology similarity;Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, obtains semantic phase Like degree;
According to following formula combination morphology similarity and semantic similarity, each suggestion word is calculated with non-standard word to be corrected Go out a score value;Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, most Suggestion word above is replaced to non-standard word;If Correcting Suggestion word list is sky, the non-standard word is directly filtered;
S(ω12)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β Semantic weight is represented, α is meaning of a word parameter, distances of the l between term vector.
Further, the method for the distance between described calculating any two term vector is:Euclidean distance or it is bright can Paderewski distance or Chebyshev's distance or manhatton distance or Mahalanobis generalised distance or cosine angle.
Further, pretreatment is specially described in step S2:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
Further, the dictionary set described in step S2 comprises at least:Each conventional english dictionary, by pretreatment obtain Topic word and user name.
Beneficial effects of the present invention:The social media non-standard word correcting method based on semanteme of the present invention, make use of most The mode of small editing distance characterizes the morphology of word, and is characterized the semantic of word using the mode of COS distance between term vector and believed Similarity is ceased, common screening is expected correct word to replace non-standard word, can reach following beneficial effect:
1st, existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduced Amount of calculation, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection;
2nd, when locking correct word, common morphology Similar Problems are not only allowed for, it is also contemplated that term vector can characterize The semantic information of word, the similarity degree of semantic information between term vector is characterized with COS distance method, more fully devised correct The decision condition of word degree close with non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary, Error correction effect is more preferable;
3rd, different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable Long word utilizes PyTypo processing;Artificial judgment is carried out after meaningless word filtering, adds wordbook;During pushing away text pretreatment The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.
Brief description of the drawings
Fig. 1 is the protocol procedures figure of the application;
Fig. 2 is non-standard word identification process figure;
Fig. 3 is word error correction flow chart.
Embodiment
For ease of skilled artisan understands that the technology contents of the present invention, enter one to present invention below in conjunction with the accompanying drawings Step explaination.
It is the protocol procedures figure of the application as shown in Figure 1, the technical scheme of the application is:Social media based on semanteme is non- Modular word correcting method, including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculates any two term vector The distance between;
GloVe is the vector difference than being encoded in space by the co-occurrence probability of two words.Value such as P (k | i)/P (k | j) is got over Greatly, word k's is semantic just closer to i, on the contrary, P (k | i)/P (k | j) for value closer to 0, word k's is semantic just closer to j.
Define PijRepresent that word j appears in the probability of word i contexts:
Define F (ωijk) represent that word i appears in the probability of word k contexts and word j is appeared on word k The ratio of probability hereafter:
Wherein ωijkWord i, j, k term vector are represented respectively, and their dimension is that the d dimensions being previously set are empty Between in real number vector.
However, GloVe purpose is that vector difference is encoded to the ratio of Term co-occurrence probability, institute's above formula is revised as:
In above formula, the variable on the left side is the vector in two spaces, and right formula, which is one, to be total to by what corpus counted The real number that existing matrix is drawn.To simplify above formula, pass through vectorial ωkWith vector difference ωijInner product operation, can be left by equation The dependent variable on side is converted into real number from vector:
In order to meet the symmetry of co-occurrence matrix, the levoform of above-mentioned equation is made into following deformation:
So, following equalities can be obtained:
Equation is solved, is obtained:
Above formula is deformed again, introduces biasing, this is due to logarithmic function property, prevents it from dissipating:
In order to which above formula is set up, by optimizing least squares method, and weighting function f (X are introducedij), finally give target letter Number:
Wherein f (Xij) and XijCorrelation, when co-occurrence number is relatively low between word, f (Xij) value should be relatively low, to reduce f (Xij) influence to object function;, should be in X to avoid influenceing overall training effect when co-occurrence number is too highijReach certain During threshold value, f (X are keptij) value is constant.
The mode of conventional measurement vector distance has:Euclidean distance, Minkowski distance, Chebyshev distance, Manhatton distance, Mahalanobis generalised distance, cosine angle etc..Most popular is Euclidean distance method and cosine angle Method.Tested through analysis, the embodiment of the present application weighs vector distance from the method for calculating cosine angle.Calculation formula is as follows:
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each list in list Word is compared with the word in dictionary set;If the word in list if successful match be modular word;Otherwise it is non-standard Word;
Literary information is largely pushed away as shown in Fig. 2 being got from Twitter api interfaces, the original of text is pushed away from lane database extraction Beginning text data, certain pretreatment is carried out first:
1. filter the text noise such as invalid character and mess code;
2. word of the extraction comprising " # " (topic) and the user name of "@" (referring to), as lay-by material;
3. duplicate removal;
4. segment;
5. filter symbol and Arabic numerals etc..
When processing is comprising digital word, the present invention is carried out at participle using all non-letter characters to English word Reason.Due to having filtered Arabic numerals, and the dynamics of participle is increased, effectively decomposed compound word and containing digital word, in fact A kind of process of standardization can also be regarded as.Storehouse is increased income present invention uses python WordSegment to decompose compound word.
A whole-word list will be obtained after pretreatment, judged again by list each word whether specification.Above carry To the method currently without good identification word, only using the method for most basic traversal dictionary.Word list is each Individual word is compared with each word in dictionary, if in dictionary successful match to same word, it is believed that word to be detected is Modular word, it is on the contrary then be considered non-standard word.Dictionary set in the present invention not only has the multiple conventional English words collected on the net Allusion quotation, it also added the topic word and user name obtained in preprocessing process.Consider the particularity of social media, can constantly there is new production Raw cyberspeak or uncommon word etc., the present invention improve wordbook by the normalization of artificial judgment word.Conventional English words Allusion quotation includes:Longman dictionary, Oxford Dictionary, Collins's dictionary etc..
The list of non-standard word is finally obtained, then carries out subsequent error correction processing.
S3, error correction is carried out for the step S2 non-standard words for judging to obtain, present applicant proposes two kinds of solutions:
Scheme one:Find out the corresponding word of each term vectors of the N in small distance with its term vector;Find out the rule in these words Model word, wherein minimum specification word is replaced to it with non-standard term vector for selection;Specially:
Standardize in pre-processing above by compound word and containing digital word, remaining misspelling word, extend word and nothing Meaning word is untreated.For meaningless word, the present invention is handled by the way of filtering.For misspelling word and extension Word, error correction is carried out using COS distance method between the term vector told about above, that is, for each incorrect word, find with Word corresponding to those immediate term vectors of its term vector COS distance, non-standard word is replaced with correct word therein Change.
But if directly select the effect of the corresponding correct word of selection less desirable using COS distance, between word Similarity can be divided into morphology the phase Sihe meaning of a word it is similar.According to traditional word error correction method, smallest edit distance method It is the actual parameter for weighing morphology similarity degree between vocabulary, smallest edit distance is represented with d between word.From above, word to COS distance between amount can represent the semantic similar situation of vocabulary, and a parameter alpha is multiplied by come table with the COS distance l between term vector Show meaning of a word similitude.Parameter alpha taking human as adjustment, can represent the quality of term vector.
Obtain following formula of score:
S(ω12)=d+ β × l × α (11)
β in formula represents semantic weight.If the value of the score company between two words is lower, then it represents that the two words close System is closer.Because parameter alpha is related to semantic weight β, to simplify formula, h=α × β are made, formula of score is changed into:
S(ω12)=d+h × l (12)
Make the score value between non-standard word and correct word minimum using above-mentioned formula, be replaced to reach correction mesh 's.This formula not only allows for the feature of morphology, effectively handles the problem of morphology difference is big between non-standard word and correct word, also Semantic information is combined, influence of the lexical semantic information inaccuracy to error correction effect can be avoided by adjusting parameter.
Scheme two:If only directly being judged using formula, the scope of retrieval is too big, causes workload to increase severely, and the used time is oversize. Introducing python above has the storehouse PyEnchant that increases income, and has processing incorrect word, and return to a Correcting Suggestion word The function of list.And PyTypo has well the directly function of processing extension word.
As shown in figure 3, the non-standard word that correction is treated first with PyEnchant and PyTypo is handled, obtain corresponding Correcting Suggestion word list.Travel through Correcting Suggestion list again, each in calculations list suggests the minimum of word and word to be corrected Editing distance, obtain morphology similarity;The COS distance between their term vector is also calculated, obtains semantic similarity.It is logical Cross formula (12) and combine morphology similarity and semantic similarity, calculate a score value.Will correct suggestion lists by goals for by As little as high rearrangement, the lower suggestion word sequence of score is more forward, thus obtains a new Correcting Suggestion list.Choose The word of each new list foremost is replaced to non-standard word, reaches error correction purpose.But if Correcting Suggestion word list For sky, it is meaningless word to illustrate word to be corrected, present invention selection directly filtering, and later stage artificial judgment these meaningless words whether Correct word can be used as, if it is, adding wordbook.
PyEnchant is based on Enchant storehouses, is the very powerful python kits of increasing income of One function, mainly by with In functions such as word detection, spelling error correction.It can return to a Correcting Suggestion list, and this usual Correcting Suggestion list is one The list being made up of multiple correct words, the forward word that sorts are the optimal error correction selecting object of system recommendations.
PyTypo is a Python storehouse of increasing income, and is designed to (in particular prolong specifically for pushing away non-standard word in text Long word) processing Python kits.It has not only used structure dictionary method, also add the function of removing and repeat substring, effectively Processing extension word.But it is undesirable for the error correction effect of misspelling vocabulary and nonsense words.
Existing PyEnchant instruments be make use of to obtain preliminary spelling Correcting Suggestion word list, greatly reduce meter Calculation amount, the speed for obtaining expected correct set of words is improved, help quickly positioning is optimal to replace selection;
When locking correct word, not only allow for common morphology Similar Problems, it is also contemplated that term vector can characterize word Semantic information, with COS distance method characterize term vector between semantic information similarity degree, more fully devise correct word with The decision condition of the close degree of non-standard word so that the correct word selected greatly increases for the probability of preferable expected vocabulary, error correction Effect is more preferable;Different type non-standard word situation is all taken into account, prolonged for many commonly-used method error correction effect is undesirable Long word utilizes PyTypo processing;Artificial judgment is carried out after meaningless word filtering, adds wordbook;During pushing away text pretreatment The interference of topic word and user name to error correction procedure is considered, collects addition wordbook so that wordbook is more perfect.
One of ordinary skill in the art will be appreciated that embodiment described here is to aid in reader and understands this hair Bright principle, it should be understood that protection scope of the present invention is not limited to such especially statement and embodiment.For ability For the technical staff in domain, the present invention can have various modifications and variations.Within the spirit and principles of the invention, made Any modification, equivalent substitution and improvements etc., should be included within scope of the presently claimed invention.

Claims (9)

1. the social media non-standard word correcting method based on semanteme, it is characterised in that including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculated between any two term vector Distance;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each word in list with Word in dictionary set is compared;If the word in list if successful match be modular word;Otherwise it is non-standard word;
S3, the non-standard word for judging to obtain for step S2, it is corresponding single to find out each term vectors of the N in small distance with its term vector Word;The specification word in these words is found out, wherein minimum specification word is replaced to it with non-standard term vector for selection.
2. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that the meter The method for calculating the distance between any two term vector is:Euclidean distance or Minkowski distance or Chebyshev's distance Or manhatton distance or Mahalanobis generalised distance or cosine angle.
3. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S2 The pretreatment is specially:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
4. the social media non-standard word correcting method according to claim 3 based on semanteme, it is characterised in that step S2 Described dictionary set comprises at least:Each conventional english dictionary, the topic word and user name obtained by pretreatment.
5. the social media non-standard word correcting method according to claim 1 based on semanteme, it is characterised in that step S3 Also include:
B1, using between word smallest edit distance d represent morphology it is similar;
B2, by setting the meaning of a word parameter alpha quality of term vector is represented, the distance l between being multiplied by term vector with the α represents meaning of a word phase Like property;
B3, according to following formula calculate with non-standard word close relation degree highest modular word;And according to the modular word come to taking specification Word is corrected;
S(ω12)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β is represented Semantic weight.
6. the social media non-standard word correcting method based on semanteme, it is characterised in that including:
S1, semantic information structure, the term vector of each word is obtained using GloVe models, calculated between any two term vector Distance;
S2, the identification of non-standard word, by being pre-processed to obtain whole-word list to pushing away text;By each word in list with Word in dictionary set is compared;If the word in list if successful match be modular word;Otherwise it is non-standard word;
S3, the non-standard word for judging to obtain for step S2, respectively non-standard word to be corrected is entered using PyEnchant and PyTypo Row processing, obtains corresponding Correcting Suggestion word list;
Travel through Correcting Suggestion word list, each in calculations list suggest word and non-standard word to be corrected it is minimum edit away from From obtaining morphology similarity;Each suggestion the distance between term vector and non-standard term vector to be corrected are calculated, is obtained semantic similar Degree;
According to following formula combination morphology similarity and semantic similarity, one is calculated with non-standard word to be corrected to each suggestion word Individual score value;Each non-standard word to be corrected is resequenced Correcting Suggestion word list according to score value from low to high, foremost Suggestion word non-standard word is replaced;If Correcting Suggestion word list is sky, the non-standard word is directly filtered;
S(ω12)=d+ β × l × α
Wherein, S (ω1, ω2) represent two words between close relation degree, S (ω1, ω2) the smaller tight ness rating of value is higher, β is represented Semantic weight, α are meaning of a word parameter, distances of the l between term vector.
7. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that the meter The method for calculating the distance between any two term vector is:Euclidean distance or Minkowski distance or Chebyshev's distance Or manhatton distance or Mahalanobis generalised distance or cosine angle.
8. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2 The pretreatment is specially:
A1, filtering push away literary noise;It is described to push away literary noise and include:Idle character and mess code;
The word of A2, extraction comprising topic and the word comprising user name, obtain topic topic word and user name;
A3, remove repetitor;
A4, using all non-letter characters to English word carry out word segmentation processing.
9. the social media non-standard word correcting method according to claim 6 based on semanteme, it is characterised in that step S2 Described dictionary set comprises at least:Each conventional english dictionary, the topic word and user name obtained by pretreatment.
CN201710829908.7A 2017-09-15 2017-09-15 Social media non-standard word correcting method based on semanteme Pending CN107577668A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710829908.7A CN107577668A (en) 2017-09-15 2017-09-15 Social media non-standard word correcting method based on semanteme

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710829908.7A CN107577668A (en) 2017-09-15 2017-09-15 Social media non-standard word correcting method based on semanteme

Publications (1)

Publication Number Publication Date
CN107577668A true CN107577668A (en) 2018-01-12

Family

ID=61036080

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710829908.7A Pending CN107577668A (en) 2017-09-15 2017-09-15 Social media non-standard word correcting method based on semanteme

Country Status (1)

Country Link
CN (1) CN107577668A (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804423A (en) * 2018-05-30 2018-11-13 平安医疗健康管理股份有限公司 Medical Text character extraction and automatic matching method and system
CN109670171A (en) * 2018-11-23 2019-04-23 山西大学 A kind of word-based term vector expression learning method to asymmetric co-occurrence
CN110032738A (en) * 2019-04-16 2019-07-19 中森云链(成都)科技有限责任公司 Microblogging text normalization method based on context graph random walk and phonetic-stroke code
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium
WO2021129411A1 (en) * 2019-12-23 2021-07-01 华为技术有限公司 Text processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
CN105824800A (en) * 2016-03-15 2016-08-03 江苏科技大学 Automatic Chinese real word error proofreading method
CN106126494A (en) * 2016-06-16 2016-11-16 上海智臻智能网络科技股份有限公司 Synonym finds method and device, data processing method and device
US20160335244A1 (en) * 2015-05-14 2016-11-17 Nice-Systems Ltd. System and method for text normalization in noisy channels
CN106506327A (en) * 2016-10-11 2017-03-15 东软集团股份有限公司 A kind of spam filtering method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105095204A (en) * 2014-04-17 2015-11-25 阿里巴巴集团控股有限公司 Method and device for obtaining synonym
US20160335244A1 (en) * 2015-05-14 2016-11-17 Nice-Systems Ltd. System and method for text normalization in noisy channels
CN105824800A (en) * 2016-03-15 2016-08-03 江苏科技大学 Automatic Chinese real word error proofreading method
CN106126494A (en) * 2016-06-16 2016-11-16 上海智臻智能网络科技股份有限公司 Synonym finds method and device, data processing method and device
CN106506327A (en) * 2016-10-11 2017-03-15 东软集团股份有限公司 A kind of spam filtering method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
宋亚军等: "一种改进的社交媒体文本规范化方法", 《中文信息学报》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108804423A (en) * 2018-05-30 2018-11-13 平安医疗健康管理股份有限公司 Medical Text character extraction and automatic matching method and system
CN108804423B (en) * 2018-05-30 2023-09-08 深圳平安医疗健康科技服务有限公司 Medical text feature extraction and automatic matching method and system
CN109670171B (en) * 2018-11-23 2021-05-14 山西大学 Word vector representation learning method based on word pair asymmetric co-occurrence
CN109670171A (en) * 2018-11-23 2019-04-23 山西大学 A kind of word-based term vector expression learning method to asymmetric co-occurrence
CN110032738A (en) * 2019-04-16 2019-07-19 中森云链(成都)科技有限责任公司 Microblogging text normalization method based on context graph random walk and phonetic-stroke code
CN110348497A (en) * 2019-06-28 2019-10-18 西安理工大学 A kind of document representation method based on the building of WT-GloVe term vector
CN110348497B (en) * 2019-06-28 2021-09-10 西安理工大学 Text representation method constructed based on WT-GloVe word vector
WO2021129411A1 (en) * 2019-12-23 2021-07-01 华为技术有限公司 Text processing method and device
CN113095072A (en) * 2019-12-23 2021-07-09 华为技术有限公司 Text processing method and device
EP4060526A4 (en) * 2019-12-23 2022-12-28 Huawei Technologies Co., Ltd. Text processing method and device
CN112597373A (en) * 2020-12-29 2021-04-02 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
CN112597373B (en) * 2020-12-29 2023-09-15 科技谷(厦门)信息技术有限公司 Data acquisition method based on distributed crawler engine
CN112765962A (en) * 2021-01-15 2021-05-07 上海微盟企业发展有限公司 Text error correction method, device and medium

Similar Documents

Publication Publication Date Title
CN107577668A (en) Social media non-standard word correcting method based on semanteme
US10515090B2 (en) Data extraction and transformation method and system
CN100517301C (en) Systems and methods for improved spell checking
CN109684642B (en) Abstract extraction method combining page parsing rule and NLP text vectorization
CN110543639A (en) english sentence simplification algorithm based on pre-training Transformer language model
US10915707B2 (en) Word replaceability through word vectors
CN107102983B (en) Word vector representation method of Chinese concept based on network knowledge source
CN108073571B (en) Multi-language text quality evaluation method and system and intelligent text processing system
CN108920482B (en) Microblog short text classification method based on lexical chain feature extension and LDA (latent Dirichlet Allocation) model
CN110209818B (en) Semantic sensitive word and sentence oriented analysis method
CN111967258B (en) Method for constructing coreference resolution model, coreference resolution method and medium
CN106959943B (en) Language identification updating method and device
CN114065758A (en) Document keyword extraction method based on hypergraph random walk
CN111507093A (en) Text attack method and device based on similar dictionary and storage medium
Lahbari et al. Arabic question classification using machine learning approaches
WO2014002774A1 (en) Synonym extraction system, method, and recording medium
CN113673252A (en) Automatic join recommendation method for data table based on field semantics
CN114997288A (en) Design resource association method
CN108319584A (en) A kind of new word discovery method based on the microblogging class short text for improving FP-Growth algorithms
CN114722176A (en) Intelligent question answering method, device, medium and electronic equipment
CN111178091A (en) Multi-dimensional Chinese-English bilingual data cleaning method
CN107092595A (en) New keyword extraction techniques
CN115828854B (en) Efficient table entity linking method based on context disambiguation
CN104216880A (en) Term definition discriminating and analysis method based on Internet
CN113254473B (en) Method and device for acquiring weather service knowledge

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180112

RJ01 Rejection of invention patent application after publication