CN101763352A

CN101763352A - Unnormalized language processing method base on web mining

Info

Publication number: CN101763352A
Application number: CN200810207672A
Authority: CN
Inventors: 张霄凯; 杨帆; 史天艺; 尹航
Original assignee: Individual
Current assignee: Individual
Priority date: 2008-12-24
Filing date: 2008-12-24
Publication date: 2010-06-30

Abstract

The invention provides an unnormalized language processing method base on web mining, which relates to the filed of computer data mining, in particular to the technology of the network emotion mining scheme. The invention discloses a method for processing the network unnormalized language, which belongs to the field of computer data mining. The method provides a method for processing the unnormalized language by using the minimized monitoring study. The types of the normal unnormalized language are simplified from six kinds into two disjoint kinds: the typical unnormalized language and the ambiguous unnormalized language. The invention provides a model matching algorithm based on the sequence coverage for the typical unnormalized language, and provides a classification algorithm based on the feather extraction for the ambiguous unnormalized language to process the ambiguous unnormalized language. Finally, the completely normalized written words can be obtained, so the subjective opinion type mining operation is convenient, and information such as motion, opinions, advices and the like can be perfectly extracted.

Description

A kind of unnormalized language disposal route of excavating based on WEB

Technical field

The present invention relates to the computer data excavation applications, specially refer to the technology that the network emotion is excavated scheme.

Background technology

In recent years, the internet has had very huge user.By this platform of internet, the user delivers some individual viewpoint and comments through regular meeting, and promptly describing not is the subjectivity text of the complete fact, and its main contents comprise suggestion, emotion and attitude of individual, colony, tissue etc. or the like.Obviously, no matter from the enterprise that other users still produce a certain product, user's this viewpoint is of practical significance very much.It has good reference value and guidance quality.Event is just handled for the text of asserting or comment on has had realistic meaning very much.And these comments are mostly expressed from user's the person in charge, within the specific limits.Just may there be a kind of novel language: unnormalized language.Unnormalized language and noise are big key characters of subjectivity text.It is in the communication of media that informal language is widely used in the network, Chat room, BBS, email etc.(Informal Language NIL) expresses, for example " idol " replacement " I " commonly used in Internet chat, " 8 mistake " expression " well " etc. as informal language so the special language that uses in these environment is called as.

This shows that unnormalized language has comprised a large amount of individuals' subjectivity information, and these information are for excavating the user to a certain social phenomenon, the reaction of commodity or public opinion and suggestion play vital work.And in traditional WEB mining process, these non-standard words all are taken as noise and have filtered out.If we will excavate the comment of user to product, if the user uses some popular cyberspeaks, we just are difficult to find out the potential comment content of user.So if want the digging user emotional expression, undoubted just must it carries out standardization processing to unnormalized language.

At some is that the country of representative has some of the staff that this type of work is studied with the English language, and domestic processing to unnormalized language at present also is in the starting stage.Handle for unnormalized language now one comparatively widely means be exactly the method for application mode coupling, this method is mainly from Hesperian researchist.This method is simple, and certain specific aim is also arranged.Mainly be because western countries substantially all are some characters and character string.In our processing procedure, at first requisite is exactly the collection of non-standard word.This patent is passed judgment on all language materials of test all from the forum of internet.Because of there is the unnormalized language of a large amount of subjectivities in it, thus can be more convenient we carry out the processing of unnormalized language.Handle unnormalized language in order finally to carry out automated process, we are the learning methods that adopt supervision.At first, by manually carrying out the identification and the judgement of unnormalized language, then by machine learning.Finally automatically it is handled.

The language material of final all tests is all from the text on Baidu's mhkc.Select Baidu's mhkc mainly to be because its popularity is more prosperous, the content that comprises almost is the subjectivity content entirely, thereby the unnormalized language that contains is also just more.In order to collect the text that contains a large amount of this unnormalized language, utilize and write the web crawlers instrument and the content on Baidu's mhkc is downloaded obtained.Its objective is kind and the form of at first determining unnormalized language, next obtains test and trains needed data set.Climbing and getting required webpage is a very important step, and particular content can be consulted other patents.Be achieving the goal property and real-time, finally extract the webpage within a year.Afterwards, just need therefrom filter out unnormalized language.These words are expressed with the implementation of dictionary.After setting up non-standard speech dictionary, just can collect needed language material targetedly and come for training and testing used.Because dictionary has been arranged, climb so and get webpage specific aim has also just been arranged.So just can avoid in a large amount of gibberishes, having sought needed data set.Adding the keyword feature step above this in the process flow diagram can realize, and the non-standard word in the IL dictionary that these key words are screened just before.Keyword can be placed in the text, for raising the efficiency, can put it in the set in program realizes.So just guaranteed that the webpage that is grasped comprises more than one non-standard word at least, has reduced unnecessary loaded down with trivial details work such as artificial filtration to a certain extent.

Make up after the unnormalized language dictionary, just can discern and judged the non-standard word.Because work afterwards all is based on this basis.The explanation unnormalized language is handled processing procedure now.

Concrete steps are:

1) specifies the file that needs processing, can carry out batch to it and read.

2) hypertext is carried out the conversion of plain text format, it is being carried out corresponding pre-service.

3) word that unnormalized language is comprised by the outstanding method of this patent extracts and normalization process.

4) content of batch output to handle.

5) with the experiment evaluating standard it is assessed.

Here, the standard of evaluation and test is three evaluation metricses that adopt in the natural language processing method: accuracy rate, recall rate and F value.Finally verify the correctness and the validity of the inventive methods that propose with these three standards.The proposition of this method uses minimized method to handle unnormalized language, and obtains fine treatment effect.

Summary of the invention

The present invention proposes a kind of measure of supervision that minimizes of handling unnormalized language.The minimum training data of utilization obtains reasonable result exactly.This method easy operating, and need not mass memory.Effectively solved the problem of the unnormalized language in the text mining.Thereby can extract useful user feeling, the expression of suggestion and comment.

The unnormalized language disposal route that the present invention proposes comprises two parts.Typical case's unnormalized language and ambiguity unnormalized language are judged respectively at the two difference then.Finally judgement being belonged to the non-standard dictionary that the word utilization of unnormalized language sets up replaces.Thereby reach normalized purpose.At first introduce two definition:

Typical case's unnormalized language: typical unnormalized language is meant that those comprise letter, numeral, and mix abbreviated form and can represent regular language to express the word of the meaning.The feature of this word is exactly that form is very irregular, and in normal text, for example magazine generally this word can not occur on the newpapers and periodicals.The characteristics of this expression way are exactly that it only may come across in the interchange on the internet, regularly generally such expression can not occur in written.As: OMG (oh, my god), PMP (flattering) or the like.We are divided into this classification with numeral and mixed form.As 3q (thank you) or the like.

The ambiguity unnormalized language: ambiguity non-standard word be meant those from literal be the word of normal form, but true what express in the context that occurs at this word is the meaning of its corresponding non-standard word.In this case, we also regard this class word as non-normal, range ambiguity unnormalized language part.The characteristics of this expression way are not only exactly on network such word can occur, and regular written word also this word occurs through regular meeting.In this case, be exactly that the implication of regular language has had ambiguity.It is inaccurate to cause the main cause of this phenomenon to come from user pronunciation, and input method does not have suitable post option, for reaching quick interchange purpose.With regard to the meaning of wanting with this regular expression-form alternate user to express.This NIL expresses following example: gruel (liking), mottled bamboo (edition owner), bean vermicelli (song fan, movie-buff) or the like all are the representatives of more typical ambiguity non-standard word.

In the process of handling, at first judge whether contain suspicious non-standard word in the sentence.Judge further then whether it is non-standard word.Here will be in two kinds of situation, because the method difference of handling will be judged typical non-standard word or ambiguity non-standard word.If judging is non-standard word, then parallel expression substitutes it in the dictionary by setting up before, finally reaches the normalized purpose of non-standard word.

Concrete processing procedure may further comprise the steps:

1) judges whether to appear in the non-standard dictionary for the word in the text.This dictionary in the process that first step WEB collects by artificial cognition and extraction.This dictionary is stored as major key with each non-standard speech.

2) if exist in unnormalized language dictionary occurring words, just judge it is typical non-standard word or ambiguity non-standard word.Because the non-standard word is of a great variety.For preliminary judgement, we are divided into six big classes with it.And these kinds bother as if handling one by one very much, and situation is intersected to some extent.Handle for convenient, the way that proposes this solution also is the key of handling.

3) at typical non-standard word, adopt method for mode matching based on the sequence covering algorithm, it is judged and replaces.Algorithm is implemented in the concrete enforcement hurdle and tells about in detail, and the treatment scheme of this invention as shown in Figure 1.

4) at ambiguity non-standard word, the method that we adopt simulation to classify is discerned it and is judged.The solution of this class problem also is emphasis and the difficult point that Chinese non-standard word is handled.Just described in definition, on literal, can't judge at all whether he is unnormalized language, because also can there be this word in the magazine of normal standard on the newpapers and periodicals.Judge whether he is non-standard word, will be from the context.The relation of analysis context, this just involves the processing of semantic aspect, and this will cause more loaded down with trivial details and complicated process.And the usage of each speech is all different in the Chinese, and this is just too big like consider workload from semantic aspect.And the method that the present invention proposes is with regard to good this problem that solved.This method has in detail in concrete enforcement hurdle to be told about.

Description of drawings

Fig. 1 is the processing flow chart of unnormalized language.

Embodiment

A kind of method of utilizing minimum workload to handle unnormalized language of main invention.Be to further specify below to of the present invention:

For the processing of typical unnormalized language, the present invention adopts the method for mode matching based on the sequence covering algorithm.The specific implementation method is as follows: what at first we need handle is typical non-standard word.To avoid being confined to a certain field relevant in order to reach so, and we just can not concentrate in a certain territory collected data.For example in so-and-so automobile forum or mobile phone forum, collect the non-standard word relevant with this field.For the purpose of justice, the data of extraction are field independence.Adopt following algorithm to extract the rule of this non-standard of identification NIL.

1) training data set S, sen is an example among the S.Regular collection R is initially sky.If the keyword that comprises among the s is the non-standard word, then be labeled as positive example: otherwise, be labeled as counter-example.

2) begin circulation, know that the sentence set is for empty.For each sen ∈ S, connect 3 down) or 4).

3), extract a regular r, and r is added R if the word that sen comprises is a positive example.And delete other sentences that this rule can be layed onto.

4) if do not satisfy 3) then arrive this step, extract a rule, and the rule among the R is simplified according to this rule, and delete other sentences that this rule can cover.

Handle for the ambiguity unnormalized language, the front by the agency of can not handle it with top method.Here proposing a kind of method of simulating the method for classification handles it.Idiographic flow is as follows:

At first the sentence to the artificial ambiguity non-standard word of having determined carries out feature extraction, and final the present invention is by extracting following five features:

1) typical unnormalized language.Express if itself just contain a non-standard of having determined in this sentence, the author who is equivalent to state one's views has previous conviction, and then the possibility of the ambiguity NIL that comprises in this sentence has just strengthened greatly.

2) expression of opinion, advise or contain the word of emotion.The subjectivity that contains these features has just strengthened greatly.On the basis of subjectivity text, the expression of some ambiguity NIL appears again, then this to be expressed in this sentence be non-standard word probably just.

3) first and second persons.The non-standard word is expressed from individual's subjectivity entirely, wherein comprise the comment of some things and assert, so the reviewer for the viewpoint of expressing oneself just usually can be on the content of expressing the adding first or the second person.

4) irregular use punctuation mark.The user of non-standard word itself just has expressing the feature lack of standardization of content representation, and randomness is bigger.What therefore also can show in the use of punctuation mark is very lack of standardization, and regular expression can not done like this.

5) have emotion color word and punctuation mark.Exist in the sentence that ambiguity NIL expresses if exist and have the symbol that shows emotion, then this ambiguity NIL is that the degree of non-regulate expression has just been deepened greatly.As exclamation mark appears, the expression author nourishes surprised or exciting emotion to things to be expressed, says hello and represents that then the author nourishes the meaning of query to something or other.The two all is reviewer individual's a emotional expression.

For these features, it is classified with naive Bayesian method and support vector machine.For the existing much methods of other patents of the concrete grammar of classification, not one's own invention has not just been introduced.

Ambiguity non-standard word to artificial cognition carries out the vector space model mapping earlier, then by these known classifications of study.Required non-standard word to be processed is carried out it being classified after the spatial model mapping.Finally judge whether it is the ambiguity unnormalized language.If determine to be, just it is substituted with regular speech corresponding in the constructed non-standard dictionary.Thereby reach the purpose of normalization process.

The method purpose that the present invention adopts is to use to minimize supervision, only needs the data of a spot of training just can reach reasonable result.9362 sentences of final use are handled the result of typical unnormalized language as the training and testing data set, and the accurate rate of obtaining is 87.1%, and recall rate is 68.2%, and the F value is 76.5%.The result of ambiguity unnormalized language adopts the result under the 10 folding cross validation situations, and wherein accurate rate is 81.3%, and recall rate is 88.5%, and the F value is 84.7%.Validity and practicality that result verification should be invented.

Claims

1. the disposal route of a network unnormalized language, its purpose is to use minimized training data just can obtain the good treatment result.The unnormalized language of often using on the network is divided into two big classes: typical unnormalized language and ambiguity unnormalized language.At the different disposal route of dissimilar employings.Its objective is and use few training data of trying one's best to obtain maximized normalization process result.

2. the disposal route of a network unnormalized language, wherein typical unnormalized language is defined as and comprises letter, numeral, and mix abbreviated form and can represent regular language to express the word that looks like.The feature of this word is exactly that form is very irregular, and in normal text, for example magazine generally this word can not occur on the newpapers and periodicals.The characteristics of this expression way are exactly that it only may come across in the interchange on the internet, regularly generally such expression can not occur in written.For typical unnormalized language, adopt the method for mode matching that covers based on sequence that it is handled.Extract the contextual rule that this non-standard speech exists at the different sentences in the training set, delete the sentence that this rule can cover then.Circulation extremely stops repeatedly.

3. the disposal route of a network unnormalized language, wherein the ambiguity unnormalized language be defined as from literal be the word of normal form, but true what express in the context that occurs at this word is the meaning of its corresponding non-standard word.For this class unnormalized language, extract following five features: typical unnormalized language, the word of expression of opinion, first second person pronoun, the irregular use of punctuation mark has the punctuation mark of emotion.It is classified with sorter by these features then.Judge whether to belong to unnormalized language.