CN106407235B

CN106407235B - A kind of semantic dictionary construction method based on comment data

Info

Publication number: CN106407235B
Application number: CN201510469211.4A
Authority: CN
Inventors: 林小俊; 张猛; 暴筱; 焦宇
Original assignee: Beijing Zhong Hui Information Technology Ltd By Share Ltd
Current assignee: Beijing Yishang Huiping Network Technology Co ltd
Priority date: 2015-08-03
Filing date: 2015-08-03
Publication date: 2019-06-11
Anticipated expiration: 2035-08-03
Also published as: CN106407235A

Abstract

The present invention relates to a kind of semantic dictionary construction method based on comment data, step includes: 1) to construct seed semantic dictionary by comment data on a small quantity；2) comment data are segmented；3) semantic category of data is commented on by word judgment and be replaced with semantic category label；4) template is generated according to the concrete term that the title of each semantic category and each semantic category include；5) in the comment data after template to be applied to semantic category tag replacement, to extract the semantic word of each semantic category；6) it is given a mark according to the importance of template, generalization and accuracy to each template；7) the part template for choosing highest scoring calculates the score for the semantic word that each template extracts, and then the part of semantic word for choosing highest scoring expands semantic dictionary；8) step 3)~7) iteration progress, final semantic dictionary and template library are obtained after termination.The present invention can obtain fairly large semantic dictionary within a short period of time, and can extract multiple semantic categories simultaneously.

Description

A kind of semantic dictionary construction method based on comment data

Technical field

The invention belongs to information technologies, data mining technology field, and in particular to a kind of semantic word based on comment data Allusion quotation construction method.

Background technique

With the fast development of e-commerce, the comment on internet is used from the people visual field is progressed into slowly influence The selection at family, then just deepening step by step to the influence to brand.By taking hotel industry as an example, hotel, which wishes to obtain by technological means, to be used The comment at family is fed back, and for instructing the Brand management and operation management in hotel, promotes brand image and service quality.User wishes The comment for checking other users, the advantages of specifying hotel and disadvantage, in this, as the important references of reservation.Tripadvisor is ground Study carefully display, the user more than 85% pays much attention to the public praise quality in hotel, and nearly 90% user checks before making reservation decision User reviews.

More and more users are happy to share oneself viewpoint or experience on the internet, and this kind of comment data explosion formula increases Long, only method manually is difficult to cope with the collection and processing of online magnanimity comment.Therefore, there is an urgent need to computer help users Quick obtaining comes into being with these comment information, sentiment analysis (Sentiment Analysis) technology is arranged.Sentiment analysis It is not only the research hotspot of field of information processing, also results in extensive concern in industrial circle.

The emotion for analyzing comment first has to identify the valuable emotion information element in comment, this includes: 1) to comment Valence object, such as " hotel ", " price "；2) evaluative component, such as " very good ", " can be said to be clean ".Wherein, evaluative component includes Emotion word (such as " good ", " clean "), degree adverb (such as " very "), common adverbial word (such as " mostly ") and negative word (such as " no "), evaluative component not only expresses emotion, is also reinforced by its ornamental equivalent, weakens or set anti-emotional expression sentence Feeling polarities, so that it is more abundant to obtain emotional expression.

Importance of the emotion word in sentiment analysis is self-evident.However in many cases, individual emotion word Polarity be it is ambiguous, such as "high" of " price in dining room is very high " describes to indicate derogatory sense when " dining room price ", and " restaurant employee's work It is very high to make efficiency " "high" description " working efficiency " when indicate commendation.Therefore, emotion is only considered in the sentiment analysis of text Word is far from being enough, it is also necessary to consider the collocation of evaluation object and emotion word, such as<price, high>,<working efficiency, high>in this way Binary collocation.

Above-mentioned all kinds of semantic dictionaries, either emotion word dictionary, degree adverb dictionary etc. or dictionary of collocations etc., for Text emotion analysis plays the role of very important.The current pure dictionary resources for artificially collecting arrangement, scale is inadequate, efficiency Also very low.A kind of better method is statistical method or machine learning method based on corpus, although this method can band Carry out some noises, but at this moment intervene again manually, cost is relatively low.Bootstrapping is a kind of semi-supervised engineering Learning method is widely applied in information extraction, construction of knowledge base field, can also be used for reference and is applied in semantic dictionary building.

Summary of the invention

The present invention in view of the above-mentioned problems, provide it is a kind of based on comment data semantic dictionary construction method, height can be generated The semantic dictionary and template library of quality.

The technical solution adopted by the invention is as follows:

A kind of semantic dictionary construction method based on comment data, includes the following steps:

1) comment data are obtained, the semantic word of each semantic category is obtained by commenting on data on a small quantity, construct seed semanteme word Allusion quotation；

2) word segmentation processing is carried out to the sentence of comment data；

3) it to the comment data after participle, is replaced by its semantic category of word judgment and with semantic category label；

4) make pauses in reading unpunctuated ancient writings to the comment data after tag replacement, the tool for including according to the title of each semantic category and each semantic category Pronouns, general term for nouns, numerals and measure words language generates template；

5) in the comment data after template to be applied to semantic category tag replacement, to extract the semantic word of each semantic category；

6) it according to the importance of template, generalization and accuracy, gives a mark to each template；

7) the part template for choosing highest scoring calculates the semantic word that each template extracts according to the template of selection and its marking Score, and then choose highest scoring part of semantic word semantic dictionary is expanded；

8) step 3) to step 7) iteration carries out, and iteration ends when select semanteme word is incorrect obtain most Whole semantic dictionary, and template library is constituted by each template.

Further, step 1) obtains online comment data from comment website by focused crawler, and by manually checking A small amount of comment, arranges the semantic word of each semantic category, forms seed dictionary.

Further, step 2) is segmented using the maximum match segmentation based on dictionary first, is then directed to and is divided The ambiguous part of word obtains correct word segmentation result using the segmenting method of sequence labelling；The segmenting method of the sequence labelling The cutting problems of word are converted to the classification problem of word, each radical assigns different positions according to its different location in word Category label determines the slit mode of sentence based on such flag sequence.

Further, the step 3) semantic category include evaluation object word, it is evaluation attributes word, emotion word, degree adverb, general Logical adverbial word, negative word, insertion word.

Further, step 4) according to ".","！", "? " 3 punctuation marks are made pauses in reading unpunctuated ancient writings, and the minimum for limiting template is long Degree is 3 words, and maximum length is 7 words.

Further, when step 5) extracts the semantic word of each semantic category, when some corresponding template of comment segment and step 4) when only one word of difference of gained template, using the word as the example word of corresponding semantic category.

Further, the part template of the step 7) highest scoring is preceding 5~10% template of highest scoring, described The part of semantic word of highest scoring is preceding 5~10% semantic word of highest scoring.

Further, after step 8), by the polarity and emotion word that are manually determined emotion word in semantic dictionary With the collocation polarity of evaluation object word, evaluation attributes word；In artificial determination process, by the corresponding comment segment work of its affiliated template For the foundation of judgement.

Compared with the pure mode artificially collected, the present invention use based on comment corpus method it is high-efficient, can compared with It is arranged in short time and obtains fairly large semantic dictionary；Compared with traditional Bootstrapping method, mould proposed by the present invention Version marking can effectively measure the situation of template nesting；Semantic dictionary construction method phase with tradition based on Bootstrapping Than the present invention can extract multiple semantic categories simultaneously.

Detailed description of the invention

Fig. 1 is the step flow chart of the semantic dictionary construction method of the invention based on comment data.

Specific embodiment

In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, below by specific embodiment and Attached drawing, the present invention will be further described.

Building for semantic dictionary, the present invention use the method based on bootstrapping (Bootstrapping).Bootstrapping, i.e., certainly Extension or bootstrapping are a kind of semi-supervised machine learning methods, can be used for extracting semantic dictionary and template simultaneously.This method Thought based on the observation that extraction template can be used for extracting new example, these examples can be used for taking out again in turn Take new template.The advantage of this method is not needing the training corpus of mark, it is thus only necessary to a small number of seeds.Pass through people first Work intervenes the seed word initialized, obtains template using seed word, and then obtain seed word by template, so Iteration carries out.In each round iteration, new labeled data will be all generated, optimal word can be added to accordingly to semantic dictionary In, optimal template can also be added in template library, model be relearned with these new labeled data, to can produce again New data, loop back and forth like this, and terminate until finally restraining, to obtain more seed words and template.Here it is most Basic Bootstrapping algorithm (or process).

The semantic category of semantic dictionary includes evaluation object word (Obj), evaluation attributes word (Attr), emotion word (Sent), journey Adverbial word (Dgr), common adverbial word (Adv), negative word (Neg), insertion word (Inter) etc. are spent, each semantic category includes several words Language, template are exactly the sequence being made of semantic class name or concrete term.

The step of present invention improves on the basis of existing Bootstrapping method, and Fig. 1 is the method for the present invention stream Cheng Tu, here are specific implementation steps:

Step 1: data preparation.The online comment data of website acquisition are commented on from the mainstreams such as journey are taken by focused crawler.

Step 2: seed dictionary creation.It manually checks a small amount of (such as 500) comment, arranges the semantic word of each semantic category, Seed dictionary is formed, which is denoted as SemLex.

Step 3: comment participle.Chinese word segmentation is the basic steps of Chinese natural language processing, and present invention participle uses word The method of allusion quotation participle and statistics participle fusion.The maximum match segmentation based on dictionary is used first, has ambiguity for participle Part use the segmenting method of sequence labelling again.

Maximum match segmentation based on dictionary gives dictionary, for chinese character sequence to be segmented, successively finds matching Longest dictionary word, no matcher is then used as monosyllabic word to handle, until the chinese character sequence is disposed.It is swept according to chinese character sequence Retouch the difference in direction, this method can be divided into again: Forward Maximum Method (matching from left to right) and reverse maximum matching are (from dextrad Left matching).For example, for sequence " when the atom binding constituents period of the day from 11 p.m. to 1 a.m ", Forward Maximum Method result be " when | atom | in conjunction with | at | Molecule | when ", and reverse maximum matching result is " when | atom | in conjunction with | ingredient | the period of the day from 11 p.m. to 1 a.m ".

Obviously, Forward Maximum Method and reverse maximum matching cannot all handle cutting ambiguity problem well.It is positive maximum Matching and reverse maximum matching also may be combined to form two-way maximum matching, forward direction and inversely match inconsistent when bi-directional matching Place, the often place of potential ambiguity.There is ambiguity to generally require to confirm word segmentation result according to specific context.There is the sequence of supervision Column mask method can adequately excavate the feature-rich of context, therefore present invention introduces sequence labellings in ambiguous situation Method disambiguation.The cutting problems of word are converted to the classification problem of word by this method, and each radical is according to its difference in word Position assigns different position classification labels, for example, in prefix, word, suffix and monosyllabic word.Based on such flag sequence, very It is easy to determine the slit mode of sentence.Wherein, B (Begin), M (Middle), E (End), S (Single) respectively indicate prefix, In word, suffix, monosyllabic word.There is the flag sequence of word, the word sequence for meeting regular expression " S " or " B (M) * E " indicates one Word, to be readily accomplished sentence cutting.In order to realize that sequence labelling task, the present invention use conditional random field models (Conditional Random Fields, CRF), which is used widely in natural language processing, and achieves very Ten-strike.Specific features include: previous word, current word, the latter word, previous word and current word, current word and the latter Word, and the binary feature based on these unitary features.For conditional random field models using these features extracted, what is predicted is every The category label of a word.

The dictionary of maximum matching process and have supervision conditional random field models training study corpus both be from this hair Bright 100,000 manually marked hotel comment.

Step 4: semantic category tag replacement.It is replaced to the comment after participle by its semantic category of word judgment and with semantic category label It changes, as " dining room | | price | very | it is high ", replace with " Obj | | Attr | Dgr | Sent ", for commenting on starting and ending position Add " Start " and " End " label respectively, in comment in addition to ".","！", "? " except punctuation mark also use " Punc " mark Label replacement.

Step 5: template generates.The step makes pauses in reading unpunctuated ancient writings to the comment data after tag replacement, according to the name of each semantic category The concrete term that title and each semantic category include generates template.In the present embodiment, according to ".","！", "? " 3 punctuation mark punctuates, 3 words of template minimum length, 7 words of maximum length are limited, the comment after scanning tag replacement generates candidate template.

Step 6: semantic word extracts.In comment after candidate template to be applied to semantic category tag replacement.When some comment When only one word of difference of the corresponding template of segment and candidate template, using the word as the example word of corresponding semantic category.For example, For comment segment " price | very | high ", wherein " price " belongs to evaluation attributes word, "high" belongs to emotion word, and is not belonging to " very much " Any semantic category, at this moment its corresponding template is " Attr | very | Sent ".This with candidate template " Attr | Dgr | Sent " only in Between a word difference, then will extract " very much " the example word as degree adverb.

Step 7: template marking.The present invention gives a mark in terms of two, on the one hand measures the importance of template by the frequency and pushes away On the other hand wide property measures the accuracy of template by the hit rate in semantic dictionary.

Template pat_iImportance and generalization marking S (pat_i) calculation formula it is as follows:

pat_iAccuracy marking P (pat_i) calculation formula it is as follows:

Wherein, T (pat_i) indicate template pat_iThe semantic set of words of extraction, f (t) indicate the frequency of semantic word t, SemLex The seed semantic dictionary constructed for step 1.

We use Sigmoid functionBy S (pat_i) normalize to (0,1), and then merge two aspects Marking obtain F (pat_i), calculation formula is as follows:

Wherein α is importance and generalization marking S (pat_i) weight, value range be [0,1].The present invention more focuses on mould The accuracy of version, therefore by α=0.4, it can also be adjusted according to concrete application.

Step 8: template is selected.According to F (pat_i) choose highest scoring preceding 5~10% template.

Step 9: semantic word marking.According to the template pat selected_kAnd its marking, calculate the semantic word of template extraction Score, calculation formula are as follows:

Step 10: semantic dictionary expands.Preceding 5~10% word for choosing highest scoring is added to semantic dictionary SemLex In.

Step 4 is carried out to step 10 iteration.Stopping criterion for iteration.It is select semanteme word it is obviously incorrect when terminate.

Step 11: polarity determines.Polarity and emotion word for emotion word and evaluation object word, evaluation attributes word Collocation polarity, by being accomplished manually.In artificial determination process, using the corresponding segment of commenting on of its affiliated template as the foundation determined.

The result shows that the present invention achieves good performance in accuracy rate and recall rate, the semanteme of high quality is generated Dictionary and template library.

It is in the comment of 10,000,000 hotels the experimental results showed that, semantic dictionary construction method proposed by the present invention is effective 's.The evaluation object word of extraction has 4835, such as " breakfast ", " network "；The evaluation attributes word of extraction has 175, such as " valence Lattice ", " attitude " etc.；The emotion word of extraction has 2393, such as " comfortable ", " praising "；The degree adverb of extraction has 92, and such as " ten Point ", " excessive " etc.；The common adverbial word extracted has 214, such as " very ", " excessive "；The negative word of extraction has 28, such as " wood Have ", " will not " etc.；The insertion word of extraction has 143, such as " feeling ", " generally speaking ".

The above embodiments are merely illustrative of the technical solutions of the present invention rather than is limited, the ordinary skill of this field Personnel can be with modification or equivalent replacement of the technical solution of the present invention are made, without departing from the spirit and scope of the present invention, this The protection scope of invention should be subject to described in claims.

Claims

1. a kind of semantic dictionary construction method based on comment data, which comprises the steps of:

1) comment data are obtained, the word of each semantic category is obtained by commenting on data on a small quantity, construct seed semantic dictionary；

4) make pauses in reading unpunctuated ancient writings to the comment data after tag replacement, the specific word for including according to the title of each semantic category and each semantic category Language generates template；

7) the part template for choosing highest scoring calculates obtaining for the semantic word that each template extracts according to the template of selection and its marking Point, and then the part of semantic word for choosing highest scoring expands semantic dictionary；

8) step 3) to step 7) iteration carries out, and iteration ends when select semanteme word is incorrect obtain final Semantic dictionary, and template library is constituted by each template,

Wherein, the method that step 6) gives a mark to each template is:

A) to template importance and generalization marking S (pat_i) calculation formula it is as follows:

Wherein, | pat_i| it is template pat_iLength, with word number calculating, f (pat_i) indicate template pati the frequency, C (pat_i) table Show nested pat_iTemplate set；

B) to template accuracy marking P (pat_i) calculation formula it is as follows:

Wherein, T (pat_i) indicate template pat_iThe semantic set of words of extraction, f (t) indicate the frequency of semantic word t, and SemLex is kind Sub- semantic dictionary；

C) fusion steps a), both sides marking b) obtained by the way of weighting.

2. the method as described in claim 1, it is characterised in that: step 1) obtains online point from comment website by focused crawler Data are commented, and by manually checking a small amount of comment, arranges the word of each semantic category, forms seed dictionary.

3. the method as described in claim 1, it is characterised in that: step 2) is first using the maximum matching participle side based on dictionary Method is segmented, and then obtains correct word segmentation result using the segmenting method of sequence labelling for the ambiguous part of participle； The cutting problems of word are converted to the classification problem of word by the segmenting method of the sequence labelling, each radical according to its in word not Same position is assigned different position classification labels, the slit mode of sentence is determined based on such flag sequence.

4. method as claimed in claim 3, it is characterised in that: the different position classification label, including in prefix, word, Suffix and monosyllabic word, and sequence labelling task is realized using conditional random field models.

5. the method as described in claim 1, it is characterised in that: the step 3) semantic category includes evaluation object word, evaluation category Property word, emotion word, degree adverb, common adverbial word, negative word, insertion word.

6. the method as described in claim 1, it is characterised in that: step 4) basis ".","！", "? " 3 punctuation marks break Sentence, and the minimum length of template is limited as 3 words, maximum length is 7 words.

7. the method as described in claim 1, it is characterised in that: when step 5) extracts the semantic word of each semantic category, when some point When commenting only one word of difference of template obtained by the corresponding template of segment and step 4), using the word as the example of corresponding semantic category Word.

8. the method as described in claim 1, which is characterized in that the fusion steps a) by the way of weighting, b) obtain Both sides marking, comprising:

Using Sigmoid functionBy S (pat_i) normalize to (0,1), and then merge both sides and give a mark To F (pat_i), calculation formula is as follows:

Wherein α is importance and generalization marking S (pat_i) weight, value range be [0,1].

9. the method as described in claim 1, it is characterised in that: the part template of the step 7) highest scoring is highest scoring Preceding 5~10% template, the part of semantic word of the highest scoring is preceding 5~10% semantic word of highest scoring.

10. the method as described in claim 1, it is characterised in that: after step 8), by being manually determined in semantic dictionary Polarity and emotion word and evaluation object word, the collocation polarity of evaluation attributes word of emotion word；In artificial determination process, by it The corresponding segment of commenting on of affiliated template is as the foundation determined.