CN104699766A

CN104699766A - Implicit attribute mining method integrating word correlation and context deduction

Info

Publication number: CN104699766A
Application number: CN201510082519.3A
Authority: CN
Inventors: 张宇; 刘妙
Original assignee: Zhejiang University of Technology ZJUT
Current assignee: Zhejiang University of Technology ZJUT
Priority date: 2015-02-15
Filing date: 2015-02-15
Publication date: 2015-06-10
Anticipated expiration: 2035-02-15
Also published as: CN104699766B

Abstract

The invention discloses an implicit attribute mining method integrating word correlation and context deduction. The implicit attribute mining method comprises the following steps: establishing a corpus, and according to the corpus, establishing a reference comment dataset, an attribute word dictionary, a sentiment word dictionary, a notional word dictionary, an attribute word-sentiment word modification matrix and an attribute word-notional word co-occurrence matrix of a current product category; according to the established reference comment dataset, the established attribute word dictionary, the established sentiment word dictionary, the established notional word dictionary, the established attribute word-sentiment word modification matrix and the established attribute word-notional word co-occurrence matrix, in combination with the context of a clause, sequentially mining clauses, which require implicit attribute mining, in a comment dataset to be analyzed, so as to obtain implicit attribute mining results. In the implicit attribute mining method, two different word correlations, namely attribute word-sentiment word modification relation and an attribute word-notional word co-occurrence relation, are comprehensively utilized, and the context of the clause is used for deducing, so that the implicit attribute mining accurate rate is greatly improved.

Description

A kind of implicit attribute method for digging merging word association relation and context of co-text deduction

Technical field

The present invention relates to data mining technology field, be specifically related to a kind of implicit attribute method for digging merging word association relation and context of co-text deduction.

Background technology

In opining mining field, the excavation of attribute word and emotion word excavation are two basic subtasks.Excavated by attribute word, Classifying Sum can be carried out to User Perspective, thus provide better decision support for user.At present, the attribute word digging technology of used for products comment, is mainly divided into explicit attribute to excavate and implicit attribute excavates two large classes.Explicit attribute excavates relatively simple, and scholars have carried out a large amount of research work.Implicit attribute excavates then very complicated, and current correlative study work is less.

In implicit attribute excavation, the people such as Liu propose and set up mapping between product attribute and property value by the mode of rule digging (rule mining) in document " Opinion observer:analyzing andcomparing opinions on the Web ", such as " heavy " is mapped to attribute " weight ", " big " is mapped to attribute " size ", is then carried out the excavation of implicit attribute by above-mentioned mapping relations.But the foundation of mapping ruler needs certain artificial mark, therefore, the accuracy rate that implicit attribute excavates is limited to the quality and quantity of rule mark.In addition, for new field, mapping ruler needs to re-start artificial mark, and time cost is high and accuracy rate is also difficult to guarantee.

The people such as Su propose a kind of implicit attribute method for digging based on attribute word and emotion word cooccurrence relation in document " Hidden sentiment association in Chinese Web opinionmining ", on attribute word and emotion word, the algorithm of cluster is strengthened in application mutually iteratively, obtain attribute word bunch and emotion word bunch, thus the incidence relation between single attribute word and single emotion word is expanded to the incidence relation between attribute word bunch and emotion word bunch.But their method does not consider the incidence relation between other word outside emotion word and attribute word.

The people such as Chou Guang propose a kind of implicit attribute method for digging based on regularization theme modeling (regularized topic modeling) thought in document " the implicit expression product attribute based on the modeling of regularization theme extracts ".Under the prerequisite not needing priori, realize the excavation of implicit attribute according to attribute related term, but the method does not consider the context of co-text of comment subordinate sentence.

Summary of the invention

For the deficiencies in the prior art, the present invention proposes a kind of implicit attribute method for digging merging word association relation and context of co-text deduction.

Merge an implicit attribute method for digging for word association relation and context of co-text deduction, comprise the steps:

(1) corpus is built, and reference the comment data collection of the current product series products of building of corpus, attribute word dictionary, emotion word dictionary, notional word dictionary, attribute word-emotion word modification matrix and the attribute word-notional word co-occurrence matrix described in utilizing, specific as follows:

(1-1) obtain the comment data of different product series products, and pre-service is carried out to the comment data obtained;

Detailed process is as follows:

(1-11) to the standardization processing of comment data: the complex form of Chinese characters in comment data is converted to simplified Chinese character, identify wrongly written or mispronounced characters wherein and correct, and the comment statement that there is mess code and the foreign language word that comprises None-identified is deleted;

(1-12) comment spam is filtered: utilize regular expression to containing No. QQ, cell-phone number, the information such as website comment statement filter;

(1-13) Chinese word segmentation and part-of-speech tagging are carried out to comment data, then carry out stop words filtration, finally delete in the whole text without punctuate and the long comment statement of subordinate sentence.

(1-2) pretreated comment data is utilized to build corpus;

The corpus built in the present invention is interpreted as the set of all pretreated comment data.

(1-3) for the product of current category, using the reference comment data collection of the comment data of product series products current in corpus as current product series products, and build the attribute word dictionary of current product series products, emotion word dictionary and notional word dictionary based on described reference comment data collection;

The present invention builds attribute word dictionary, emotion word dictionary and notional word dictionary according to each attribute word, emotion word and notional word in the appearance situation that described reference comment data is concentrated, specific as follows:

A () builds attribute word dictionary by following operation:

According to described reference comment data collection, the method for bidirectional iteration is utilized to build initial attribute word word set F and initial emotion word word set O:

For any one the attribute word in initial attribute word word set F, according to this attribute word in the occurrence number concentrated with reference to comment data, following formulae discovery is utilized to go out the TF-IDF weights of each attribute word in initial attribute word word set F:

w_{i}^{T} = {tf}_{i} \times {idf}_{i} = {tf}_{i} \times \log (\frac{N}{n_{i}})

Wherein, for i-th attribute word f in initial attribute word word set F _itF-IDF weights, 1≤i≤n _f, n _ffor the number of attribute word in initial attribute word word set F.Tf _ifor attribute word f _iin the normalization word frequency concentrated with reference to comment data, (normalization word frequency is attribute word f _ioccurrence number and the ratio concentrating all notional word occurrence numbers with reference to comment data is being concentrated) with reference to comment data; idf _ifor comprising attribute word f in corpus _ithe inverse of comment data quantity, i.e. inverse document frequency; N is the total quantity of all category product review data in described corpus, n _ifor comprising attribute word f in described corpus _ithe total quantity of comment data.

Attribute word TF-IDF weights being greater than first threshold screens, and constructs domain attribute word word set, and then from initial attribute word word set F remaining attribute word, artificial screening goes out the larger attribute word of 20 ~ 30 word frequency, constructs public attribute word word set;

Described domain attribute word word set and public attribute word word set are merged (namely asking union), constructs attribute word dictionary.

The present invention, according to the TF-IDF weights of attribute word each in initial attribute word word set F, can filter out that discrimination is high, the distinctive attribute word in field.

The value of first threshold directly has influence on the structure of domain attribute word word set, and as preferably, described first threshold is 0.01 ~ 0.02, and further preferably, described first threshold is 0.015.

Optimally, from initial attribute word word set F remaining attribute word, the composition public attribute word word set that 25 word frequency are larger is selected.During specific implementation, sorted from high to low by remaining for initial attribute word word set F attribute word according to word frequency, artificial screening goes out 25 word frequency is higher and the attribute word that field is general constructs public attribute word word set.

B () builds emotion word dictionary by following operation:

Utilize and know that " the sentiment analysis word collection " of net, " the emotion vocabulary ontology library " of Dalian University of Technology and initial emotion word word set O carry out intersection screening, construct emotion word dictionary.

C () builds notional word dictionary by following operation:

Reference comment data described in statistics is concentrated the word frequency of all notional words and by descending sort, is filtered out the notional word that word frequency is greater than Second Threshold, constructs notional word dictionary.

As preferably, Second Threshold is 50.

(1-4) based on described reference comment data collection, the attribute word dictionary described in utilization, emotion word dictionary and notional word dictionary creation attribute word-emotion word modifies matrix and attribute word-notional word co-occurrence matrix;

Described attribute word-emotion word value of modifying in matrix represents that any one attribute word and any one emotion word are at the number of times concentrating co-occurrence with reference to comment data, and the value in described attribute word-notional word co-occurrence matrix represents that any one attribute word and any one notional word are at the number of times concentrating co-occurrence with reference to comment data.

Structure attribute word-emotion word modifies matrix and attribute word-notional word co-occurrence matrix specifically comprises following operation:

(1-41) reference the comment data collection described in traversal, attribute word dictionary, emotion word dictionary and notional word dictionary described in utilization, to occurred the subordinate sentence of attribute word, extract attribute word-emotion word modification to attribute word-notional word co-occurrence pair;

(1-42) right according to the attribute word extracted-emotion word modification, build attribute word-emotion word and modify matrix; According to the attribute word-notional word co-occurrence pair extracted, build attribute word-notional word co-occurrence matrix.

Extracting attribute word-emotion word in the present invention to modify during with attribute word-notional word co-occurrence pair, is carry out in units of subordinate sentence, extracts successively to each subordinate sentence that described reference comment data is concentrated.

Implicit attribute method for digging of the present invention builds proprietary reference comment data collection, attribute word dictionary, emotion word dictionary and notional word dictionary for the product needed of different category, attribute word-emotion word modifies matrix and attribute word-notional word co-occurrence matrix, ensure that the field correlativity of attribute word, and improve the accuracy rate of implicit attribute Result.

(2) successively each subordinate sentence that comment data to be analyzed is concentrated is processed, when current subordinate sentence is processed, first current subordinate sentence is the need of carrying out implicit attribute excavation to utilize described attribute word dictionary to judge, if do not need, then directly process next subordinate sentence, otherwise, then proceed as follows:

(2-1) the emotion word dictionary described in utilization and attribute word-emotion word modify the candidate attribute word array A that matrix determines current subordinate sentence _f;

(2-2) context of co-text of current subordinate sentence is analyzed, if there is explicit attribute word f in its last bar subordinate sentence or a rear subordinate sentence _i, and then by f _ijoin the candidate attribute word array A of current subordinate sentence _fin, and by f _icontext weights assignment is 1; If f _i∈ A _f, then f is increased _icontext weights 1≤i≤n _f, n _frepresent candidate attribute word array A _fthe number of middle attribute word;

(2-3) the emotion word dictionary described in utilization and the notional word array A of the current subordinate sentence of notional word dictionary creation _t, for the candidate attribute word array A of current subordinate sentence _fin each attribute word, according to co-occurrence number of times, the notional word array A of attribute word and notional word _tin each notional word at this attribute word of context weight computing of the appearance situation concentrated with reference to comment data and attribute word and notional word array A _tin the weighted association value of each notional word, and choose the implicit expression Result of the maximum candidate attribute word of weighted association value as current subordinate sentence.

The present invention judges that current subordinate sentence is the need of carrying out implicit attribute excavation by the following method:

First judge whether this subordinate sentence is viewpoint sentence, if not viewpoint sentence, then do not need to carry out implicit attribute excavation; If viewpoint sentence, then whether this subordinate sentence is shown to expect, wish or imagination to utilize regular expression to judge: if then do not need to carry out implicit attribute excavation; If not, then need to carry out implicit attribute excavation.

According to the pause of comment text to be analyzed self, punctuate situation in the present invention, determine the scope of each subordinate sentence.

Described step (2-1) comprises following operation:

(2-11) the emotion word dictionary described in utilization, extracts emotion word all in current subordinate sentence and forms emotion word array A _o;

(2-12) the emotion word array A of the current subordinate sentence of following formulae discovery is utilized _oin any one attribute word f of each emotion word and its modification _ibetween point condition association relationship:

PMI (f_{i}, o_{j}) = \log \frac{P (f_{i}, o_{j})}{P (f_{i}) P (o_{j})}

Wherein, 1≤i≤n, n is the number of attribute word in attribute word dictionary, o _jfor emotion word array A _oin emotion word, 1≤j≤n _o, n _orepresent emotion word array A _othe number of middle emotion word, P (f _i, o _j) be attribute word f _iwith emotion word o _jthe number of times of co-occurrence is concentrated, P (f in described reference comment data _i, o _j) be modify matrix reading from described attribute word-emotion word to obtain, P (f _i), P (o _j) be respectively attribute word f _iwith emotion word o _jthe number of times of appearance is concentrated in described reference comment data;

(2-13) according to emotion word array A _oin point condition association relationship between each emotion word and the attribute word of its modification, choose 3 attribute words alternatively attribute word that point condition association relationship is the highest, then will according to emotion word array A _oin the candidate attribute word chosen of all emotion word merge, delete the candidate attribute word array A that the attribute word wherein repeated constructs current subordinate sentence _f, and by A _fin each attribute word f _icontext weights initial value compose be 1.

In step of the present invention (2-2), if f _i∈ A _f, then illustrate and utilize word association relation excavation candidate attribute word array A out _fin contain the attribute word f utilizing context of co-text to infer _i, f _ithe possibility becoming current subordinate sentence implicit attribute word is comparatively large, therefore increases f _icontext weights, as preferably, in described step (2-2), if f _i∈ A _f, then f is increased _icontext weights for original 2 times.

(2-31) the notional word dictionary described in utilization, extracts notional words all in current subordinate sentence and forms notional word array A _t, and delete notional word array A _tin emotion word;

(2-32) following formulae discovery candidate attribute word array A is utilized _fin each attribute word and notional word array A _tin the relating value of all notional words:

T (f_{i}) = Σ_{k = 1}^{v} \frac{P (f_{i} | t_{k})}{v},

Wherein, T (f _i) be attribute word f _iwith notional word array A _tin the relating value of all notional words, t _kfor notional word array A _tin notional word, 1≤i≤n _f, n _frepresent candidate attribute word array A _fthe number of middle attribute word, 1≤k≤v, v represents notional word array A _tthe number of middle notional word, P (f _i| t _k) concentrate attribute word f for described reference comment data _iwith notional word array A _tmiddle notional word t _kconditional probability in co-occurrence situation, according to following formulae discovery:

P (f_{i} | t_{k}) = \frac{P (f_{i}, t_{k})}{P (t_{k})} = \frac{n_{c} / n_{n}}{n_{t_{k}} / n_{n}} = \frac{n_{c}}{n_{t_{k}}},

Wherein, n _creference comment data described in expression concentrates attribute word f _iwith notional word t _kco-occurrence number of times, n _cread from described attribute word-notional word co-occurrence matrix and obtain, represent notional word t _kthe number of times of appearance is being concentrated, n with reference to comment data _nrepresent that in notional word dictionary, all notional words are at the number of times concentrating appearance with reference to comment data;

(2-33) for candidate attribute word array A _fin each candidate attribute word f _i, with following formulae discovery itself and notional word array A _tin the weighted association value T'(f of all notional words _i):

T^{'} (f_{i}) = w_{f_{i}} \times T (f_{i})

Wherein, for each candidate attribute word f _icontext weights, 1≤i≤n _f, n _frepresent candidate attribute word array A _fthe number of middle attribute word.And choose the implicit attribute Result of the maximum candidate attribute word of weighted association value as current subordinate sentence according to result of calculation.

Do not make specified otherwise, in the present invention, the word frequency of certain word (comprising notional word, emotion word and attribute word) is the number of times that this word occurs in current category product review data centralization.

Do not make specified otherwise, the comment statement in the present invention refers to a comment of acquisition, and comment data then refers to the set of some comment statements.

Compared with prior art, tool of the present invention has the following advantages:

(1) first non-viewpoint sentence is carried out to subordinate sentence and table is expected, wished or the identification of imaginary statement, implicit attribute deduction is not carried out to above-mentioned a few class subordinate sentence, not only reduces workload, also improve the accuracy rate that implicit attribute excavates;

(2) modified relationship between emotion word and attribute word is utilized to obtain multiple candidate attribute word, implicit attribute excavation is carried out again according to the cooccurrence relation of candidate attribute word and notional word, this method has fully utilized two kinds of different word association relations, effectively can improve the accuracy rate that implicit attribute excavates;

(3) consider the context of co-text of subordinate sentence, by the context weights of adjustment candidate attribute word, the accuracy rate that implicit attribute excavates can be improved further.

Accompanying drawing explanation

Fig. 1 is the process flow diagram of the fusion word association relation of the present embodiment and the implicit attribute method for digging of context of co-text deduction;

Fig. 2 is for carry out pretreated process flow diagram to comment data;

Fig. 3 is the process flow diagram building attribute word dictionary, emotion word dictionary and notional word dictionary;

Fig. 4 is the process flow diagram of calculated candidate attribute word context weights;

Fig. 5 is the process flow diagram of calculated candidate attribute word weighted association value.

Embodiment

Below in conjunction with the drawings and specific embodiments, the specific embodiment of the present invention is described in further detail.

Be described for the cell phone type product review that Taobao captures in the present embodiment.

As shown in Figure 1, the fusion word association relation of the present embodiment and the implicit attribute method for digging of context of co-text deduction comprise the following steps:

(1) capture the comment data of different product series products from website (being Taobao the present embodiment), comprise the different categories such as clothes, jewelry, household electrical appliances, mobile phone, number, and pre-service is carried out to the comment data obtained, form corpus S.For comment data pretreatment process as shown in Figure 2, comprise the steps:

(1-1) standardization processing of comment data: the complex form of Chinese characters in comment data is converted to simplified Chinese character, identifies wrongly written or mispronounced characters wherein and corrects, and to there is mess code and comprise None-identified foreign language word comment statement delete.

Citing is described respectively below:

A () either traditional and simplified characters is changed: " father is this mobile phone of Xi Huan very ", “ Huan in subordinate sentence " be the complex form of Chinese characters, after either traditional and simplified characters conversion, export as " father enjoys a lot this mobile phone ".

B () wrongly written or mispronounced characters identification and corrigendum: " mobile phone reflection is very slow ", " reflection " in subordinate sentence should be " reaction ", after identifying corrigendum, exports as " handset response is very slow ".

(c) identification and the deletion of mess code statement: " the not smoothgoing Yu letter of Pi fine jade adze umbrella shackles silicon Shi Hao toot В Mo harasses for a Qin swinging Yun adze An Mao whore DEG C Retained fermium Xue Previous-set Breathe heavily Acta+Ren Lian and borrow Chen adze An Jue 5 Liu Hai? ", in this comment statement, comprise mess code, directly by its deletion.

(1-2) comment spam is filtered: utilize regular expression to containing No. QQ, cell-phone number, the information such as website comment statement filter.The regular expression wherein identifying phone number is " (13|18|15|17) [0-9] { 9} ", and this expression formula can identify the comment statement comprised with 11 figure place word strings of 13,18,15,17 beginnings.Identify the regular expression of No. QQ be " .*qq.* [1-9] [0-9] { 4; } | .*QQ.* [1-9] [0-9] { 4; } | .* button button .* [1-9] [0-9] { 4; } ", wherein " [1-9] [0-9] { 4, } " represents the Connected digits of more than 5, if when there is the key word such as " QQ ", " qq " or " button button " before Connected digits, namely judge that this Connected digits is as QQ number, this comment statement is comment spam, and deletes.

Such as: " taken at [321fanli.cn] and returned profit, found this dotey to return a lot of money by [321fanli.cn]! Remember [network address: 321fanli.cn] be directly inputted in browser-----help they publicize evaluate also have reward, contact QQ:15325973793." this comments in statement and occurred website and No. QQ, belongs to comment spam, utilize above-mentioned regular expression identified and delete.

(1-3) Chinese word segmentation and part-of-speech tagging are carried out to comment data, then carry out stop words filtration, finally delete in the whole text without punctuate and the long comment statement of subordinate sentence.

Such as: " mobile phone/n buys/v/u is good/d for a long time/a/u/d comes/v comment/v shyly/a/y mobile phone/n very/d is handy/a with/v/u is several/m days/q/father u/n very/d likes/a ", this comment statement do not have in the whole text punctuate and also length long, be easy to the analysis result producing mistake, therefore deleted.

(2) according to the corpus S built in step (1), the reference comment data collection S of wherein cell phone type product is utilized _phone, build attribute word dictionary Dic_F, the emotion word dictionary Dic_O of cell phone type and notional word dictionary Dic_T, concrete steps as shown in Figure 3:

(2-1) method of bidirectional iteration is utilized to build initial attribute word word set F and initial emotion word word set O:

First manually selected 1 ~ 2 (in the present embodiment being 2) seed attribute word joins in initial attribute word word set F.For each the attribute word f in F _i, the reference comment data collection S of traversal cell phone type product _phonein comment statement, find out one by one and modify attribute word f _iemotion word o _j.If then by o _jjoin in initial emotion word word set O;

Otherwise, for each the emotion word o in initial emotion word word set O _j, the reference comment data collection S of traversal cell phone type product _phonein comment statement, find out one by one by it modify attribute word f _i.If then by f _ijoin in initial attribute word word set F.So iterate, until the word number in F and O all no longer increases.

The corpus built in the present embodiment is actually obtained whole product series products, pretreated comment data set, and cell phone type is actually the set of all cell phone type product review data in corpus with reference to comment data collection.

Such as: selection " mobile phone ", " service " carry out bidirectional iteration as seed words, finally can obtain initial attribute word word set F and initial emotion word word set O.

(2-2) following formula is utilized:

w_{i}^{T} = {tf}_{i} \times {idf}_{i} = {tf}_{i} \times \log (\frac{N}{n_{i}})

Calculate the TF-IDF weights of each attribute word in initial attribute word word set F, wherein, for i-th attribute word f in initial attribute word word set F _itF-IDF weights, 1≤i≤n _f, n _ffor the number of attribute word in initial attribute word word set F.Tf _ifor attribute word f _iat cell phone type product with reference to comment data collection S _phonein normalization word frequency (normalization word frequency is attribute word f _iat reference comment data collection S _phonemiddle occurrence number and reference comment data collection S _phonein the ratio of all notional word occurrence numbers); idf _ifor comprising attribute word f in corpus S _ithe inverse of comment data quantity, i.e. inverse document frequency; N is the total quantity of all category product review data in corpus S, n _ifor comprising attribute word f in corpus S _ithe total quantity of comment data.

Next, utilize threshold value to screen according to result of calculation (the TF-IDF weights of each attribute word), the attribute word being greater than first threshold 0.015 is screened, is built into domain attribute word word set.The attribute word being less than or equal to first threshold is joined public attribute word candidate word set, and artificial screening is carried out to public attribute word candidate word set obtains public attribute word word set.

Artificial screening method is as follows: by whole attribute word (the attribute word namely in public attribute word candidate word set) remaining in initial attribute word word set F, by word frequency, (namely this attribute word is at cell phone type product reference comment data collection S _phonethe number of times of middle appearance) sequence (by descending sort in the present embodiment, namely by order arrangement from high to low), and artificial screening goes out the general attribute word in field becomes public attribute word word set.

Finally, domain attribute word word set and public attribute word word set are merged, constructs attribute word dictionary Dic_F.

Such as: the TF-IDF weights of the word such as " mobile phone ", " screen ", " button ", higher than first threshold, are screened and join in domain attribute word word set.The TF-IDF weights of the word such as " dotey ", " logistics ", lower than first threshold, are joined in public attribute word word set after artificial screening.Finally domain attribute word word set and public attribute word word set are merged, construct attribute word dictionary Dic_F.

(2-3) utilization knows that " the sentiment analysis word collection " of net, " the emotion vocabulary ontology library " of Dalian University of Technology and initial emotion word word set O carry out intersection screening, constructs emotion word dictionary Dic_O.

To initial emotion word word set O be appeared at simultaneously and know that the emotion word in net " sentiment analysis word collection " joins in emotion word dictionary Dic_O.In like manner, the emotion word simultaneously appeared in initial emotion word word set O and Dalian University of Technology's " emotion vocabulary ontology library " is also joined in emotion word dictionary Dic_O.Delete the emotion word repeated in emotion word dictionary Dic_O, complete the structure of emotion word dictionary Dic_O.

(2-4) the reference comment data collection S of cell phone type product is added up _phonein the word frequency of all notional words (namely each notional word is at S _phonethe number of times of middle appearance) and by descending sort, filter out the notional word that word frequency is greater than Second Threshold (in the present embodiment, Second Threshold is 50), construct notional word dictionary Dic_T.

(3) the reference comment data collection S of cell phone type product is utilized _phonebuild attribute word-emotion word and modify matrix M ^fOwith attribute word-notional word co-occurrence matrix M ^fT:

(3-1) the reference comment data collection S of cell phone type product is traveled through _phone, utilize the dictionary (comprising attribute word dictionary Dic_F, emotion word dictionary Dic_O and notional word dictionary Dic_T) built in step (2), extract attribute word-emotion word modify to attribute word-notional word co-occurrence pair.

In the present embodiment with " battery/n charging/v /u time/n very/d not /d is stable/a ,/w " for example, extract result as follows:

It is right that attribute word-emotion word is modified: " battery-stable ";

Attribute word-notional word co-occurrence pair: " battery-charging ", " battery-time ", " battery-stable ".

(3-2) right according to the attribute word extracted-emotion word modification, build attribute word-emotion word and modify matrix M ^fO; According to the attribute word-notional word co-occurrence pair extracted, build attribute word-notional word co-occurrence matrix M ^fT.

In the present embodiment: the attribute word-emotion word as extracted above is modified " battery-stable ", finds " battery " position i in attribute word dictionary Dic_F, finds " stablizing " position j in emotion word dictionary Dic_O.Whenever extracting " battery-stable " this attribute word-emotion word modification pair, then by matrix M ^fOthe value of the upper element of the i-th row jth row adds 1.In like manner, according to the attribute word-notional word co-occurrence pair extracted, whenever extracting corresponding attribute word-notional word pair, then by attribute word-notional word co-occurrence matrix M ^fTon middle relevant position, the value of element adds 1.

(4) (the present embodiment be 5000, this part comment data is not comprised in S again to capture a small amount of cell phone type product review data from Taobao _phonein), and carry out pre-service according to the method in step (1), build comment data collection D to be analyzed.Read the comment subordinate sentence in comment data collection D to be analyzed one by one, and analyze in accordance with the following steps, till to the last a subordinate sentence processes:

During process current commentary subordinate sentence, first read in current commentary subordinate sentence (i.e. subordinate sentence), by word match attribute word dictionary Dic_F, if there is not explicit attribute word in this comment subordinate sentence, then obtain candidate attribute word array A according to following steps _f;

Explicit attribute word refers to the product attribute word of explicit appearance in comment subordinate sentence, and such as: " price " explicitly in " too expensive " appears in subordinate sentence, can directly be extracted out according to attribute word dictionary Dic_F, be therefore explicit attribute word." mobile phone is fine, is exactly too expensive! " this comment second subordinate sentence in, " expensive " modifies " price ", but " price " this attribute word does not have explicitly appears in this subordinate sentence, and needing to carry out implicit attribute and excavate and could obtain, is therefore implicit attribute word.

(4-1) first judge whether this comment subordinate sentence is viewpoint sentence:

If non-viewpoint sentence, then do not carry out implicit attribute excavation, continue to read in next subordinate sentence;

If viewpoint sentence, then regular expression is utilized to make the following judgment:

If this comment subordinate sentence table is expected, wish or imaginary, then this subordinate sentence does not also carry out implicit attribute excavation, continues to read in next subordinate sentence;

Otherwise carry out implicit attribute excavation, and extract emotion word all in this subordinate sentence according to emotion word dictionary Dic_O, form the emotion word array A of this subordinate sentence _o.

Citing is described respectively below:

(a) non-viewpoint sentence: " I/r this/r is several/m days/q goes on business/v/y." current subordinate sentence do not have emotion word, be therefore non-viewpoint sentence, do not carry out implicit attribute excavation.

(b) viewpoint sentence: " if/c again/d is cheap/a 1 point/m just/d is good/a/u./ w ", occurred in subordinate sentence the imaginary clause of table " if ... just ... ", therefore do not carry out implicit attribute excavation.

C (), for needing the subordinate sentence carrying out implicit attribute excavation, is extracted wherein all emotion word, is formed the emotion word array of this subordinate sentence, such as: " very/d not /d is durable/a./ w ", having there is emotion word " durable " in this subordinate sentence, but does not have explicit attribute word, therefore needs to carry out implicit attribute excavation.Extract from subordinate sentence " durable ", form the emotion word array A of this subordinate sentence _o={ durable }.

(4-2) matrix M is modified according to the attribute word built in step (3)-emotion word ^fO, utilize the emotion word array A of following this subordinate sentence of formulae discovery _oin any one attribute word f of each emotion word and its modification _ibetween point condition association relationship PMI (f _i, o _j) (PMI value, Point Mutual Information):

PMI (f_{i}, o_{j}) = \log \frac{P (f_{i}, o_{j})}{P (f_{i}) P (o_{j})}

Wherein, 1≤i≤n, n is the number of attribute word in attribute word dictionary, o _jfor emotion word array A _oin emotion word, P (f _i, o _j) be attribute word f _iwith emotion word o _jat the reference comment data collection S of cell phone type product _phone(dependency word-emotion word modifies matrix M to the number of times of middle co-occurrence ^fOmiddle reading obtains), P (f _i), P (o _j) be respectively attribute word f _iwith emotion word o _jat cell phone type product with reference to comment data collection S _phonethe number of times (i.e. word frequency) of middle appearance.

According to result of calculation (PMI value between the attribute word of each emotion word and its modification), for the emotion word array A of this subordinate sentence _oin each emotion word, 3 the attribute words the highest with its PMI value are joined the candidate attribute word array A of this subordinate sentence _fin.After all having added, delete the attribute word wherein repeated, build the candidate attribute word array A obtaining this subordinate sentence _f, and by the candidate attribute word array A of this subordinate sentence _fin each candidate attribute word f _icontext weights initial value compose be 1.

Such as: calculate " durable " has all properties word of modified relationship PMI value with it, and filter out the candidate attribute word of 3 the highest attribute words of PMI value as this subordinate sentence:

PMI (battery)=log (918/6242)=-0.8325,

PMI (electroplax)=log (24/337)=-1.1474,

PMI (loom)=log (6/9616)=-3.2048.

The candidate attribute word array A finally constructed _f=[battery, electroplax, loom].

(5) computation attribute word f _icontext weight, as shown in Figure 4, first read in context subordinate sentence (i.e. the last bar subordinate sentence of current subordinate sentence and a rear subordinate sentence), judge whether there is explicit attribute word in context subordinate sentence:

If there is certain explicit attribute word f in its context subordinate sentence _iand then extract this explicit attribute word f _i, and by attribute word f _ijoin candidate attribute word array A _fin, and by its context weights assignment is 1.If f _i∈ A _f, then by f _icontext weights double.

Such as: " battery/n charging/v /u time/n very/d not /d is stable/a ,/w very/d not /d is durable/a./ w ", for subordinate sentence " very/d not /d is durable/a./ w ", based on context linguistic context, can obtain its context property word for " battery ", " battery " ∈ A _f, then by double for the context weights of " battery ", i.e. w _battery=2.

(6) calculated candidate attribute word array A _fin each candidate attribute word and current subordinate sentence in relating value between the notional word that occurs, as shown in Figure 5, concrete steps are as follows:

(6-1) utilize the notional word dictionary Dic_T built in step (2) to extract notional words all in current subordinate sentence, and delete wherein all emotion word according to emotion word dictionary Dic_O, form notional word array A _t.

Such as: " battery/n too/d not /d to power/a ,/w once/m just/d do not have/v electricity/n/u very/d not /d is durable/a./ w ", extract notional words all in second subordinate sentence: " once ", " not having ", " electricity ", " durable ", and delete emotion word " durable " wherein, form notional word array A _tnot=[once, not having, electricity].

(6-2) for candidate attribute word array A _fin each attribute word f _i, according to following formulae discovery itself and notional word array A _tin the relating value T (f of all notional words _i):

T (f_{i}) = Σ_{k = 1}^{v} \frac{P (f_{i} | t_{k})}{v}

Wherein, 1≤i≤n _f, n _frepresent candidate attribute word array A _fthe number of middle candidate attribute word, 1≤k≤v, v represents notional word array A _tthe number of middle notional word, P (f _i| t _k) represent the reference comment data collection S of cell phone type product _phonemiddle attribute word f _iwith notional word array A _tin notional word t _kconditional probability in co-occurrence situation.

In the present embodiment, P (f _i| t _k) calculate according to following formula:

P (f_{i} | t_{k}) = \frac{P (f_{i}, t_{k})}{P (t_{k})} = \frac{n_{c} / n_{n}}{n_{t_{k}} / n_{n}} = \frac{n_{c}}{n_{t_{k}}}

Wherein, n _crepresent attribute word f _iwith notional word t _knumber of times (dependency word-notional word co-occurrence matrix the M of co-occurrence ^fTmiddle reading obtains), represent notional word t _kat reference comment data collection S _phonethe number of times (i.e. word frequency) of middle appearance, n _nrepresent all notional words in notional word dictionary Dic_T with reference to comment data collection S _phonethe number of times of middle appearance.

(6-3) for candidate attribute word array A _fin each candidate attribute word f _i, with following formulae discovery itself and notional word array A _tin the weighted association value T'(f of all notional words _i):

T^{'} (f_{i}) = w_{f_{i}} \times T (f_{i})

Wherein, for candidate attribute word f _icontext weights, 1≤i≤n _f, n _frepresent candidate attribute word array A _fthe number of middle attribute word.According to result of calculation, choose the Result of the maximum candidate attribute word of weighted association value as implicit attribute, and export.

Above-described embodiment has been described in detail technical scheme of the present invention and beneficial effect; be understood that and the foregoing is only most preferred embodiment of the present invention; be not limited to the present invention; all make in spirit of the present invention any amendment, supplement and equivalent to replace, all should be included within protection scope of the present invention.

Claims

1. merge an implicit attribute method for digging for word association relation and context of co-text deduction, it is characterized in that, comprise the steps:

(1) corpus is built, and reference the comment data collection of the current product series products of building of corpus, attribute word dictionary, emotion word dictionary, notional word dictionary, attribute word-emotion word modification matrix and the attribute word-notional word co-occurrence matrix described in utilizing;

(2) successively each subordinate sentence that comment data to be analyzed is concentrated is processed, when current subordinate sentence is processed, first current subordinate sentence is the need of carrying out implicit attribute excavation to utilize described attribute word dictionary to judge, if do not need, then directly process next subordinate sentence, otherwise, proceed as follows:

(2-3) the emotion word dictionary described in utilization and the notional word array A of the current subordinate sentence of notional word dictionary creation _t, for the candidate attribute word array A of current subordinate sentence _fin each attribute word, according to co-occurrence number of times, the notional word array A of attribute word and notional word _tin each notional word at this attribute word of context weight computing of the appearance situation concentrated with reference to comment data and attribute word and notional word array A _tin the weighted association value of all notional words, and choose the implicit attribute Result of the maximum candidate attribute word of weighted association value as current subordinate sentence.

2. the implicit attribute method for digging merging word association relation and context of co-text deduction as claimed in claim 1, it is characterized in that, described step (1) comprises following operation:

(1-2) all pretreated comment data are utilized to build corpus;

3. the implicit attribute method for digging merging word association relation and context of co-text deduction as claimed in claim 2, it is characterized in that, it is as follows that described step (1-1) carries out pre-service to comment data:

(1-12) comment spam filter: utilize regular expression to containing No. QQ, cell-phone number, website information comment statement filter;

4. the implicit attribute method for digging merging word association relation and context of co-text deduction as claimed in claim 2, it is characterized in that, described step (1-3) builds attribute word dictionary, emotion word dictionary and notional word dictionary according to each notional word, attribute word and emotion word in the appearance situation that described reference comment data is concentrated.

5. the implicit attribute method for digging merging word association relation and context of co-text deduction as claimed in claim 2, it is characterized in that, described step (1-4) comprises following operation:

6. the implicit attribute method for digging merging word association relation and context of co-text deduction as claimed in claim 1, it is characterized in that, described step (2-1) comprises following operation:

PMI (f_{i}, o_{j}) = \log \frac{P (f_{i}, o_{j})}{P (f_{i}) P (o_{j})}

Wherein, 1≤i≤n, n is the number of attribute word in attribute word dictionary, o _jfor emotion word array A _oin emotion word, 1≤j≤n _o, n _ofor emotion word array A _othe number of middle emotion word, P (f _i, o _j) be attribute word f _iwith emotion word o _jthe number of times of co-occurrence is concentrated, P (f in described reference comment data _i, o _j) be modify matrix to read from described attribute word-emotion word to obtain, P (f _i), P (o _j) be respectively attribute word f _iwith emotion word o _jthe number of times of appearance is concentrated in described reference comment data;

7., as the implicit attribute method for digging that the fusion word association relation in claim 1 ~ 6 as described in any one and context of co-text are inferred, it is characterized in that, described step (2-3) comprises following operation:

(2-32) following formulae discovery candidate attribute word array A is utilized _fin each attribute word f _iwith notional word array A _tin the relating value of all notional words:

T (f_{i}) = Σ_{k = 1}^{v} \frac{P (f_{i} | t_{k})}{v},

P (f_{i} | t_{k}) = \frac{P (f_{i}, t_{k})}{P (t_{k})} = \frac{n_{c} / n_{n}}{n_{t_{k}} / n_{n}} = \frac{n_{c}}{n_{t_{k}}},

T^{'} (f_{i}) = w_{f_{i}} \times T (f_{i})