CN106250365A

CN106250365A - The extracting method of item property Feature Words in consumer reviews based on text analyzing

Info

Publication number: CN106250365A
Application number: CN201610580612.1A
Authority: CN
Inventors: 陈峥; 张婷; 梁恒; 张永生
Original assignee: Chengdu De Maian Science And Technology Ltd
Current assignee: Chengdu De Maian Science And Technology Ltd
Priority date: 2016-07-21
Filing date: 2016-07-21
Publication date: 2016-12-21

Abstract

The invention discloses the extracting method of item property Feature Words in a kind of consumer reviews based on text analyzing, comprise determining that end article, and obtain the comment data of end article；Described comment data is carried out pretreatment；Part of speech sequence samples is obtained from pretreated comment data；Utilize described part of speech sequence samples to mate all comment data, state the position of Feature Words in model according to the formalization of part of speech sequence samples from comment data, extract Feature Words, and record the frequency of each Feature Words, all Feature Words constitutive characteristic word pre-candidate set；Feature Words pre-candidate set is carried out pretreatment；The similarity of any two Feature Words in statistical nature word pre-candidate set, and similarity is merged more than two Feature Words of threshold value.The present invention uses semantic similarity based on quantity of information to merge similar features word, removes redundancy feature word, decreases the data volume being analyzed Feature Words.

Description

The extracting method of item property Feature Words in consumer reviews based on text analyzing

Technical field

The present invention relates to the technical field of information processing, particularly relate in a kind of consumer reviews based on text analyzing The extracting method of item property Feature Words.

Background technology

The ordinary consumer that develops into of the Internet and information technology is shared commodity consumption online and is experienced and provide chance, thus The a large amount of comment data produced for Platform Analysis market, obtain user and evaluate attitude and carry out recommendation for user and provide Good chance, obtains other users for consumer and can preferably assist it to carry out decision-making in purchasing the attitude of commodity, and The important step of data mining it is by from comment on commodity extracting data attribute character word.

From the quality of the attribute character word that comment on commodity extracting data goes out, the impact on platform and user is all very big, good Feature Words platform can be allowed to understand the characteristic of commodity that user pays close attention to, promote or keep the individual features of commodity, improve and sell Amount, it is also possible to allow user understand the truth of the product characteristics oneself paid close attention to.

At present, in comment on commodity data, the method for Feature Words extraction has had a lot, is broadly divided into two big classes: rule-based Feature extraction and feature extraction based on probability.Such as the part of speech template matching method extended based on grammatical rules, based on word sequence The Hidden Markov of row mark and condition random field, these are all tentatively to extract the Feature Words in comment data.Research finds, Owing to being affected by consumer's schooling, culture background, diction, for the same attribute of same commodity, also The gap in description can be there is, but overall semanteme is close, if only with rule-based matching process to Feature Words Extracting, the Feature Words that extracts is it would appear that redundancy phenomena.

Summary of the invention

It is an object of the invention to overcome the deficiencies in the prior art, it is provided that in a kind of consumer reviews based on text analyzing The extracting method of item property Feature Words, uses semantic similarity based on quantity of information to merge similar features word, removes redundancy special Levy word, decrease the data volume that Feature Words is analyzed.

It is an object of the invention to be achieved through the following technical solutions: commodity in consumer reviews based on text analyzing The extracting method of attribute character word, comprises determining that end article, and obtains the comment data of end article；To described comment number According to carrying out pretreatment；Part of speech sequence samples is obtained from pretreated comment data；Described part of speech sequence samples is utilized to mate All comment data, state the position of Feature Words in model according to the formalization of part of speech sequence samples and extract spy from comment data Levy word, and record the frequency of each Feature Words, all Feature Words constitutive characteristic word pre-candidate set；In statistical nature word pre-candidate set The similarity of any two Feature Words, and similarity is merged more than two Feature Words of threshold value.

The acquisition methods of the comment data of end article is: use crawler algorithm to crawl end article from default website Comment data.

The preprocess method of comment data is: according to punctuation mark, every comment data is divided into multiple statement；By described Sentence segmentation is multiple single words；Part of speech is marked for each single word.

The preprocess method of comment data also includes, removes stop words.

The method obtaining part of speech sequence samples is:

The comment on commodity statement that definition comprises item property Feature Words is characterized sentence, chooses and carries out pretreated characteristic sentence As part of speech sequence samples；

The formalization statement model of part of speech sequence samples is:

(BF₃, BF₂, BF₁, feature_i, AF₁, AF₂, AF₃, Pos:i)

In formula: feature_iFeature Words, BF_iI-th word before Feature Words, AF_iI-th word after Feature Words, Pos Feature Words position in this feature sentence.

Further, the step that Feature Words pre-candidate set is carried out pretreatment is also included:

Whether the Feature Words in judging characteristic word pre-candidate set meets preset rules, if meeting, then retains this feature word, no Then delete this feature word.

Described preset rules is: the length of word is less than or equal to four words, and the frequency of word is in preset range.

In statistical nature word pre-candidate set, the method for the similarity of each Feature Words is: each in Feature Words pre-candidate set Feature Words carries out the calculating of quantity of information based on HowNet, and calculates the similar of any two Feature Words in Feature Words pre-candidate set Degree.

The method that Feature Words merges is: more than two Feature Words of threshold value, similarity is merged into a Feature Words, This feature word is the Feature Words that said two Feature Words medium frequency is bigger.

The invention has the beneficial effects as follows: the present invention uses semantic similarity based on quantity of information to merge similar features word, goes Except redundancy feature word, decrease the data volume that Feature Words is analyzed.

Accompanying drawing explanation

Fig. 1 is the flow chart of one embodiment of the present of invention.

Detailed description of the invention

Technical scheme is described in further detail below in conjunction with the accompanying drawings, but protection scope of the present invention is not limited to The following stated.

As it is shown in figure 1, the extracting method of item property Feature Words in consumer reviews based on text analyzing, including following Step:

Step one, determine end article, and obtain the comment data of end article.

Step 2, described comment data is carried out pretreatment.

The preprocess method of comment data is: according to punctuation mark, every comment data is divided into multiple statement；Participle: will Described sentence segmentation is multiple single words；Part-of-speech tagging: mark part of speech for each single word.Participle refers to one Sentence is cut into one by one individually word, it is simply that according to certain specification, continuous print word sequence is reassembled into word order Row；Part-of-speech tagging refers to mark a correct part of speech into each word of word segmentation result, namely determines that each word is noun, moves The process of word, adjective or other parts of speech.

The preprocess method of comment data also includes, removes stop words, and it is actual that stop words refers to what does not has in sentence The word of implication, such as all kinds of pronouns, numeral, mathematical symbol etc..The present invention can use Open-Source Tools HanLp or Words partition system NLPIR carries out pretreatment to comment data.Such as, comment: " mobile phone feel is pretty good, and tonequality is good, and charging rate is fast " enters with HanLp The pretreated text of row is: " mobile phone/n feel/n is pretty good/a tonequality/n is good/a charging/v speed/n soon/a ".Wherein n represents name Word, a represents adjective, and v represents verb, and d represents adverbial word, part of speech symbol except use defined in HanLp mark collection in addition to, Can the most additionally add part custom words.

Step 3, from pretreated comment data obtain part of speech sequence samples.

The method obtaining part of speech sequence samples is: the comment on commodity statement that definition comprises item property Feature Words is characterized Sentence, chooses and carries out pretreated characteristic sentence as part of speech sequence samples；The formalization statement model of part of speech sequence samples is:

(BF₃, BF₂, BF₁, feature_i, AF₁, AF₂, AF₃, Pos:i)

Step 4, utilize described part of speech sequence samples mate all comment data, according to the formalization of part of speech sequence samples In statement model, the position of Feature Words extracts Feature Words from comment data, and records the frequency of each Feature Words, all features Word constitutive characteristic word pre-candidate set.

Step 5, Feature Words pre-candidate set is carried out pretreatment: whether the Feature Words in judging characteristic word pre-candidate set accords with Close preset rules, if meeting, then retain this feature word, otherwise delete this feature word；That is, the Feature Words meeting preset rules is protected Stay in Feature Words pre-candidate set, delete the Feature Words not meeting preset rules in Feature Words pre-candidate set.Preset rules is: word The length of language is less than or equal to four words, and the frequency of word is in preset range.

The similarity of any two Feature Words in step 6, statistical nature word pre-candidate set, and to similarity more than threshold value Two Feature Words merge.

Embodiment one

Several comments as follows are selected to be analyzed from the comment text of certain mobile phone of certain electricity business website:

A, " mobile phone feel is pretty good, and tonequality is good, and charging rate is fast, the same with what boudoir honey was bought ".

B, " mobile phone pixel is fine, and unlocked by fingerprint is ultrafast, and quality is the prettyst good ".

C, " mobile phone screen is enough big, and pixel is high, and performance is good, and customer service attitude is super good, super likes, and next time, bull's machine also came this Family ".

D, " employing a period of time, screen size is suitable, and feel is pretty good, and earphone tonequality is fine, and volume is enough big, the most not Mistake, battery is the most durable ".

E, " quickly, Mobile phone screen is suitable, and definition is felt quite pleased in logistics, and pixel is high, and customer service is fine ".

Every comment is divided into multiple sentence according to punctuation mark, and utilizes HanLp to carry out data prediction, such as: " hands Machine/n-pixel/n very well/a fingerprint/n unblocks/v is super/d is fast/a mass/n also/d is pretty good/a ", wherein n representation noun, a representative is described Word, v represents verb, and d represents adverbial word.

Use brief introduction HanLp being carried out to pretreatment is as follows:

import com.hankcs.hanlp.tokenizer.NLPTokenizer；

TermList=NLPTokenizer.segment (sentence).

For five examples of A, B, C, D, E chosen above, each sentence in A, B, C is selected to use as characteristic sentence.

All texts in example are carried out pretreatment:

" mobile phone/n, feel/n, good/a, tonequality/n, good/a, charging/vi, speed/n, fast/a, and/cc, boudoir honey/nz, Buy/v, /ude1, the same/uyy] ".

" mobile phone/n, pixel/n, very well/a, fingerprint/n, unblock/v, super/d, fast/a, quality/n also/d, good/a ".

" mobile phone/n, screen/n, enough/v, big/a, pixel/n, height/a, performance/n, good/a, customer service/n, attitude/n, super/d, good/ A, super/b, like/vi, next time/t, buys/v, mobile phone/n, also/d, carrys out/vf, this/rzv, family/q ".

" use/v ,/ule, and one section/mq, time/n ,/ule, screen/n, size/n, suitable/a, feel/n, no Mistake/a, earphone/n, tonequality/n, very/d, good/a, volume/n, enough/v, big/a, very/d, good/a, battery/n, also/d, durable/ a”。

" logistics/n, very/d, fast/a, mobile phone/n, screen/n, suitable/a, definition/n, very/d, satisfaction/v, pixel/n, high/ A, customer service/n, very/d, good/a ".

Can be expressed as respectively (the most not comprising spy by the part of speech sequence formalized model of example A, B, C, D, E The sentence levying word only marks part of speech):

{feature₁/n feature₂/n AF₁/a,Pos:1,2},{feature/n AF₁/a,Pos:1}, {feature₁/vi feature₂/n AF₁/a,Pos:1,2}{/cc,/nz,/v,/ude1,/uyy}。

{feature₁/n feature₂/n AF₁/ a, Pos:1,2}, { feature₁/n feature₂/v AF₁/d feature₂/a,Pos:1,2},{feature/n AF₁/d AF₂/a,Pos:1}。

{{feature₁/n feature₂/n AF₁/v AF₂/ a, Pos:1,2}, { feature/n AF₁/a,Pos:1}, {feature/n AF₁/a,Pos:1},{BF₁/n feature/n AF₁/d AF₂/ a, Pos:1,2} ,/b/vi} ,/t ,/v ,/ n,/d,/v,/rzv,/q}。

{feature/n AF₁/a,Pos:1},{feature/n AF₁/a,Pos:1}{{feature₁/n feature₂/n AF₁/d AF₂/ a, Pos, 1,2}, { feature/n AF₁/v AF₂/a,Pos:1},{/d/a},{feature/n AF₁/d AF₂/ a,Pos:1}。

{feature/n AF₁/d AF₂/a,Pos:1},{feature₁/n feature₂/n AF₁/ a, Pos:1,2}, {feature/n AF₁/d AF₂/v,Pos:2},{feature/n AF₁/a,Pos:1},{feature/n AF₁/d AF₂/a, Pos:1}。

After sample part of speech sequences match, it is thus achieved that preliminary election concentrate Feature Words and the frequency to be: mobile phone screen: 2, tonequality: 1, charging rate: 1, mobile phone pixel: 1, unlocked by fingerprint: 1, quality: 1, pixel: 2, performance: 1, customer service attitude: 1, screen: 1, ear Machine tonequality: 1, volume: 1, battery: 1, logistics: 1, Mobile phone screen: 1, definition: 1, customer service: 1}.

According to rule: if certain word is included in another word, using word less for word length as Feature Words, i.e. Word1.contains (word2), then retain word2 as Feature Words.Obtain after pre-selected works are made preliminary treatment by rule To screen: 4, and tonequality: 2, charging rate, pixel: 3, unlocked by fingerprint: 1, quality: 1, performance: 1, customer service 2, volume: 1, battery: 1, logistics: 1, definition: 1}.

The master record pattern of HowNet dictionary:

Word: W_C=

Word example: E_C=

Part of speech: G_C=

Concept definition (senses of a dictionary entry): DEF=

HowNet records example as follows:

Basic concepts in HowNet: justice is former: describe the ultimate unit of the senses of a dictionary entry；The senses of a dictionary entry: the different implications of word.

Assume that senses of a dictionary entry n_1 has n adopted former N_1={P_11, P_12 ..., P_1n}, senses of a dictionary entry n_2 have m adopted former N_2={P_ 21, P_22 ..., P_2m}, des (P) they are the adopted former quantity of descendants that adopted former p comprises, and max (P) is this justice elite tree place former system of justice The quantity of system, the most adopted former sample space, we select entity class in HowNet, event class, Attribute class, property value class, secondary spy Levying totally 2216 the adopted original work comprised is sample space.The information computing formula of the former P of justice is:

I C (P) = - l o g l o g \frac{d e s (P) + 1}{m a x (P)}

The former similarity of justice depends on their general character and individual character, general character i.e.: on an adopted elite tree, it is assumed that adopted former P₁And P₂ Nearest ancestors' node is P_a, then P_aFor adopted former P₁And P₂Minimum general character, adopted former calculating formula of similarity is:

{Sim}_{o r i} (P_{1}, P_{2}) = \frac{I C (P_{a})}{I C (P_{1}) + I C (P_{2})}

By calculating senses of a dictionary entry n respectively₁And n₂In each former quantity of information of justice, the similarity between the senses of a dictionary entry, Sim can be obtained_L (N₁,N₂): set similarity is equal to the arithmetic average of the similarity of its element pair, C₁、C₂Represent senses of a dictionary entry n respectively₁And n₂Middle record Number, between the senses of a dictionary entry, calculating formula of similarity is:

{Sim}_{i t e} (n_{1}, n_{2}) = {Sim}_{L} (N_{1}, N_{2}) \frac{m i n (C_{1}, C_{2})}{\sqrt{C_{1} C_{2}}}

For two word w₁And w₂, it is assumed that w₁There is k the senses of a dictionary entry: w₁=(n₂₁,n₂₂,…,n_2r),w₂There is r the senses of a dictionary entry: w₂= (n₁₁,n₁₂,…,n_1k), then can obtain word w by equation below by the above senses of a dictionary entry similarity calculated₁And w₂Similarity.

{Sim}_{w o r} = \frac{Σ_{i = 1}^{k} Σ_{j = 1}^{r} {Sim}_{i t e} (n_{1 i}, n_{2 j})}{k r}

The word using above formula to concentrate preliminary election carries out Similarity Measure two-by-two, and result is as follows:

According to the comparison of Similarity value, set similarity threshold β, at this we assume that β=0.310, then by eigenvalue Tonequality and the similarity of volume more than threshold value beta, tonequality and volume are merged into the word that frequency is high, i.e. tonequality, and frequency are Two word frequency rate sums, then characteristic value collection be screen: 4, tonequality: 3, charging rate, pixel: 3, unlocked by fingerprint: 1, quality: 1, performance: 1, customer service 2, battery: 1, logistics: 1, definition: 1}.

The above is only the preferred embodiment of the present invention, it should be understood that the present invention is not limited to described herein Form, is not to be taken as the eliminating to other embodiments, and can be used for other combinations various, amendment and environment, and can be at this In the described contemplated scope of literary composition, it is modified by above-mentioned teaching or the technology of association area or knowledge.And those skilled in the art are entered The change of row and change, the most all should be at the protection domains of claims of the present invention without departing from the spirit and scope of the present invention In.

Claims

1. the extracting method of item property Feature Words in consumer reviews based on text analyzing, it is characterised in that: including:

Determine end article, and obtain the comment data of end article；

Described comment data is carried out pretreatment；

Part of speech sequence samples is obtained from pretreated comment data；

Utilize described part of speech sequence samples to mate all comment data, state in model special according to the formalization of part of speech sequence samples Levy the position of word from comment data, extract Feature Words, and record the frequency of each Feature Words, all Feature Words constitutive characteristic words Pre-candidate set；

The similarity of any two Feature Words in statistical nature word pre-candidate set, and similarity is more than two Feature Words of threshold value Merge.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: the acquisition methods of the comment data of end article is: use crawler algorithm to crawl target business from default website The comment data of product.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: the preprocess method of comment data is:

Every comment data is divided into multiple statement according to punctuation mark；

It is multiple single words by described sentence segmentation；

Part of speech is marked for each single word.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 3, It is characterized in that: the preprocess method of comment data also includes, remove stop words.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: the method obtaining part of speech sequence samples is:

The comment on commodity statement that definition comprises item property Feature Words is characterized sentence, chooses and carries out pretreated characteristic sentence conduct Part of speech sequence samples；

The formalization statement model of part of speech sequence samples is:

(BF₃, BF₂, BF₁, feature_i, AF₁, AF₂, AF₃, Pos:i)

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: also include Feature Words pre-candidate set is carried out the step of pretreatment:

Whether the Feature Words in judging characteristic word pre-candidate set meets preset rules, if meeting, then retains this feature word, otherwise deletes Except this feature word.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 6, It is characterized in that: described preset rules is: the length of word is less than or equal to four words, and the frequency of word is in preset range.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: in statistical nature word pre-candidate set, the method for the similarity of each Feature Words is: in Feature Words pre-candidate set Each Feature Words carries out the calculating of quantity of information based on HowNet, and calculates any two Feature Words in Feature Words pre-candidate set Similarity.

The extracting method of item property Feature Words in consumer reviews based on text analyzing the most according to claim 1, It is characterized in that: the method that Feature Words merges is: similarity is merged into a feature more than two Feature Words of threshold value Word, this feature word is the Feature Words that said two Feature Words medium frequency is bigger.