CN107480257A

CN107480257A - Product feature extracting method based on pattern match

Info

Publication number: CN107480257A
Application number: CN201710694361.4A
Authority: CN
Inventors: 徐新胜; 林静
Original assignee: China Jiliang University
Current assignee: China Jiliang University
Priority date: 2017-08-14
Filing date: 2017-08-14
Publication date: 2017-12-15

Abstract

The present invention proposes a kind of product feature extracting method based on pattern match, comprises the following steps that：1, comment corpus obtains；2, Chinese natural language processing；3, product feature extraction.Five criterions that the product feature that the innovative point of whole method proposes in step 3 must is fulfilled for, step 1 and 2 be product feature extraction element task.The present invention is intended to provide a kind of convenient, efficient method extraction product feature, it is the expansion to existing product feature extracting method.Using the present invention, researchers can fast and effectively carry out product feature extraction, while improve accuracy rate, recall rate and the F values of product feature extraction.

Description

Product feature extracting method based on pattern match

Technical field：

The invention belongs to text mining field, is related to a kind of product feature extracting method based on pattern match, is a kind of Unsupervised product feature extracting method.

Background technology：

With the development of network technology and the variation of network english teaching, people can pass through electronic product whenever and wherever possible Obtain or sharing information, the Web2.0 epoch of customer-centric have come quietly.Modern life rhythm is fast, live load compared with Weight, shopping at network with its it is convenient, fast the characteristics of attract increasing people pass through internet buy product, therefore, electronics business Business has obtained vigorous growth in China.End in December, 2016, Chinese netizen's scale is up to 7.31 hundred million people, Internet penetration 53.2%, wherein customers scale reaches 4.67 hundred million, accounts for netizen's ratio as 63.8%.Manufacturing enterprise and electric business in order to The market situation of product is preferably grasped, e-commerce website typically all allows consumer to deliver the related comment of product.These Contain abundant, valuable information in product review text, effectively can help manufacturing enterprise using these comment texts Improve the design of product, lift the quality of product, improve the market competitiveness, electric business can also be helped to take suitable operation sale Strategy, extend volume growth.

In order to provide more automation, intelligentized text-mining tool to manufacturing enterprise and electric business, domestic and international expert learns Person has carried out substantial amounts of research.For the excavation and utilization of English network comment text, external brainstrust proposes a variety of effective Method for digging, achieve huge achievement in research.And Chinese network comment text is excavated and started late, at present, text mining Research work be concentrated mainly on product feature extraction, comment feeling polarities and intensity judge, comment Result analysis On.Wherein, product feature extraction is the element task of product review text mining, and the quality of the product feature of extraction is direct Have influence on the effect of follow-up study work.

The present invention proposes a kind of product feature extracting method based on pattern match, is a kind of extraction side of unsupervised type Method, it can improve accuracy rate, recall rate and the F values of product feature extraction.

The content of the invention：

In order to quickly and efficiently extract real product feature from magnanimity, multi-source heterogeneous product review text, this Invention provides a kind of product feature extracting method based on pattern match, is a kind of efficient, easily product feature extraction Method, and the expansion to existing product feature extracting method.

The technical solution adopted for the present invention to solve the technical problems such as the description below：

Product feature extracting method based on pattern match, it is characterised in that：This method comprises the steps：

Step 1, corpus is commented on to obtain：Using web crawlers instrument, some is gathered from large-scale electric business platform and specifies production The product of product uses comment information, and is saved in local data base, and then the comment information of preservation is pre-processed, and reduces number Noise in, obtain true, reliable, non-structured comment corpus；

Step 2, Chinese natural language is handled：Comment language material is carried out for the first time respectively using Chinese natural language handling implement Participle and part-of-speech tagging, new word identification, optimize the operation such as participle and part-of-speech tagging, syntactic analysis and sentiment analysis, obtain structure The sentiment analysis result of change is simultaneously saved in database；

Step 3, product feature is extracted：Five criterions of product feature are defined, according to this five criterions to sentiment analysis knot Fruit carries out product feature mark, and extraction is labeled as the word of product feature, generates product feature set.

In the above-mentioned product feature extracting method based on pattern match, in described step 1, due to opening for network The diversification of putting property and network comment, discreteness so that contain in the network comment text captured from electric business platform and largely make an uproar Sound, if directly carrying out product feature extraction to it, acquired results may produce relatively large deviation with actual.So in order to obtain Meet actual result, original comment set need to be filtered and cleaned, reduce noise.Wherein, data prediction includes deleting Except blank, useless comment, punctuation mark unnecessary in comment is deleted, deletes the word of redundancy in comment, number of words is deleted and is less than 4 The comment of word, changes wrong word, and simplified Chinese character replaces the complex form of Chinese characters, deletes comment of redundancy etc..

In the above-mentioned product feature extracting method based on pattern match, in described step 3, product feature five Criterion is specific as follows：

First, product feature can not be off word；

2nd, product feature is the noun or noun phrase in the comment numerous appearance of language material intermediate frequency；

3rd, product feature and the dependence of governing word are " SBV ", and governing word is emotion word；

4th, product feature is to meet the word of seven decimation rules；

5th, product feature is the domain term of non-single word.

In the above-mentioned product feature extracting method based on pattern match, seven decimation rules that product feature meets can It is different by centre word part of speech, it is divided into two major classes, is specifically described as：

First, when centre word part of speech is adjective,

1. when the relation of word and centre word is " SBV ", i.e. when the governing word of word is exactly centre word, then the word is product spy Sign；2. the direct dependence of " COO " when the governing word of word is not centre word, but between governing word and centre word be present, then The word is product feature；3. when the governing word of word is not centre word, but in the presence of the indirect of " COO " between governing word and centre word Dependence, then the word is product feature；

2nd, when centre word part of speech is not centre word for the governing word of verb and word,

4. when the direct dependence that " COO " between the governing word and centre word of word be present, then the word is product feature；⑤ When the direct dependence that " VOB " between the governing word and centre word of word be present, then the word is product feature；6. when the domination of word The indirect dependence of " COO " between word and centre word be present, then the word is product feature；7. when the governing word and centre word of word Between exist " VOB " indirect dependence, then the word is product feature.

The present invention can obtain magnanimity using web crawlers instrument from electric business platform website, multi-source heterogeneous product uses Comment text, by shallow-layer, the Chinese text information processing technology of deep layer so that non-structured data become the number of structuring According to, and carry out product feature mark and extraction using five criterions of definition.Using the method for the present invention, researchers can be fast Speed, effective accuracy rate, recall rate and the F values for carrying out the extraction of product feature, while improving product feature extraction.

Brief description of the drawings：

Fig. 1 is the overall flow figure of the present invention.

Fig. 2 is the product feature extractive technique route map of the present invention.

Fig. 3 is caused result field variation diagram in product feature extraction process of the invention.

The comment corpus that Fig. 4 is the present invention obtains flow chart.

Fig. 5 is the syntactic analysis result case diagram of the comment sentence of the present invention.

Fig. 6 is the dependency relationship type expression figure between the word and word of the present invention.

Fig. 7 is seven decimation rule figures of the product feature of the present invention.

Fig. 8 is the portioned product feature annotation results figure of the present invention.

Embodiment：

With reference to specific accompanying drawing, the present invention is further illustrated.

The present invention is to carry out information scratching to large-scale electric business platform by web crawlers instrument, obtains magnanimity, multi-source heterogeneous Chinese network user comment text, and Chinese natural language processing is carried out to it, products is extracted according to the five of definition criterions Feature, improve accuracy rate, recall rate and the F values of product feature extraction.

Product feature extracting method based on pattern match, including comment corpus obtain, Chinese natural language processing and Product feature extracts these three steps, as shown in Figure 1.

Technology and its technology path involved by product feature extracting method based on pattern match is as shown in Fig. 2 Fig. 2 is gone back Denote caused result after every kind of technology use.Wherein, data acquisition and data prediction are used in step 1 of the present invention Technology；First participle and its part-of-speech tagging, optimization participle and its part-of-speech tagging, syntactic analysis, sentiment analysis are then nature languages Speech processing basic technology, is the technology in step 2, and product feature mark and extraction are the technologies of step 3.

Caused result and its field change in the whole extraction process of product feature extracting method based on pattern match, such as Shown in Fig. 3.Comment on and there was only two fields, respectively sequence number and comment text in corpus；First participle and part-of-speech tagging result, Optimization participle and part-of-speech tagging result have 3 fields, respectively sequence number, morphology and part of speech；Syntactic analysis result has 6 words Section, respectively sequence number, morphology, part of speech, dependence, governing word and governing word part of speech；Sentiment analysis result has 7 fields, point Wei not sequence number, morphology, part of speech, dependence, governing word, governing word part of speech and emotion mark；Product feature annotation results have 8 Field, respectively sequence number, morphology, part of speech, dependence, governing word, governing word part of speech, emotion mark and product signature； Product feature set has two fields, respectively sequence number and product feature.

This each step is described in detail separately below.

Step 1, corpus is commented on to obtain：Using web crawlers instrument, some is gathered from large-scale electric business platform and specifies production The product of product uses comment information, and is saved in local data base, and then the comment information of preservation is pre-processed, and reduces number Noise in, obtain true, reliable, non-structured comment corpus.

It is as shown in Figure 4 to comment on the process that corpus obtains.That formulates web crawlers instrument crawls rule, treats the big of crawl Type electric business platform carries out data grabber, and the result of crawl is stored into local data base, turns into original comment text；To original Comment text carries out data prediction, generation comment corpus, is also stored into database.

Wherein, due to the opening of network and diversification, the discreteness of network comment so that captured from electric business platform Contain a large amount of noises in network comment text, if directly carrying out text mining to it, acquired results may with it is actual produce compared with Large deviation.So meeting actual result to obtain, original comment set need to be filtered and cleaned, reduce noise.In advance Processing includes deleting blank, useless comment, deletes punctuation mark unnecessary in comment, deletes the word of redundancy in comment, deletes Except comment of the number of words less than 4 words, modification wrong word, simplified Chinese character replaces the complex form of Chinese characters, the comment for deleting redundancy etc..

Step 2, Chinese natural language is handled：Comment language material is carried out for the first time respectively using Chinese natural language handling implement Participle and part-of-speech tagging, new word identification, optimize the operation such as participle and part-of-speech tagging, syntactic analysis and sentiment analysis, obtain structure The sentiment analysis result of change is simultaneously saved in database.

2.1) participle and part-of-speech tagging

Comment of the client feedback on electric business platform is for the purpose of exchanging and share, and is the unstructured of textual form Natural language, to therefrom excavate valuable information, then need that it is converted into structural data by participle technique.To commenting The instrument that The Analects of Confucius material carries out segmenting use is ICTCLAS, and the instrument of part-of-speech tagging use is carried out to the comment language material after participle It is ICTCLAS, in order to improve the precision ratio of product feature extraction, the part-of-speech tagging method of selection is to mark out more specific situation Two level mark.

With the fast development of society, there are many new words.The Chinese that these neologisms can not updated point Word device correctly identifies, during participle, it by mistakenly separate, such as, " cost performance " can by ICTCLAS be divided into " property ", Three words of " valency ", " ratio ".In order to solve this problem, the accuracy rate of participle is improved, we will carry out new to first word segmentation result Word is found, the field neologisms of identification are added in user-oriented dictionary, recycles ICTCLAS to optimize participle to comment corpus And two level part-of-speech tagging.

New word discovery process includes construction repeated strings, frequency filter, cohesion filtering and left and right entropy filtering FOUR EASY STEPS.Its In, construction repeated strings are to utilize N-Gram algorithms, and combined filtering vocabulary, filtering part of speech vocabulary, stop words etc. exclude vocabulary and carried out The construction of repeated strings；Frequency filter is to filter out repeated strings of the frequency less than a certain threshold value；Cohesion filtering is by cohesion Value filters out less than the repeated strings of a certain threshold value, the mutual information (Mutual of the cohesion repeated strings of repeated strings Information, MI) represent, the mutual information calculation formula of repeated strings is：

Wherein, x, y represent composition repeated strings R 2 substrings, P_xyRepresent what repeated strings R occurred in first word segmentation result Probability, P_x, P_yRepresent the probability that substring x, y individually occur in first word segmentation result；Left and right entropy filtering is that left entropy or right entropy is low Filtered out in the repeated strings of threshold value, the left entropy of repeated strings, right entropy calculation formula are respectively：

Wherein, p (a | R) represents that word string a is the probability of repeated strings R left adjacent word, and p (b | R) represent that word string b is repeated strings The probability of R right adjacent word.

2.2) syntactic analysis

Interdependent syntactic analysis is one of key technology in natural language processing, is to identify " SVO " in sentence, " fixed The grammatical items such as shape benefit ", and analyze the technology of relation between each composition.Herein using the language technology platform of Harbin Institute of Technology's research and development (Language Technology Platform, LTP) determines the dependence between each composition in sentence.Due to ICTCLAS It is different with the part-of-speech tagging collection that LTP is used, before interdependent syntactic analysis is carried out, first carry out part-of-speech tagging collection conversion.

Fig. 5 is the interdependent syntactic analysis result of a comment sentence, can from the interdependent syntactic analysis result in Fig. 5 Go out, dependence directly occurs between the word and word in sentence, and a dependence connects two words, one of them cries domination Word, another is dependent.Dependence is represented with a directed arc, is interdependent arc, and the direction of interdependent arc is by governing word Point to dependent.There is individual mark on each interdependent arc, be relationship type, represent which type of be present between governing word and dependent Dependence.Dependent, relationship type and governing word composition one are interdependent right, i.e., dependent depends on domination with dependence Word.As shown in figure 5, (mobile phone, SBV, good) is one interdependent right, " mobile phone " is dependent, and " good " is governing word, " SBV " It is the dependence for representing to exist between " mobile phone " and " good " " SBV ", this is interdependent to representing that " mobile phone " is depended on " SBV " " good ".

Wherein, the centre word of sentence is not dominated by other any compositions, i.e., is " HED " with the dependence of " Root " Word centered on word.In Fig. 5, " good " and the dependence of " Root " are " HED ", then " good " is the center of this comment Word.

2.3) sentiment analysis

By analyzing the Chinese network comment text of the homologous isomery of magnanimity, the comment of user feedback is user to purchase Commodity in-service evaluation, generally express the viewpoint of oneself with adjective, noun or verb.Arrange herein and generate a feelings Word dictionary is felt, for judging whether the governing word of each word in syntactic analysis result is emotion word, if the governing word of certain word is feelings Feel word, then the emotion mark isOp of the word is designated as " Y ", conversely, being designated as " N ".

In Chinese product review, the dependence between two words is extremely complex, and we define two kinds of dependences The grammer that type is come between descriptor and word contacts, respectively direct dependence and indirect dependence.Wherein, it is directly interdependent Relation：Represent that a word directly depends on another word, as shown in (a) in Fig. 6, A directly depends on B with dependence；Between Connect dependence：Represent that a word depends on another word by one or more medium terms, such as (b) and (c) institute in Fig. 6 Show, A directly depends on medium term with dependence, and medium term directly depends on B with one or more " COO " again, i.e. A indirectly according to It is stored in B.

3.1) product feature marks

By analyzing substantial amounts of Chinese comment text, summing up product feature needs to meet following five criterions：

First, product feature can not be off word

Stop words is usually that frequency of use is very high, but has no its meaning in itself, only puts it into a complete sentence Just there is the word of certain effect in son, such as " ", " ", " and " etc..And product feature is notional word, there is lexical meaning and language Adopted meaning, syntactic constituent can be served as in sentence.So product feature is unlikely to be stop words.

2nd, product feature is the noun or noun phrase in the comment numerous appearance of language material intermediate frequency

3rd, product feature and the dependence of governing word are " SBV ", and governing word is emotion word

4th, product feature is to meet the word of seven decimation rules

5th, product feature is the domain term of non-single word

Wherein, seven decimation rules in criterion four, it is that we combine definition and the sentiment analysis knot of dependency relationship type Fruit, according to direct dependence or indirect dependence between the governing word and centre word of word be present, sum up what is come, such as Fig. 7 It is shown.

This seven rules can be different by centre word part of speech, are divided into two major classes, are specifically described as：

(1) when centre word part of speech is adjective

1. when the relation of certain word and centre word is " SBV ", i.e. when the governing word of certain word is exactly centre word, then the word is product Feature, as shown in (a) in Fig. 7.

2. when the governing word of certain word is not centre word, but in the presence of the direct interdependent of " COO " between governing word and centre word Relation, then the word is product feature, as shown in (b) in Fig. 7.

3. when the governing word of certain word is not centre word, but in the presence of the indirect interdependent of " COO " between governing word and centre word Relation, then the word is product feature, as shown in (c) in Fig. 7.

(2) when centre word part of speech is verb

4. when the direct dependence that " COO " between the governing word and centre word of word be present, then the word is product feature, such as Shown in (d) in Fig. 7.

5. when the direct dependence that " VOB " between the governing word and centre word of word be present, then the word is product feature, such as Shown in (f) in Fig. 3.

6. when the indirect dependence that " COO " between the governing word and centre word of word be present, then the word is product feature, such as Shown in (e) in Fig. 3.

7. when the indirect dependence that " VOB " between the governing word and centre word of word be present, then the word is product feature, such as Shown in (g) in Fig. 3.

Fig. 8 is portioned product feature annotation results, shares 8 fields.Wherein, no represents sequence number, and tk represents morphology, pos Part of speech is represented, pRel represents dependence, and pWd represents governing word, and pPos represents governing word part of speech, and isOp represents emotion mark, IsPF represents product signature.Morphology and part of speech are that participle and part-of-speech tagging generate, dependence, governing word and governing word Part of speech is syntactic analysis generation, and emotion mark is that sentiment analysis generates, and product signature is product feature mark generation 's.

3.2) product feature is extracted

The word that product feature is labeled as in product feature mark set is extracted, generates product feature set.

The present invention can utilize web crawlers instrument to capture user comment related to appointed product on large-scale electric business platform Text, and a series of processing are carried out to it, product feature mark and extraction are carried out according to the five of definition criterions, generation product is special Collection is closed.Using the method for the present invention, we can efficiently, efficiently carry out product feature extraction, and improve product feature Accuracy rate, recall rate and the F values of extraction.

Claims

1. the product feature extracting method based on pattern match, it is characterised in that：This method comprises the steps：

Step 1：Corpus is commented on to obtain

Using web crawlers instrument, the product that some appointed product is gathered from large-scale electric business platform uses comment information, and protects Be stored to local data base, then the comment information of preservation pre-processed, reduce data in noise, obtain it is true, reliable, Non-structured comment corpus；

Step 2：Chinese natural language processing

Comment language material is segmented for the first time respectively using Chinese natural language handling implement and part-of-speech tagging, new word identification, excellent Change the operation such as participle and part-of-speech tagging, syntactic analysis and sentiment analysis, obtain the sentiment analysis result of structuring and be saved in number According in storehouse；

Step 3：Product feature is extracted

Five criterions of product feature are defined, product feature mark, extraction are carried out to sentiment analysis result according to this five criterions The word of product feature is labeled as, generates product feature set.

2. the product feature extracting method based on pattern match as claimed in claim 1, it is characterised in that：In step 3, product Five criterions of feature are specific as follows：

First, product feature can not be off word；

4th, product feature is to meet the word of seven decimation rules；

5th, product feature is the domain term of non-single word.

3. the product feature extracting method based on pattern match as claimed in claim 2, it is characterised in that：Product feature meets Seven decimation rules can be different by centre word part of speech, be divided into two major classes, be specifically described as：

First, when centre word part of speech is adjective,

1. when the relation of word and centre word is " SBV ", i.e. when the governing word of word is exactly centre word, then the word is product feature；② The direct dependence of " COO " when the governing word of word is not centre word, but between governing word and centre word be present, then the word is Product feature；3. the indirect interdependent pass of " COO " when the governing word of word is not centre word, but between governing word and centre word be present System, then the word is product feature；

4. when the direct dependence that " COO " between the governing word and centre word of word be present, then the word is product feature；5. work as word Governing word and centre word between exist " VOB " direct dependence, then the word is product feature；6. when word governing word with The indirect dependence of " COO " between centre word be present, then the word is product feature；7. when between the governing word and centre word of word In the presence of the indirect dependence of " VOB ", then the word is product feature.