CN109871447A

CN109871447A - Clustering method, computer program product and the server system of Chinese comment unsupervised learning

Info

Publication number: CN109871447A
Application number: CN201910163711.3A
Authority: CN
Inventors: 杨帆; 于巨明; 尚应
Original assignee: Nanjing Zhenshi Intelligent Technology Co Ltd
Current assignee: Nanjing Zhenshi Intelligent Technology Co Ltd
Priority date: 2019-03-05
Filing date: 2019-03-05
Publication date: 2019-06-11

Abstract

The present invention provides clustering method, computer program product and the server system of a kind of Chinese comment unsupervised learning, and wherein clustering method includes: acquisition comment data, and arrangement obtains corpus；Comment content information in corpus is pre-processed, and carries out participle and term vector training；Extract candidate label；Disappeared to candidate tag library and is handled again；Candidate label after offseting weight carries out emotion word filtering；Cluster operation based on DBSCAN is signed to the candidate mark after removal invalid tag, obtains the magnitude of all candidate labels, descending arrangement is carried out according to quantity to cluster result；Each cluster magnitude is finally counted, TopN is exported.The present invention proposes the cluster mode based on unsupervised learning, overcome the problems, such as that previous label clustering method is difficult to objective expression comment result, the present invention can carry out autonomous, unsupervised refinement and study according to the actual content of comment and label, provide the cluster result of the true comment result of more objective and reaction.

Description

Clustering method, computer program product and the service of Chinese comment unsupervised learning Device system

Technical field

The present invention relates to data minings and processing technology field, in particular to a kind of Chinese comment unsupervised learning Clustering method, computer program product and server system.

Background technique

Label is carried out often through technological means in the evaluation of commodity or service in electric business platform or forum at present Extraction and displaying, so that potential user directly obtains the most direct evaluation of product or service.It is existing to generate these marks There are mainly two types of in the mode of label, one of which is to extract, i.e., based on Statistics extract the highest vocabulary of the frequency of occurrences or Phrase forms label, and according to the carry out sequence arrangement of the height of frequency, this mode can generate more make an uproar in mark Sound, and be based only upon the extraction of Statistics, frequently results in very strange result (label), cannot really reflect comment or The characteristics of product；Another kind is the generation based on preparatory customized label, then carries out searching in comment information again and add up, such as Fruit occur it is primary then add up 1, the accumulation result of customized label can then be obtained by having inquired all comments, and top n is taken to be arranged Column obtain final annotation results, this mode generally requires the labour compared, low efficiency when mark, and can only be directed to certainly The label of definition adds up, for new comment or keyword often without effect.

In conjunction with above two mode, it is all based on the cluster of monitor mode, its main feature is that being difficult to react truth.

Summary of the invention

The purpose of the present invention be intended to the cluster with supervision mode the prior art there are aiming at the problem that, propose that a kind of Chinese is commented By the clustering method, computer program product and server system of unsupervised learning, the label obtained by Unsupervised clustering, Independently it can update and learn, and the truth of deeper reaction comment and comment object, so that cluster result It is more objective.

To achieve the above object, the technical solution adopted in the present invention is as follows:

A kind of clustering method of Chinese comment unsupervised learning, comprising the following steps:

Step 1 obtains the comment data for being directed to a product or service, and arrangement obtains corpus, wraps in the corpus Containing the comment content information stored in order；

Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains For the correspondence term vector of word segmentation result；

Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library；

Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label；

Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag；

Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result；

Step 7, each cluster magnitude of statistics, export TopN.

Disclosed another aspect according to the present invention also proposes a kind of computer program product, has the one of instruction including coding A or multiple non-transitory machine-readable medias, described instruction are performed process when executed by one or more processors, The process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, and the process includes executing aforementioned stream Journey.

The disclosed third aspect according to the present invention also proposes a kind of server system, comprising:

Interface is arranged for obtaining for an at least product or the comment data of service；

At least one processor；

At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition Process, the process include:

Step 1, the comment data to acquisition, arrangement obtain corpus, comment in the corpus comprising what is stored in order By content information；

Step 7, each cluster magnitude of statistics, export TopN.

In more preferred example, the process more includes:

In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate label filtered Library, candidate tag library data structure include candidate tag characters string and candidate tag characters string vector；

Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, from first candidate mark Label start, and choose candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values；Determine again If tag set is greater than minimum neighbours' number of the setting of definition, the number of label in the tag set is counted as this label Magnitude, otherwise terminate；

Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate；

Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.

In conjunction with preceding solution and its implementation, significant beneficial effect of the invention is:

1, propose the cluster mode based on unsupervised learning on the whole, overcome previous cluster with supervision it is simple statistics or Pre-defined label can not carry out autonomous learning, cause the result of label display beyond expression of words true and objectively comment on result Problem after participle and candidate label based on ad hoc rules are chosen, is used using the cluster mode of unsupervised learning of the invention The cluster of unsupervised learning can carry out autonomous, unsupervised (without customized, nothing is preparatory according to the actual content of comment and label It is specified) refinement and study, final cluster process and result more withdraw deposit objectively comment as a result, study front and back undopes people For factor and interference intervention；

2, in the data basis of cluster, the tag extraction based on natural language is carried out, using Chinese dependency parsing Its syntactic structure is explained by the dependence before ingredient in metalanguage unit, is to dominate other with sentence center word aroused in interest The center compositions of ingredient are principle, and itself is by the domination of other any ingredients, and all subject ingredients are all with certain Relationship is subordinated to dominator, therefore the extraction of Different Rule can be carried out based on this, such as used in embodiment " noun subject+ 5 classes such as the adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " extract rule, high for Chinese comment The comment content of hair effectively, objectively extracts candidate label；

3, the based process of data further include to the filtering of the emotion word of candidate label, based on the emotion word preferably combined into The duties such as row matching, so that many invalid and meaningless labels are filtered out, the invalidation and effect for avoiding later data from clustering Rate is low, and the defect that cluster result caused by avoiding thus is unable to objective reflection comment generates；

4, in the process later period for filtering out invalid tag, also the participle for splitting and combining is carried out at the equalization of term vector Reason, carried out obtaining tag set based on the cluster of cosine similarity algorithm conducive to the later period, and was carried out finally based on the tag set Magnitude determine, improve cluster efficiency.

Detailed description of the invention

Fig. 1 is the flow diagram of the clustering method of Chinese comment unsupervised learning of the invention.

Specific embodiment

In order to better understand the technical content of the present invention, special to lift specific embodiment and institute's accompanying drawings is cooperated to be described as follows.

Various aspects with reference to the accompanying drawings to describe the present invention in the disclosure, shown in the drawings of the embodiment of many explanations. It is not intended to cover all aspects of the invention for embodiment of the disclosure.It should be appreciated that a variety of designs and reality presented hereinbefore Those of apply example, and describe in more detail below design and embodiment can in many ways in any one come it is real It applies.

In conjunction with Fig. 1, the clustering method of the Chinese comment unsupervised learning of disclosed embodiment according to the present invention is intended to pair The product of the company of acquisition or the comment of service are clustered, and obtain being best able to withdraw deposit commenting on the TOPN comment mark of result Label, with for reference, help user with most fast speed understand in the past to this product perhaps the evaluation of service or nationality with The follow-up service of product or service is improved, reference is used as.

The statistical (keyword identification and cumulative) that used in the past and customized label (customized keyword) is not The situation that actually occurs of comment can be covered, lacks scalability, the homogeneity of label substance is serious, and base in the solution of the present invention Carried out in unsupervised mode, can be learnt and be adjusted in real time according to practical comment content, continuous renewal label and Label clustering is as a result, provide the cluster result of the true comment result of more objective and reaction.

As shown in connection with fig. 1, the clustering method of unsupervised learning proposed by the present invention, generally comprises following procedure:

Step 7, each cluster magnitude of statistics, export TopN.

The above process, which is realized, as a result, depends on used unsupervised deep learning and natural language processing technique, passes through The automation that clustering technique and label extraction model complete client's theme comment label is extracted, being capable of comprehensive, objective displaying user Data mining to the specific subject comment potential profound level of content.

As shown in connection with fig. 1, the exemplary realization of the clustering method of the embodiment of the present invention is more specifically described below.

Step 1: obtaining the comment data for a product or service, arrangement obtains corpus, wraps in the corpus Containing the comment content information stored in order.

During some concrete implementations, it can be obtained by electric business, customer service and other channels original for one Product or the comment data of service are illustrated in this example by taking this season clothes " design of scattered small flowers and plants one-piece dress " sold as an example, but this Field personnel should be appreciated that system is not limited thereto in implementation of the invention.

In comment data, we arrange Chinese comment data, obtain corpus, wherein win in a certain order and User is arranged to the comment content of " the design of scattered small flowers and plants one-piece dress ", especially word content.Certainly it in other embodiment, can also wrap Data containing voice remark.Word content can be converted thereof by voice-text conversion.

In some instances, for example, by comment on the time sequencing, all word contents are organized by rows The corpus of storage, for subsequent processing.

Step 2: the comment content information in corpus being pre-processed, and carries out participle and term vector training, is obtained For the correspondence term vector of word segmentation result.

In an embodiment of the present invention, the processing of following process mainly is carried out in step 2:

Step 2-1, the word content in corpus is pre-processed, pretreatment here refers in particular to removal and deactivates Word, for example, take out in the word content " clothes enjoyed a lot " stop words therein " " " very ", and retain and " like clothing The comment content of clothes ", to reduce subsequent participle, calculating and the index and calculation amount of clustering processing；

Step 2-2, it after removing stop words, to the word content in corpus, is segmented according to the sequence of storage mode Processing；Such as participle forms the word segmentation result of " liking ", " clothes "；

Step 2-3, term vector training is carried out to participle, obtains the correspondence term vector for word segmentation result.

In the preprocessing process in step 2-2, alternatively, it is being directed to the clustered demand for commenting on content, I Segmented using hanLP, participle, and be based on word2vec training term vector to word segmentation result, trained term vector is used for here The subsequent clustering processing based on cosine similarity algorithm.

Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library.

In implementation process of the invention, in the data basis of cluster, the tag extraction based on natural language is carried out, is used Chinese dependency parsing explains its syntactic structure by the dependence before ingredient in metalanguage unit, with sentence center Word aroused in interest is that the center compositions of domination other compositions are principle, and itself is not by the domination of other any ingredients, Suo Youshou Governor is all subordinated to dominator with certain relationship, therefore the extraction of Different Rule, such as embodiment institute can be carried out based on this 5 classes such as " the noun subject+adverbial modifier, the noun subjects+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " that uses extract Rule comment on high-incidence comment content for Chinese, effectively, objectively extract candidate label.

Certainly, in a further embodiment, for different cluster scenes and demand, other extracting rules can be selected Or their combination.

Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label.

In preferred example, to the candidate label in candidate tag library, disappear again based on simhash algorithm, in removal Hold substantially the same label.

For example, being essentially the mark of the same substantive meaning of expression for comment content " liking clothes ", " liking clothes " Label, therefore for the ease of subsequent unified clustering processing, one of label is removed, only retains one, so in subsequent progress When cluster, expressing the substantially label of same meaning will be clustered under same label, provide the computational efficiency of cluster and objective Property, it avoids confusion and nearly justice repeats to cluster.

Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag.

Due in some comment contents, although we are handled according to above-mentioned steps refines outgoing label, and pre-processed, Duplicate removal, but some comment contents are still had in reality, essence not substantially is carried out to comment object " design of scattered small flowers and plants one-piece dress " Property comment label, such as the comment content of " I am uncomfortable today ", essence do not react commented on commodity, that is, For the significant comment of comment object, wherein not including emotion word.Therefore, we wish before clustering processing will be this kind of Comment filters out, so that the real object of cluster is the actual evaluation content for commodity.

In a preferred embodiment, we will be filtered using combined sentiment word lexicon.

Specifically, it includes following procedure that emotion word filtering is carried out in step 5:

Step 5-1, combined emotion dictionary is set；

Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through Jieba segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary one by one Equivalence matching, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word；

If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate Label, and by all participles of this candidate label, term vector is obtained by the term vector library inquiry of step 1, calculates word The average value of vector；If not including emotion word, directly filter；

Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, has been handled Cheng Hou, generates the candidate tag library filtered, and candidate tag library data structure includes candidate tag characters string and candidate label Character string vector.

In particular it is preferred that we are combined multiple sentiment word lexicons in step 5-1, so that sentiment word lexicon Range more extend and comprehensively, avoid single emotional dictionary insufficient and mistake filter out it is some should be when carrying out clustering processing Comment.

For example, added Tsinghua University's Li Jun Chinese on the basis of original emotion vocabulary and pass judgement on adopted dictionary and Dalian Polytechnics's Chinese emotion vocabulary ontology library (no auxiliary emotional semantic classification), under equal conditions, multiple emotion word table packs are obtained Obtain better label effect.

Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result.

In the embodiment of the present invention, the cluster operation based on DBSCAN specifically includes following procedure:

Step 6-1, candidate label, the candidate tag library that obtaining step 5-4 is obtained are loaded；

Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, from first candidate label Start, chooses candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values；

If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, count in the tag set Magnitude of the number of label as this label, otherwise terminates；

Step 6-4, candidate label all in candidate tag library is successively handled according to the processing of above-mentioned steps 6-2,6-3, Until all candidate label clusterings terminate；

Step 6-5, descending arrangement is carried out by quantity to cluster result according to the magnitude of obtained all labels and label.

In conjunction with implementation process of the invention, in some embodiments, we also propose a kind of computer program product, including Coding has one or more non-transitory machine-readable medias of instruction, and described instruction makes when executed by one or more processors The process of obtaining is performed, and the process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, the process packet It includes and executes the preceding method process that is included, method especially shown in FIG. 1 and aforementioned described in method as shown in connection with fig. 1 Treatment process.

It is noted that Fig. 1 of the present invention and aforementioned processing process described in conjunction with Figure 1, that is, be based on unsupervised The clustering method of habit, can in local server, local computer system or cloud server embodiment,

It is illustrated by taking the implementation of cloud server as an example below.

Disclosed server system according to the present invention, comprising:

At least one processor；

At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition Process, aforementioned process include:

Step 7, each cluster magnitude of statistics, export TopN.

In especially preferred embodiment, aforementioned process more includes:

Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause This, the scope of protection of the present invention is defined by those of the claims.

Claims

1. a kind of clustering method of Chinese comment unsupervised learning, which comprises the following steps:

Step 1 obtains the comment data for being directed to a product or service, and it includes to press in the corpus that arrangement, which obtains corpus, The comment content information of sequential storage；

Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, is directed to The correspondence term vector of word segmentation result；

Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidate labels Magnitude, to cluster result according to quantity carry out descending arrangement；

Step 7, each cluster magnitude of statistics, export TopN.

2. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2 Pretreatment include removal stop words.

3. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2, It is segmented using hanLP, and word2vec training term vector is based on to word segmentation result.

4. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 3 The label lot rule used includes: the noun subject+adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+describe Word, 5 class decimation rule of the adverbial modifier obtain candidate label.

5. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 4, To the candidate label in candidate tag library, the weight that disappears is carried out based on simhash algorithm, removes identical label on content.

6. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 5 Carrying out emotion word filtering specifically includes:

Step 5-1, combined emotion dictionary is set；

Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through into jieba Segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary to equivalent one by one Match, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word；

If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate mark Label, and by all participles of this candidate label, obtain term vector by the term vector library inquiry of step 1, calculate word to The average value of amount；If not including emotion word, directly filter；

Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, after the completion of processing, The candidate tag library filtered is generated, candidate tag library data structure includes candidate tag characters string and candidate tag characters string Vector.

7. the clustering method of Chinese comment unsupervised learning according to claim 6, which is characterized in that in the step 6 Cluster operation the following steps are included:

Step 6-1, candidate label, the candidate tag library of obtaining step 5-4 are loaded；

Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, open from first candidate label Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity, Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values；

If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, label in the tag set is counted Magnitude of the number as this label, otherwise terminate；

8. a kind of computer program product has one or more non-transitory machine-readable medias of instruction, the finger including encoding Order is performed process when executed by one or more processors, and the process is used to execute the Chinese comment number to acquisition According to Unsupervised clustering processing, the process includes executing any one of preceding claims 1-7 the method to be included Process.

9. a kind of server system characterized by comprising

At least one processor；

At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, described It instructs when being executed by least one described processor to realize the Unsupervised clustering treatment process to the comment data of acquisition, The process includes:

Step 1, the comment data to acquisition, arrangement obtain corpus, include in the comment stored in order in the corpus Hold information；

Step 7, each cluster magnitude of statistics, export TopN.

10. server system according to claim 9, which is characterized in that the process more includes:

In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate tag library filtered, Candidate tag library data structure includes candidate tag characters string and candidate tag characters string vector；

Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, opened from first candidate label Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity, Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values；Determine again such as Fruit tag set is greater than minimum neighbours' number of the setting of definition, then counts the number of label in the tag set as this label Otherwise magnitude terminates；