CN109871447A - Clustering method, computer program product and the server system of Chinese comment unsupervised learning - Google Patents

Clustering method, computer program product and the server system of Chinese comment unsupervised learning Download PDF

Info

Publication number
CN109871447A
CN109871447A CN201910163711.3A CN201910163711A CN109871447A CN 109871447 A CN109871447 A CN 109871447A CN 201910163711 A CN201910163711 A CN 201910163711A CN 109871447 A CN109871447 A CN 109871447A
Authority
CN
China
Prior art keywords
candidate
label
comment
tag
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910163711.3A
Other languages
Chinese (zh)
Inventor
杨帆
于巨明
尚应
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhenshi Intelligent Technology Co Ltd
Original Assignee
Nanjing Zhenshi Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhenshi Intelligent Technology Co Ltd filed Critical Nanjing Zhenshi Intelligent Technology Co Ltd
Priority to CN201910163711.3A priority Critical patent/CN109871447A/en
Publication of CN109871447A publication Critical patent/CN109871447A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention provides clustering method, computer program product and the server system of a kind of Chinese comment unsupervised learning, and wherein clustering method includes: acquisition comment data, and arrangement obtains corpus;Comment content information in corpus is pre-processed, and carries out participle and term vector training;Extract candidate label;Disappeared to candidate tag library and is handled again;Candidate label after offseting weight carries out emotion word filtering;Cluster operation based on DBSCAN is signed to the candidate mark after removal invalid tag, obtains the magnitude of all candidate labels, descending arrangement is carried out according to quantity to cluster result;Each cluster magnitude is finally counted, TopN is exported.The present invention proposes the cluster mode based on unsupervised learning, overcome the problems, such as that previous label clustering method is difficult to objective expression comment result, the present invention can carry out autonomous, unsupervised refinement and study according to the actual content of comment and label, provide the cluster result of the true comment result of more objective and reaction.

Description

Clustering method, computer program product and the service of Chinese comment unsupervised learning Device system
Technical field
The present invention relates to data minings and processing technology field, in particular to a kind of Chinese comment unsupervised learning Clustering method, computer program product and server system.
Background technique
Label is carried out often through technological means in the evaluation of commodity or service in electric business platform or forum at present Extraction and displaying, so that potential user directly obtains the most direct evaluation of product or service.It is existing to generate these marks There are mainly two types of in the mode of label, one of which is to extract, i.e., based on Statistics extract the highest vocabulary of the frequency of occurrences or Phrase forms label, and according to the carry out sequence arrangement of the height of frequency, this mode can generate more make an uproar in mark Sound, and be based only upon the extraction of Statistics, frequently results in very strange result (label), cannot really reflect comment or The characteristics of product;Another kind is the generation based on preparatory customized label, then carries out searching in comment information again and add up, such as Fruit occur it is primary then add up 1, the accumulation result of customized label can then be obtained by having inquired all comments, and top n is taken to be arranged Column obtain final annotation results, this mode generally requires the labour compared, low efficiency when mark, and can only be directed to certainly The label of definition adds up, for new comment or keyword often without effect.
In conjunction with above two mode, it is all based on the cluster of monitor mode, its main feature is that being difficult to react truth.
Summary of the invention
The purpose of the present invention be intended to the cluster with supervision mode the prior art there are aiming at the problem that, propose that a kind of Chinese is commented By the clustering method, computer program product and server system of unsupervised learning, the label obtained by Unsupervised clustering, Independently it can update and learn, and the truth of deeper reaction comment and comment object, so that cluster result It is more objective.
To achieve the above object, the technical solution adopted in the present invention is as follows:
A kind of clustering method of Chinese comment unsupervised learning, comprising the following steps:
Step 1 obtains the comment data for being directed to a product or service, and arrangement obtains corpus, wraps in the corpus Containing the comment content information stored in order;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
Disclosed another aspect according to the present invention also proposes a kind of computer program product, has the one of instruction including coding A or multiple non-transitory machine-readable medias, described instruction are performed process when executed by one or more processors, The process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, and the process includes executing aforementioned stream Journey.
The disclosed third aspect according to the present invention also proposes a kind of server system, comprising:
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition Process, the process include:
Step 1, the comment data to acquisition, arrangement obtain corpus, comment in the corpus comprising what is stored in order By content information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
In more preferred example, the process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate label filtered Library, candidate tag library data structure include candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, from first candidate mark Label start, and choose candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;Determine again If tag set is greater than minimum neighbours' number of the setting of definition, the number of label in the tag set is counted as this label Magnitude, otherwise terminate;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
In conjunction with preceding solution and its implementation, significant beneficial effect of the invention is:
1, propose the cluster mode based on unsupervised learning on the whole, overcome previous cluster with supervision it is simple statistics or Pre-defined label can not carry out autonomous learning, cause the result of label display beyond expression of words true and objectively comment on result Problem after participle and candidate label based on ad hoc rules are chosen, is used using the cluster mode of unsupervised learning of the invention The cluster of unsupervised learning can carry out autonomous, unsupervised (without customized, nothing is preparatory according to the actual content of comment and label It is specified) refinement and study, final cluster process and result more withdraw deposit objectively comment as a result, study front and back undopes people For factor and interference intervention;
2, in the data basis of cluster, the tag extraction based on natural language is carried out, using Chinese dependency parsing Its syntactic structure is explained by the dependence before ingredient in metalanguage unit, is to dominate other with sentence center word aroused in interest The center compositions of ingredient are principle, and itself is by the domination of other any ingredients, and all subject ingredients are all with certain Relationship is subordinated to dominator, therefore the extraction of Different Rule can be carried out based on this, such as used in embodiment " noun subject+ 5 classes such as the adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " extract rule, high for Chinese comment The comment content of hair effectively, objectively extracts candidate label;
3, the based process of data further include to the filtering of the emotion word of candidate label, based on the emotion word preferably combined into The duties such as row matching, so that many invalid and meaningless labels are filtered out, the invalidation and effect for avoiding later data from clustering Rate is low, and the defect that cluster result caused by avoiding thus is unable to objective reflection comment generates;
4, in the process later period for filtering out invalid tag, also the participle for splitting and combining is carried out at the equalization of term vector Reason, carried out obtaining tag set based on the cluster of cosine similarity algorithm conducive to the later period, and was carried out finally based on the tag set Magnitude determine, improve cluster efficiency.
Detailed description of the invention
Fig. 1 is the flow diagram of the clustering method of Chinese comment unsupervised learning of the invention.
Specific embodiment
In order to better understand the technical content of the present invention, special to lift specific embodiment and institute's accompanying drawings is cooperated to be described as follows.
Various aspects with reference to the accompanying drawings to describe the present invention in the disclosure, shown in the drawings of the embodiment of many explanations. It is not intended to cover all aspects of the invention for embodiment of the disclosure.It should be appreciated that a variety of designs and reality presented hereinbefore Those of apply example, and describe in more detail below design and embodiment can in many ways in any one come it is real It applies.
In conjunction with Fig. 1, the clustering method of the Chinese comment unsupervised learning of disclosed embodiment according to the present invention is intended to pair The product of the company of acquisition or the comment of service are clustered, and obtain being best able to withdraw deposit commenting on the TOPN comment mark of result Label, with for reference, help user with most fast speed understand in the past to this product perhaps the evaluation of service or nationality with The follow-up service of product or service is improved, reference is used as.
The statistical (keyword identification and cumulative) that used in the past and customized label (customized keyword) is not The situation that actually occurs of comment can be covered, lacks scalability, the homogeneity of label substance is serious, and base in the solution of the present invention Carried out in unsupervised mode, can be learnt and be adjusted in real time according to practical comment content, continuous renewal label and Label clustering is as a result, provide the cluster result of the true comment result of more objective and reaction.
As shown in connection with fig. 1, the clustering method of unsupervised learning proposed by the present invention, generally comprises following procedure:
Step 1 obtains the comment data for being directed to a product or service, and arrangement obtains corpus, wraps in the corpus Containing the comment content information stored in order;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
The above process, which is realized, as a result, depends on used unsupervised deep learning and natural language processing technique, passes through The automation that clustering technique and label extraction model complete client's theme comment label is extracted, being capable of comprehensive, objective displaying user Data mining to the specific subject comment potential profound level of content.
As shown in connection with fig. 1, the exemplary realization of the clustering method of the embodiment of the present invention is more specifically described below.
Step 1: obtaining the comment data for a product or service, arrangement obtains corpus, wraps in the corpus Containing the comment content information stored in order.
During some concrete implementations, it can be obtained by electric business, customer service and other channels original for one Product or the comment data of service are illustrated in this example by taking this season clothes " design of scattered small flowers and plants one-piece dress " sold as an example, but this Field personnel should be appreciated that system is not limited thereto in implementation of the invention.
In comment data, we arrange Chinese comment data, obtain corpus, wherein win in a certain order and User is arranged to the comment content of " the design of scattered small flowers and plants one-piece dress ", especially word content.Certainly it in other embodiment, can also wrap Data containing voice remark.Word content can be converted thereof by voice-text conversion.
In some instances, for example, by comment on the time sequencing, all word contents are organized by rows The corpus of storage, for subsequent processing.
Step 2: the comment content information in corpus being pre-processed, and carries out participle and term vector training, is obtained For the correspondence term vector of word segmentation result.
In an embodiment of the present invention, the processing of following process mainly is carried out in step 2:
Step 2-1, the word content in corpus is pre-processed, pretreatment here refers in particular to removal and deactivates Word, for example, take out in the word content " clothes enjoyed a lot " stop words therein " " " very ", and retain and " like clothing The comment content of clothes ", to reduce subsequent participle, calculating and the index and calculation amount of clustering processing;
Step 2-2, it after removing stop words, to the word content in corpus, is segmented according to the sequence of storage mode Processing;Such as participle forms the word segmentation result of " liking ", " clothes ";
Step 2-3, term vector training is carried out to participle, obtains the correspondence term vector for word segmentation result.
In the preprocessing process in step 2-2, alternatively, it is being directed to the clustered demand for commenting on content, I Segmented using hanLP, participle, and be based on word2vec training term vector to word segmentation result, trained term vector is used for here The subsequent clustering processing based on cosine similarity algorithm.
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library.
In implementation process of the invention, in the data basis of cluster, the tag extraction based on natural language is carried out, is used Chinese dependency parsing explains its syntactic structure by the dependence before ingredient in metalanguage unit, with sentence center Word aroused in interest is that the center compositions of domination other compositions are principle, and itself is not by the domination of other any ingredients, Suo Youshou Governor is all subordinated to dominator with certain relationship, therefore the extraction of Different Rule, such as embodiment institute can be carried out based on this 5 classes such as " the noun subject+adverbial modifier, the noun subjects+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+adjective, the adverbial modifier " that uses extract Rule comment on high-incidence comment content for Chinese, effectively, objectively extract candidate label.
Certainly, in a further embodiment, for different cluster scenes and demand, other extracting rules can be selected Or their combination.
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label.
In preferred example, to the candidate label in candidate tag library, disappear again based on simhash algorithm, in removal Hold substantially the same label.
For example, being essentially the mark of the same substantive meaning of expression for comment content " liking clothes ", " liking clothes " Label, therefore for the ease of subsequent unified clustering processing, one of label is removed, only retains one, so in subsequent progress When cluster, expressing the substantially label of same meaning will be clustered under same label, provide the computational efficiency of cluster and objective Property, it avoids confusion and nearly justice repeats to cluster.
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag.
Due in some comment contents, although we are handled according to above-mentioned steps refines outgoing label, and pre-processed, Duplicate removal, but some comment contents are still had in reality, essence not substantially is carried out to comment object " design of scattered small flowers and plants one-piece dress " Property comment label, such as the comment content of " I am uncomfortable today ", essence do not react commented on commodity, that is, For the significant comment of comment object, wherein not including emotion word.Therefore, we wish before clustering processing will be this kind of Comment filters out, so that the real object of cluster is the actual evaluation content for commodity.
In a preferred embodiment, we will be filtered using combined sentiment word lexicon.
Specifically, it includes following procedure that emotion word filtering is carried out in step 5:
Step 5-1, combined emotion dictionary is set;
Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through Jieba segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary one by one Equivalence matching, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word;
If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate Label, and by all participles of this candidate label, term vector is obtained by the term vector library inquiry of step 1, calculates word The average value of vector;If not including emotion word, directly filter;
Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, has been handled Cheng Hou, generates the candidate tag library filtered, and candidate tag library data structure includes candidate tag characters string and candidate label Character string vector.
In particular it is preferred that we are combined multiple sentiment word lexicons in step 5-1, so that sentiment word lexicon Range more extend and comprehensively, avoid single emotional dictionary insufficient and mistake filter out it is some should be when carrying out clustering processing Comment.
For example, added Tsinghua University's Li Jun Chinese on the basis of original emotion vocabulary and pass judgement on adopted dictionary and Dalian Polytechnics's Chinese emotion vocabulary ontology library (no auxiliary emotional semantic classification), under equal conditions, multiple emotion word table packs are obtained Obtain better label effect.
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result.
In the embodiment of the present invention, the cluster operation based on DBSCAN specifically includes following procedure:
Step 6-1, candidate label, the candidate tag library that obtaining step 5-4 is obtained are loaded;
Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, from first candidate label Start, chooses candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;
If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, count in the tag set Magnitude of the number of label as this label, otherwise terminates;
Step 6-4, candidate label all in candidate tag library is successively handled according to the processing of above-mentioned steps 6-2,6-3, Until all candidate label clusterings terminate;
Step 6-5, descending arrangement is carried out by quantity to cluster result according to the magnitude of obtained all labels and label.
In conjunction with implementation process of the invention, in some embodiments, we also propose a kind of computer program product, including Coding has one or more non-transitory machine-readable medias of instruction, and described instruction makes when executed by one or more processors The process of obtaining is performed, and the process is used to execute the Unsupervised clustering processing to the Chinese comment data of acquisition, the process packet It includes and executes the preceding method process that is included, method especially shown in FIG. 1 and aforementioned described in method as shown in connection with fig. 1 Treatment process.
It is noted that Fig. 1 of the present invention and aforementioned processing process described in conjunction with Figure 1, that is, be based on unsupervised The clustering method of habit, can in local server, local computer system or cloud server embodiment,
It is illustrated by taking the implementation of cloud server as an example below.
Disclosed server system according to the present invention, comprising:
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, Described instruction by least one described processor when being executed to realize the Unsupervised clustering processing to the comment data of acquisition Process, aforementioned process include:
Step 1, the comment data to acquisition, arrangement obtain corpus, comment in the corpus comprising what is stored in order By content information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, obtains For the correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidates The magnitude of label carries out descending arrangement according to quantity to cluster result;
Step 7, each cluster magnitude of statistics, export TopN.
In especially preferred embodiment, aforementioned process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate label filtered Library, candidate tag library data structure include candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, from first candidate mark Label start, and choose candidate label and other all candidate labels in candidate tag library are similar according to the calculating of cosine similarity algorithm Degree, similarity value and preset similarity threshold are compared, and determine that similarity is greater than the tag set of threshold values;Determine again If tag set is greater than minimum neighbours' number of the setting of definition, the number of label in the tag set is counted as this label Magnitude, otherwise terminate;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
Although the present invention has been disclosed as a preferred embodiment, however, it is not to limit the invention.Skill belonging to the present invention Has usually intellectual in art field, without departing from the spirit and scope of the present invention, when can be used for a variety of modifications and variations.Cause This, the scope of protection of the present invention is defined by those of the claims.

Claims (10)

1. a kind of clustering method of Chinese comment unsupervised learning, which comprises the following steps:
Step 1 obtains the comment data for being directed to a product or service, and it includes to press in the corpus that arrangement, which obtains corpus, The comment content information of sequential storage;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, is directed to The correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidate labels Magnitude, to cluster result according to quantity carry out descending arrangement;
Step 7, each cluster magnitude of statistics, export TopN.
2. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2 Pretreatment include removal stop words.
3. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 2, It is segmented using hanLP, and word2vec training term vector is based on to word segmentation result.
4. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 3 The label lot rule used includes: the noun subject+adverbial modifier, the noun subject+adverbial modifier+adverbial modifier, the adverbial modifier+adverbial modifier, the adverbial modifier+describe Word, 5 class decimation rule of the adverbial modifier obtain candidate label.
5. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 4, To the candidate label in candidate tag library, the weight that disappears is carried out based on simhash algorithm, removes identical label on content.
6. the clustering method of Chinese comment unsupervised learning according to claim 1, which is characterized in that in the step 5 Carrying out emotion word filtering specifically includes:
Step 5-1, combined emotion dictionary is set;
Step 5-2, emotion dictionary is loaded into a set, since first candidate label, candidate label is passed through into jieba Segmentation methods split into multiple words, and all words split are done with the emotion word inside emotion dictionary to equivalent one by one Match, this candidate label label contains emotion word if successful match, and otherwise label does not include emotion word;
If step 5-3, determining that this candidate label includes emotion word, the word split into is reassembled into candidate mark Label, and by all participles of this candidate label, obtain term vector by the term vector library inquiry of step 1, calculate word to The average value of amount;If not including emotion word, directly filter;
Step 5-4, the emotion word filtration treatment that each candidate label is carried out according to above-mentioned steps 5-2,5-3, after the completion of processing, The candidate tag library filtered is generated, candidate tag library data structure includes candidate tag characters string and candidate tag characters string Vector.
7. the clustering method of Chinese comment unsupervised learning according to claim 6, which is characterized in that in the step 6 Cluster operation the following steps are included:
Step 6-1, candidate label, the candidate tag library of obtaining step 5-4 are loaded;
Step 6-2, DBSCAN clustering algorithm is input to according to candidate label and carries out cluster operation, open from first candidate label Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity, Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values;
If step 6-3, determining that tag set is greater than minimum neighbours' number of the setting of definition, label in the tag set is counted Magnitude of the number as this label, otherwise terminate;
Step 6-4, candidate label all in candidate tag library is successively handled according to the processing of above-mentioned steps 6-2,6-3, until All candidate label clusterings terminate;
Step 6-5, descending arrangement is carried out by quantity to cluster result according to the magnitude of obtained all labels and label.
8. a kind of computer program product has one or more non-transitory machine-readable medias of instruction, the finger including encoding Order is performed process when executed by one or more processors, and the process is used to execute the Chinese comment number to acquisition According to Unsupervised clustering processing, the process includes executing any one of preceding claims 1-7 the method to be included Process.
9. a kind of server system characterized by comprising
Interface is arranged for obtaining for an at least product or the comment data of service;
At least one processor;
At least one processor is arranged for the instruction for the coding that storage can be executed by least one described processor, described It instructs when being executed by least one described processor to realize the Unsupervised clustering treatment process to the comment data of acquisition, The process includes:
Step 1, the comment data to acquisition, arrangement obtain corpus, include in the comment stored in order in the corpus Hold information;
Step 2 pre-processes the comment content information in corpus, and carries out participle and term vector training, is directed to The correspondence term vector of word segmentation result;
Step 3, tag extraction Rule Extraction candidate's label based on natural language form candidate tag library;
Step 4, disappeared to the candidate tag library is handled again, removes duplicate candidate label;
Step 5 offsets the candidate label progress emotion word filtering after weight, removes invalid tag;
Step 6 is signed the cluster operation based on DBSCAN to the candidate mark after removal invalid tag, obtains all candidate labels Magnitude, to cluster result according to quantity carry out descending arrangement;
Step 7, each cluster magnitude of statistics, export TopN.
10. server system according to claim 9, which is characterized in that the process more includes:
In the step 5, emotion word filtration treatment is carried out to each candidate's label, generates the candidate tag library filtered, Candidate tag library data structure includes candidate tag characters string and candidate tag characters string vector;
Then, candidate label is input to DBSCAN clustering algorithm in step 6 and carries out cluster operation, opened from first candidate label Begin, choose candidate label and other all candidate labels in candidate tag library according to cosine similarity algorithm and calculate similarity, Similarity value and preset similarity threshold are compared, determine that similarity is greater than the tag set of threshold values;Determine again such as Fruit tag set is greater than minimum neighbours' number of the setting of definition, then counts the number of label in the tag set as this label Otherwise magnitude terminates;
Then, it is continuously circulated above-mentioned cluster calculating process, until all candidate label clusterings terminate;
Finally, the magnitude according to obtained all labels and label carries out descending arrangement by quantity to cluster result.
CN201910163711.3A 2019-03-05 2019-03-05 Clustering method, computer program product and the server system of Chinese comment unsupervised learning Pending CN109871447A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910163711.3A CN109871447A (en) 2019-03-05 2019-03-05 Clustering method, computer program product and the server system of Chinese comment unsupervised learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910163711.3A CN109871447A (en) 2019-03-05 2019-03-05 Clustering method, computer program product and the server system of Chinese comment unsupervised learning

Publications (1)

Publication Number Publication Date
CN109871447A true CN109871447A (en) 2019-06-11

Family

ID=66919802

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910163711.3A Pending CN109871447A (en) 2019-03-05 2019-03-05 Clustering method, computer program product and the server system of Chinese comment unsupervised learning

Country Status (1)

Country Link
CN (1) CN109871447A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN110928981A (en) * 2019-11-18 2020-03-27 佰聆数据股份有限公司 Method, system and storage medium for establishing and perfecting iteration of text label system
CN112148881A (en) * 2020-10-22 2020-12-29 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN112184323A (en) * 2020-10-13 2021-01-05 上海风秩科技有限公司 Evaluation label generation method and device, storage medium and electronic equipment
CN112579738A (en) * 2020-12-23 2021-03-30 广州博冠信息科技有限公司 Target object label processing method, device, equipment and storage medium
CN112818660A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Product description generation method based on user evaluation

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462132A (en) * 2013-09-23 2015-03-25 华为技术有限公司 Comment information display method and device
CN105550269A (en) * 2015-12-10 2016-05-04 复旦大学 Product comment analyzing method and system with learning supervising function
WO2016159453A1 (en) * 2015-03-27 2016-10-06 주식회사 비주얼다이브 Method for providing social activity integration service
CN107633007A (en) * 2017-08-09 2018-01-26 五邑大学 A kind of comment on commodity data label system and method based on stratification AP clusters
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
CN108363725A (en) * 2018-01-08 2018-08-03 浙江大学 A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label
CN109255027A (en) * 2018-08-27 2019-01-22 上海宝尊电子商务有限公司 A kind of method and apparatus of electric business comment sentiment analysis noise reduction

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104462132A (en) * 2013-09-23 2015-03-25 华为技术有限公司 Comment information display method and device
WO2016159453A1 (en) * 2015-03-27 2016-10-06 주식회사 비주얼다이브 Method for providing social activity integration service
CN105550269A (en) * 2015-12-10 2016-05-04 复旦大学 Product comment analyzing method and system with learning supervising function
CN107633007A (en) * 2017-08-09 2018-01-26 五邑大学 A kind of comment on commodity data label system and method based on stratification AP clusters
CN108009228A (en) * 2017-11-27 2018-05-08 咪咕互动娱乐有限公司 A kind of method to set up of content tab, device and storage medium
CN108363725A (en) * 2018-01-08 2018-08-03 浙江大学 A kind of method of the extraction of user comment viewpoint and the generation of viewpoint label
CN109255027A (en) * 2018-08-27 2019-01-22 上海宝尊电子商务有限公司 A kind of method and apparatus of electric business comment sentiment analysis noise reduction

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SHIJING888: "CommentsMining", 《HTTPS://GITHUB.COM/SHIJING888/COMMENTSMINING》 *
李丕绩;马军;张冬梅;韩晓辉: "用户评论中的标签抽取以及排序", 《中文信息学报》 *
火贪三刀: "用户评论标签的抽取", 《HTTPS://BLOG.CSDN.NET/SHIJING_0214/ARTICLE/DETAILS/71036808/》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750646A (en) * 2019-10-16 2020-02-04 乐山师范学院 Attribute description extracting method for hotel comment text
CN110928981A (en) * 2019-11-18 2020-03-27 佰聆数据股份有限公司 Method, system and storage medium for establishing and perfecting iteration of text label system
CN112184323A (en) * 2020-10-13 2021-01-05 上海风秩科技有限公司 Evaluation label generation method and device, storage medium and electronic equipment
CN112148881A (en) * 2020-10-22 2020-12-29 北京百度网讯科技有限公司 Method and apparatus for outputting information
CN112148881B (en) * 2020-10-22 2023-09-22 北京百度网讯科技有限公司 Method and device for outputting information
CN112579738A (en) * 2020-12-23 2021-03-30 广州博冠信息科技有限公司 Target object label processing method, device, equipment and storage medium
CN112818660A (en) * 2021-01-26 2021-05-18 山西三友和智慧信息技术股份有限公司 Product description generation method based on user evaluation

Similar Documents

Publication Publication Date Title
CN109871447A (en) Clustering method, computer program product and the server system of Chinese comment unsupervised learning
Tan et al. RoBERTa-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network
CN107992543B (en) Question-answer interaction method and device, computer equipment and computer readable storage medium
Schreuder et al. Prefix stripping re-revisited
CN111125360B (en) Emotion analysis method and device in game field and model training method and device thereof
Wang et al. Kga: A general machine unlearning framework based on knowledge gap alignment
CN112559684A (en) Keyword extraction and information retrieval method
CN107807960A (en) Intelligent customer service method, electronic installation and computer-readable recording medium
CN103049490B (en) Between knowledge network node, attribute generates system and the method for generation
CN109344403A (en) A kind of document representation method of enhancing semantic feature insertion
CN111198946A (en) Network news hotspot mining method and device
CN109635275A (en) Literature content retrieval and recognition methods and device
Pellegrini et al. Exploiting Food Embeddings for Ingredient Substitution.
CN106156005B (en) Based on visual classic poetry characteristic analysis method
CN110399603A (en) A kind of text-processing technical method and system based on sense-group division
CN105488206B (en) A kind of Android application evolution recommended method based on crowdsourcing
CN115994535A (en) Text processing method and device
Ghaddar et al. Revisiting pre-trained language models and their evaluation for arabic natural language understanding
Tran et al. CovRelex: A COVID-19 retrieval system with relation extraction
CN108595426A (en) Term vector optimization method based on Chinese character pattern structural information
CN113204643B (en) Entity alignment method, device, equipment and medium
Golubev et al. Transfer learning for improving results on Russian sentiment datasets
CN109213988A (en) Barrage subject distillation method, medium, equipment and system based on N-gram model
Anantharaman et al. SSN_MLRG1@ LT-EDI-ACL2022: Multi-class classification using BERT models for detecting depression signs from social media text
CN107590163B (en) The methods, devices and systems of text feature selection

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20190611

RJ01 Rejection of invention patent application after publication