CN104462480A

CN104462480A - Typicality-based big comment data mining method

Info

Publication number: CN104462480A
Application number: CN201410796566.XA
Authority: CN
Inventors: 刘耀强
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2015-03-25
Anticipated expiration: 2034-12-18
Also published as: CN104462480B

Abstract

The invention discloses a typicality-based big comment data mining method. The method comprises the steps of conducting comment typicality mining modeling, and conducting modeling and formalized definition on comment typicality calculation and the minimum representative comment set mining problem; (2) automatically constructing a typicality comment prototype; (3) conducting minimum comment set mining, and adopting the minimum comment set mining algorithm for screening out one minimum comment set; (4) adopting the BigSimSet parallel computing method, and calling computing nodes in a distributed cluster for processing a similarity comment detection task in a parallel mode. According to the typicality-based big comment data mining method, on the two viewpoints of the cognitive psychology and the viewpoint mining, the user comment typicality balance method is studied, and the minimum comment set with the representativeness is mined on the basis of the method, so that a potential commodity client is helped to know a certain commodity more comprehensively at multiple angles, the user is helped to screen out the needed commodity more accurately, and the user purchase experience is improved.

Description

The large data digging method of comment based on typicalness

Technical field

The present invention relates to the research field of data mining, particularly the large data digging method of a kind of comment based on typicalness.

Background technology

Along with the high speed development of China's Internet of Things, the comment be distributed in e-commerce website, social networks and various online forum presents volatile growth, and these numbers disclose the personal view of user to a series of broad subjects such as such as consumer products, tissue, personnel and social events in the large data of comment (BigData) of Petabytes (PB).The real needs of these comments on commodity enterprise not only can be allowed to understand client that they are concerned about or potential customers, and provide useful guidance for the shopping decision-making of consumer.According to 2014 CNNIC data displays, the net purchase user more than 90% can make comments below the commodity of shopping website.Meanwhile, the net purchase subscriber's meter exceeding half all can read dependent merchandise comment before being shown in and buying each commodity.Such as, ctrip.com provide one allow client issue its to Suo Zhuguo hotel comment platform, by hotel's comment that this platform is issued, the hotel not being only other customer selecting suitable provides reference, hotel management also can steadily improve service according to online feedback, thus attracts more client both at home and abroad to move in.In addition, analyze these line Evaluations opinion and also government department can be helped comparatively fast and more widely to understand the condition of the people of various places, understand the masses to the view of government policy or community development and viewpoint.Generally speaking, from the angle of user, comment can help user to understand certain commodity more comprehensively, with multi-angle, thus makes the decision whether buying these commodity.Also user can be allowed to understand which commodity simultaneously and could meet its needs.From the angle of enterprise, manufacturer and service supplier need to know the view of user to its product, namely its product from the angle of Consumer's Experience which be advantage which be shortcoming, goods producer can be helped like this to obtain more, more fully field feedback, thus goods and services can be improved better.In sum, abundant valuable information is contained in online comment, is worth us to carry out deep excavation and analysis.

Although line Evaluation opinion has very important meaning and function to enterprise, regulator and commodity user.But at large data age, it is almost impossible for manually browsing the online comment of substantial amounts and analyze, traditional comment method for digging is difficult to carry out real-time analysis and summary to the large data of comment, and the comment and analysis effect obtained thus is unsatisfactory.Under large data background, set up and intelligent network comments on opining mining system and there is very high investigation and application be worth.Such as, by excavating minimum representativeness comment set in the large data of comment, allowing system user understand viewpoints different in comment fast, thus fast and effeciently monitoring the condition of the people of market trend or various places.

Summary of the invention

Fundamental purpose of the present invention is that the shortcoming overcoming prior art is with not enough, there is provided a kind of comment based on typicalness large data digging method, the method utilizes " basic unit's concept " (Basic LevelConcept) of cognitive psychology theory and polyarch theory to carry out design review typicalness and calculates, to excavate representative minimum comment set, and Hadoop platform is used to process the large data mining of comment concurrently.。

In order to achieve the above object, the present invention is by the following technical solutions:

The large data digging method of comment based on typicalness, comprises the steps:

(1) comment on typicalness and excavate modeling, modeling and formal definitions are carried out to the calculating of comment typicalness and minimum representative comment set Mining Problems;

(2) typicalness comment prototype builds automatically, and " basic unit's concept " based on cognitive psychology theoretical and polyarch theory carrys out design review typicalness computing method, instructs the generation commenting on prototype by the classification effectiveness in " basic unit's concept " theory;

(3) minimum comment set is excavated, minimum comment is adopted to gather mining algorithm, filter out a minimum comment set, this set has following features: each comment in set is all different and can represent the viewpoint of quite a few user, the viewpoint of all comments of these commodity can be contained and represent to all comments in this minimum comment set, user only needs to browse the comment in this minimum comment set, just can understand the User Perspective of this comments on commodity all;

(4) adopt BigSimDet parallel calculating method, process similarity comment Detection task in a parallel fashion by the computing node called in distributed type assemblies.

Preferably, in step (1), the concrete steps that comment typicalness excavates modeling are:

(1-1) all comments of certain commodity x are regarded as one " concept ", described " concept " the i.e. comment of commodity x, each comment is then one " example " of this " concept ", then every bar comment has different typicalness in this " concept ", in addition, in all comments of the comment of commodity x, extract a minimum representative comment set, this comment set has following two attributes:

(1-1-1) all n bar comments that set comprises can represent the dissimilar viewpoint of all users to the full extent;

(1-1-2) the number of reviews n in set is little as far as possible; User only need browse n bar few in number and comment on all viewpoints and suggestion that just can understand more all sidedly for commodity x;

(1-2) " aspect " is adopted to carry out formalization representation to comment on commodity;

{\overset{&RightArrow;}{p}}_{a} = (s_{a, 1} : v_{a, 1} : s_{a, 2} : v_{a, 2}, . . ., s_{a, k} : v_{a, k})

Wherein s _a,ibe one and belong to commodity a " aspect ", v _{a, i}for s in comment _a,ifeeling polarities value, i.e. the Sentiment orientation value of some aspects.

(1-3) for comment typicalness computational problem, one can be regarded as minor function:

χ:R _i→T _i

Wherein, R _ithe comment set belonging to commodity i, T _iit is the comment set after sorting according to comment typicalness;

For minimum representativeness comment set Mining Problems, theoretical according to polyarch, first cluster is carried out to comment on commodity, then from each comment classification, a comment prototype is extracted to represent this kind of comment, therefore, all comments of commodity x can be represented by n comment prototype, that is:

{\overset{&RightArrow;}{t}}_{c} = ({\overset{&RightArrow;}{t}}_{c, 1}, {\overset{&RightArrow;}{t}}_{c, 2}, . . ., {\overset{&RightArrow;}{t}}_{c, n})

Wherein, be a comment prototype, can be expressed as:

{\overset{&RightArrow;}{t}}_{c, j} = (s_{j, 1}^{c} : v_{j, 1}^{c}, s_{j, 2}^{c} : v_{j, 2}^{c}, . . ., s_{j, m}^{c} : v_{j, m}^{c})

(1-4) minimum representativeness comment set Mining Problems can be expressed as a function:

θ:R _i→L _i

Wherein, R _ithe comment set belonging to commodity i, L _iit is the minimum representativeness comment set of commodity i.

Preferably, in step (2), the concrete grammar that typicalness comment prototype builds automatically is:

(2-1) in polyarch theory, a concept can be represented by multiple abstract object and represent, described object representative is prototype, and each prototype represents one group of similar object respectively, is the abstract representation of these objects;

(2-2) according to the research of cognitive psychology, the typicalness of object is determined by two factors: CentralTendency and Frequency of Instantiation; The former is the similarity of object and concept member, if namely an object is very similar with other object instances in certain concept, and other concepts beyond this concept are very dissimilar, and so the typicalness of this object in this concept is very high; The latter is that a people often meets certain object and it classified as the frequency of certain concept;

(2-3) for the Central Tendency of an object in concept, determined by two aspects: the similarity of other objects in this object and concept, i.e. internal similarity, and and other concepts in the dissimilarity of object, i.e. outside dissimilarity; Internal similarity can be expressed as:

β ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = sim ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c, s})

Wherein, represent an object a, represent a concept c, the prototype s the most similar with a;

(2-4) in order to weigh the similarity of prototype and object, different similarity functions is taked to process, for the outside dissimilarity of an object in certain concept, it is regarded as the integration value of the dissimilarity of this object and other concepts, uses following formulae discovery:

δ ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = \frac{Σ_{x} dissimilar ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{x, s})}{N_{Δ} - 1}, x &Element; Candx &NotEqual; c

(2-5) integrating internal similarity and outside dissimilarity by integrating function (aggregation function), obtaining the Central Tendency of object a in concept c;

For the multiple objects belonged to outside certain concept, cluster is carried out to these objects, thus the multiple prototypes obtaining a concept represent, for the second factor affecting object typicalness, Frequency ofInstantiation, define a remarkable vector of prototype and represent the Frequencyof Instantiation of different prototype in a concept, as follows:

{\overset{&RightArrow;}{w}}_{c} = (w_{c, 1} w_{c, 2}, . . ., w_{c, n}), 0 < w_{c, i} \leq 1

Each value in this vector is that the number of objects that comprises in the cluster corresponding to a prototype is relative to all number percent belonging to this conceptual object quantity;

After the value of the Central Tendency of acquisition object in certain concept and Frequency ofInstantiation, integrate function according to one and both are integrated, this procedural formalism is expressed as follows:

τ_{c} (a) = Φ (w_{c, s}, α ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}));

(2-5) need to carry out cluster to object in the middle of the process building object prototypes, obtain the multiple prototypes in a concept by automatic clustering objects method; Adopt " classification effectiveness " in cognitive psychology as the automatic developing algorithm of a reference design concept polyarch, make generated prototype be all basic layer in concept, described basic layer is: a kind of special conception division that people exist in the process of cognition; According to the theory of cognitive psychology, in comment, Category Utility is as follows:

cu (C, T) = \frac{1}{m} Σ_{k = 1}^{m} p (c_{k}) [\frac{Σ_{i = 1}^{n_{k}} w_{i} p {(t_{i} | c_{k})}^{2}}{n_{k}} - \frac{Σ_{i = 1}^{n} w_{i} p {(t_{i})}^{2}}{n}]

Wherein, C is concept set, and F is aspect set, p (t _i| c _k) ²be that a concept has an aspect t _i, probability, p (c _k) be the probability that an object belongs to a concept, p (t _i) be that an object has aspect t _iprobability;

(2-6) mining algorithm of Basic Level concept is adopted automatically to excavate substantially sub-concept for some concepts; Utilize this algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, and finally choose the classification that in the cluster result of formation, classification effectiveness is maximum, each class that this classification is polymerized to belongs to this level of Basic Level.

Preferably, in step (3), the concrete steps that described minimum comment set is excavated are:

(3-1) carry out similar comment to detect, similar comment rejected in this set, described similar comment test problems can form turn to: product review is by one group of vector representation D={d ₁, d ₁..., d _{| D|}, each vectorial d _i=<w _{i, 1}, w _{i, 2}..., w _{i, | L|}> comprises | L| word weight, wherein L is the word finder number sum of product review collected works, described word finder is a series of non-repetitive words, and each vector is normalized to unit length, if the cosine-similarity of a pair vector is more than or equal to threshold value μ _sIMtime, so think it is for a pair similar comment, given normalized vector, a pair vector (d _i, d _j) cosine similarity sim (d _i, d _j) be the dot product of these vectors:

sim (d_{i}, d_{j}) = Σ_{t = 1}^{| L |} w_{i, t} \cdot w_{j, t}

(3-2) 1 (similar comment test problems) is defined: the set D of the text vector of given expression corresponding product comment and similarity threshold μ _sIM, similar comment test problems to determine that comment is to d _i, d _jsimilarity sim (the d of ∈ D _i, d _j)>=μ _sIM;

(3c) similarity threshold that given is utilized, reduce in the right quantity of the candidate of described candidate's generation phase generation, weight as crossed the maximal term in a document vector is too little, so the cosine similarity of it and any vector is by lower than given threshold value, namely without the need to calling Similarity measures again; Given two documents vector d _i, d _j∈ D, the value of the 1-norm of representation vector; In addition, at a vectorial d _jin maximal term weight be

w_{j}^{\max} = \max ({w_{j, 1}, w_{j, 2}, . . ., w_{j, | L |}}),

be also referred to as the ∞-norm of vector; So, can according to sim (d _i, d _j) can obtain below inequality:

sim(d _i,d _j)≤min((||d _i|| ₁×||d _j|| _∞),(||d _j|| ₁×||d _i|| _∞))

If || d _i|| ₁× || d _j|| _∞< μ _sIM, certainly sim (d _i, d _j) < μ _sIM, i.e. d _i, d _jit not similar comment.

Preferably, in step (4), the concrete steps of BigSimDet parallel calculating method are:

(4-1) each document vector d is calculated _ithe 1-norm of ∈ D || d _i|| ₁with ∞-norm || d _i|| _∞value, and according to the group size parameter τ in MapReduce parallel computational model _gdistribute these vectors to each group; Foundation sequence ( ), and make each Vector Groups g _i∈ G sorts from big to small according to 1-norm value;

(4-2) for each to Vector Groups g _i, g _j∈ G and by assessment g _imaximum 1-norm value and g _jthe product of maximum 1-norm value whether be greater than threshold value μ _sIM, determine its non-similarity relation;

(4-3) for each group of g _i∈ G, the subregion of an initial MapReduce calculate g _isimilar right to other Vector Groups; In the process, also need to consider non-similarity relation and partition size restriction τ _maxG;

(4-4) according to partition size restriction τ _maxG, perform MapReduce task if one group of vector g _i∈ G has multiple potential similar Vector Groups, then by g _ias seed, and according to partition size restriction τ _maxGformed additional partitions;

(4-5) for each subregion the parallel processing task starting a MapReduce detects similarity comment.

Preferably, in step (4-1), the specific implementation step of MapReduce parallel computational model is:

Call the operation of a MapReduce to calculate 1-norm and the ∞-norm value of each document vector; Institute's directed quantity subsequently by according to the value of its 1-norm by ascending order arrangement, and according to the size τ of pre-defined group _gthese vectors are put into corresponding group, largest packet size τ _git is one of parameter of system, whole collection of document D is divided into some input subregions equably, each Mapper will process the file vector of its oneself subregion concurrently, run Mapper sequence in the middle of generate key-it is right to be worth, then keys in the middle of these-be worth are input in Reducer by i.e. 1-norm value and document id after shuffling; By using " TotalOrderPartitioner " self-defined partition programs by name in MapReduce, the key that centre generates-be worth and reduce process range after sorting according to each worth size in first Reducer, and the data after range shorter are processed being input in second Reducer, by that analogy;

Wherein, the specific implementation process of the operation of MapReduce is as follows: Mapper obtains from its input subregion that each enter key-it is right to be worth, an i.e. file ID and file vector, then, this Mapper will produce the document vector be made up of the item weight after normalization accordingly; Then, Mapper will calculate 1-norm || d _i|| ₁with ∞-norm || d _i|| _∞value; Finally, by generate in the middle of exporting key-it is right to be worth, i.e. 1-norm value and the tuple that is made up of file ID and ∞-norm; In addition, each Reducer generates key-be worth to as input, then according to group size parameter τ in the middle of after sequence _gdocument is sorted according to 1-norm value; Finally, run-out key-it is right to be worth; The key of each output-be worth comprising: a group ID and contain the tuple of orderly document id list, corresponding 1-norm and ∞-norm value.

Preferably, step (4-2) is specially:

When certain group is by weighing 1-norm and ∞-norm, when being found to comment on different with other, other different sets of documentation can be determined so simultaneously; For each group g _i, the document with maximum 1-norm value is expressed as and the document with maximum ∞-norm is expressed as when different relation between two groups is determined, sets of documentation g _jin the value of ∞-norm can be applied in another group g _i1-norm if show that the document in these two groups does not have similar; Because MapReduce work before has produced the ordered queue of an All Files group, and the document in each group also arranges according to the value ascending order of 1-norm, compares g _isort other lower groups (as g ₁and g ₂) will g be specified on _jdissimilar.

Preferably, step (4-3) is specially:

For each group of g _i∈ G, getting rid of different sets of documentation by merging similar document combination concurrently, coming initialization document into regions get rid of different sets of documentation and greatly can save the computing time of commenting on detection-phase in similarity; But the size of partitions of file during initialization may be very uneven, because some sets of documentation may other only has several potential similar sets of documentation containing many potential similar document groups; The similarity comment Detection task execution performance parallel under causing MapReduce of the imbalance of partitions of file declines dramatically; Its reason is that total execution time that a similar comment detects operation comments on the computing node control of Detection task by the similarity that working time is the longest.

Preferably, in step (4-4),

Call a level and smooth task and limit τ by partition size _maxGlarge files is divided into several less subregion, τ _maxGvalue be determined by the average local memory capacity of computing cluster node, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node, remote disk access each time all can consume accesses much more time than local disk, consequently, total execution time may increase greatly, for each group of g _i∈ G, if its initial partitions of file is more than τ _maxGrestriction, this group g _ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction, finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.

Preferably, in step (4-5),

Similarity comment Detection task is processed in a parallel fashion by the computing node called in distributed type assemblies; For each document into regions start the similarity comment Detection task that a MapReduce is parallel, before the calculating of similarity, need to carry out an operation of falling index in each Local partition, rerun after index likeness in form comment job run; For Map and the Reduce function of the calculating document vector similarity in each subregion, each Mapper accept from this locality key-it is right to be worth, key is an item t, and its value and be document id list and corresponding normalization item weight, call Map function, often pair of document vector is only calculated only once, and only has when two vectors are when falling in index containing at least one identical item, just calculate the similarity between them, for each document vector d _i, adopt an Associate array H to keep potential similar document and d _isimilarity score; Finally, Map function output document IDn _iminimize the number of the middle key generated with Associate array H, and they are shuffled; In addition, generate in the middle of Reduce function accepts key-it is right to be worth, and sum operation is carried out to the local similarity score being stored in Associate array; Finally, Reduce function export to comprise in Local partition document id and document the reddest similarity score key-it is right to be worth.

Compared with prior art, tool has the following advantages and beneficial effect in the present invention:

(1) the present invention is a research method with subject crossing, by polyarch representation of concept in Applied Cognitive Psychology, theoretical and Basic Level Concept theory weighs a typical degree commented on, and discloses new comment typicalness computing method.In addition, the classification effectiveness proposed based on cognitive psychologist assists the generation commenting on prototype as clustering target, the prototype making cluster generate out and the true cognition of the comment typicalness calculated closer to people.

(2) mainly concentrate with existing opining mining technology the sentiment analysis in comment on and viewpoint make a summarys to be intended to minimum representative in the large data of research comment unlike, the present invention and comment on and gather this new problem of excavation.By excavating minimum representative comment set, user can be allowed without the need to the comment browsing large data also to understand the general picture of all comments and the diversity of viewpoint in comment easily comprehensively, the research filling up this problem is blank, and can to strengthen commenting in large data meaningful comment to a greater extent to the reference role of user.

(3) achievement in research disclosed by the invention can help potential customers to understand certain commodity more comprehensively, with multi-angle, and help user screens the commodity needed for it more accurately, promotes the buying experience of user.In addition, from the angle of manufacturer and service supplier, they more fully can understand the view of user to its product, namely from product the angle of Consumer's Experience which be advantage which be determine, goods producer can be helped like this to obtain more more fully field feedbacks, thus improve commodity better, promote the sale of commodity.In addition, method disclosed by the invention can also help government department comparatively fast and more extensively, all sidedly to understand the various places condition of the people, understands the masses to the more typical view of government policy and various different representative view.

(4) method disclosed by the invention is that the large data of upper comment Network Based are carried out, mining algorithm is commented on, to tackle the application under large data environment with the comment typicalness computing method in a distributed manner disclosed in realization and minimum representative by research is parallel under Hadoop and MapReduce.

Accompanying drawing explanation

Fig. 1 is the processing flow chart of method for digging of the present invention;

Fig. 2 is comment vector order and packet design pattern.

Fig. 3 is that document Vector Groups distinctiveness ratio calculates.

Embodiment

Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.

Embodiment

As shown in Figure 1, the large data digging method of the comment based on typicalness, comprises the steps:

In the present embodiment, modeling is excavated in the comment of comment typicalness, adopts " aspect " to carry out formalization representation to comment on commodity:

{\overset{&RightArrow;}{p}}_{a} = (s_{a, 1} : v_{a, 1} : s_{a, 2} : v_{a, 2}, . . ., s_{a, k} : v_{a, k})

Wherein s _a,ibe one and belong to commodity a " aspect ", v _a,ifor s in comment _a,ifeeling polarities value, i.e. the Sentiment orientation value of some aspects.

For comment typicalness computational problem, one can be regarded as minor function:

χ:R _i→T _i

Wherein, R _ithe comment set belonging to commodity i, T _iit is the comment set after sorting according to comment typicalness.

For minimum representativeness comment set Mining Problems, theoretical according to polyarch, first cluster is carried out to comment on commodity, from each comment classification, then extract a comment prototype to represent this kind of comment.Therefore, all comments of commodity x can be represented by n comment prototype, that is:

{\overset{&RightArrow;}{t}}_{c} = ({\overset{&RightArrow;}{t}}_{c, 1}, {\overset{&RightArrow;}{t}}_{c, 2}, . . ., {\overset{&RightArrow;}{t}}_{c, n})

Wherein, be a comment prototype, can be expressed as:

{\overset{&RightArrow;}{t}}_{c, j} = (s_{j, 1}^{c} : v_{j, 1}^{c}, s_{j, 2}^{c} : v_{j, 2}^{c}, . . ., s_{j, m}^{c} : v_{j, m}^{c})

Minimum representativeness comment set Mining Problems can be expressed as a function

θ:R _i→L _i

In the present embodiment, typicalness comment prototype automatically builds and comprises the following steps:

1. for the Central Tendency of an object in concept, determined by two aspects: the similarity (internal similarity) of other objects in this object and concept, and and other concepts in the dissimilarity (outside dissimilarity) of object.Internal similarity can be expressed as:

β ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = sim ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c, s})

Wherein, represent an object a, represent a concept c, the prototype s the most similar with a.

2. in order to weigh the similarity of prototype and object, the present invention takes different similarity functions, such as Cosine similarity function, Jaccard similarity function etc.For the outside dissimilarity of an object in certain concept, it is regarded as the integration value of the dissimilarity of this object and other concepts, uses following formulae discovery:

δ ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = \frac{Σ_{x} dissimilar ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{x, s})}{N_{Δ} - 1}, x &Element; Candx &NotEqual; c

3. the present invention designs integration function (aggregation function) and integrates internal similarity and outside dissimilarity, thus obtains the Central Tendency of object a in concept c.For the second factor affecting object typicalness, Frequency of Instantiation, the present invention defines a remarkable vector of prototype to represent the Frequency of Instantiation of different prototype in a concept, as follows:

{\overset{&RightArrow;}{w}}_{c} = (w_{c, 1} w_{c, 2}, . . ., w_{c, n}), 0 < w_{c, i} &RightArrow; 1

Each value in this vector is that the number of objects that comprises in the cluster corresponding to a prototype is relative to all number percent belonging to this conceptual object quantity.

4., after the value of the Central Tendency of acquisition object in certain concept and Frequency ofInstantiation, will integrate function according to one and both will be integrated, this procedural formalism is expressed as follows:

τ_{c} (a) = Φ (w_{c, s}, α ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}));

5. adopt " classification effectiveness " in cognitive psychology to design the automatic developing algorithm of a concept polyarch as benchmark, make generated prototype be all basic layer in concept.According to the theory of cognitive psychology, in comment, Category Utility is as follows:

cu (C, T) = \frac{1}{m} Σ_{k = 1}^{m} p (c_{k}) [\frac{Σ_{i = 1}^{n_{k}} w_{i} p {(t_{i} | c_{k})}^{2}}{n_{k}} - \frac{Σ_{i = 1}^{n} w_{i} p {(t_{i})}^{2}}{n}]

Wherein, C is concept set, and F is aspect set, p (t _i| c _k) ²be that a concept has an aspect t _i, probability, p (c _k) be the probability that an object belongs to a concept, p (t _i) be that an object has aspect t _iprobability.

6. the mining algorithm of basic unit's concept (Basic Level concept) is as shown in table 1, and this algorithm automatically can excavate the substantially sub-concept (Basic Level Concept) for some concepts.Utilize this algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, and finally choose the classification that in the cluster result of formation, classification effectiveness is maximum, each class that this classification is polymerized to belongs to this level of Basic Level.

Table 1

In the present embodiment, minimum comment set is excavated and is specifically comprised the following steps:

1. similar comment detects, and similar comment is rejected in this set.Described similar comment test problems can form turn to: product review (i.e. document) is by one group of vector representation D={d ₁, d ₁..., d _{| D|}.Each vectorial d _i=<w _{i, 1}, w _{i, 2}..., w _{i, | L|}> comprises | and L| word weight, wherein L is word finder (i.e. a series of non-repetitive word) the number sum of product review collected works.Each vector is normalized to unit length.If the Yu Xian of a pair vector ?similarity be more than or equal to threshold value μ _sIMtime, so think it is for a pair similar comment.Given normalized vector, a pair vector (d _i, d _j) cosine similarity sim (d _i, d _j) be the dot product of these vectors:

sim (d_{i}, d_{j}) = Σ_{t = 1}^{| L |} w_{i, t} \cdot w_{j, t}

2. utilize the similarity threshold that given, to reduce in the right quantity of the candidate of described candidate's generation phase generation.If the weight of the maximal term in a document vector is too little, so the cosine similarity of it and any vector is by lower than given threshold value, namely without the need to calling Similarity measures again.Given two documents vector d _i, d _j∈ D, representation vector 1 ?the value of norm.In addition, at a vectorial d _jin maximal term weight be

w_{j}^{\max} = \max ({w_{j, 1}, w_{j, 2}, . . ., w_{j, | L |}}),

be also referred to as vectorial ∞ ?norm.So, can according to sim (d _i, d _j) can obtain below inequality:

In the present embodiment, BigSimDet parallel algorithm comprises the following steps:

1. the operation of calling a MapReduce to calculate each document vector 1 ?norm He ∞ ?norm value.Institute's directed quantity subsequently by according to the value of its 1 ?norm by ascending order arrangement, and according to the size τ of pre-defined group _gthese vectors are put into corresponding group.Largest packet size τ _git is one of parameter of system.Generally, a little τ _gwill be consuming time for cost more to organize structure, allow to find more different relation Vector Groups.As shown in Figure 2, be the parallel processing mode of first MapReduce operation.Whole collection of document D is divided into some input subregions equably.Each Mapper by process concurrently its oneself subregion file vector (such as, calculate 1 ?norm He ∞ ?the value of norm).Run Mapper sequence in the middle of generate Jian ?value to (that is, 1 ?norm value and document id), then by these middle Jian ?value be input in Reducer after shuffling.By using " TotalOrderPartitioner " self-defined partition programs by name in MapReduce, middle generate Jian ?value after sorting according to each worth size in first Reducer, reduce process range, and the data after range shorter are processed being input in second Reducer, by that analogy.

The algorithm details of first described MapReduce operation is as shown in table 2, and its primary responsibility sorts to file vector concurrently and divides into groups.Wherein, Mapper obtains each enter key-be worth (that is, a file ID and file vector) from its input subregion.Then, this Mapper will produce the document vector be made up of the item weight after normalization accordingly.Then, Mapper will calculate 1-norm || d _i|| ₁with ∞-norm || d _i|| _∞value.Finally, by the key-be worth generated in the middle of exporting to (that is, 1-norm value and the tuple that is made up of file ID and ∞-norm).In addition, each Reducer generates key-be worth to as input, then according to group size parameter τ in the middle of after sequence _gdocument is sorted according to 1-norm value.Finally, run-out key-it is right to be worth.The key of each output-be worth comprising: a group ID and contain the tuple of orderly document id list, corresponding 1-norm and ∞-norm value.

Table 2

2. search relevant different sets of documentation, and save thus in a large number in the time that unnecessary document similarity calculates.For given threshold value μ _sIMwhen=0.5, the process that different sets of documentation is searched.When certain group by measurement 1 ?norm He ∞ ?norm, be found to comment on (as the g in Fig. 3 with other ₃and g ₅) different time, other different sets of documentation (such as, g can be determined so simultaneously ₅and g ₂, g ₅and g ₁).For each group g _i, have maximum 1 ?the document of norm value be expressed as and have maximum ∞ ?the document of norm be expressed as when different relation between two groups is determined, sets of documentation g _jin (as g ₅in d ₁₀) ∞ ?the value of norm can be applied in another group g _i1 ?norm (as g ₃in d ₆).If show in these two groups (as g ₃and g ₅) document do not have similar.Because MapReduce work before has produced the ordered queue of an All Files group, and document in each group also according to 1 ?the value ascending order arrangement of norm, compare g _isort other lower groups (as g ₁and g ₂) will g be specified on _jdissimilar.Reason is for any one group of g _i-k,

{| | d_{i - k}^{L_{1}} | |}_{1} \times {| | d_{j}^{L_{\infty}} | |}_{\infty} < μ_{SIM},

Because

{| | d_{i - k}^{L_{1}} | |}_{1} \times {| | d_{j}^{L_{\infty}} | |}_{1} .

In fact, in step 1, sets of documentation is sorted entirely and significantly can reduce the comparison number of sets of documentation in step 2.In addition, due to each group between 1 ?norm He ∞ ?norm value ( with ) be that independent parallel calculates, assessment

{| | d_{i - k}^{L_{1}} | |}_{1} \times {| | d_{j}^{L_{\infty}} | |}_{\infty} < μ_{SIM}

Instead of assessment with between dot product can save computing time significantly, particularly process when vocabulary is very large large data.

3. for each group of g _i∈ G, getting rid of different sets of documentation by merging similar document combination concurrently, coming initialization document into regions get rid of different sets of documentation and greatly can save the computing time of commenting on detection-phase in similarity.But the size of partitions of file during initialization may be very uneven, because some sets of documentation may other only has several potential similar sets of documentation containing many potential similar document groups.The similarity comment Detection task execution performance parallel under causing MapReduce of the imbalance of partitions of file declines dramatically.Its reason is that total execution time that a similar comment detects operation comments on the computing node control of Detection task by the similarity that working time is the longest.

4. call a level and smooth task and limit τ by partition size _maxGlarge files is divided into several less subregion.τ _maxGvalue be determined by the average local memory capacity of computing cluster node.In general, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node.Remote disk access each time all can consume accesses much more time than local disk.Consequently, total execution time may increase greatly.. for each group of g _i∈ G, if its initial partitions of file is more than τ _maxGrestriction, this group g _ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction.Finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.

Similarity comment Detection task is processed in a parallel fashion by the computing node called in distributed type assemblies.For each document into regions start the similarity comment Detection task that a MapReduce is parallel.Before the calculating of similarity, need to carry out an operation of falling index in each Local partition.Rerun after index likeness in form comment job run.As shown in table 3, be Map and the Reduce function of the calculating document vector similarity in each subregion.Each Mapper accept from this locality key-it is right to be worth.Key is an item t, and its value and be document id list and corresponding normalization item weight.Call Map function, often pair of document vector is only calculated only once.Only having when two vectors are when falling in index containing at least one identical item, just calculating the similarity between them.For each document vector d _i(be n by document id _i), adopt an Associate array H to keep potential similar document and d _isimilarity score.Finally, Map function output document IDn _iminimize the number of the middle key generated with Associate array H, and they are shuffled.In addition, generate in the middle of Reduce function accepts key-it is right to be worth, and sum operation is carried out to the local similarity score being stored in Associate array.Finally, Reduce function export to comprise in Local partition document id and document the reddest similarity score key-it is right to be worth.

Table 3

Above-described embodiment is the present invention's preferably embodiment; but embodiments of the present invention are not restricted to the described embodiments; change, the modification done under other any does not deviate from Spirit Essence of the present invention and principle, substitute, combine, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims

1., based on the large data digging method of comment of typicalness, it is characterized in that, comprise the steps:

2. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (1), the concrete steps that comment typicalness excavates modeling are:

{\overset{&RightArrow;}{p}}_{a} = (s_{a, 1} : v_{a, 1}, s_{a, 2} : v_{a, 2}, . . ., s_{a, k} : v_{a, k})

Wherein s _{a, i}be one and belong to commodity a " aspect ", v _{a, i}for s in comment _{a, i}feeling polarities value, i.e. the Sentiment orientation value of some aspects.

χ：R _i→T _i

{\overset{&RightArrow;}{t}}_{c} = ({\overset{&RightArrow;}{t}}_{c, 1}, {\overset{&RightArrow;}{t}}_{c, 2}, . . ., {\overset{&RightArrow;}{t}}_{c, n})

Wherein, be a comment prototype, can be expressed as:

{\overset{&RightArrow;}{t}}_{c, j} = (s_{j, 1}^{c} : v_{j, 1}^{c}, s_{j, 2}^{c} : v_{j, 2}^{c}, . . ., s_{j, m}^{c} : v_{j, m}^{c})

θ：R _i→L _i

3. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (2), the concrete grammar that typicalness comment prototype builds automatically is:

β ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = sim ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c, s})

δ ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}) = \frac{Σxdissimilar ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{x, s})}{N_{Δ} - 1}, s &Element; C and x &NotEqual; c

{\overset{&RightArrow;}{w}}_{c} = (w_{c, 1}, w_{c, 2}, . . ., w_{c, n}), 0 < w_{c, i} \leq 1

τ_{c} (a) = Φ (w_{c, s} α ({\overset{&RightArrow;}{p}}_{a}, {\overset{&RightArrow;}{t}}_{c}));

cu (C, T) = \frac{1}{m} Σ_{k = 1}^{m} p (c_{k}) [\frac{Σ_{i = 1}^{n_{k}} w_{i} p {(t_{i} | c_{k})}^{1}}{n_{k}} - \frac{Σ_{i = 1}^{n} w_{i} p {(t_{i})}^{2}}{n}]

4. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (3), the concrete steps that described minimum comment set is excavated are:

(3-1) carry out similar comment to detect, similar comment rejected in this set, described similar comment test problems can form turn to: product review is by one group of vector representation D={d ₁, d ₁..., d _{| D|}, each vectorial d _i=< w _{i, 1}, w _{i, 2}..., w _{i, | L|}> comprises | L| word weight, wherein L is the word finder number sum of product review collected works, described word finder is a series of non-repetitive words, and each vector is normalized to unit length, if the cosine-similarity of a pair vector is more than or equal to threshold value μ _sIMtime, so think it is for a pair similar comment, given normalized vector, a pair vector (d _i, d _j) cosine similarity sim (d _i, d _j) be the dot product of these vectors:

sim (d_{i}, d_{j}) = Σ_{t = 1}^{| L |} w_{i, t} \cdot w_{j, t}

w_{j}^{\max} = \max ({w_{j, 1}, w_{j, 2}, . . ., w_{j, | L |}}),

sim(d _i，d _j)≤min((||d _i|| ₁×||d _j|| _∞)，(||d _j|| ₁×||d _i|| _∞))

5. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4), the concrete steps of BigSimDet parallel calculating method are:

(4-1) each document vector d is calculated _ithe 1-norm of ∈ D || d _i|| ₁with ∞-norm || d _i|| _∞value, and according to the group size parameter τ in MapReduce parallel computational model _gdistribute these vectors to each group; Set up sequence and make each Vector Groups g _i∈ G sorts from big to small according to 1-norm value;

(4-2) for each to Vector Groups g _i, g _j∈ G and by assessment g _imaximum 1-norm value and g _ithe product of maximum 1-norm value whether be greater than threshold value μ _sIM, determine its non-similarity relation;

(4-3) for each group of g _i∈ G, the subregion of an initial MapReduce calculate the similar right of gi and other Vector Groups; In the process, also need to consider non-similarity relation and partition size restriction τ _maxG;

6. according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-1), the specific implementation step of MapReduce parallel computational model is:

7., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, step (4-2) is specially:

8., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, step (4-3) is specially:

9., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-4),

Call a level and smooth task by partition size restriction τ M _axGlarge files is divided into several less subregion, τ _maxGvalue be determined by the average local memory capacity of computing cluster node, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node, remote disk access each time all can consume accesses much more time than local disk, consequently, total execution time may increase greatly, for each group of g _i∈ G, if its initial partitions of file is more than τ M _axGrestriction, this group g _ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction, finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.

10., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-5),