CN104462480B

CN104462480B - Comment big data method for digging based on typicalness

Info

Publication number: CN104462480B
Application number: CN201410796566.XA
Authority: CN
Inventors: 刘耀强
Original assignee: Individual
Current assignee: Individual
Priority date: 2014-12-18
Filing date: 2014-12-18
Publication date: 2017-11-10
Anticipated expiration: 2034-12-18
Also published as: CN104462480A

Abstract

The present invention discloses a kind of comment big data method for digging based on typicalness, comprises the steps：(1) comment on typicalness excavation to model, comment typicalness is calculated and minimum representative comment set Mining Problems are modeled and formal definitions；(2) typicalness comment prototype is built automatically；(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set；(4) BigSimDet parallel calculating methods are used, Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion.The present invention is gone to study the typicalness balancing method of user comment, representative minimum comment set is excavated based on the method from cognitive psychology and opining mining the two angles；So as to help commodity potential customers to understand some commodity more comprehensively, with multi-angle, help user more accurately to screen the commodity needed for it, lift the buying experience of user.

Description

Comment big data method for digging based on typicalness

Technical field

The present invention relates to the research field of data mining, more particularly to a kind of comment big data excavation side based on typicalness Method.

Background technology

With the high speed development of China's Internet of Things, issue in e-commerce website, social networks and various online forums Comment show volatile growth, these comment big datas (Big Data) in terms of Petabytes (PB) of number disclose Personal view of the user to a series of broad subjects such as consumer products, tissue, personnel and social event.These comments on commodity It can not only allow enterprise to understand the real needs of their clients or potential customers of concern, and be the shopping decision-making of consumer Provide beneficial guidance.Shown according to 2014 CNNIC data, the net purchase user more than 90% can be Made comments below the commodity of shopping website.At the same time, the net purchase user more than half is represented before each commodity is bought Dependent merchandise comment will be read.Client is allowed to issue its platform to the comment of Suo Zhuguo hotels for example, ctrip.com provides one, Commented on by the hotel issued on the platform, be not only that the suitable hotel of other customer selectings provides reference, hotel management It can also be steadily improved service according to online feedback, so as to attract more clients both at home and abroad to move in.In addition, analyze this A little line Evaluations opinions may also aid in government department comparatively fast and understand the conditions of the people of various regions widely, understand the masses to government policy or society The view and viewpoint of area's development.Generally speaking, from the perspective of user, comment can help user more comprehensively, with multi-angle Some commodity is understood, so as to be made whether to buy the decision of the commodity.Simultaneously user can also be allowed to understand which commodity could expire Its needs of foot.From the perspective of enterprise, manufacturer and service supplier need to know view of the user to its product, i.e. its product From the perspective of Consumer's Experience which be advantage which be shortcoming, can so help goods producer to obtain more, more complete The field feedback in face, so as to preferably improve goods and services.In summary, online comment is contained and abundant had Value information, it is worth us go deep into excavating and analyze.

Although line Evaluation is discussed has highly important meaning and effect to enterprise, regulator and commodity user.But In the big data epoch, the online comment to substantial amounts manually browses and analysis is nearly impossible, traditional comment excavation side Method is difficult to comment big data analyze and summarize in real time, and thus obtained comment and analysis effect is unsatisfactory.Counting greatly There is very high research and application value according to comment opining mining system on intelligent network under background, is established.For example, by from comment The representative comment set of minimum is excavated in big data, allows system user quickly to understand viewpoint different in comment, so as to fast Speed effectively monitors market trend or the condition of the people of various regions.

The content of the invention

The shortcomings that it is a primary object of the present invention to overcome prior art and deficiency, there is provided a kind of comment based on typicalness Big data method for digging, this method are theoretical and more using " basic unit's concept " (the Basic Level Concept) of cognitive psychology Prototype Theory carrys out the calculating of design review typicalness, to excavate representative minimum comment set, and is put down with Hadoop Platform concurrently handles comment big data and excavated..

In order to achieve the above object, the present invention uses following technical scheme：

Comment big data method for digging based on typicalness, comprises the steps：

(1) comment on typicalness and excavate modeling, comment typicalness is calculated and the minimum comment set Mining Problems that represent are carried out Modeling and formal definitions；

(2) typicalness comment prototype is built automatically, and " basic unit's concept " based on cognitive psychology is theoretical and polyarch is theoretical Carry out design review typicalness computational methods, the generation of comment prototype is instructed with the classification effectiveness of " basic unit's concept " in theoretical；

(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set, The set has following features：Each comment in set is all different and can represent the viewpoint of quite a few user, should The viewpoint of all comments of the commodity can be covered and be represented to all comments in minimum comment set, and user only needs to browse the minimum Comment in comment set, it is possible to understand the User Perspective of all comments on commodity；

(4) BigSimDet parallel calculating methods are used, by calling the calculate node in distributed type assemblies with parallel side Formula processing similitude comment Detection task.

Preferably, in step (1), comment typicalness excavates concretely comprising the following steps for modeling：

(1-1) regards some commodity x all comments one " concept " as, and " concept " is commodity x comment, often One comment is then one " example " of this " concept ", then every comment has different typicalness in " concept ", separately Outside, in all comments of commodity x comment, extract minimum represents and comment on set, the comment set has following two Attribute：

All n bars comment that (1-1-1) set is included can represent the different types of sight of all users to the full extent Point；

Number of reviews n in (1-1-2) set is as small as possible；User need to only browse n bars comment few in number can More fully to understand all viewpoints and opinion for commodity x；

(1-2) carries out formalization representation using " aspect " to comment on commodity；

Wherein s_a,iIt is one and belongs to commodity a " aspect ", v_{A, i}It is for s in comment_a,iFeeling polarities value, i.e., it is a certain The Sentiment orientation value of individual aspect.

(1-3) can regard one such as minor function as commenting on typicalness computational problem：

χ:R_i→T_i

Wherein, R_iIt is the comment set for belonging to commodity i, T_iIt is according to the comment set after comment typicalness sequence；

It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, Then a comment prototype is extracted to represent this kind of comment from each comment classification, therefore, commodity x all comments can To be represented by n comment prototype, i.e.,：

Wherein,It is a comment prototype, can be expressed as：

(1-4) minimum representative comment set Mining Problems can be expressed as a function：

θ:R_i→L_i

Wherein, R_iIt is the comment set for belonging to commodity i, L_iIt is the commodity i representative comment set of minimum.

Preferably, in step (2), the specific method that typicalness comment prototype is built automatically is：

(2-1) in polyarch theory, a concept can be represented to represent by multiple abstract objects, the object generation Table is prototype, and each prototype represents one group of similar object respectively, is the abstract representation of these objects；

(2-2) is determined according to the research of cognitive psychology, the typicalness of object by two factors：Central Tendency With Frequency of Instantiation；The former is the similitude of object and concept member, i.e., if an object and certain Other object instances in individual concept are much like, and other concepts beyond the concept are very dissimilar, then this object is at this Typicalness in concept is very high；The latter is that a people often meets certain object and it is classified as the frequency of some concept；

The Central Tendency of (2-3) for an object in concept, determined by two aspects：The object and concept In other objects similitude, i.e. internal similarity, and the dissimilarity with object in other concepts is that is, outside dissimilar Property；Internal similarity can be expressed as：

Wherein,An object a is represented,A concept c is represented,It is the prototype s most like with a；

(2-4) takes different similarity functions to be handled, for one to weigh the similitude of prototype and object Outside dissimilarity of the object in some concept, it is regarded as the object and the integration value of the dissimilarity of other concepts, Calculated with below equation：

It is dissimilar that (2-5) integrates internal similarity and outside by integrating function (aggregation function) Property, obtain Central Tendencys of the object a in concept c；

For belonging to multiple objects outside some concept, these objects are clustered, so as to obtain the more of concept Individual prototype represents, the second factor for influenceing object typicalness, Frequency of Instantiation, defines one The notable vector of prototype represents Frequency of Instantiation of the different prototypes in a concept, as follows：

Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to relative to all The percentage of this conceptual object quantity；

Obtaining Central Tendency and Frequency of of the object in some concept After Instantiation value, integrate function according to one and both are integrated, the procedural formalism represents as follows：

(2-5) needs to cluster object among the process of structure object prototypes, passes through automatic clustering objects side Method obtains multiple prototypes in a concept；One concept of reference design is used as using " the classification effectiveness " in cognitive psychology The automatic developing algorithm of polyarch so that the prototype generated is all the basic layer in concept, and the basic layer is：People Existing a kind of special conception division during cognition；According to the theory of cognitive psychology, Category in comment Utility is as follows：

Wherein, C is concept set, and F is that aspect is gathered, p (t_i|c_k)²Possess one side t for a concept_i, probability, p(c_k) it is the probability that an object belongs to a concept, p (t_i) it is that an object possesses aspect t_iProbability；

(2-6) automatically excavates the base for some concept using Basic Level concept mining algorithm Book concept；Using the algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, finally The classification that classification effectiveness is maximum in the cluster result formed is chosen, each class that the classification is polymerized to belongs to Basic Level This level.

Preferably, in step (3), what the minimum comment set was excavated concretely comprises the following steps：

(3-1) carries out similar comment detection, similar comment is rejected in the set, the similar comment test problems Can form turn to：Product review is by one group of vector representation D={ d₁,d₁,...,d_|D|, each vectorial d_i=<w_i,1, w_i,2,...,w_i,|L|>Comprising | L | individual word weight, wherein L are that the word finder number of product review collected works is total, the word finder For a series of non-repetitive words, each vector is to be normalized to unit length, if cosine-similarity of a pair of vectors is more than Or equal to threshold value μ_SIMWhen, then it is similar comment to think a pair, gives normalized vector, a pair of vector (d_i,d_j) cosine phase Like property sim (d_i,d_j) be these vectors dot product：

(3-2) defines 1 (similar comment test problems)：The set D of the given text vector for representing corresponding product comment With similarity threshold μ_SIM, similar comment test problems are to determine comment to d_i,d_j∈ D similarity sim (d_i,d_j)≥μ_SIM；

(3c) utilizes a given similarity threshold, reduces the quantity in candidate couple caused by candidate's generation phase, Weight as crossed maximal term in a document vector is too small, then the cosine similarity of it and any vector will be less than given Threshold value, i.e., it need not recall Similarity measures；Give two document vector d_i,d_j∈ D,Representation vector 1- norms value；In addition, in a vectorial d_jIn maximal term weight be Also referred to as vectorial ∞-norm；It is possible to according to sim (d_i,d_j) following inequality can be obtained:

sim(d_i,d_j)≤min((||d_i||₁×||d_j||_∞),(||d_j||₁×||d_i||_∞))

If | | d_i||₁×||d_j||_∞＜ μ_SIM, certainly sim (d_i,d_j) ＜ μ_SIM, i.e. d_i,d_jIt is not similar comment By.

Preferably, in step (4), BigSimDet parallel calculating methods concretely comprise the following steps：

(4-1) calculates each document vector d_i∈ D 1- norms | | d_i||₁With ∞-norm | | d_i||_∞Value, and according to Group size parameter τ in MapReduce parallel computational models_GDistribute these vectors and arrive each group；Foundation sequence (), and make Obtain each Vector Groups g_i∈ G sort from big to small according to 1- norm values；

(4-2) is for each to Vector Groups g_i,g_j∈ G andBy assessing g_iMaximum 1- norm values and g_jMost Whether the product of big 1- norm values is more than threshold value μ_SIM, to determine its non-similarity relation；

(4-3) is for each group of g_i∈ G, initial MapReduce subregionTo calculate g_iWith other Vector Groups It is similar right；In the process, also need to consider non-similarity relation and partition size limitation τ_MaxG；

(4-4) is according to partition size restriction τ_MaxG, perform MapReduce tasksIf one group of vector g_i∈G There are multiple potential similar Vector Groups, then by g_iAs seed, and according to partition size restriction τ_MaxGFormedVolume Outer subregion；

(4-5) is for each subregionThe parallel processing task for starting a MapReduce is commented to detect similitude By.

Preferably, in step (4-1), the specific implementation step of MapReduce parallel computational models is：

MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector；Institute's directed quantity It will then be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τ_GThese vectors are put into accordingly Group, largest packet size τ_GIt is one of parameter of system, whole collection of document D is divided evenly into some input subregions, each Mapper will concurrently handle the file vector of the subregion of their own, run the key-value pair of the middle generation of Mapper sequences, i.e., 1- norm values and document id, then by the key-value among these to being input to after shuffling in Reducer；By using " TotalOrderPartitioner " self-defined partition programs entitled in MapReduce, the key-value of centre generation is to will be the Process range is reduced after being ranked up in one Reducer according to each worth size, and by the data after range shorter defeated Enter into second Reducer and handle, by that analogy；

Wherein, the specific implementation process of MapReduce operation is as follows：One Mapper obtains from its input subregion Each input key-value pair, i.e. a file ID and file vector, then, the Mapper will produce accordingly by normalizing after The document vector of item weight composition；Then, Mapper will calculate 1- norms | | d_i||₁With ∞-norm | | d_i||_∞Value；Finally, will The key-value pair of generation, i.e. 1- norm values and the tuple being made up of file ID and ∞-norm among output；It is in addition, each Reducer using the intermediate green after sequence into key-value to as input, then according to group size parameter τ_GBy document according to 1- norms Value is ranked up；Finally, key-value pair is exported；The key-value each exported to including：One group ID and contain orderly document The tuple of ID lists, corresponding 1- norms and ∞-norm value.

Preferably, step (4-2) is specially：

When some group by weighing 1- norms and ∞-norm, is being found different with other comments, then while can be with Determine other different sets of documentation；For each group of g_i, the document with maximum 1- norm values is expressed asAnd with most The document of big ∞-norm is expressed asWhen the different relation between two groups is determined, sets of documentation g_jIn∞-norm Value can apply in another group of g_i1- normsIfShow the two Document in group does not have similar；Because MapReduce work before has produced the ordered queue of an All Files group, And the document in each group arranges also according to the value ascending order of 1- norms, compares g_iOther groups (such as g for sorting relatively low₁And g₂) will It is specified on g_jIt is dissimilar.

Preferably, step (4-3) is specially：

For each group of g_i∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, to initialize Document into regionsThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation；However, The size of partitions of file during initialization may be very uneven, because some sets of documentation may contain many potential similar texts Shelves group and other only there are several potential similar sets of documentation；The imbalance of partitions of file can cause parallel under MapReduce Similitude comment Detection task execution performance dramatically declines；Its reason is that a similar comment detection operation is always held The row time is controlled by the calculate node of the most long similitude comment Detection task of run time.

Preferably, in step (4-4),

A smooth task is called by partition size limitation τ_MaxGBy big file division into several less subregions, τ_MaxG's Value is determined by the average local memory capacity of computing cluster node, if it exceeds capacity is locally stored in calculate node When, then sets of documentation must be retrieved from remote node, and remote disk is accessed all consume and accessed than local disk each time Much more time, as a result, always performing the time may greatly increase, for each group of g_i∈ G, if its initial file Subregion is more than τ_MaxGLimitation, this group of g_iSeed will be used as, and merged to other possible similar sets of documentation so that be raw Into each subregionSize not over limitation, finally, similar comment Detection task can be distributed evenly at cluster In all calculate nodes between.

Preferably, in step (4-5),

Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion；For Each document into regionsStart the parallel similitude comment Detection tasks of a MapReduce, in the calculating of similitude Fall to index operation, it is necessary to carry out one in each Local partition before, likeness in form of reruning after index comment job run；For Map the and Reduce functions of calculating document vector similarity in each subregion, each Mapper receive the key-value from local Right, key is an item t, and its value and be document id list and corresponding normalization item weight, calls Map functions, each pair Document vector is only calculated only once, and only when two vectors are containing at least one identical item in falling to index, just calculates it Between similarity, for each document vector d_i, potential similar document and d are kept using an Associate array H_i's Similarity score；Finally, Map functions output document IDn_iThe number of the key of middle generation is minimized with Associate array H, and will They are shuffled；In addition, Reduce functions receive the key-value pair of middle generation, and to being stored in the Local Phase of Associate array Sum operation is carried out like property score；Finally, Reduce functions output includes the most red phase of document id and document in Local partition Like degree score value key-value pair.

The present invention compared with prior art, has the following advantages that and beneficial effect：

(1) present invention is a research method for carrying subject crossing, passes through the polyarch in Applied Cognitive Psychology Representation of concept is theoretical and Basic Level Concept theories weigh the typical degree of a comment, disclose one it is new Comment on typicalness computational methods.In addition, aid in commenting as clustering target based on the classification effectiveness that cognitive psychologist is proposed By the generation of prototype so that cluster generates the true cognition of the prototype come and the comment typicalness calculated closer to people.

(2) mainly concentrated from existing opining mining technology unlike the sentiment analysis and viewpoint summary in comment, this Invention is intended in research comment big data minimum representative comment set and excavates this new problem.Commented on by excavating minimum represent Set, can allow user need not browse big data comment also can easily understand comprehensively all comments general picture and comment in see The diversity of point, fills up the research blank of the problem, and can strengthen commenting on significant comment pair in big data to a greater extent The reference role of user.

(3) achievement in research disclosed by the invention can help potential customers to understand some commodity more comprehensively, with multi-angle, side Help user more accurately to screen the commodity needed for it, lift the buying experience of user.In addition, from manufacturer and the angle of service supplier Degree from the point of view of, view of the user to its product can be appreciated more fully in they, i.e., from product for the angle of Consumer's Experience which Be advantage which be to determine, can so help goods producer to obtain more more fully field feedbacks, so as to more Commodity are improved well, lift the sale of commodity.In addition, to may also help in government department very fast and wider for method disclosed by the invention It is general, comprehensively understand the various regions condition of the people, understand the masses to government policy than more typical view and a variety of representative views.

(4) method disclosed by the invention be based on network comment on big data carries out, by research Hadoop with Under MapReduce it is parallel and realize in a distributed manner disclosed comment typicalness computational methods and it is minimum represent comment and excavate calculate Method, to tackle the application under big data environment.

Brief description of the drawings

Fig. 1 is the process chart of the method for digging of the present invention；

Fig. 2 is comment vector order and packet design pattern.

Fig. 3 calculates for document Vector Groups distinctiveness ratio.

Embodiment

With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are unlimited In this.

Embodiment

As shown in figure 1, the comment big data method for digging based on typicalness, comprises the steps：

In the present embodiment, modeling is excavated in comment typicalness comment, and formalization table is carried out to comment on commodity using " aspect " Show：

Wherein s_a,iIt is one and belongs to commodity a " aspect ", v_a,iIt is for s in comment_a,iFeeling polarities value, i.e., it is a certain The Sentiment orientation value of individual aspect.

For commenting on typicalness computational problem, one such as minor function can be regarded as：

χ:R_i→T_i

Wherein, R_iIt is the comment set for belonging to commodity i, T_iIt is according to the comment set after comment typicalness sequence.

It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, Then a comment prototype is extracted to represent this kind of comment from each comment classification.Therefore, commodity x all comments can To be represented by n comment prototype, i.e.,：

Wherein,It is a comment prototype, can be expressed as：

Minimum representative comment set Mining Problems can be expressed as a function

θ:R_i→L_i

In the present embodiment, structure comprises the following steps typicalness comment prototype automatically：

1. for Central Tendency of the object in concept, determined by two aspects：In the object and concept The similitude (internal similarity) of other objects, and the dissimilarity (outside dissimilarity) with object in other concepts.It is interior Portion's similitude can be expressed as：

Wherein,An object a is represented,A concept c is represented,It is the prototype s most like with a.

2. in order to weigh the similitude of prototype and object, the present invention takes different similarity functions, such as Cosine phases Like property function, Jaccard similarity functions etc..For outside dissimilarity of the object in some concept, it is regarded as It is the object and the integration value of the dissimilarity of other concepts, is calculated with below equation：

3. present invention design one integrates function (aggregation function) to integrate internal similarity and outside Dissimilarity, so as to obtain Central Tendencys of the object a in concept c.The for influenceing object typicalness Two factors, Frequency of Instantiation, the present invention define a notable vector of prototype to represent different prototypes Frequency of Instantiation in a concept, it is as follows：

Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to relative to all The percentage of this conceptual object quantity.

4. obtaining Central Tendency and Frequency of of the object in some concept After Instantiation value, function will be integrated according to one both will be integrated, the procedural formalism represents as follows：

5. the automatic structure of a concept polyarch is designed as benchmark using " the classification effectiveness " in cognitive psychology Algorithm so that the prototype generated is all the basic layer in concept.According to the theory of cognitive psychology, Category in comment Utility is as follows：

Wherein, C is concept set, and F is that aspect is gathered, p (t_i|c_k)²Possess one side t for a concept_i, probability, p(c_k) it is the probability that an object belongs to a concept, p (t_i) it is that an object possesses aspect t_iProbability.

6. the mining algorithm of basic unit's concept (Basic Level concept) is as shown in table 1, the algorithm can automatically be dug Excavate the substantially sub- concept (Basic Level Concept) for some concept.Using the algorithm, all comments of commodity Several classes can be automatically polymerized to according to the change of classification effectiveness, finally choose classification effectiveness maximum in the cluster result of formation Classification, each class that the classification is polymerized to belong to this level of Basic Level.

Table 1

In the present embodiment, minimum comment set excavation specifically includes following steps：

1. similar comment detection, similar comment is rejected in the set.The similar comment test problems can form Turn to：Product review (i.e. document) is by one group of vector representation D={ d₁,d₁,...,d_|D|}.Each vectorial d_i=<w_i,1, w_i,2,...,w_i,|L|>Comprising | L | individual word weight, wherein L are word finder (i.e. a series of non-repetitive lists of product review collected works Word) number sum.Each vector is to be normalized to unit length.If cosine-similarity of a pair of vectors is more than or equal to threshold Value μ_SIMWhen, then it is similar comment to think a pair.Given normalized vector, a pair of vector (d_i,d_j) cosine similarity sim (d_i,d_j) be these vectors dot product：

2. a given similarity threshold is utilized, to reduce the number in candidate couple caused by candidate's generation phase Amount.If the weight of the maximal term in a document vector is too small, then the cosine similarity of it and any vector will be less than Given threshold value, i.e., it need not recall Similarity measures.Give two document vector d_i,d_j∈ D,Representation vector 1- norms value.In addition, in a vectorial d_jIn maximal term weight be Also referred to as vectorial ∞-norm.It is possible to according to sim (d_i,d_j) following inequality can be obtained:

sim(d_i,d_j)≤min((||d_i||₁×||d_j||_∞),(||d_j||₁×||d_i||_∞))

In the present embodiment, BigSimDet parallel algorithms comprise the following steps：

1. MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector.Institute is oriented Amount will then be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τ_GThese vectors are put into accordingly Group.Largest packet size τ_GIt is one of parameter of system.Generally, a small τ_GIt will be taken with more groups of structures For cost, it is allowed to find more different relation Vector Groups.As shown in Fig. 2 the parallel processing for first MapReduce operation Pattern.Whole collection of document D is divided evenly into some input subregions.Each Mapper will concurrently handle the subregion of their own File vector (for example, calculating the value of 1- norms and ∞-norm).The key-value of middle generation of Mapper sequences is run to (i.e., 1- norm values and document id), then by the key-value among these to being input to after shuffling in Reducer.By using " TotalOrderPartitioner " self-defined partition programs entitled in MapReduce, the key-value of centre generation is to will be the Process range is reduced after being ranked up in one Reducer according to each worth size, and by the data after range shorter defeated Enter into second Reducer and handle, by that analogy.

The algorithm details of first described MapReduce operation is as shown in table 2, its mainly be responsible for concurrently to file to Amount is ranked up and is grouped.Wherein, a Mapper obtains each input key-value to (that is a, text from its input subregion Part ID and file vector).Then, the Mapper will produce the document vector being made up of accordingly the item weight after normalizing.Connect , Mapper will calculate 1- norms | | d_i||₁With ∞-norm | | d_i||_∞Value.Finally, the key-value pair generated among exporting (that is, 1- norm values and the tuple being made up of file ID and ∞-norm).In addition, each Reducer generates the centre after sequence Key-value is to as input, then according to group size parameter τ_GDocument is ranked up according to 1- norm values.Finally, key-value is exported It is right.The key-value each exported to including：One group ID and contain orderly document id list, corresponding 1- norms and ∞-model The tuple of numerical value.

Table 2

2. searching relevant different sets of documentation, and therefore save on the time largely calculated in unnecessary document similarity. For given threshold value μ_SIMWhen=0.5, the process of different sets of documentation lookup.When some group is by weighing 1- norms and ∞-norm, It is found and other comments (g in such as Fig. 3₃And g₅) it is different when, then while other different sets of documentation can be determined (for example, g₅ And g₂, g₅And g₁).For each group of g_i, the document with maximum 1- norm values is expressed asAnd there is maximum ∞-model Several documents are expressed asWhen the different relation between two groups is determined, sets of documentation g_jIn(such as g₅In d₁₀) The value of ∞-norm can be applied in another group of g_i1- norms(such as g₃In d₆).IfShow (such as g in the two groups₃And g₅) document do not have it is similar.Due to before MapReduce work produced the ordered queue of an All Files group, and the document in each group is also according to 1- The value ascending order arrangement of norm, compares g_iOther groups (such as g for sorting relatively low₁And g₂) g will be specified on_jIt is dissimilar.Reason be for Any group of g_i-k,BecauseIn fact, in step 1 The comparison number of sets of documentation in step 2 can be significantly reduced to the full sequence of sets of documentation.Further, since each 1- model of the group between The value of number and ∞-norm (With) it is that independent parallel is calculated, assess Rather than assessWithBetween dot product can significantly save calculate the time, particularly processing work as vocabulary right and wrong Often during big big data.

3. for each group of g_i∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, come initial Change document into regionsThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation.So And the size of partitions of file during initialization may be very uneven, because some sets of documentation may contain many potential phases It is other like sets of documentation that only there are several potential similar sets of documentation.The imbalance of partitions of file can be caused under MapReduce simultaneously Capable similitude comment Detection task execution performance dramatically declines.Its reason is a similar comment detection operation Total execution time is controlled by the calculate node of the most long similitude comment Detection task of run time.

4. a smooth task is called by partition size limitation τ_MaxGBy big file division into several less subregions.τ_MaxG Value be to be determined by the average local memory capacity of computing cluster node.In general, if it exceeds the sheet of calculate node During ground memory capacity, then sets of documentation must be retrieved from remote node.Remote disk, which accesses, each time can all consume than this Local disk accesses much more time.As a result, always performing the time may greatly increase.For each group of g_i∈ G, if Its initial partitions of file is more than τ_MaxGLimitation, this group of g_iSeed will be used as, and carried out to the sets of documentation that other may be similar Merge so that generate each subregionSize not over limitation.Finally, similar comment Detection task can be uniform Between all calculate nodes of ground distribution in the cluster.

Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion.For Each document into regionsStart the parallel similitude comment Detection tasks of a MapReduce.In the calculating of similitude Fall to index operation, it is necessary to carry out one in each Local partition before.Likeness in form of being reruned after index comment job run.Such as It is Map the and Reduce functions of the calculating document vector similarity in each subregion shown in table 3.Each Mapper receives to come from Local key-value pair.Key is an item t, and its value and be document id list and corresponding normalization item weight.Call Map functions, each pair document vector are only calculated only once.Only when two vectors contain at least one identical item in falling to index When, just calculate the similarity between them.For each document vector d_i(it is n by document id_i), using an Associate array H come Keep potential similar document and d_iSimilarity score.Finally, Map functions output document IDn_iMinimized with Associate array H The number of the key of centre generation, and they are shuffled.In addition, Reduce functions receive the key-value pair of middle generation, and it is right The local similarity score for being stored in Associate array carries out sum operation.Finally, the output of Reduce functions is in Local partition Zhong Bao Include the most red similarity score key-value pair of document id and document.

Table 3

Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims

1. the comment big data method for digging based on typicalness, it is characterised in that comprise the steps：

(1) comment on typicalness and excavate modeling, comment typicalness is calculated and the minimum comment set Mining Problems that represent are modeled And formal definitions；

(2) typicalness comment prototype is built automatically, and " basic unit's concept " theory based on cognitive psychology is set with polyarch theory Meter comment typicalness computational methods, the generation of comment prototype is instructed with the classification effectiveness of " basic unit's concept " in theoretical；

(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set, the collection Conjunction has following features：Each comment in set is all different and can represent the viewpoint of quite a few user, the minimum The viewpoint of all comments of some commodity can be covered and be represented to all comments in comment set, and user only needs to browse the minimum Comment in comment set, it is possible to understand the User Perspective of all comments on commodity；

(4) BigSimDet parallel calculating methods are used, by calling the calculate node in distributed type assemblies to locate in a parallel fashion Manage similitude comment Detection task.

2. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (1), comment Concretely comprising the following steps for modeling is excavated by typicalness：

(1-1) regards some commodity x all comments one " concept " as, and " concept " is commodity x comment, each Comment is then one " example " of this " concept ", then every comment has different typicalness in " concept ", in addition, In all comments of commodity x comment, extract minimum represents and comment on set, the comment set there are following two attributes：

All n bars comment that (1-1-1) set is included can represent the different types of viewpoint of all users to the full extent；

Number of reviews n in (1-1-2) set is as small as possible；User need to only browse n bars comment cans few in number compared with Comprehensively understand all viewpoints and opinion for commodity x；

<mrow> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>

Wherein s_a,iIt is one and belongs to commodity a " aspect ", v_a,iIt is for s in comment_a,iFeeling polarities value, i.e. some side The Sentiment orientation value in face；

χ:R_i→T_i

It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, then A comment prototype is extracted from each comment classification and represents this kind of comment, therefore, commodity x all comments can be by To represent, i.e., n is commented on prototype：

<mrow> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mrow> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow>

Wherein,It is a comment prototype, can be expressed as：

<mrow> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>1</mn> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>1</mn> </mrow> <mi>c</mi> </msubsup> <mo>,</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>2</mn> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>2</mn> </mrow> <mi>c</mi> </msubsup> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>m</mi> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>m</mi> </mrow> <mi>c</mi> </msubsup> <mo>)</mo> </mrow> </mrow>

θ:R_i→L_i

3. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (2), allusion quotation Type comments on the specific method that prototype is built automatically：

(2-1) in polyarch theory, a concept can be represented to represent by multiple abstract objects, and the object, which represents, is For prototype, each prototype represents one group of similar object respectively, is the abstract representation of these objects；

(2-2) is determined according to the research of cognitive psychology, the typicalness of object by two factors：Central Tendency and Frequency of Instantiation；The former is the similitude of object and concept member, i.e., if an object and some Other object instances in concept are much like, and other concepts beyond the concept are very dissimilar, then this object is general at this Typicalness in thought is very high；The latter is that a people often meets certain object and it is classified as the frequency of some concept；

The Central Tendency of (2-3) for an object in concept, determined by two aspects：In the object and concept its The similitude of his object, i.e. internal similarity, and the dissimilarity with object in other concepts, i.e., outside dissimilarity；It is interior Portion's similitude can be expressed as：

<mrow> <mi>&beta;</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>

(2-4) takes different similarity functions to be handled, for an object to weigh the similitude of prototype and object Outside dissimilarity in some concept, it is regarded as the object and the integration value of the dissimilarity of other concepts, to Lower formula calculates：

<mrow> <mi>&delta;</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&Sigma;</mi> <mi>x</mi> </msub> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mrow> <mi>x</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>N</mi> <mi>&Delta;</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <mo>,</mo> <mi>x</mi> <mo>&Element;</mo> <mi>C</mi> <mi> </mi> <mi>a</mi> <mi>n</mi> <mi>d</mi> <mi> </mi> <mi>x</mi> <mo>&NotEqual;</mo> <mi>c</mi> </mrow>

(2-5) integrates internal similarity and outside dissimilarity by integrating function aggregation function, obtains Central Tendencys of one object a in concept c；

For belonging to multiple objects outside some concept, these objects are clustered, so as to obtain multiple originals of a concept Type represents, the second factor for influenceing object typicalness, Frequency of Instantiation, defines a prototype Notable vector represents Frequency of Instantiation of the different prototypes in a concept, as follows：

<mrow> <msub> <mover> <mi>w</mi> <mo>&RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mn>0</mn> <mo><</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>&le;</mo> <mn>1</mn> </mrow>

Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to this relative to all The percentage of conceptual object quantity；

Obtaining Central Tendency and Frequency of Instantiation of the object in some concept Value after, according to one integrate function both are integrated, the procedural formalism represent it is as follows：

<mrow> <msub> <mi>&tau;</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <mi>a</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>&Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>,</mo> <mi>&alpha;</mi> <mo>(</mo> <mrow> <msub> <mover> <mi>p</mi> <mo>&RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&RightArrow;</mo> </mover> <mi>c</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>

(2-6) structure object prototypes process among need to cluster object, by automatic clustering objects method come Obtain multiple prototypes in a concept；It is how former as one concept of reference design using " the classification effectiveness " in cognitive psychology The automatic developing algorithm of type so that the prototype generated is all the basic layer in concept, and the basic layer is：People are recognizing Existing a kind of special conception division during knowing；According to the theory of cognitive psychology, Category in comment Utility is as follows：

<mrow> <mi>c</mi> <mi>u</mi> <mrow> <mo>(</mo> <mrow> <mi>C</mi> <mo>,</mo> <mi>T</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <munderover> <mi>&Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>&lsqb;</mo> <mrow> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> </msubsup> <msub> <mi>w</mi> <mi>i</mi> </msub> <mi>p</mi> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>k</mi> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <mo>-</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>w</mi> <mi>i</mi> </msub> <mi>p</mi> <msup> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mi>n</mi> </mfrac> </mrow> <mo>&rsqb;</mo> </mrow> </mrow>

Wherein, C is concept set, and T is that aspect is gathered, p (t_i|c_k)²Possess one side t for a concept_iProbability, p (c_k) It is the probability that an object belongs to a concept, p (t_i) it is that an object possesses aspect t_iProbability；

(2-7) is automatically excavated for the substantially sub of some concept using Basic Level concept mining algorithm Concept；Using the algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, finally choose The maximum classification of classification effectiveness in the cluster result of formation, each class that the classification is polymerized to belong to Basic Level this Level.

4. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (3), institute State concretely comprising the following steps for minimum comment set excavation：

(3-1) carries out similar comment detection, and similar comment is rejected in the set, and the similar comment test problems can shape Formula turns to：Product review is by one group of vector representation D={ d₁..., d_|D|, each vectorial d_i=＜ w_i,1,w_i,2,...,w_i,|L| ＞ is included | L | individual word weight, wherein L are the word finder number sums of product review collected works, and the word finder is a series of non- The word repeated, each vector is to be normalized to unit length, if cosine-similarity of a pair of vectors is more than or equal to threshold value μ_SIMWhen, then it is similar comment to think a pair, gives normalized vector, a pair of vector (d_i,d_j) cosine similarity sim (d_i, d_j) be these vectors dot product：

<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>L</mi> <mo>|</mo> </mrow> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> <mo>&CenterDot;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> </mrow>

(3-2) defines 1 similar comment test problems：The set D of the given text vector for representing corresponding product comment and similar Property threshold value μ_SIM, similar comment test problems are to determine comment to d_i,d_j∈ D similarity sim (d_i,d_j)≥μ_SIM；

(3c) utilizes a given similarity threshold, reduces the quantity in candidate couple caused by candidate collection generation phase, such as The weight of maximal term of the fruit in a document vector is too small, then its cosine similarity with any vector will be less than given Threshold value, i.e., it need not recall Similarity measures；Give two document vector d_i,d_j∈ D,Representation vector 1- norms value；In addition, in a vectorial d_jIn maximal term weight be Also referred to as vectorial ∞-norm；It is possible to according to sim (d_i,d_j) following inequality can be obtained:

sim(d_i,d_j)≤min((||d_i||₁×||d_j||_∞),(||d_j||₁×||d_i||_∞))

If | | d_i||₁×||d_j||_∞＜ μ_SIM, certainly sim (d_i,d_j) ＜ μ_SIM, i.e. d_i,d_jIt is not similar comment.

5. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (4), BigSimDet parallel calculating methods concretely comprise the following steps：

(4-1) calculates each document vector d_i∈ D 1- norms | | d_i||₁With ∞-norm | | d_i||_∞Value, and according to MapReduce Group size parameter τ in parallel computational model_GDistribute these vectors and arrive each group；Sequence (G, p) is established, and causes each vector Group g_i∈ G sort from big to small according to 1- norm values；

(4-2) is for each to Vector Groups g_i,g_j∈ G and g_ip g_j, by assessing g_iMaximum 1- norm values and g_jMaximum 1- Whether the product of norm value is more than threshold value μ_SIM, to determine its non-similarity relation；

(4-3) is for each group of g_i∈ G, initial MapReduce subregionTo calculate g_iWith the phase of other Vector Groups Like right；In the process, also need to consider non-similarity relation and partition size limitation τ_MaxG；

(4-4) is according to partition size restriction τ_MaxG, perform MapReduce tasksIf one group of vector g_i∈ G have more Individual potential similar Vector Groups, then by g_iAs seed, and according to partition size restriction τ_MaxGFormedExtra point Area；

(4-5) is for each subregionThe parallel processing task for starting a MapReduce is commented on to detect similitude.

6. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that in step (4-1), The specific implementation step of MapReduce parallel computational models is：

MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector；Institute's directed quantity is subsequent It will be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τ_GThese vectors are put into corresponding group, Largest packet size τ_GIt is one of parameter of system, whole collection of document D is divided evenly into some input subregions, each Mapper The file vector of the subregion of their own, the key-value pair generated among operation Mapper sequences, i.e. 1- norms will concurrently be handled Value and document id, then by the key-value among these to being input to after shuffling in Reducer；By using MapReduce In entitled " TotalOrderPartitioner " self-defined partition programs, the key-value of centre generation is to will be at first Process range is reduced after being ranked up in Reducer according to each worth size, and the data after range shorter are being input to Handled in second Reducer, by that analogy；

Wherein, the specific implementation process of MapReduce operation is as follows：One Mapper is obtained each from its input subregion Key-value pair, i.e. a file ID and file vector are inputted, then, the Mapper will be produced accordingly by the Xiang Quan after normalizing The document vector reassembled into；Then, Mapper will calculate 1- norms | | d_i||₁With ∞-norm | | d_i||_∞Value；Finally, will export The key-value pair of centre generation, i.e. 1- norm values and the tuple being made up of file ID and ∞-norm；In addition, each Reducer will Intermediate green after sequence into key-value to as input, then according to group size parameter τ_GDocument is arranged according to 1- norm values Sequence；Finally, key-value pair is exported；The key-value each exported to including：One group ID and contain orderly document id list, phase The 1- norms and the tuple of ∞-norm value answered.

7. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-2) has Body is：

When some group by weighing 1- norms and ∞-norm, is being found different with other comments, then while can determine Other different sets of documentation；For each group of g_i, the document with maximum 1- norm values is expressed asAnd with maximum ∞- The document of norm is expressed asWhen the different relation between two groups is determined, sets of documentation g_jIn∞-norm value It can apply in another group of g_i1- normsIfShow in the two groups Document do not have it is similar；Because MapReduce work before has produced the ordered queue of an All Files group, and Document in each group arranges also according to the value ascending order of 1- norms, compares g_iOther groups for sorting relatively low will be specified on g_jNot phase Seemingly.

8. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-3) has Body is：

For each group of g_i∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, to initialize document SubregionThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation；It is however, initial The size of partitions of file during change may be very uneven, because some sets of documentation may contain many potential similar document groups It is and other only with several potential similar sets of documentation；The imbalance of partitions of file can cause under MapReduce parallel similar Property comment Detection task execution performance dramatically decline；Its reason be one it is similar comment detection operation total execution when Between be by run time it is most long similitude comment Detection task calculate node control.

9. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that in step (4-4),

A smooth task is called by partition size limitation τ_MaxGBy big file division into several less subregions, τ_MaxGValue be Determined by the average local memory capacity of computing cluster node, if it exceeds when capacity is locally stored of calculate node, that Sets of documentation must be retrieved from remote node, and remote disk access each time can all consume more much more than local disk access Time, as a result, always perform the time may greatly increase, for each group of g_i∈ G, if its initial partitions of file surpasses Cross τ_MaxGLimitation, this group of g_iSeed will be used as, and merged to other possible similar sets of documentation so that generation is each SubregionSize not over limitation, finally, similar comment Detection task can be distributed evenly at the institute in cluster Have between calculate node.

10. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-5) In,

Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion；For each Individual document into regionsStart the parallel similitude comment Detection tasks of a MapReduce, before the calculating of similitude, Need to carry out index operation in each Local partition, perform comment Similarity Measure again after index；For each point Map the and Reduce functions of calculating document vector similarity in area, each Mapper receive the key-value pair from local, key Be an item t, and its value and be document id list and corresponding normalization item weight, call Map functions, each pair document to Amount is only calculated only once, and only when two vectors are containing at least one identical item in falling to index, is just calculated between them Similarity, for each document vector d_i, potential similar document and d are kept using an Associate array H_iSimilarity Score value；Finally, Map functions output document IDn_iThe number of the key of middle generation is minimized with Associate array H, and they are entered Row shuffles；In addition, Reduce functions receive the key-value pair of middle generation, and the local similarity to being stored in Associate array obtains Divide and carry out sum operation；Finally, the output of Reduce functions includes document id and document most red similarity point in Local partition It is worth key-value pair.