CN104462480B - Comment big data method for digging based on typicalness - Google Patents

Comment big data method for digging based on typicalness Download PDF

Info

Publication number
CN104462480B
CN104462480B CN201410796566.XA CN201410796566A CN104462480B CN 104462480 B CN104462480 B CN 104462480B CN 201410796566 A CN201410796566 A CN 201410796566A CN 104462480 B CN104462480 B CN 104462480B
Authority
CN
China
Prior art keywords
mrow
comment
msub
concept
mover
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410796566.XA
Other languages
Chinese (zh)
Other versions
CN104462480A (en
Inventor
刘耀强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201410796566.XA priority Critical patent/CN104462480B/en
Publication of CN104462480A publication Critical patent/CN104462480A/en
Application granted granted Critical
Publication of CN104462480B publication Critical patent/CN104462480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of comment big data method for digging based on typicalness, comprises the steps:(1) comment on typicalness excavation to model, comment typicalness is calculated and minimum representative comment set Mining Problems are modeled and formal definitions;(2) typicalness comment prototype is built automatically;(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set;(4) BigSimDet parallel calculating methods are used, Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion.The present invention is gone to study the typicalness balancing method of user comment, representative minimum comment set is excavated based on the method from cognitive psychology and opining mining the two angles;So as to help commodity potential customers to understand some commodity more comprehensively, with multi-angle, help user more accurately to screen the commodity needed for it, lift the buying experience of user.

Description

Comment big data method for digging based on typicalness
Technical field
The present invention relates to the research field of data mining, more particularly to a kind of comment big data excavation side based on typicalness Method.
Background technology
With the high speed development of China's Internet of Things, issue in e-commerce website, social networks and various online forums Comment show volatile growth, these comment big datas (Big Data) in terms of Petabytes (PB) of number disclose Personal view of the user to a series of broad subjects such as consumer products, tissue, personnel and social event.These comments on commodity It can not only allow enterprise to understand the real needs of their clients or potential customers of concern, and be the shopping decision-making of consumer Provide beneficial guidance.Shown according to 2014 CNNIC data, the net purchase user more than 90% can be Made comments below the commodity of shopping website.At the same time, the net purchase user more than half is represented before each commodity is bought Dependent merchandise comment will be read.Client is allowed to issue its platform to the comment of Suo Zhuguo hotels for example, ctrip.com provides one, Commented on by the hotel issued on the platform, be not only that the suitable hotel of other customer selectings provides reference, hotel management It can also be steadily improved service according to online feedback, so as to attract more clients both at home and abroad to move in.In addition, analyze this A little line Evaluations opinions may also aid in government department comparatively fast and understand the conditions of the people of various regions widely, understand the masses to government policy or society The view and viewpoint of area's development.Generally speaking, from the perspective of user, comment can help user more comprehensively, with multi-angle Some commodity is understood, so as to be made whether to buy the decision of the commodity.Simultaneously user can also be allowed to understand which commodity could expire Its needs of foot.From the perspective of enterprise, manufacturer and service supplier need to know view of the user to its product, i.e. its product From the perspective of Consumer's Experience which be advantage which be shortcoming, can so help goods producer to obtain more, more complete The field feedback in face, so as to preferably improve goods and services.In summary, online comment is contained and abundant had Value information, it is worth us go deep into excavating and analyze.
Although line Evaluation is discussed has highly important meaning and effect to enterprise, regulator and commodity user.But In the big data epoch, the online comment to substantial amounts manually browses and analysis is nearly impossible, traditional comment excavation side Method is difficult to comment big data analyze and summarize in real time, and thus obtained comment and analysis effect is unsatisfactory.Counting greatly There is very high research and application value according to comment opining mining system on intelligent network under background, is established.For example, by from comment The representative comment set of minimum is excavated in big data, allows system user quickly to understand viewpoint different in comment, so as to fast Speed effectively monitors market trend or the condition of the people of various regions.
The content of the invention
The shortcomings that it is a primary object of the present invention to overcome prior art and deficiency, there is provided a kind of comment based on typicalness Big data method for digging, this method are theoretical and more using " basic unit's concept " (the Basic Level Concept) of cognitive psychology Prototype Theory carrys out the calculating of design review typicalness, to excavate representative minimum comment set, and is put down with Hadoop Platform concurrently handles comment big data and excavated..
In order to achieve the above object, the present invention uses following technical scheme:
Comment big data method for digging based on typicalness, comprises the steps:
(1) comment on typicalness and excavate modeling, comment typicalness is calculated and the minimum comment set Mining Problems that represent are carried out Modeling and formal definitions;
(2) typicalness comment prototype is built automatically, and " basic unit's concept " based on cognitive psychology is theoretical and polyarch is theoretical Carry out design review typicalness computational methods, the generation of comment prototype is instructed with the classification effectiveness of " basic unit's concept " in theoretical;
(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set, The set has following features:Each comment in set is all different and can represent the viewpoint of quite a few user, should The viewpoint of all comments of the commodity can be covered and be represented to all comments in minimum comment set, and user only needs to browse the minimum Comment in comment set, it is possible to understand the User Perspective of all comments on commodity;
(4) BigSimDet parallel calculating methods are used, by calling the calculate node in distributed type assemblies with parallel side Formula processing similitude comment Detection task.
Preferably, in step (1), comment typicalness excavates concretely comprising the following steps for modeling:
(1-1) regards some commodity x all comments one " concept " as, and " concept " is commodity x comment, often One comment is then one " example " of this " concept ", then every comment has different typicalness in " concept ", separately Outside, in all comments of commodity x comment, extract minimum represents and comment on set, the comment set has following two Attribute:
All n bars comment that (1-1-1) set is included can represent the different types of sight of all users to the full extent Point;
Number of reviews n in (1-1-2) set is as small as possible;User need to only browse n bars comment few in number can More fully to understand all viewpoints and opinion for commodity x;
(1-2) carries out formalization representation using " aspect " to comment on commodity;
Wherein sa,iIt is one and belongs to commodity a " aspect ", vA, iIt is for s in commenta,iFeeling polarities value, i.e., it is a certain The Sentiment orientation value of individual aspect.
(1-3) can regard one such as minor function as commenting on typicalness computational problem:
χ:Ri→Ti
Wherein, RiIt is the comment set for belonging to commodity i, TiIt is according to the comment set after comment typicalness sequence;
It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, Then a comment prototype is extracted to represent this kind of comment from each comment classification, therefore, commodity x all comments can To be represented by n comment prototype, i.e.,:
Wherein,It is a comment prototype, can be expressed as:
(1-4) minimum representative comment set Mining Problems can be expressed as a function:
θ:Ri→Li
Wherein, RiIt is the comment set for belonging to commodity i, LiIt is the commodity i representative comment set of minimum.
Preferably, in step (2), the specific method that typicalness comment prototype is built automatically is:
(2-1) in polyarch theory, a concept can be represented to represent by multiple abstract objects, the object generation Table is prototype, and each prototype represents one group of similar object respectively, is the abstract representation of these objects;
(2-2) is determined according to the research of cognitive psychology, the typicalness of object by two factors:Central Tendency With Frequency of Instantiation;The former is the similitude of object and concept member, i.e., if an object and certain Other object instances in individual concept are much like, and other concepts beyond the concept are very dissimilar, then this object is at this Typicalness in concept is very high;The latter is that a people often meets certain object and it is classified as the frequency of some concept;
The Central Tendency of (2-3) for an object in concept, determined by two aspects:The object and concept In other objects similitude, i.e. internal similarity, and the dissimilarity with object in other concepts is that is, outside dissimilar Property;Internal similarity can be expressed as:
Wherein,An object a is represented,A concept c is represented,It is the prototype s most like with a;
(2-4) takes different similarity functions to be handled, for one to weigh the similitude of prototype and object Outside dissimilarity of the object in some concept, it is regarded as the object and the integration value of the dissimilarity of other concepts, Calculated with below equation:
It is dissimilar that (2-5) integrates internal similarity and outside by integrating function (aggregation function) Property, obtain Central Tendencys of the object a in concept c;
For belonging to multiple objects outside some concept, these objects are clustered, so as to obtain the more of concept Individual prototype represents, the second factor for influenceing object typicalness, Frequency of Instantiation, defines one The notable vector of prototype represents Frequency of Instantiation of the different prototypes in a concept, as follows:
Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to relative to all The percentage of this conceptual object quantity;
Obtaining Central Tendency and Frequency of of the object in some concept After Instantiation value, integrate function according to one and both are integrated, the procedural formalism represents as follows:
(2-5) needs to cluster object among the process of structure object prototypes, passes through automatic clustering objects side Method obtains multiple prototypes in a concept;One concept of reference design is used as using " the classification effectiveness " in cognitive psychology The automatic developing algorithm of polyarch so that the prototype generated is all the basic layer in concept, and the basic layer is:People Existing a kind of special conception division during cognition;According to the theory of cognitive psychology, Category in comment Utility is as follows:
Wherein, C is concept set, and F is that aspect is gathered, p (ti|ck)2Possess one side t for a concepti, probability, p(ck) it is the probability that an object belongs to a concept, p (ti) it is that an object possesses aspect tiProbability;
(2-6) automatically excavates the base for some concept using Basic Level concept mining algorithm Book concept;Using the algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, finally The classification that classification effectiveness is maximum in the cluster result formed is chosen, each class that the classification is polymerized to belongs to Basic Level This level.
Preferably, in step (3), what the minimum comment set was excavated concretely comprises the following steps:
(3-1) carries out similar comment detection, similar comment is rejected in the set, the similar comment test problems Can form turn to:Product review is by one group of vector representation D={ d1,d1,...,d|D|, each vectorial di=<wi,1, wi,2,...,wi,|L|>Comprising | L | individual word weight, wherein L are that the word finder number of product review collected works is total, the word finder For a series of non-repetitive words, each vector is to be normalized to unit length, if cosine-similarity of a pair of vectors is more than Or equal to threshold value μSIMWhen, then it is similar comment to think a pair, gives normalized vector, a pair of vector (di,dj) cosine phase Like property sim (di,dj) be these vectors dot product:
(3-2) defines 1 (similar comment test problems):The set D of the given text vector for representing corresponding product comment With similarity threshold μSIM, similar comment test problems are to determine comment to di,dj∈ D similarity sim (di,dj)≥μSIM
(3c) utilizes a given similarity threshold, reduces the quantity in candidate couple caused by candidate's generation phase, Weight as crossed maximal term in a document vector is too small, then the cosine similarity of it and any vector will be less than given Threshold value, i.e., it need not recall Similarity measures;Give two document vector di,dj∈ D,Representation vector 1- norms value;In addition, in a vectorial djIn maximal term weight be Also referred to as vectorial ∞-norm;It is possible to according to sim (di,dj) following inequality can be obtained:
sim(di,dj)≤min((||di||1×||dj||),(||dj||1×||di||))
If | | di||1×||dj||< μSIM, certainly sim (di,dj) < μSIM, i.e. di,djIt is not similar comment By.
Preferably, in step (4), BigSimDet parallel calculating methods concretely comprise the following steps:
(4-1) calculates each document vector di∈ D 1- norms | | di||1With ∞-norm | | di||Value, and according to Group size parameter τ in MapReduce parallel computational modelsGDistribute these vectors and arrive each group;Foundation sequence (), and make Obtain each Vector Groups gi∈ G sort from big to small according to 1- norm values;
(4-2) is for each to Vector Groups gi,gj∈ G andBy assessing giMaximum 1- norm values and gjMost Whether the product of big 1- norm values is more than threshold value μSIM, to determine its non-similarity relation;
(4-3) is for each group of gi∈ G, initial MapReduce subregionTo calculate giWith other Vector Groups It is similar right;In the process, also need to consider non-similarity relation and partition size limitation τMaxG
(4-4) is according to partition size restriction τMaxG, perform MapReduce tasksIf one group of vector gi∈G There are multiple potential similar Vector Groups, then by giAs seed, and according to partition size restriction τMaxGFormedVolume Outer subregion;
(4-5) is for each subregionThe parallel processing task for starting a MapReduce is commented to detect similitude By.
Preferably, in step (4-1), the specific implementation step of MapReduce parallel computational models is:
MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector;Institute's directed quantity It will then be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τGThese vectors are put into accordingly Group, largest packet size τGIt is one of parameter of system, whole collection of document D is divided evenly into some input subregions, each Mapper will concurrently handle the file vector of the subregion of their own, run the key-value pair of the middle generation of Mapper sequences, i.e., 1- norm values and document id, then by the key-value among these to being input to after shuffling in Reducer;By using " TotalOrderPartitioner " self-defined partition programs entitled in MapReduce, the key-value of centre generation is to will be the Process range is reduced after being ranked up in one Reducer according to each worth size, and by the data after range shorter defeated Enter into second Reducer and handle, by that analogy;
Wherein, the specific implementation process of MapReduce operation is as follows:One Mapper obtains from its input subregion Each input key-value pair, i.e. a file ID and file vector, then, the Mapper will produce accordingly by normalizing after The document vector of item weight composition;Then, Mapper will calculate 1- norms | | di||1With ∞-norm | | di||Value;Finally, will The key-value pair of generation, i.e. 1- norm values and the tuple being made up of file ID and ∞-norm among output;It is in addition, each Reducer using the intermediate green after sequence into key-value to as input, then according to group size parameter τGBy document according to 1- norms Value is ranked up;Finally, key-value pair is exported;The key-value each exported to including:One group ID and contain orderly document The tuple of ID lists, corresponding 1- norms and ∞-norm value.
Preferably, step (4-2) is specially:
When some group by weighing 1- norms and ∞-norm, is being found different with other comments, then while can be with Determine other different sets of documentation;For each group of gi, the document with maximum 1- norm values is expressed asAnd with most The document of big ∞-norm is expressed asWhen the different relation between two groups is determined, sets of documentation gjIn∞-norm Value can apply in another group of gi1- normsIfShow the two Document in group does not have similar;Because MapReduce work before has produced the ordered queue of an All Files group, And the document in each group arranges also according to the value ascending order of 1- norms, compares giOther groups (such as g for sorting relatively low1And g2) will It is specified on gjIt is dissimilar.
Preferably, step (4-3) is specially:
For each group of gi∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, to initialize Document into regionsThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation;However, The size of partitions of file during initialization may be very uneven, because some sets of documentation may contain many potential similar texts Shelves group and other only there are several potential similar sets of documentation;The imbalance of partitions of file can cause parallel under MapReduce Similitude comment Detection task execution performance dramatically declines;Its reason is that a similar comment detection operation is always held The row time is controlled by the calculate node of the most long similitude comment Detection task of run time.
Preferably, in step (4-4),
A smooth task is called by partition size limitation τMaxGBy big file division into several less subregions, τMaxG's Value is determined by the average local memory capacity of computing cluster node, if it exceeds capacity is locally stored in calculate node When, then sets of documentation must be retrieved from remote node, and remote disk is accessed all consume and accessed than local disk each time Much more time, as a result, always performing the time may greatly increase, for each group of gi∈ G, if its initial file Subregion is more than τMaxGLimitation, this group of giSeed will be used as, and merged to other possible similar sets of documentation so that be raw Into each subregionSize not over limitation, finally, similar comment Detection task can be distributed evenly at cluster In all calculate nodes between.
Preferably, in step (4-5),
Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion;For Each document into regionsStart the parallel similitude comment Detection tasks of a MapReduce, in the calculating of similitude Fall to index operation, it is necessary to carry out one in each Local partition before, likeness in form of reruning after index comment job run;For Map the and Reduce functions of calculating document vector similarity in each subregion, each Mapper receive the key-value from local Right, key is an item t, and its value and be document id list and corresponding normalization item weight, calls Map functions, each pair Document vector is only calculated only once, and only when two vectors are containing at least one identical item in falling to index, just calculates it Between similarity, for each document vector di, potential similar document and d are kept using an Associate array Hi's Similarity score;Finally, Map functions output document IDniThe number of the key of middle generation is minimized with Associate array H, and will They are shuffled;In addition, Reduce functions receive the key-value pair of middle generation, and to being stored in the Local Phase of Associate array Sum operation is carried out like property score;Finally, Reduce functions output includes the most red phase of document id and document in Local partition Like degree score value key-value pair.
The present invention compared with prior art, has the following advantages that and beneficial effect:
(1) present invention is a research method for carrying subject crossing, passes through the polyarch in Applied Cognitive Psychology Representation of concept is theoretical and Basic Level Concept theories weigh the typical degree of a comment, disclose one it is new Comment on typicalness computational methods.In addition, aid in commenting as clustering target based on the classification effectiveness that cognitive psychologist is proposed By the generation of prototype so that cluster generates the true cognition of the prototype come and the comment typicalness calculated closer to people.
(2) mainly concentrated from existing opining mining technology unlike the sentiment analysis and viewpoint summary in comment, this Invention is intended in research comment big data minimum representative comment set and excavates this new problem.Commented on by excavating minimum represent Set, can allow user need not browse big data comment also can easily understand comprehensively all comments general picture and comment in see The diversity of point, fills up the research blank of the problem, and can strengthen commenting on significant comment pair in big data to a greater extent The reference role of user.
(3) achievement in research disclosed by the invention can help potential customers to understand some commodity more comprehensively, with multi-angle, side Help user more accurately to screen the commodity needed for it, lift the buying experience of user.In addition, from manufacturer and the angle of service supplier Degree from the point of view of, view of the user to its product can be appreciated more fully in they, i.e., from product for the angle of Consumer's Experience which Be advantage which be to determine, can so help goods producer to obtain more more fully field feedbacks, so as to more Commodity are improved well, lift the sale of commodity.In addition, to may also help in government department very fast and wider for method disclosed by the invention It is general, comprehensively understand the various regions condition of the people, understand the masses to government policy than more typical view and a variety of representative views.
(4) method disclosed by the invention be based on network comment on big data carries out, by research Hadoop with Under MapReduce it is parallel and realize in a distributed manner disclosed comment typicalness computational methods and it is minimum represent comment and excavate calculate Method, to tackle the application under big data environment.
Brief description of the drawings
Fig. 1 is the process chart of the method for digging of the present invention;
Fig. 2 is comment vector order and packet design pattern.
Fig. 3 calculates for document Vector Groups distinctiveness ratio.
Embodiment
With reference to embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are unlimited In this.
Embodiment
As shown in figure 1, the comment big data method for digging based on typicalness, comprises the steps:
(1) comment on typicalness and excavate modeling, comment typicalness is calculated and the minimum comment set Mining Problems that represent are carried out Modeling and formal definitions;
(2) typicalness comment prototype is built automatically, and " basic unit's concept " based on cognitive psychology is theoretical and polyarch is theoretical Carry out design review typicalness computational methods, the generation of comment prototype is instructed with the classification effectiveness of " basic unit's concept " in theoretical;
(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set, The set has following features:Each comment in set is all different and can represent the viewpoint of quite a few user, should The viewpoint of all comments of the commodity can be covered and be represented to all comments in minimum comment set, and user only needs to browse the minimum Comment in comment set, it is possible to understand the User Perspective of all comments on commodity;
(4) BigSimDet parallel calculating methods are used, by calling the calculate node in distributed type assemblies with parallel side Formula processing similitude comment Detection task.
In the present embodiment, modeling is excavated in comment typicalness comment, and formalization table is carried out to comment on commodity using " aspect " Show:
Wherein sa,iIt is one and belongs to commodity a " aspect ", va,iIt is for s in commenta,iFeeling polarities value, i.e., it is a certain The Sentiment orientation value of individual aspect.
For commenting on typicalness computational problem, one such as minor function can be regarded as:
χ:Ri→Ti
Wherein, RiIt is the comment set for belonging to commodity i, TiIt is according to the comment set after comment typicalness sequence.
It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, Then a comment prototype is extracted to represent this kind of comment from each comment classification.Therefore, commodity x all comments can To be represented by n comment prototype, i.e.,:
Wherein,It is a comment prototype, can be expressed as:
Minimum representative comment set Mining Problems can be expressed as a function
θ:Ri→Li
Wherein, RiIt is the comment set for belonging to commodity i, LiIt is the commodity i representative comment set of minimum.
In the present embodiment, structure comprises the following steps typicalness comment prototype automatically:
1. for Central Tendency of the object in concept, determined by two aspects:In the object and concept The similitude (internal similarity) of other objects, and the dissimilarity (outside dissimilarity) with object in other concepts.It is interior Portion's similitude can be expressed as:
Wherein,An object a is represented,A concept c is represented,It is the prototype s most like with a.
2. in order to weigh the similitude of prototype and object, the present invention takes different similarity functions, such as Cosine phases Like property function, Jaccard similarity functions etc..For outside dissimilarity of the object in some concept, it is regarded as It is the object and the integration value of the dissimilarity of other concepts, is calculated with below equation:
3. present invention design one integrates function (aggregation function) to integrate internal similarity and outside Dissimilarity, so as to obtain Central Tendencys of the object a in concept c.The for influenceing object typicalness Two factors, Frequency of Instantiation, the present invention define a notable vector of prototype to represent different prototypes Frequency of Instantiation in a concept, it is as follows:
Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to relative to all The percentage of this conceptual object quantity.
4. obtaining Central Tendency and Frequency of of the object in some concept After Instantiation value, function will be integrated according to one both will be integrated, the procedural formalism represents as follows:
5. the automatic structure of a concept polyarch is designed as benchmark using " the classification effectiveness " in cognitive psychology Algorithm so that the prototype generated is all the basic layer in concept.According to the theory of cognitive psychology, Category in comment Utility is as follows:
Wherein, C is concept set, and F is that aspect is gathered, p (ti|ck)2Possess one side t for a concepti, probability, p(ck) it is the probability that an object belongs to a concept, p (ti) it is that an object possesses aspect tiProbability.
6. the mining algorithm of basic unit's concept (Basic Level concept) is as shown in table 1, the algorithm can automatically be dug Excavate the substantially sub- concept (Basic Level Concept) for some concept.Using the algorithm, all comments of commodity Several classes can be automatically polymerized to according to the change of classification effectiveness, finally choose classification effectiveness maximum in the cluster result of formation Classification, each class that the classification is polymerized to belong to this level of Basic Level.
Table 1
In the present embodiment, minimum comment set excavation specifically includes following steps:
1. similar comment detection, similar comment is rejected in the set.The similar comment test problems can form Turn to:Product review (i.e. document) is by one group of vector representation D={ d1,d1,...,d|D|}.Each vectorial di=<wi,1, wi,2,...,wi,|L|>Comprising | L | individual word weight, wherein L are word finder (i.e. a series of non-repetitive lists of product review collected works Word) number sum.Each vector is to be normalized to unit length.If cosine-similarity of a pair of vectors is more than or equal to threshold Value μSIMWhen, then it is similar comment to think a pair.Given normalized vector, a pair of vector (di,dj) cosine similarity sim (di,dj) be these vectors dot product:
2. a given similarity threshold is utilized, to reduce the number in candidate couple caused by candidate's generation phase Amount.If the weight of the maximal term in a document vector is too small, then the cosine similarity of it and any vector will be less than Given threshold value, i.e., it need not recall Similarity measures.Give two document vector di,dj∈ D,Representation vector 1- norms value.In addition, in a vectorial djIn maximal term weight be Also referred to as vectorial ∞-norm.It is possible to according to sim (di,dj) following inequality can be obtained:
sim(di,dj)≤min((||di||1×||dj||),(||dj||1×||di||))
If | | di||1×||dj||< μSIM, certainly sim (di,dj) < μSIM, i.e. di,djIt is not similar comment By.
In the present embodiment, BigSimDet parallel algorithms comprise the following steps:
1. MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector.Institute is oriented Amount will then be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τGThese vectors are put into accordingly Group.Largest packet size τGIt is one of parameter of system.Generally, a small τGIt will be taken with more groups of structures For cost, it is allowed to find more different relation Vector Groups.As shown in Fig. 2 the parallel processing for first MapReduce operation Pattern.Whole collection of document D is divided evenly into some input subregions.Each Mapper will concurrently handle the subregion of their own File vector (for example, calculating the value of 1- norms and ∞-norm).The key-value of middle generation of Mapper sequences is run to (i.e., 1- norm values and document id), then by the key-value among these to being input to after shuffling in Reducer.By using " TotalOrderPartitioner " self-defined partition programs entitled in MapReduce, the key-value of centre generation is to will be the Process range is reduced after being ranked up in one Reducer according to each worth size, and by the data after range shorter defeated Enter into second Reducer and handle, by that analogy.
The algorithm details of first described MapReduce operation is as shown in table 2, its mainly be responsible for concurrently to file to Amount is ranked up and is grouped.Wherein, a Mapper obtains each input key-value to (that is a, text from its input subregion Part ID and file vector).Then, the Mapper will produce the document vector being made up of accordingly the item weight after normalizing.Connect , Mapper will calculate 1- norms | | di||1With ∞-norm | | di||Value.Finally, the key-value pair generated among exporting (that is, 1- norm values and the tuple being made up of file ID and ∞-norm).In addition, each Reducer generates the centre after sequence Key-value is to as input, then according to group size parameter τGDocument is ranked up according to 1- norm values.Finally, key-value is exported It is right.The key-value each exported to including:One group ID and contain orderly document id list, corresponding 1- norms and ∞-model The tuple of numerical value.
Table 2
2. searching relevant different sets of documentation, and therefore save on the time largely calculated in unnecessary document similarity. For given threshold value μSIMWhen=0.5, the process of different sets of documentation lookup.When some group is by weighing 1- norms and ∞-norm, It is found and other comments (g in such as Fig. 33And g5) it is different when, then while other different sets of documentation can be determined (for example, g5 And g2, g5And g1).For each group of gi, the document with maximum 1- norm values is expressed asAnd there is maximum ∞-model Several documents are expressed asWhen the different relation between two groups is determined, sets of documentation gjIn(such as g5In d10) The value of ∞-norm can be applied in another group of gi1- norms(such as g3In d6).IfShow (such as g in the two groups3And g5) document do not have it is similar.Due to before MapReduce work produced the ordered queue of an All Files group, and the document in each group is also according to 1- The value ascending order arrangement of norm, compares giOther groups (such as g for sorting relatively low1And g2) g will be specified onjIt is dissimilar.Reason be for Any group of gi-k,BecauseIn fact, in step 1 The comparison number of sets of documentation in step 2 can be significantly reduced to the full sequence of sets of documentation.Further, since each 1- model of the group between The value of number and ∞-norm (With) it is that independent parallel is calculated, assess Rather than assessWithBetween dot product can significantly save calculate the time, particularly processing work as vocabulary right and wrong Often during big big data.
3. for each group of gi∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, come initial Change document into regionsThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation.So And the size of partitions of file during initialization may be very uneven, because some sets of documentation may contain many potential phases It is other like sets of documentation that only there are several potential similar sets of documentation.The imbalance of partitions of file can be caused under MapReduce simultaneously Capable similitude comment Detection task execution performance dramatically declines.Its reason is a similar comment detection operation Total execution time is controlled by the calculate node of the most long similitude comment Detection task of run time.
4. a smooth task is called by partition size limitation τMaxGBy big file division into several less subregions.τMaxG Value be to be determined by the average local memory capacity of computing cluster node.In general, if it exceeds the sheet of calculate node During ground memory capacity, then sets of documentation must be retrieved from remote node.Remote disk, which accesses, each time can all consume than this Local disk accesses much more time.As a result, always performing the time may greatly increase.For each group of gi∈ G, if Its initial partitions of file is more than τMaxGLimitation, this group of giSeed will be used as, and carried out to the sets of documentation that other may be similar Merge so that generate each subregionSize not over limitation.Finally, similar comment Detection task can be uniform Between all calculate nodes of ground distribution in the cluster.
Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion.For Each document into regionsStart the parallel similitude comment Detection tasks of a MapReduce.In the calculating of similitude Fall to index operation, it is necessary to carry out one in each Local partition before.Likeness in form of being reruned after index comment job run.Such as It is Map the and Reduce functions of the calculating document vector similarity in each subregion shown in table 3.Each Mapper receives to come from Local key-value pair.Key is an item t, and its value and be document id list and corresponding normalization item weight.Call Map functions, each pair document vector are only calculated only once.Only when two vectors contain at least one identical item in falling to index When, just calculate the similarity between them.For each document vector di(it is n by document idi), using an Associate array H come Keep potential similar document and diSimilarity score.Finally, Map functions output document IDniMinimized with Associate array H The number of the key of centre generation, and they are shuffled.In addition, Reduce functions receive the key-value pair of middle generation, and it is right The local similarity score for being stored in Associate array carries out sum operation.Finally, the output of Reduce functions is in Local partition Zhong Bao Include the most red similarity score key-value pair of document id and document.
Table 3
Above-described embodiment is the preferable embodiment of the present invention, but embodiments of the present invention are not by above-described embodiment Limitation, other any Spirit Essences without departing from the present invention with made under principle change, modification, replacement, combine, simplification, Equivalent substitute mode is should be, is included within protection scope of the present invention.

Claims (10)

1. the comment big data method for digging based on typicalness, it is characterised in that comprise the steps:
(1) comment on typicalness and excavate modeling, comment typicalness is calculated and the minimum comment set Mining Problems that represent are modeled And formal definitions;
(2) typicalness comment prototype is built automatically, and " basic unit's concept " theory based on cognitive psychology is set with polyarch theory Meter comment typicalness computational methods, the generation of comment prototype is instructed with the classification effectiveness of " basic unit's concept " in theoretical;
(3) minimum comment set is excavated, and is commented on using minimum and is gathered mining algorithm, filters out a minimum comment set, the collection Conjunction has following features:Each comment in set is all different and can represent the viewpoint of quite a few user, the minimum The viewpoint of all comments of some commodity can be covered and be represented to all comments in comment set, and user only needs to browse the minimum Comment in comment set, it is possible to understand the User Perspective of all comments on commodity;
(4) BigSimDet parallel calculating methods are used, by calling the calculate node in distributed type assemblies to locate in a parallel fashion Manage similitude comment Detection task.
2. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (1), comment Concretely comprising the following steps for modeling is excavated by typicalness:
(1-1) regards some commodity x all comments one " concept " as, and " concept " is commodity x comment, each Comment is then one " example " of this " concept ", then every comment has different typicalness in " concept ", in addition, In all comments of commodity x comment, extract minimum represents and comment on set, the comment set there are following two attributes:
All n bars comment that (1-1-1) set is included can represent the different types of viewpoint of all users to the full extent;
Number of reviews n in (1-1-2) set is as small as possible;User need to only browse n bars comment cans few in number compared with Comprehensively understand all viewpoints and opinion for commodity x;
(1-2) carries out formalization representation using " aspect " to comment on commodity;
<mrow> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>s</mi> <mrow> <mi>a</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>:</mo> <msub> <mi>v</mi> <mrow> <mi>a</mi> <mo>,</mo> <mi>k</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>
Wherein sa,iIt is one and belongs to commodity a " aspect ", va,iIt is for s in commenta,iFeeling polarities value, i.e. some side The Sentiment orientation value in face;
(1-3) can regard one such as minor function as commenting on typicalness computational problem:
χ:Ri→Ti
Wherein, RiIt is the comment set for belonging to commodity i, TiIt is according to the comment set after comment typicalness sequence;
It is theoretical according to polyarch for minimum representative comment set Mining Problems, comment on commodity is clustered first, then A comment prototype is extracted from each comment classification and represents this kind of comment, therefore, commodity x all comments can be by To represent, i.e., n is commented on prototype:
<mrow> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <mrow> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> </mrow> <mo>)</mo> </mrow> </mrow>
Wherein,It is a comment prototype, can be expressed as:
<mrow> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>1</mn> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>1</mn> </mrow> <mi>c</mi> </msubsup> <mo>,</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>2</mn> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mn>2</mn> </mrow> <mi>c</mi> </msubsup> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msubsup> <mi>s</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>m</mi> </mrow> <mi>c</mi> </msubsup> <mo>:</mo> <msubsup> <mi>v</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>m</mi> </mrow> <mi>c</mi> </msubsup> <mo>)</mo> </mrow> </mrow>
(1-4) minimum representative comment set Mining Problems can be expressed as a function:
θ:Ri→Li
Wherein, RiIt is the comment set for belonging to commodity i, LiIt is the commodity i representative comment set of minimum.
3. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (2), allusion quotation Type comments on the specific method that prototype is built automatically:
(2-1) in polyarch theory, a concept can be represented to represent by multiple abstract objects, and the object, which represents, is For prototype, each prototype represents one group of similar object respectively, is the abstract representation of these objects;
(2-2) is determined according to the research of cognitive psychology, the typicalness of object by two factors:Central Tendency and Frequency of Instantiation;The former is the similitude of object and concept member, i.e., if an object and some Other object instances in concept are much like, and other concepts beyond the concept are very dissimilar, then this object is general at this Typicalness in thought is very high;The latter is that a people often meets certain object and it is classified as the frequency of some concept;
The Central Tendency of (2-3) for an object in concept, determined by two aspects:In the object and concept its The similitude of his object, i.e. internal similarity, and the dissimilarity with object in other concepts, i.e., outside dissimilarity;It is interior Portion's similitude can be expressed as:
<mrow> <mi>&amp;beta;</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>c</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow>
Wherein,An object a is represented,A concept c is represented,It is the prototype s most like with a;
(2-4) takes different similarity functions to be handled, for an object to weigh the similitude of prototype and object Outside dissimilarity in some concept, it is regarded as the object and the integration value of the dissimilarity of other concepts, to Lower formula calculates:
<mrow> <mi>&amp;delta;</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mrow> <msub> <mi>&amp;Sigma;</mi> <mi>x</mi> </msub> <mi>d</mi> <mi>i</mi> <mi>s</mi> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mi>i</mi> <mi>l</mi> <mi>a</mi> <mi>r</mi> <mrow> <mo>(</mo> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mrow> <mi>x</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>N</mi> <mi>&amp;Delta;</mi> </msub> <mo>-</mo> <mn>1</mn> </mrow> </mfrac> <mo>,</mo> <mi>x</mi> <mo>&amp;Element;</mo> <mi>C</mi> <mi> </mi> <mi>a</mi> <mi>n</mi> <mi>d</mi> <mi> </mi> <mi>x</mi> <mo>&amp;NotEqual;</mo> <mi>c</mi> </mrow>
(2-5) integrates internal similarity and outside dissimilarity by integrating function aggregation function, obtains Central Tendencys of one object a in concept c;
For belonging to multiple objects outside some concept, these objects are clustered, so as to obtain multiple originals of a concept Type represents, the second factor for influenceing object typicalness, Frequency of Instantiation, defines a prototype Notable vector represents Frequency of Instantiation of the different prototypes in a concept, as follows:
<mrow> <msub> <mover> <mi>w</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>c</mi> </msub> <mo>=</mo> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mn>1</mn> </mrow> </msub> <mo>,</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mn>2</mn> </mrow> </msub> <mo>,</mo> <mo>...</mo> <mo>,</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>n</mi> </mrow> </msub> <mo>)</mo> </mrow> <mo>,</mo> <mn>0</mn> <mo>&lt;</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>i</mi> </mrow> </msub> <mo>&amp;le;</mo> <mn>1</mn> </mrow>
Each value in the vector is that the number of objects included in the cluster corresponding to a prototype belongs to this relative to all The percentage of conceptual object quantity;
Obtaining Central Tendency and Frequency of Instantiation of the object in some concept Value after, according to one integrate function both are integrated, the procedural formalism represent it is as follows:
<mrow> <msub> <mi>&amp;tau;</mi> <mi>c</mi> </msub> <mrow> <mo>(</mo> <mi>a</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>&amp;Phi;</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mrow> <mi>c</mi> <mo>,</mo> <mi>s</mi> </mrow> </msub> <mo>,</mo> <mi>&amp;alpha;</mi> <mo>(</mo> <mrow> <msub> <mover> <mi>p</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>a</mi> </msub> <mo>,</mo> <msub> <mover> <mi>t</mi> <mo>&amp;RightArrow;</mo> </mover> <mi>c</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
(2-6) structure object prototypes process among need to cluster object, by automatic clustering objects method come Obtain multiple prototypes in a concept;It is how former as one concept of reference design using " the classification effectiveness " in cognitive psychology The automatic developing algorithm of type so that the prototype generated is all the basic layer in concept, and the basic layer is:People are recognizing Existing a kind of special conception division during knowing;According to the theory of cognitive psychology, Category in comment Utility is as follows:
<mrow> <mi>c</mi> <mi>u</mi> <mrow> <mo>(</mo> <mrow> <mi>C</mi> <mo>,</mo> <mi>T</mi> </mrow> <mo>)</mo> </mrow> <mo>=</mo> <mfrac> <mn>1</mn> <mi>m</mi> </mfrac> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>k</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>c</mi> <mi>k</mi> </msub> <mo>)</mo> </mrow> <mrow> <mo>&amp;lsqb;</mo> <mrow> <mfrac> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> </msubsup> <msub> <mi>w</mi> <mi>i</mi> </msub> <mi>p</mi> <msup> <mrow> <mo>(</mo> <mrow> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>|</mo> <msub> <mi>c</mi> <mi>k</mi> </msub> </mrow> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <msub> <mi>n</mi> <mi>k</mi> </msub> </mfrac> <mo>-</mo> <mfrac> <mrow> <msubsup> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>w</mi> <mi>i</mi> </msub> <mi>p</mi> <msup> <mrow> <mo>(</mo> <msub> <mi>t</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mn>2</mn> </msup> </mrow> <mi>n</mi> </mfrac> </mrow> <mo>&amp;rsqb;</mo> </mrow> </mrow>
Wherein, C is concept set, and T is that aspect is gathered, p (ti|ck)2Possess one side t for a conceptiProbability, p (ck) It is the probability that an object belongs to a concept, p (ti) it is that an object possesses aspect tiProbability;
(2-7) is automatically excavated for the substantially sub of some concept using Basic Level concept mining algorithm Concept;Using the algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, finally choose The maximum classification of classification effectiveness in the cluster result of formation, each class that the classification is polymerized to belong to Basic Level this Level.
4. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (3), institute State concretely comprising the following steps for minimum comment set excavation:
(3-1) carries out similar comment detection, and similar comment is rejected in the set, and the similar comment test problems can shape Formula turns to:Product review is by one group of vector representation D={ d1..., d|D|, each vectorial di=< wi,1,wi,2,...,wi,|L| > is included | L | individual word weight, wherein L are the word finder number sums of product review collected works, and the word finder is a series of non- The word repeated, each vector is to be normalized to unit length, if cosine-similarity of a pair of vectors is more than or equal to threshold value μSIMWhen, then it is similar comment to think a pair, gives normalized vector, a pair of vector (di,dj) cosine similarity sim (di, dj) be these vectors dot product:
<mrow> <mi>s</mi> <mi>i</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>d</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>d</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>t</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mo>|</mo> <mi>L</mi> <mo>|</mo> </mrow> </munderover> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> <mo>&amp;CenterDot;</mo> <msub> <mi>w</mi> <mrow> <mi>j</mi> <mo>,</mo> <mi>t</mi> </mrow> </msub> </mrow>
(3-2) defines 1 similar comment test problems:The set D of the given text vector for representing corresponding product comment and similar Property threshold value μSIM, similar comment test problems are to determine comment to di,dj∈ D similarity sim (di,dj)≥μSIM
(3c) utilizes a given similarity threshold, reduces the quantity in candidate couple caused by candidate collection generation phase, such as The weight of maximal term of the fruit in a document vector is too small, then its cosine similarity with any vector will be less than given Threshold value, i.e., it need not recall Similarity measures;Give two document vector di,dj∈ D,Representation vector 1- norms value;In addition, in a vectorial djIn maximal term weight be Also referred to as vectorial ∞-norm;It is possible to according to sim (di,dj) following inequality can be obtained:
sim(di,dj)≤min((||di||1×||dj||),(||dj||1×||di||))
If | | di||1×||dj||< μSIM, certainly sim (di,dj) < μSIM, i.e. di,djIt is not similar comment.
5. the comment big data method for digging based on typicalness according to claim 1, it is characterised in that in step (4), BigSimDet parallel calculating methods concretely comprise the following steps:
(4-1) calculates each document vector di∈ D 1- norms | | di||1With ∞-norm | | di||Value, and according to MapReduce Group size parameter τ in parallel computational modelGDistribute these vectors and arrive each group;Sequence (G, p) is established, and causes each vector Group gi∈ G sort from big to small according to 1- norm values;
(4-2) is for each to Vector Groups gi,gj∈ G and gip gj, by assessing giMaximum 1- norm values and gjMaximum 1- Whether the product of norm value is more than threshold value μSIM, to determine its non-similarity relation;
(4-3) is for each group of gi∈ G, initial MapReduce subregionTo calculate giWith the phase of other Vector Groups Like right;In the process, also need to consider non-similarity relation and partition size limitation τMaxG
(4-4) is according to partition size restriction τMaxG, perform MapReduce tasksIf one group of vector gi∈ G have more Individual potential similar Vector Groups, then by giAs seed, and according to partition size restriction τMaxGFormedExtra point Area;
(4-5) is for each subregionThe parallel processing task for starting a MapReduce is commented on to detect similitude.
6. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that in step (4-1), The specific implementation step of MapReduce parallel computational models is:
MapReduce operation is called to calculate the 1- norms and ∞-norm value of each document vector;Institute's directed quantity is subsequent It will be arranged according to the value of its 1- norm by ascending order, and according to pre-defined group of size τGThese vectors are put into corresponding group, Largest packet size τGIt is one of parameter of system, whole collection of document D is divided evenly into some input subregions, each Mapper The file vector of the subregion of their own, the key-value pair generated among operation Mapper sequences, i.e. 1- norms will concurrently be handled Value and document id, then by the key-value among these to being input to after shuffling in Reducer;By using MapReduce In entitled " TotalOrderPartitioner " self-defined partition programs, the key-value of centre generation is to will be at first Process range is reduced after being ranked up in Reducer according to each worth size, and the data after range shorter are being input to Handled in second Reducer, by that analogy;
Wherein, the specific implementation process of MapReduce operation is as follows:One Mapper is obtained each from its input subregion Key-value pair, i.e. a file ID and file vector are inputted, then, the Mapper will be produced accordingly by the Xiang Quan after normalizing The document vector reassembled into;Then, Mapper will calculate 1- norms | | di||1With ∞-norm | | di||Value;Finally, will export The key-value pair of centre generation, i.e. 1- norm values and the tuple being made up of file ID and ∞-norm;In addition, each Reducer will Intermediate green after sequence into key-value to as input, then according to group size parameter τGDocument is arranged according to 1- norm values Sequence;Finally, key-value pair is exported;The key-value each exported to including:One group ID and contain orderly document id list, phase The 1- norms and the tuple of ∞-norm value answered.
7. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-2) has Body is:
When some group by weighing 1- norms and ∞-norm, is being found different with other comments, then while can determine Other different sets of documentation;For each group of gi, the document with maximum 1- norm values is expressed asAnd with maximum ∞- The document of norm is expressed asWhen the different relation between two groups is determined, sets of documentation gjIn∞-norm value It can apply in another group of gi1- normsIfShow in the two groups Document do not have it is similar;Because MapReduce work before has produced the ordered queue of an All Files group, and Document in each group arranges also according to the value ascending order of 1- norms, compares giOther groups for sorting relatively low will be specified on gjNot phase Seemingly.
8. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-3) has Body is:
For each group of gi∈ G, different sets of documentation is excluded by the document combination for concurrently merging similar, to initialize document SubregionThe calculating time that detection-phase is commented in similitude can greatly be saved by excluding different sets of documentation;It is however, initial The size of partitions of file during change may be very uneven, because some sets of documentation may contain many potential similar document groups It is and other only with several potential similar sets of documentation;The imbalance of partitions of file can cause under MapReduce parallel similar Property comment Detection task execution performance dramatically decline;Its reason be one it is similar comment detection operation total execution when Between be by run time it is most long similitude comment Detection task calculate node control.
9. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that in step (4-4),
A smooth task is called by partition size limitation τMaxGBy big file division into several less subregions, τMaxGValue be Determined by the average local memory capacity of computing cluster node, if it exceeds when capacity is locally stored of calculate node, that Sets of documentation must be retrieved from remote node, and remote disk access each time can all consume more much more than local disk access Time, as a result, always perform the time may greatly increase, for each group of gi∈ G, if its initial partitions of file surpasses Cross τMaxGLimitation, this group of giSeed will be used as, and merged to other possible similar sets of documentation so that generation is each SubregionSize not over limitation, finally, similar comment Detection task can be distributed evenly at the institute in cluster Have between calculate node.
10. the comment big data method for digging based on typicalness according to claim 5, it is characterised in that step (4-5) In,
Detection task is commented on by calling the calculate node in distributed type assemblies to handle similitude in a parallel fashion;For each Individual document into regionsStart the parallel similitude comment Detection tasks of a MapReduce, before the calculating of similitude, Need to carry out index operation in each Local partition, perform comment Similarity Measure again after index;For each point Map the and Reduce functions of calculating document vector similarity in area, each Mapper receive the key-value pair from local, key Be an item t, and its value and be document id list and corresponding normalization item weight, call Map functions, each pair document to Amount is only calculated only once, and only when two vectors are containing at least one identical item in falling to index, is just calculated between them Similarity, for each document vector di, potential similar document and d are kept using an Associate array HiSimilarity Score value;Finally, Map functions output document IDniThe number of the key of middle generation is minimized with Associate array H, and they are entered Row shuffles;In addition, Reduce functions receive the key-value pair of middle generation, and the local similarity to being stored in Associate array obtains Divide and carry out sum operation;Finally, the output of Reduce functions includes document id and document most red similarity point in Local partition It is worth key-value pair.
CN201410796566.XA 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness Active CN104462480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410796566.XA CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410796566.XA CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Publications (2)

Publication Number Publication Date
CN104462480A CN104462480A (en) 2015-03-25
CN104462480B true CN104462480B (en) 2017-11-10

Family

ID=52908515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410796566.XA Active CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Country Status (1)

Country Link
CN (1) CN104462480B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955957B (en) * 2016-05-05 2019-01-25 北京邮电大学 The determination method and device that aspect scores in a kind of businessman's general comment
CN106446264B (en) * 2016-10-18 2019-08-27 哈尔滨工业大学深圳研究生院 Document representation method and system
CN109903851B (en) * 2019-01-24 2023-05-23 暨南大学 Automatic observation method for psychological abnormal change based on social network

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945268A (en) * 2012-10-25 2013-02-27 北京腾逸科技发展有限公司 Method and system for excavating comments on characteristics of product
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103778214A (en) * 2014-01-16 2014-05-07 北京理工大学 Commodity property clustering method based on user comments

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130080208A1 (en) * 2011-09-23 2013-03-28 Fujitsu Limited User-Centric Opinion Analysis for Customer Relationship Management

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945268A (en) * 2012-10-25 2013-02-27 北京腾逸科技发展有限公司 Method and system for excavating comments on characteristics of product
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103778214A (en) * 2014-01-16 2014-05-07 北京理工大学 Commodity property clustering method based on user comments

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Facial Expressions of Emotion: Does the Prototype Represent Central Tendency, Frequency of Instantiation, or an Ideal?;Gernot Horstmann;《Emotion》;20021231;第2卷(第3期);297-305 *
Ideals, Central Tendency, and Frequency of Instantiation as Determinants of Graded Structure in Categories;Lawrence W. Barsalou;《Journal of Experimental Psychology: Learning, Memory, and Cognition》;19851031;第11卷(第4期);629-654 *

Also Published As

Publication number Publication date
CN104462480A (en) 2015-03-25

Similar Documents

Publication Publication Date Title
US10019442B2 (en) Method and system for peer detection
CN103678672B (en) Method for recommending information
CN105320719B (en) A kind of crowd based on item label and graphics relationship raises website item recommended method
CN108920527A (en) A kind of personalized recommendation method of knowledge based map
Li et al. A comparative analysis of evolutionary and memetic algorithms for community detection from signed social networks
CN106127546A (en) A kind of Method of Commodity Recommendation based on the big data in intelligence community
CN105760443B (en) Item recommendation system, project recommendation device and item recommendation method
Calders et al. What is data mining and how does it work?
CN107895038A (en) A kind of link prediction relation recommends method and device
Halibas et al. Determining the intervening effects of exploratory data analysis and feature engineering in telecoms customer churn modelling
CN106547864A (en) A kind of Personalized search based on query expansion
Tiwari et al. Amalgamating contextual information into recommender system
CN104462480B (en) Comment big data method for digging based on typicalness
CN105447117B (en) A kind of method and apparatus of user&#39;s cluster
Kaur et al. Advanced eclat algorithm for frequent itemsets generation
Sharma et al. Predicting purchase probability of retail items using an ensemble learning approach and historical data
Sethi et al. Data mining: current applications & trends
CN106649380A (en) Hot spot recommendation method and system based on tag
CN113762703A (en) Method and device for determining enterprise portrait, computing equipment and storage medium
CN113569162A (en) Data processing method, device, equipment and storage medium
Kumar et al. Cuisine prediction based on ingredients using tree boosting algorithms
Khanday et al. A comparative analysis of identifying influential users in online social networks
Muruganantham et al. Discovering and ranking influential users in social media networks using Multi-Criteria Decision Making (MCDM) Methods
CN104572880B (en) The Parallel Implementation method and system of collaborative filtering based on user
CN104102654B (en) A kind of method and device of words clustering

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant