CN104462480A - Typicality-based big comment data mining method - Google Patents

Typicality-based big comment data mining method Download PDF

Info

Publication number
CN104462480A
CN104462480A CN201410796566.XA CN201410796566A CN104462480A CN 104462480 A CN104462480 A CN 104462480A CN 201410796566 A CN201410796566 A CN 201410796566A CN 104462480 A CN104462480 A CN 104462480A
Authority
CN
China
Prior art keywords
comment
similarity
concept
vector
norm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410796566.XA
Other languages
Chinese (zh)
Other versions
CN104462480B (en
Inventor
刘耀强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to CN201410796566.XA priority Critical patent/CN104462480B/en
Publication of CN104462480A publication Critical patent/CN104462480A/en
Application granted granted Critical
Publication of CN104462480B publication Critical patent/CN104462480B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a typicality-based big comment data mining method. The method comprises the steps of conducting comment typicality mining modeling, and conducting modeling and formalized definition on comment typicality calculation and the minimum representative comment set mining problem; (2) automatically constructing a typicality comment prototype; (3) conducting minimum comment set mining, and adopting the minimum comment set mining algorithm for screening out one minimum comment set; (4) adopting the BigSimSet parallel computing method, and calling computing nodes in a distributed cluster for processing a similarity comment detection task in a parallel mode. According to the typicality-based big comment data mining method, on the two viewpoints of the cognitive psychology and the viewpoint mining, the user comment typicality balance method is studied, and the minimum comment set with the representativeness is mined on the basis of the method, so that a potential commodity client is helped to know a certain commodity more comprehensively at multiple angles, the user is helped to screen out the needed commodity more accurately, and the user purchase experience is improved.

Description

The large data digging method of comment based on typicalness
Technical field
The present invention relates to the research field of data mining, particularly the large data digging method of a kind of comment based on typicalness.
Background technology
Along with the high speed development of China's Internet of Things, the comment be distributed in e-commerce website, social networks and various online forum presents volatile growth, and these numbers disclose the personal view of user to a series of broad subjects such as such as consumer products, tissue, personnel and social events in the large data of comment (BigData) of Petabytes (PB).The real needs of these comments on commodity enterprise not only can be allowed to understand client that they are concerned about or potential customers, and provide useful guidance for the shopping decision-making of consumer.According to 2014 CNNIC data displays, the net purchase user more than 90% can make comments below the commodity of shopping website.Meanwhile, the net purchase subscriber's meter exceeding half all can read dependent merchandise comment before being shown in and buying each commodity.Such as, ctrip.com provide one allow client issue its to Suo Zhuguo hotel comment platform, by hotel's comment that this platform is issued, the hotel not being only other customer selecting suitable provides reference, hotel management also can steadily improve service according to online feedback, thus attracts more client both at home and abroad to move in.In addition, analyze these line Evaluations opinion and also government department can be helped comparatively fast and more widely to understand the condition of the people of various places, understand the masses to the view of government policy or community development and viewpoint.Generally speaking, from the angle of user, comment can help user to understand certain commodity more comprehensively, with multi-angle, thus makes the decision whether buying these commodity.Also user can be allowed to understand which commodity simultaneously and could meet its needs.From the angle of enterprise, manufacturer and service supplier need to know the view of user to its product, namely its product from the angle of Consumer's Experience which be advantage which be shortcoming, goods producer can be helped like this to obtain more, more fully field feedback, thus goods and services can be improved better.In sum, abundant valuable information is contained in online comment, is worth us to carry out deep excavation and analysis.
Although line Evaluation opinion has very important meaning and function to enterprise, regulator and commodity user.But at large data age, it is almost impossible for manually browsing the online comment of substantial amounts and analyze, traditional comment method for digging is difficult to carry out real-time analysis and summary to the large data of comment, and the comment and analysis effect obtained thus is unsatisfactory.Under large data background, set up and intelligent network comments on opining mining system and there is very high investigation and application be worth.Such as, by excavating minimum representativeness comment set in the large data of comment, allowing system user understand viewpoints different in comment fast, thus fast and effeciently monitoring the condition of the people of market trend or various places.
Summary of the invention
Fundamental purpose of the present invention is that the shortcoming overcoming prior art is with not enough, there is provided a kind of comment based on typicalness large data digging method, the method utilizes " basic unit's concept " (Basic LevelConcept) of cognitive psychology theory and polyarch theory to carry out design review typicalness and calculates, to excavate representative minimum comment set, and Hadoop platform is used to process the large data mining of comment concurrently.。
In order to achieve the above object, the present invention is by the following technical solutions:
The large data digging method of comment based on typicalness, comprises the steps:
(1) comment on typicalness and excavate modeling, modeling and formal definitions are carried out to the calculating of comment typicalness and minimum representative comment set Mining Problems;
(2) typicalness comment prototype builds automatically, and " basic unit's concept " based on cognitive psychology theoretical and polyarch theory carrys out design review typicalness computing method, instructs the generation commenting on prototype by the classification effectiveness in " basic unit's concept " theory;
(3) minimum comment set is excavated, minimum comment is adopted to gather mining algorithm, filter out a minimum comment set, this set has following features: each comment in set is all different and can represent the viewpoint of quite a few user, the viewpoint of all comments of these commodity can be contained and represent to all comments in this minimum comment set, user only needs to browse the comment in this minimum comment set, just can understand the User Perspective of this comments on commodity all;
(4) adopt BigSimDet parallel calculating method, process similarity comment Detection task in a parallel fashion by the computing node called in distributed type assemblies.
Preferably, in step (1), the concrete steps that comment typicalness excavates modeling are:
(1-1) all comments of certain commodity x are regarded as one " concept ", described " concept " the i.e. comment of commodity x, each comment is then one " example " of this " concept ", then every bar comment has different typicalness in this " concept ", in addition, in all comments of the comment of commodity x, extract a minimum representative comment set, this comment set has following two attributes:
(1-1-1) all n bar comments that set comprises can represent the dissimilar viewpoint of all users to the full extent;
(1-1-2) the number of reviews n in set is little as far as possible; User only need browse n bar few in number and comment on all viewpoints and suggestion that just can understand more all sidedly for commodity x;
(1-2) " aspect " is adopted to carry out formalization representation to comment on commodity;
p → a = ( s a , 1 : v a , 1 : s a , 2 : v a , 2 , . . . , s a , k : v a , k )
Wherein s a,ibe one and belong to commodity a " aspect ", v a, ifor s in comment a,ifeeling polarities value, i.e. the Sentiment orientation value of some aspects.
(1-3) for comment typicalness computational problem, one can be regarded as minor function:
χ:R i→T i
Wherein, R ithe comment set belonging to commodity i, T iit is the comment set after sorting according to comment typicalness;
For minimum representativeness comment set Mining Problems, theoretical according to polyarch, first cluster is carried out to comment on commodity, then from each comment classification, a comment prototype is extracted to represent this kind of comment, therefore, all comments of commodity x can be represented by n comment prototype, that is:
t → c = ( t → c , 1 , t → c , 2 , . . . , t → c , n )
Wherein, be a comment prototype, can be expressed as:
t → c , j = ( s j , 1 c : v j , 1 c , s j , 2 c : v j , 2 c , . . . , s j , m c : v j , m c )
(1-4) minimum representativeness comment set Mining Problems can be expressed as a function:
θ:R i→L i
Wherein, R ithe comment set belonging to commodity i, L iit is the minimum representativeness comment set of commodity i.
Preferably, in step (2), the concrete grammar that typicalness comment prototype builds automatically is:
(2-1) in polyarch theory, a concept can be represented by multiple abstract object and represent, described object representative is prototype, and each prototype represents one group of similar object respectively, is the abstract representation of these objects;
(2-2) according to the research of cognitive psychology, the typicalness of object is determined by two factors: CentralTendency and Frequency of Instantiation; The former is the similarity of object and concept member, if namely an object is very similar with other object instances in certain concept, and other concepts beyond this concept are very dissimilar, and so the typicalness of this object in this concept is very high; The latter is that a people often meets certain object and it classified as the frequency of certain concept;
(2-3) for the Central Tendency of an object in concept, determined by two aspects: the similarity of other objects in this object and concept, i.e. internal similarity, and and other concepts in the dissimilarity of object, i.e. outside dissimilarity; Internal similarity can be expressed as:
β ( p → a , t → c ) = sim ( p → a , t → c , s )
Wherein, represent an object a, represent a concept c, the prototype s the most similar with a;
(2-4) in order to weigh the similarity of prototype and object, different similarity functions is taked to process, for the outside dissimilarity of an object in certain concept, it is regarded as the integration value of the dissimilarity of this object and other concepts, uses following formulae discovery:
δ ( p → a , t → c ) = Σ x dissimilar ( p → a , t → x , s ) N Δ - 1 , x ∈ Candx ≠ c
(2-5) integrating internal similarity and outside dissimilarity by integrating function (aggregation function), obtaining the Central Tendency of object a in concept c;
For the multiple objects belonged to outside certain concept, cluster is carried out to these objects, thus the multiple prototypes obtaining a concept represent, for the second factor affecting object typicalness, Frequency ofInstantiation, define a remarkable vector of prototype and represent the Frequencyof Instantiation of different prototype in a concept, as follows:
w &RightArrow; c = ( w c , 1 w c , 2 , . . . , w c , n ) , 0 < w c , i &le; 1
Each value in this vector is that the number of objects that comprises in the cluster corresponding to a prototype is relative to all number percent belonging to this conceptual object quantity;
After the value of the Central Tendency of acquisition object in certain concept and Frequency ofInstantiation, integrate function according to one and both are integrated, this procedural formalism is expressed as follows:
&tau; c ( a ) = &Phi; ( w c , s , &alpha; ( p &RightArrow; a , t &RightArrow; c ) ) ;
(2-5) need to carry out cluster to object in the middle of the process building object prototypes, obtain the multiple prototypes in a concept by automatic clustering objects method; Adopt " classification effectiveness " in cognitive psychology as the automatic developing algorithm of a reference design concept polyarch, make generated prototype be all basic layer in concept, described basic layer is: a kind of special conception division that people exist in the process of cognition; According to the theory of cognitive psychology, in comment, Category Utility is as follows:
cu ( C , T ) = 1 m &Sigma; k = 1 m p ( c k ) [ &Sigma; i = 1 n k w i p ( t i | c k ) 2 n k - &Sigma; i = 1 n w i p ( t i ) 2 n ]
Wherein, C is concept set, and F is aspect set, p (t i| c k) 2be that a concept has an aspect t i, probability, p (c k) be the probability that an object belongs to a concept, p (t i) be that an object has aspect t iprobability;
(2-6) mining algorithm of Basic Level concept is adopted automatically to excavate substantially sub-concept for some concepts; Utilize this algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, and finally choose the classification that in the cluster result of formation, classification effectiveness is maximum, each class that this classification is polymerized to belongs to this level of Basic Level.
Preferably, in step (3), the concrete steps that described minimum comment set is excavated are:
(3-1) carry out similar comment to detect, similar comment rejected in this set, described similar comment test problems can form turn to: product review is by one group of vector representation D={d 1, d 1..., d | D|, each vectorial d i=<w i, 1, w i, 2..., w i, | L|> comprises | L| word weight, wherein L is the word finder number sum of product review collected works, described word finder is a series of non-repetitive words, and each vector is normalized to unit length, if the cosine-similarity of a pair vector is more than or equal to threshold value μ sIMtime, so think it is for a pair similar comment, given normalized vector, a pair vector (d i, d j) cosine similarity sim (d i, d j) be the dot product of these vectors:
sim ( d i , d j ) = &Sigma; t = 1 | L | w i , t &CenterDot; w j , t
(3-2) 1 (similar comment test problems) is defined: the set D of the text vector of given expression corresponding product comment and similarity threshold μ sIM, similar comment test problems to determine that comment is to d i, d jsimilarity sim (the d of ∈ D i, d j)>=μ sIM;
(3c) similarity threshold that given is utilized, reduce in the right quantity of the candidate of described candidate's generation phase generation, weight as crossed the maximal term in a document vector is too little, so the cosine similarity of it and any vector is by lower than given threshold value, namely without the need to calling Similarity measures again; Given two documents vector d i, d j∈ D, the value of the 1-norm of representation vector; In addition, at a vectorial d jin maximal term weight be w j max = max ( { w j , 1 , w j , 2 , . . . , w j , | L | } ) , be also referred to as the ∞-norm of vector; So, can according to sim (d i, d j) can obtain below inequality:
sim(d i,d j)≤min((||d i|| 1×||d j|| ),(||d j|| 1×||d i|| ))
If || d i|| 1× || d j|| < μ sIM, certainly sim (d i, d j) < μ sIM, i.e. d i, d jit not similar comment.
Preferably, in step (4), the concrete steps of BigSimDet parallel calculating method are:
(4-1) each document vector d is calculated ithe 1-norm of ∈ D || d i|| 1with ∞-norm || d i|| value, and according to the group size parameter τ in MapReduce parallel computational model gdistribute these vectors to each group; Foundation sequence ( ), and make each Vector Groups g i∈ G sorts from big to small according to 1-norm value;
(4-2) for each to Vector Groups g i, g j∈ G and by assessment g imaximum 1-norm value and g jthe product of maximum 1-norm value whether be greater than threshold value μ sIM, determine its non-similarity relation;
(4-3) for each group of g i∈ G, the subregion of an initial MapReduce calculate g isimilar right to other Vector Groups; In the process, also need to consider non-similarity relation and partition size restriction τ maxG;
(4-4) according to partition size restriction τ maxG, perform MapReduce task if one group of vector g i∈ G has multiple potential similar Vector Groups, then by g ias seed, and according to partition size restriction τ maxGformed additional partitions;
(4-5) for each subregion the parallel processing task starting a MapReduce detects similarity comment.
Preferably, in step (4-1), the specific implementation step of MapReduce parallel computational model is:
Call the operation of a MapReduce to calculate 1-norm and the ∞-norm value of each document vector; Institute's directed quantity subsequently by according to the value of its 1-norm by ascending order arrangement, and according to the size τ of pre-defined group gthese vectors are put into corresponding group, largest packet size τ git is one of parameter of system, whole collection of document D is divided into some input subregions equably, each Mapper will process the file vector of its oneself subregion concurrently, run Mapper sequence in the middle of generate key-it is right to be worth, then keys in the middle of these-be worth are input in Reducer by i.e. 1-norm value and document id after shuffling; By using " TotalOrderPartitioner " self-defined partition programs by name in MapReduce, the key that centre generates-be worth and reduce process range after sorting according to each worth size in first Reducer, and the data after range shorter are processed being input in second Reducer, by that analogy;
Wherein, the specific implementation process of the operation of MapReduce is as follows: Mapper obtains from its input subregion that each enter key-it is right to be worth, an i.e. file ID and file vector, then, this Mapper will produce the document vector be made up of the item weight after normalization accordingly; Then, Mapper will calculate 1-norm || d i|| 1with ∞-norm || d i|| value; Finally, by generate in the middle of exporting key-it is right to be worth, i.e. 1-norm value and the tuple that is made up of file ID and ∞-norm; In addition, each Reducer generates key-be worth to as input, then according to group size parameter τ in the middle of after sequence gdocument is sorted according to 1-norm value; Finally, run-out key-it is right to be worth; The key of each output-be worth comprising: a group ID and contain the tuple of orderly document id list, corresponding 1-norm and ∞-norm value.
Preferably, step (4-2) is specially:
When certain group is by weighing 1-norm and ∞-norm, when being found to comment on different with other, other different sets of documentation can be determined so simultaneously; For each group g i, the document with maximum 1-norm value is expressed as and the document with maximum ∞-norm is expressed as when different relation between two groups is determined, sets of documentation g jin the value of ∞-norm can be applied in another group g i1-norm if show that the document in these two groups does not have similar; Because MapReduce work before has produced the ordered queue of an All Files group, and the document in each group also arranges according to the value ascending order of 1-norm, compares g isort other lower groups (as g 1and g 2) will g be specified on jdissimilar.
Preferably, step (4-3) is specially:
For each group of g i∈ G, getting rid of different sets of documentation by merging similar document combination concurrently, coming initialization document into regions get rid of different sets of documentation and greatly can save the computing time of commenting on detection-phase in similarity; But the size of partitions of file during initialization may be very uneven, because some sets of documentation may other only has several potential similar sets of documentation containing many potential similar document groups; The similarity comment Detection task execution performance parallel under causing MapReduce of the imbalance of partitions of file declines dramatically; Its reason is that total execution time that a similar comment detects operation comments on the computing node control of Detection task by the similarity that working time is the longest.
Preferably, in step (4-4),
Call a level and smooth task and limit τ by partition size maxGlarge files is divided into several less subregion, τ maxGvalue be determined by the average local memory capacity of computing cluster node, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node, remote disk access each time all can consume accesses much more time than local disk, consequently, total execution time may increase greatly, for each group of g i∈ G, if its initial partitions of file is more than τ maxGrestriction, this group g ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction, finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.
Preferably, in step (4-5),
Similarity comment Detection task is processed in a parallel fashion by the computing node called in distributed type assemblies; For each document into regions start the similarity comment Detection task that a MapReduce is parallel, before the calculating of similarity, need to carry out an operation of falling index in each Local partition, rerun after index likeness in form comment job run; For Map and the Reduce function of the calculating document vector similarity in each subregion, each Mapper accept from this locality key-it is right to be worth, key is an item t, and its value and be document id list and corresponding normalization item weight, call Map function, often pair of document vector is only calculated only once, and only has when two vectors are when falling in index containing at least one identical item, just calculate the similarity between them, for each document vector d i, adopt an Associate array H to keep potential similar document and d isimilarity score; Finally, Map function output document IDn iminimize the number of the middle key generated with Associate array H, and they are shuffled; In addition, generate in the middle of Reduce function accepts key-it is right to be worth, and sum operation is carried out to the local similarity score being stored in Associate array; Finally, Reduce function export to comprise in Local partition document id and document the reddest similarity score key-it is right to be worth.
Compared with prior art, tool has the following advantages and beneficial effect in the present invention:
(1) the present invention is a research method with subject crossing, by polyarch representation of concept in Applied Cognitive Psychology, theoretical and Basic Level Concept theory weighs a typical degree commented on, and discloses new comment typicalness computing method.In addition, the classification effectiveness proposed based on cognitive psychologist assists the generation commenting on prototype as clustering target, the prototype making cluster generate out and the true cognition of the comment typicalness calculated closer to people.
(2) mainly concentrate with existing opining mining technology the sentiment analysis in comment on and viewpoint make a summarys to be intended to minimum representative in the large data of research comment unlike, the present invention and comment on and gather this new problem of excavation.By excavating minimum representative comment set, user can be allowed without the need to the comment browsing large data also to understand the general picture of all comments and the diversity of viewpoint in comment easily comprehensively, the research filling up this problem is blank, and can to strengthen commenting in large data meaningful comment to a greater extent to the reference role of user.
(3) achievement in research disclosed by the invention can help potential customers to understand certain commodity more comprehensively, with multi-angle, and help user screens the commodity needed for it more accurately, promotes the buying experience of user.In addition, from the angle of manufacturer and service supplier, they more fully can understand the view of user to its product, namely from product the angle of Consumer's Experience which be advantage which be determine, goods producer can be helped like this to obtain more more fully field feedbacks, thus improve commodity better, promote the sale of commodity.In addition, method disclosed by the invention can also help government department comparatively fast and more extensively, all sidedly to understand the various places condition of the people, understands the masses to the more typical view of government policy and various different representative view.
(4) method disclosed by the invention is that the large data of upper comment Network Based are carried out, mining algorithm is commented on, to tackle the application under large data environment with the comment typicalness computing method in a distributed manner disclosed in realization and minimum representative by research is parallel under Hadoop and MapReduce.
Accompanying drawing explanation
Fig. 1 is the processing flow chart of method for digging of the present invention;
Fig. 2 is comment vector order and packet design pattern.
Fig. 3 is that document Vector Groups distinctiveness ratio calculates.
Embodiment
Below in conjunction with embodiment and accompanying drawing, the present invention is described in further detail, but embodiments of the present invention are not limited thereto.
Embodiment
As shown in Figure 1, the large data digging method of the comment based on typicalness, comprises the steps:
(1) comment on typicalness and excavate modeling, modeling and formal definitions are carried out to the calculating of comment typicalness and minimum representative comment set Mining Problems;
(2) typicalness comment prototype builds automatically, and " basic unit's concept " based on cognitive psychology theoretical and polyarch theory carrys out design review typicalness computing method, instructs the generation commenting on prototype by the classification effectiveness in " basic unit's concept " theory;
(3) minimum comment set is excavated, minimum comment is adopted to gather mining algorithm, filter out a minimum comment set, this set has following features: each comment in set is all different and can represent the viewpoint of quite a few user, the viewpoint of all comments of these commodity can be contained and represent to all comments in this minimum comment set, user only needs to browse the comment in this minimum comment set, just can understand the User Perspective of this comments on commodity all;
(4) adopt BigSimDet parallel calculating method, process similarity comment Detection task in a parallel fashion by the computing node called in distributed type assemblies.
In the present embodiment, modeling is excavated in the comment of comment typicalness, adopts " aspect " to carry out formalization representation to comment on commodity:
p &RightArrow; a = ( s a , 1 : v a , 1 : s a , 2 : v a , 2 , . . . , s a , k : v a , k )
Wherein s a,ibe one and belong to commodity a " aspect ", v a,ifor s in comment a,ifeeling polarities value, i.e. the Sentiment orientation value of some aspects.
For comment typicalness computational problem, one can be regarded as minor function:
χ:R i→T i
Wherein, R ithe comment set belonging to commodity i, T iit is the comment set after sorting according to comment typicalness.
For minimum representativeness comment set Mining Problems, theoretical according to polyarch, first cluster is carried out to comment on commodity, from each comment classification, then extract a comment prototype to represent this kind of comment.Therefore, all comments of commodity x can be represented by n comment prototype, that is:
t &RightArrow; c = ( t &RightArrow; c , 1 , t &RightArrow; c , 2 , . . . , t &RightArrow; c , n )
Wherein, be a comment prototype, can be expressed as:
t &RightArrow; c , j = ( s j , 1 c : v j , 1 c , s j , 2 c : v j , 2 c , . . . , s j , m c : v j , m c )
Minimum representativeness comment set Mining Problems can be expressed as a function
θ:R i→L i
Wherein, R ithe comment set belonging to commodity i, L iit is the minimum representativeness comment set of commodity i.
In the present embodiment, typicalness comment prototype automatically builds and comprises the following steps:
1. for the Central Tendency of an object in concept, determined by two aspects: the similarity (internal similarity) of other objects in this object and concept, and and other concepts in the dissimilarity (outside dissimilarity) of object.Internal similarity can be expressed as:
&beta; ( p &RightArrow; a , t &RightArrow; c ) = sim ( p &RightArrow; a , t &RightArrow; c , s )
Wherein, represent an object a, represent a concept c, the prototype s the most similar with a.
2. in order to weigh the similarity of prototype and object, the present invention takes different similarity functions, such as Cosine similarity function, Jaccard similarity function etc.For the outside dissimilarity of an object in certain concept, it is regarded as the integration value of the dissimilarity of this object and other concepts, uses following formulae discovery:
&delta; ( p &RightArrow; a , t &RightArrow; c ) = &Sigma; x dissimilar ( p &RightArrow; a , t &RightArrow; x , s ) N &Delta; - 1 , x &Element; Candx &NotEqual; c
3. the present invention designs integration function (aggregation function) and integrates internal similarity and outside dissimilarity, thus obtains the Central Tendency of object a in concept c.For the second factor affecting object typicalness, Frequency of Instantiation, the present invention defines a remarkable vector of prototype to represent the Frequency of Instantiation of different prototype in a concept, as follows:
w &RightArrow; c = ( w c , 1 w c , 2 , . . . , w c , n ) , 0 < w c , i &RightArrow; 1
Each value in this vector is that the number of objects that comprises in the cluster corresponding to a prototype is relative to all number percent belonging to this conceptual object quantity.
4., after the value of the Central Tendency of acquisition object in certain concept and Frequency ofInstantiation, will integrate function according to one and both will be integrated, this procedural formalism is expressed as follows:
&tau; c ( a ) = &Phi; ( w c , s , &alpha; ( p &RightArrow; a , t &RightArrow; c ) ) ;
5. adopt " classification effectiveness " in cognitive psychology to design the automatic developing algorithm of a concept polyarch as benchmark, make generated prototype be all basic layer in concept.According to the theory of cognitive psychology, in comment, Category Utility is as follows:
cu ( C , T ) = 1 m &Sigma; k = 1 m p ( c k ) [ &Sigma; i = 1 n k w i p ( t i | c k ) 2 n k - &Sigma; i = 1 n w i p ( t i ) 2 n ]
Wherein, C is concept set, and F is aspect set, p (t i| c k) 2be that a concept has an aspect t i, probability, p (c k) be the probability that an object belongs to a concept, p (t i) be that an object has aspect t iprobability.
6. the mining algorithm of basic unit's concept (Basic Level concept) is as shown in table 1, and this algorithm automatically can excavate the substantially sub-concept (Basic Level Concept) for some concepts.Utilize this algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, and finally choose the classification that in the cluster result of formation, classification effectiveness is maximum, each class that this classification is polymerized to belongs to this level of Basic Level.
Table 1
In the present embodiment, minimum comment set is excavated and is specifically comprised the following steps:
1. similar comment detects, and similar comment is rejected in this set.Described similar comment test problems can form turn to: product review (i.e. document) is by one group of vector representation D={d 1, d 1..., d | D|.Each vectorial d i=<w i, 1, w i, 2..., w i, | L|> comprises | and L| word weight, wherein L is word finder (i.e. a series of non-repetitive word) the number sum of product review collected works.Each vector is normalized to unit length.If the Yu Xian of a pair vector ?similarity be more than or equal to threshold value μ sIMtime, so think it is for a pair similar comment.Given normalized vector, a pair vector (d i, d j) cosine similarity sim (d i, d j) be the dot product of these vectors:
sim ( d i , d j ) = &Sigma; t = 1 | L | w i , t &CenterDot; w j , t
2. utilize the similarity threshold that given, to reduce in the right quantity of the candidate of described candidate's generation phase generation.If the weight of the maximal term in a document vector is too little, so the cosine similarity of it and any vector is by lower than given threshold value, namely without the need to calling Similarity measures again.Given two documents vector d i, d j∈ D, representation vector 1 ?the value of norm.In addition, at a vectorial d jin maximal term weight be w j max = max ( { w j , 1 , w j , 2 , . . . , w j , | L | } ) , be also referred to as vectorial ∞ ?norm.So, can according to sim (d i, d j) can obtain below inequality:
sim(d i,d j)≤min((||d i|| 1×||d j|| ),(||d j|| 1×||d i|| ))
If || d i|| 1× || d j|| < μ sIM, certainly sim (d i, d j) < μ sIM, i.e. d i, d jit not similar comment.
In the present embodiment, BigSimDet parallel algorithm comprises the following steps:
1. the operation of calling a MapReduce to calculate each document vector 1 ?norm He ∞ ?norm value.Institute's directed quantity subsequently by according to the value of its 1 ?norm by ascending order arrangement, and according to the size τ of pre-defined group gthese vectors are put into corresponding group.Largest packet size τ git is one of parameter of system.Generally, a little τ gwill be consuming time for cost more to organize structure, allow to find more different relation Vector Groups.As shown in Figure 2, be the parallel processing mode of first MapReduce operation.Whole collection of document D is divided into some input subregions equably.Each Mapper by process concurrently its oneself subregion file vector (such as, calculate 1 ?norm He ∞ ?the value of norm).Run Mapper sequence in the middle of generate Jian ?value to (that is, 1 ?norm value and document id), then by these middle Jian ?value be input in Reducer after shuffling.By using " TotalOrderPartitioner " self-defined partition programs by name in MapReduce, middle generate Jian ?value after sorting according to each worth size in first Reducer, reduce process range, and the data after range shorter are processed being input in second Reducer, by that analogy.
The algorithm details of first described MapReduce operation is as shown in table 2, and its primary responsibility sorts to file vector concurrently and divides into groups.Wherein, Mapper obtains each enter key-be worth (that is, a file ID and file vector) from its input subregion.Then, this Mapper will produce the document vector be made up of the item weight after normalization accordingly.Then, Mapper will calculate 1-norm || d i|| 1with ∞-norm || d i|| value.Finally, by the key-be worth generated in the middle of exporting to (that is, 1-norm value and the tuple that is made up of file ID and ∞-norm).In addition, each Reducer generates key-be worth to as input, then according to group size parameter τ in the middle of after sequence gdocument is sorted according to 1-norm value.Finally, run-out key-it is right to be worth.The key of each output-be worth comprising: a group ID and contain the tuple of orderly document id list, corresponding 1-norm and ∞-norm value.
Table 2
2. search relevant different sets of documentation, and save thus in a large number in the time that unnecessary document similarity calculates.For given threshold value μ sIMwhen=0.5, the process that different sets of documentation is searched.When certain group by measurement 1 ?norm He ∞ ?norm, be found to comment on (as the g in Fig. 3 with other 3and g 5) different time, other different sets of documentation (such as, g can be determined so simultaneously 5and g 2, g 5and g 1).For each group g i, have maximum 1 ?the document of norm value be expressed as and have maximum ∞ ?the document of norm be expressed as when different relation between two groups is determined, sets of documentation g jin (as g 5in d 10) ∞ ?the value of norm can be applied in another group g i1 ?norm (as g 3in d 6).If show in these two groups (as g 3and g 5) document do not have similar.Because MapReduce work before has produced the ordered queue of an All Files group, and document in each group also according to 1 ?the value ascending order arrangement of norm, compare g isort other lower groups (as g 1and g 2) will g be specified on jdissimilar.Reason is for any one group of g i-k, | | d i - k L 1 | | 1 &times; | | d j L &infin; | | &infin; < &mu; SIM , Because | | d i - k L 1 | | 1 &times; | | d j L &infin; | | 1 . In fact, in step 1, sets of documentation is sorted entirely and significantly can reduce the comparison number of sets of documentation in step 2.In addition, due to each group between 1 ?norm He ∞ ?norm value ( with ) be that independent parallel calculates, assessment | | d i - k L 1 | | 1 &times; | | d j L &infin; | | &infin; < &mu; SIM Instead of assessment with between dot product can save computing time significantly, particularly process when vocabulary is very large large data.
3. for each group of g i∈ G, getting rid of different sets of documentation by merging similar document combination concurrently, coming initialization document into regions get rid of different sets of documentation and greatly can save the computing time of commenting on detection-phase in similarity.But the size of partitions of file during initialization may be very uneven, because some sets of documentation may other only has several potential similar sets of documentation containing many potential similar document groups.The similarity comment Detection task execution performance parallel under causing MapReduce of the imbalance of partitions of file declines dramatically.Its reason is that total execution time that a similar comment detects operation comments on the computing node control of Detection task by the similarity that working time is the longest.
4. call a level and smooth task and limit τ by partition size maxGlarge files is divided into several less subregion.τ maxGvalue be determined by the average local memory capacity of computing cluster node.In general, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node.Remote disk access each time all can consume accesses much more time than local disk.Consequently, total execution time may increase greatly.. for each group of g i∈ G, if its initial partitions of file is more than τ maxGrestriction, this group g ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction.Finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.
Similarity comment Detection task is processed in a parallel fashion by the computing node called in distributed type assemblies.For each document into regions start the similarity comment Detection task that a MapReduce is parallel.Before the calculating of similarity, need to carry out an operation of falling index in each Local partition.Rerun after index likeness in form comment job run.As shown in table 3, be Map and the Reduce function of the calculating document vector similarity in each subregion.Each Mapper accept from this locality key-it is right to be worth.Key is an item t, and its value and be document id list and corresponding normalization item weight.Call Map function, often pair of document vector is only calculated only once.Only having when two vectors are when falling in index containing at least one identical item, just calculating the similarity between them.For each document vector d i(be n by document id i), adopt an Associate array H to keep potential similar document and d isimilarity score.Finally, Map function output document IDn iminimize the number of the middle key generated with Associate array H, and they are shuffled.In addition, generate in the middle of Reduce function accepts key-it is right to be worth, and sum operation is carried out to the local similarity score being stored in Associate array.Finally, Reduce function export to comprise in Local partition document id and document the reddest similarity score key-it is right to be worth.
Table 3
Above-described embodiment is the present invention's preferably embodiment; but embodiments of the present invention are not restricted to the described embodiments; change, the modification done under other any does not deviate from Spirit Essence of the present invention and principle, substitute, combine, simplify; all should be the substitute mode of equivalence, be included within protection scope of the present invention.

Claims (10)

1., based on the large data digging method of comment of typicalness, it is characterized in that, comprise the steps:
(1) comment on typicalness and excavate modeling, modeling and formal definitions are carried out to the calculating of comment typicalness and minimum representative comment set Mining Problems;
(2) typicalness comment prototype builds automatically, and " basic unit's concept " based on cognitive psychology theoretical and polyarch theory carrys out design review typicalness computing method, instructs the generation commenting on prototype by the classification effectiveness in " basic unit's concept " theory;
(3) minimum comment set is excavated, minimum comment is adopted to gather mining algorithm, filter out a minimum comment set, this set has following features: each comment in set is all different and can represent the viewpoint of quite a few user, the viewpoint of all comments of these commodity can be contained and represent to all comments in this minimum comment set, user only needs to browse the comment in this minimum comment set, just can understand the User Perspective of this comments on commodity all;
(4) adopt BigSimDet parallel calculating method, process similarity comment Detection task in a parallel fashion by the computing node called in distributed type assemblies.
2. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (1), the concrete steps that comment typicalness excavates modeling are:
(1-1) all comments of certain commodity x are regarded as one " concept ", described " concept " the i.e. comment of commodity x, each comment is then one " example " of this " concept ", then every bar comment has different typicalness in this " concept ", in addition, in all comments of the comment of commodity x, extract a minimum representative comment set, this comment set has following two attributes:
(1-1-1) all n bar comments that set comprises can represent the dissimilar viewpoint of all users to the full extent;
(1-1-2) the number of reviews n in set is little as far as possible; User only need browse n bar few in number and comment on all viewpoints and suggestion that just can understand more all sidedly for commodity x;
(1-2) " aspect " is adopted to carry out formalization representation to comment on commodity;
p &RightArrow; a = ( s a , 1 : v a , 1 , s a , 2 : v a , 2 , . . . , s a , k : v a , k )
Wherein s a, ibe one and belong to commodity a " aspect ", v a, ifor s in comment a, ifeeling polarities value, i.e. the Sentiment orientation value of some aspects.
(1-3) for comment typicalness computational problem, one can be regarded as minor function:
χ:R i→T i
Wherein, R ithe comment set belonging to commodity i, T iit is the comment set after sorting according to comment typicalness;
For minimum representativeness comment set Mining Problems, theoretical according to polyarch, first cluster is carried out to comment on commodity, then from each comment classification, a comment prototype is extracted to represent this kind of comment, therefore, all comments of commodity x can be represented by n comment prototype, that is:
t &RightArrow; c = ( t &RightArrow; c , 1 , t &RightArrow; c , 2 , . . . , t &RightArrow; c , n )
Wherein, be a comment prototype, can be expressed as:
t &RightArrow; c , j = ( s j , 1 c : v j , 1 c , s j , 2 c : v j , 2 c , . . . , s j , m c : v j , m c )
(1-4) minimum representativeness comment set Mining Problems can be expressed as a function:
θ:R i→L i
Wherein, R ithe comment set belonging to commodity i, L iit is the minimum representativeness comment set of commodity i.
3. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (2), the concrete grammar that typicalness comment prototype builds automatically is:
(2-1) in polyarch theory, a concept can be represented by multiple abstract object and represent, described object representative is prototype, and each prototype represents one group of similar object respectively, is the abstract representation of these objects;
(2-2) according to the research of cognitive psychology, the typicalness of object is determined by two factors: CentralTendency and Frequency of Instantiation; The former is the similarity of object and concept member, if namely an object is very similar with other object instances in certain concept, and other concepts beyond this concept are very dissimilar, and so the typicalness of this object in this concept is very high; The latter is that a people often meets certain object and it classified as the frequency of certain concept;
(2-3) for the Central Tendency of an object in concept, determined by two aspects: the similarity of other objects in this object and concept, i.e. internal similarity, and and other concepts in the dissimilarity of object, i.e. outside dissimilarity; Internal similarity can be expressed as:
&beta; ( p &RightArrow; a , t &RightArrow; c ) = sim ( p &RightArrow; a , t &RightArrow; c , s )
Wherein, represent an object a, represent a concept c, the prototype s the most similar with a;
(2-4) in order to weigh the similarity of prototype and object, different similarity functions is taked to process, for the outside dissimilarity of an object in certain concept, it is regarded as the integration value of the dissimilarity of this object and other concepts, uses following formulae discovery:
&delta; ( p &RightArrow; a , t &RightArrow; c ) = &Sigma;xdissimilar ( p &RightArrow; a , t &RightArrow; x , s ) N &Delta; - 1 , s &Element; C and x &NotEqual; c
(2-5) integrating internal similarity and outside dissimilarity by integrating function (aggregation function), obtaining the Central Tendency of object a in concept c;
For the multiple objects belonged to outside certain concept, cluster is carried out to these objects, thus the multiple prototypes obtaining a concept represent, for the second factor affecting object typicalness, Frequency ofInstantiation, define a remarkable vector of prototype and represent the Frequencyof Instantiation of different prototype in a concept, as follows:
w &RightArrow; c = ( w c , 1 , w c , 2 , . . . , w c , n ) , 0 < w c , i &le; 1
Each value in this vector is that the number of objects that comprises in the cluster corresponding to a prototype is relative to all number percent belonging to this conceptual object quantity;
After the value of the Central Tendency of acquisition object in certain concept and Frequency ofInstantiation, integrate function according to one and both are integrated, this procedural formalism is expressed as follows:
&tau; c ( a ) = &Phi; ( w c , s &alpha; ( p &RightArrow; a , t &RightArrow; c ) ) ;
(2-5) need to carry out cluster to object in the middle of the process building object prototypes, obtain the multiple prototypes in a concept by automatic clustering objects method; Adopt " classification effectiveness " in cognitive psychology as the automatic developing algorithm of a reference design concept polyarch, make generated prototype be all basic layer in concept, described basic layer is: a kind of special conception division that people exist in the process of cognition; According to the theory of cognitive psychology, in comment, Category Utility is as follows:
cu ( C , T ) = 1 m &Sigma; k = 1 m p ( c k ) [ &Sigma; i = 1 n k w i p ( t i | c k ) 1 n k - &Sigma; i = 1 n w i p ( t i ) 2 n ]
Wherein, C is concept set, and F is aspect set, p (t i| c k) 2be that a concept has an aspect t i, probability, p (c k) be the probability that an object belongs to a concept, p (t i) be that an object has aspect t iprobability;
(2-6) mining algorithm of Basic Level concept is adopted automatically to excavate substantially sub-concept for some concepts; Utilize this algorithm, all comments of commodity can automatically be polymerized to several classes according to the change of classification effectiveness, and finally choose the classification that in the cluster result of formation, classification effectiveness is maximum, each class that this classification is polymerized to belongs to this level of Basic Level.
4. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (3), the concrete steps that described minimum comment set is excavated are:
(3-1) carry out similar comment to detect, similar comment rejected in this set, described similar comment test problems can form turn to: product review is by one group of vector representation D={d 1, d 1..., d | D|, each vectorial d i=< w i, 1, w i, 2..., w i, | L|> comprises | L| word weight, wherein L is the word finder number sum of product review collected works, described word finder is a series of non-repetitive words, and each vector is normalized to unit length, if the cosine-similarity of a pair vector is more than or equal to threshold value μ sIMtime, so think it is for a pair similar comment, given normalized vector, a pair vector (d i, d j) cosine similarity sim (d i, d j) be the dot product of these vectors:
sim ( d i , d j ) = &Sigma; t = 1 | L | w i , t &CenterDot; w j , t
(3-2) 1 (similar comment test problems) is defined: the set D of the text vector of given expression corresponding product comment and similarity threshold μ sIM, similar comment test problems to determine that comment is to d i, d jsimilarity sim (the d of ∈ D i, d j)>=μ sIM;
(3c) similarity threshold that given is utilized, reduce in the right quantity of the candidate of described candidate's generation phase generation, weight as crossed the maximal term in a document vector is too little, so the cosine similarity of it and any vector is by lower than given threshold value, namely without the need to calling Similarity measures again; Given two documents vector d i, d j∈ D, the value of the 1-norm of representation vector; In addition, at a vectorial d jin maximal term weight be w j max = max ( { w j , 1 , w j , 2 , . . . , w j , | L | } ) , be also referred to as the ∞-norm of vector; So, can according to sim (d i, d j) can obtain below inequality:
sim(d i,d j)≤min((||d i|| 1×||d j|| ),(||d j|| 1×||d i|| ))
If || d i|| 1× || d j|| < μ sIM, certainly sim (d i, d j) < μ sIM, i.e. d i, d jit not similar comment.
5. according to claim 1 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4), the concrete steps of BigSimDet parallel calculating method are:
(4-1) each document vector d is calculated ithe 1-norm of ∈ D || d i|| 1with ∞-norm || d i|| value, and according to the group size parameter τ in MapReduce parallel computational model gdistribute these vectors to each group; Set up sequence and make each Vector Groups g i∈ G sorts from big to small according to 1-norm value;
(4-2) for each to Vector Groups g i, g j∈ G and by assessment g imaximum 1-norm value and g ithe product of maximum 1-norm value whether be greater than threshold value μ sIM, determine its non-similarity relation;
(4-3) for each group of g i∈ G, the subregion of an initial MapReduce calculate the similar right of gi and other Vector Groups; In the process, also need to consider non-similarity relation and partition size restriction τ maxG;
(4-4) according to partition size restriction τ maxG, perform MapReduce task if one group of vector g i∈ G has multiple potential similar Vector Groups, then by g ias seed, and according to partition size restriction τ maxGformed additional partitions;
(4-5) for each subregion the parallel processing task starting a MapReduce detects similarity comment.
6. according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-1), the specific implementation step of MapReduce parallel computational model is:
Call the operation of a MapReduce to calculate 1-norm and the ∞-norm value of each document vector; Institute's directed quantity subsequently by according to the value of its 1-norm by ascending order arrangement, and according to the size τ of pre-defined group gthese vectors are put into corresponding group, largest packet size τ git is one of parameter of system, whole collection of document D is divided into some input subregions equably, each Mapper will process the file vector of its oneself subregion concurrently, run Mapper sequence in the middle of generate key-it is right to be worth, then keys in the middle of these-be worth are input in Reducer by i.e. 1-norm value and document id after shuffling; By using " TotalOrderPartitioner " self-defined partition programs by name in MapReduce, the key that centre generates-be worth and reduce process range after sorting according to each worth size in first Reducer, and the data after range shorter are processed being input in second Reducer, by that analogy;
Wherein, the specific implementation process of the operation of MapReduce is as follows: Mapper obtains from its input subregion that each enter key-it is right to be worth, an i.e. file ID and file vector, then, this Mapper will produce the document vector be made up of the item weight after normalization accordingly; Then, Mapper will calculate 1-norm || d i|| 1with ∞-norm || d i|| value; Finally, by generate in the middle of exporting key-it is right to be worth, i.e. 1-norm value and the tuple that is made up of file ID and ∞-norm; In addition, each Reducer generates key-be worth to as input, then according to group size parameter τ in the middle of after sequence gdocument is sorted according to 1-norm value; Finally, run-out key-it is right to be worth; The key of each output-be worth comprising: a group ID and contain the tuple of orderly document id list, corresponding 1-norm and ∞-norm value.
7., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, step (4-2) is specially:
When certain group is by weighing 1-norm and ∞-norm, when being found to comment on different with other, other different sets of documentation can be determined so simultaneously; For each group g i, the document with maximum 1-norm value is expressed as and the document with maximum ∞-norm is expressed as when different relation between two groups is determined, sets of documentation g jin the value of ∞-norm can be applied in another group g i1-norm if show that the document in these two groups does not have similar; Because MapReduce work before has produced the ordered queue of an All Files group, and the document in each group also arranges according to the value ascending order of 1-norm, compares g isort other lower groups (as g 1and g 2) will g be specified on jdissimilar.
8., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, step (4-3) is specially:
For each group of g i∈ G, getting rid of different sets of documentation by merging similar document combination concurrently, coming initialization document into regions get rid of different sets of documentation and greatly can save the computing time of commenting on detection-phase in similarity; But the size of partitions of file during initialization may be very uneven, because some sets of documentation may other only has several potential similar sets of documentation containing many potential similar document groups; The similarity comment Detection task execution performance parallel under causing MapReduce of the imbalance of partitions of file declines dramatically; Its reason is that total execution time that a similar comment detects operation comments on the computing node control of Detection task by the similarity that working time is the longest.
9., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-4),
Call a level and smooth task by partition size restriction τ M axGlarge files is divided into several less subregion, τ maxGvalue be determined by the average local memory capacity of computing cluster node, if when exceeding the local memory capacity of computing node, so sets of documentation must be retrieved from remote node, remote disk access each time all can consume accesses much more time than local disk, consequently, total execution time may increase greatly, for each group of g i∈ G, if its initial partitions of file is more than τ M axGrestriction, this group g ito seed be used as, and may similar sets of documentation merge to other, and make to generate each subregion size can not exceed restriction, finally, similar comment Detection task can be distributed evenly between all computing nodes in cluster.
10., according to claim 5 based on the large data digging method of comment of typicalness, it is characterized in that, in step (4-5),
Similarity comment Detection task is processed in a parallel fashion by the computing node called in distributed type assemblies; For each document into regions start the similarity comment Detection task that a MapReduce is parallel, before the calculating of similarity, need to carry out an operation of falling index in each Local partition, rerun after index likeness in form comment job run; For Map and the Reduce function of the calculating document vector similarity in each subregion, each Mapper accept from this locality key-it is right to be worth, key is an item t, and its value and be document id list and corresponding normalization item weight, call Map function, often pair of document vector is only calculated only once, and only has when two vectors are when falling in index containing at least one identical item, just calculate the similarity between them, for each document vector d i, adopt an Associate array H to keep potential similar document and d isimilarity score; Finally, Map function output document IDn iminimize the number of the middle key generated with Associate array H, and they are shuffled; In addition, generate in the middle of Reduce function accepts key-it is right to be worth, and sum operation is carried out to the local similarity score being stored in Associate array; Finally, Reduce function export to comprise in Local partition document id and document the reddest similarity score key-it is right to be worth.
CN201410796566.XA 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness Active CN104462480B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410796566.XA CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410796566.XA CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Publications (2)

Publication Number Publication Date
CN104462480A true CN104462480A (en) 2015-03-25
CN104462480B CN104462480B (en) 2017-11-10

Family

ID=52908515

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410796566.XA Active CN104462480B (en) 2014-12-18 2014-12-18 Comment big data method for digging based on typicalness

Country Status (1)

Country Link
CN (1) CN104462480B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955957A (en) * 2016-05-05 2016-09-21 北京邮电大学 Determining method and device for aspect score in general comment of merchant
CN106446264A (en) * 2016-10-18 2017-02-22 哈尔滨工业大学深圳研究生院 Text representation method and system
CN109903851A (en) * 2019-01-24 2019-06-18 暨南大学 A kind of automatic Observation technology of the psychological abnormality variation based on social networks

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102945268A (en) * 2012-10-25 2013-02-27 北京腾逸科技发展有限公司 Method and system for excavating comments on characteristics of product
US20130080208A1 (en) * 2011-09-23 2013-03-28 Fujitsu Limited User-Centric Opinion Analysis for Customer Relationship Management
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103778214A (en) * 2014-01-16 2014-05-07 北京理工大学 Commodity property clustering method based on user comments

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20130080208A1 (en) * 2011-09-23 2013-03-28 Fujitsu Limited User-Centric Opinion Analysis for Customer Relationship Management
CN102945268A (en) * 2012-10-25 2013-02-27 北京腾逸科技发展有限公司 Method and system for excavating comments on characteristics of product
CN103399916A (en) * 2013-07-31 2013-11-20 清华大学 Internet comment and opinion mining method and system on basis of product features
CN103778214A (en) * 2014-01-16 2014-05-07 北京理工大学 Commodity property clustering method based on user comments

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GERNOT HORSTMANN: "Facial Expressions of Emotion: Does the Prototype Represent Central Tendency, Frequency of Instantiation, or an Ideal?", 《EMOTION》 *
LAWRENCE W. BARSALOU: "Ideals, Central Tendency, and Frequency of Instantiation as Determinants of Graded Structure in Categories", 《JOURNAL OF EXPERIMENTAL PSYCHOLOGY: LEARNING, MEMORY, AND COGNITION》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105955957A (en) * 2016-05-05 2016-09-21 北京邮电大学 Determining method and device for aspect score in general comment of merchant
CN105955957B (en) * 2016-05-05 2019-01-25 北京邮电大学 The determination method and device that aspect scores in a kind of businessman's general comment
CN106446264A (en) * 2016-10-18 2017-02-22 哈尔滨工业大学深圳研究生院 Text representation method and system
CN106446264B (en) * 2016-10-18 2019-08-27 哈尔滨工业大学深圳研究生院 Document representation method and system
CN109903851A (en) * 2019-01-24 2019-06-18 暨南大学 A kind of automatic Observation technology of the psychological abnormality variation based on social networks
CN109903851B (en) * 2019-01-24 2023-05-23 暨南大学 Automatic observation method for psychological abnormal change based on social network

Also Published As

Publication number Publication date
CN104462480B (en) 2017-11-10

Similar Documents

Publication Publication Date Title
US10019442B2 (en) Method and system for peer detection
Érdi et al. Prediction of emerging technologies based on analysis of the US patent citation network
Halibas et al. Determining the intervening effects of exploratory data analysis and feature engineering in telecoms customer churn modelling
Thao et al. A new multi-criteria decision making algorithm for medical diagnosis and classification problems using divergence measure of picture fuzzy sets
CN105786810B (en) The method for building up and device of classification mapping relations
CN109783633A (en) Data analysis service procedural model recommended method
Kaur Outlier detection using kmeans and fuzzy min max neural network in network data
Jauhar et al. Supply chain and the sustainability management: selection of suppliers for sustainable operations in the manufacturing industry
CN104462480A (en) Typicality-based big comment data mining method
Peng et al. Optimization research of decision support system based on data mining algorithm
Canetta* et al. Applying two-stage SOM-based clustering approaches to industrial data analysis
Abbasimehr et al. Trust prediction in online communities employing neurofuzzy approach
Kumar Efficient k-mean clustering algorithm for large datasets using data mining standard score normalization
Shi et al. Research on Fast Recommendation Algorithm of Library Personalized Information Based on Density Clustering.
Huang et al. Two-stage fuzzy cross-efficiency aggregation model using a fuzzy information retrieval method
Singh et al. Knowledge based retrieval scheme from big data for aviation industry
Ji A heuristic collaborative filtering recommendation algorithm based on book personalized Recommendation
Gavrilenko et al. Application of association rules for formation of public (administrative) services portfolio
Best et al. Atypical behavior identification in large-scale network traffic
Badyal et al. Insightful Business Analytics Using Artificial Intelligence-A Decision Support System for E-Businesses
Fisun et al. Methods of Searching for Association Dependencies in Multidimensional Databases
Matsunaga et al. Data mining applications and techniques: A systematic review
Kaur et al. Ranking based comparative analysis of graph centrality measures to detect negative nodes in online social networks
Osial et al. Smartphone recommendation system using web data integration techniques
Han et al. Information Flow Monitoring System

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant