CN104615608A - Data mining processing system and method - Google Patents

Data mining processing system and method Download PDF

Info

Publication number
CN104615608A
CN104615608A CN201410174489.4A CN201410174489A CN104615608A CN 104615608 A CN104615608 A CN 104615608A CN 201410174489 A CN201410174489 A CN 201410174489A CN 104615608 A CN104615608 A CN 104615608A
Authority
CN
China
Prior art keywords
data
vector
customer relationship
indicative character
seed words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410174489.4A
Other languages
Chinese (zh)
Other versions
CN104615608B (en
Inventor
余建兴
高瀚
司徒志远
黄华伟
高岩
贺鹏
陈川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410174489.4A priority Critical patent/CN104615608B/en
Publication of CN104615608A publication Critical patent/CN104615608A/en
Application granted granted Critical
Publication of CN104615608B publication Critical patent/CN104615608B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data mining processing system and method. The system comprises a data acquiring unit, a data classifying unit and a data processing unit, wherein the data acquiring unit is used for acquiring data and outputting the data to the data classifying unit, and the data are classified into various data types and can represent user relations with indication features in a user relation chain in different dimensionalities; the data classifying unit is used for carrying out comprehensive analysis on the various types of data according to a classification strategy so as to obtain the user relations with the indication features from the data in an analyzing mode and outputting the user relations with the indication features to the data processing unit; the data processing unit is used for collecting information according to the user relations with the indication features to send recommendation information according to the analysis result of the information.

Description

A kind of data mining disposal system and method
Technical field
The present invention relates to the digging technology of internet communication, particularly relate to a kind of data mining disposal system and method.
Background technology
Present inventor, in the process realizing the embodiment of the present application technical scheme, at least finds to there is following technical matters in correlation technique:
Along with the develop rapidly of Internet technology and the transition of social structure, increasing people, on network, seeks in mobile phone to link up, contact and contacts, and produce the interpersonal interbehavior of magnanimity thus, polytype relation chain between user can be obtained based on this interbehavior.Between user, polytype relation chain can be applied to different social sectors, and service provider is by various application, and the reservation of such as cell-phone customer terminal is made a reservation and should be used for as user provides service.
By better user's request can be analyzed to relation chain polytype between user, thus provide better service for user, such as, the APP application for doing shopping recommending user to need, the shopping guide helping user to carry out required article guides, and for example, for user recommends required dining room and dining room special service, or health product etc., in a word, once polytype relation chain between this user accurately can be obtained, the database that just can obtain based on relation chain polytype between this user is user's service of offering the best, be embodied as the object that user recommends various useful application accurately, simultaneously, in the process that service is provided, service provider also upgrades by the assessment of this recommendation and user's purchasing power the database self applied.
The customer relationship that some have indicative character is there is in polytype relation chain between user, for example, such as indicate certain relationship, it is interested that the user of relationship may apply the service that can provide to same or same class, therefore, adopt this relationship for improve application self database and through this database be improved as user accurately recommendation information serve decisive role.Visible, if this customer relationship with indicative character in customer relationship chain can be excavated, valid data just can be it can be used as to improve data validity, take in a large number to avoid invalid data and cause data redundancy at database, thus the object of user's recommendation information accurately can be embodied as by the raising of this data validity.How excavating this customer relationship with indicative character to rise to the accuracy of user's recommendation information is the technical matters that will solve.
But, this customer relationship with indicative character to be excavated seem simple from the data of the internet communication of vastness, practical operation is got up and is not easy, but also to guarantee that this accuracy with the customer relationship of indicative character excavated has been accomplished with regard to more difficult, still there is the customer relationship of indicative character for certain relationship for described, current prior art is realized by simple keyword match, such as, be some user's remarks inside address list be " father ", for another user's remarks are " aunt ", may be then relationship between these two users, in addition, the word of expressing relationship is a lot, such as " father " just has " father ", statements such as " fathers ", the mode of above-mentioned keyword match is difficult to all possible keyword all to enumerate, visible, there is not the effective scheme solved the problem in correlation technique.
Summary of the invention
In view of this, the embodiment of the present invention is desirable to provide a kind of data mining disposal system and method, the customer relationship specifically in customer relationship chain with indicative character can be excavated, to rise to the accuracy of user's recommendation information from the data of the internet communication of vastness.
The technical scheme of the embodiment of the present invention is achieved in that
A kind of data mining disposal system of the embodiment of the present invention, described system comprises: data capture unit, data sorting unit, data processing unit; Wherein,
Described data capture unit, for obtaining data, export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions;
Described data sorting unit, for comprehensively analyzing according to classification policy described numerous types of data, obtaining the customer relationship with indicative character to analyze from described data, having the customer relationship of indicative character to described data processing unit described in output;
Described data processing unit, collects information for the customer relationship described in basis with indicative character, sends recommendation information with basis to the analysis result of described information.
Preferably, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
Preferably, described data sorting unit, comprising:
Policy selection subelement, for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determines that described data type is short text data, select the first strategy as described classification policy;
Strategy execution subelement, for adopt described first strategy described short text data carried out described in there is the identification of the customer relationship of indicative character time, random extraction seed words, described seed words can characterize the customer relationship with indicative character, using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
Preferably, described strategy execution subelement, comprising:
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Preferably, described strategy execution submodule, comprising:
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Preferably, described data sorting unit, comprising:
Policy selection subelement, for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy;
Strategy execution subelement, for adopt described second strategy described long article notebook data carried out described in there is the identification of the customer relationship of indicative character time, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
Preferably, described strategy execution subelement, comprising:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension;
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Preferably, described strategy execution subelement, comprising:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension;
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Preferably, described system also comprises: data diffusion unit, and described data diffusion unit is between described data sorting unit and described data processing unit;
Described data diffusion unit, for according to positive inverse relation and transitive relation, analyzes further to the described customer relationship with indicative character, obtains the user profile relevant to the described customer relationship with indicative character.
A kind of data mining disposal route of the embodiment of the present invention, described method comprises:
Obtain data, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions;
Described numerous types of data is comprehensively analyzed according to classification policy, obtains the customer relationship with indicative character to analyze from described data;
Collect information according to the described customer relationship with indicative character, with basis, recommendation information is sent to the analysis result of described information.
Preferably, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
Preferably, described described numerous types of data comprehensively to be analyzed according to classification policy, obtains the customer relationship with indicative character to analyze from described data, comprising:
Resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determines that described data type is short text data, select the first strategy as described classification policy;
Perform described first strategy, extract seed words at random, described seed words can characterize the customer relationship with indicative character;
Using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
Preferably, described using described seed words as with reference to benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realize classification based training, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
The vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Preferably, described using described seed words as with reference to benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realize classification based training, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
The vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Preferably, described described numerous types of data comprehensively to be analyzed according to classification policy, obtains the customer relationship with indicative character to analyze from described data, comprising:
Resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy;
Perform described second strategy, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words;
Using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
Preferably, described by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, comprising:
User relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension.
Preferably, described using described seed words as with reference to benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
The distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Preferably, described using described seed words as with reference to benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
The distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Preferably, described method also comprises:
According to positive inverse relation and transitive relation, the described customer relationship with indicative character is analyzed further, obtain the user profile relevant to the described customer relationship with indicative character.
The data mining disposal system of the embodiment of the present invention comprises: data capture unit, data sorting unit, data processing unit; Wherein, data capture unit is for obtaining data, and export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions; Described data sorting unit is used for comprehensively analyzing according to classification policy described numerous types of data, obtains the customer relationship with indicative character to analyze from described data, has the customer relationship of indicative character to described data processing unit described in output; The customer relationship that described data processing unit is used for having described in basis indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
Accompanying drawing explanation
Fig. 1 is a composition structural representation of present system embodiment;
Fig. 2 is a composition structural representation of present system embodiment;
Fig. 3 is a composition structural representation of present system embodiment;
Fig. 4 is an application scenarios schematic diagram of application present system embodiment;
Fig. 5 is a composition structural representation of present system embodiment;
Fig. 6 is the composition structural representation of strategy execution subelement in Fig. 5;
Fig. 7 is an application scenarios schematic diagram of each module in application drawing 6;
Fig. 8 is that different pieces of information point is separated the schematic diagram realizing classification by a segmentation plane;
Fig. 9 is the composition structural representation of strategy execution subelement in Fig. 5;
Figure 10 is an application scenarios schematic diagram of each module in application drawing 9;
Figure 11 is that one of each functional module of relationship expanding element in application drawing 4 realizes schematic diagram;
Figure 12 is positive inverse relation diffusion schematic diagram;
Figure 13 is transitive relation diffusion schematic diagram;
Figure 14 is the realization flow figure of the inventive method embodiment;
Figure 15 is the realization flow figure of the inventive method embodiment;
Figure 16 is the realization flow figure of the inventive method embodiment.
Embodiment
Be described in further detail below in conjunction with the enforcement of accompanying drawing to technical scheme.
System embodiment one:
A kind of data mining disposal system of the embodiment of the present invention, as shown in Figure 1, described system comprises: data capture unit, data sorting unit, data processing unit.Wherein, data capture unit is for obtaining data, and export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.Data sorting unit is used for comprehensively analyzing according to classification policy described numerous types of data, obtains the customer relationship with indicative character to analyze from described data, has the customer relationship of indicative character to described data processing unit described in output.The customer relationship that data processing unit is used for having described in basis indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
In the embodiment of the present invention one preferred implementation, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
In the embodiment of the present invention one preferred implementation, as shown in Figure 2, described system also comprises: data diffusion unit, described data diffusion unit is between described data sorting unit and described data processing unit, described data diffusion unit is used for according to positive inverse relation and transitive relation, the described customer relationship with indicative character is analyzed further, obtains the user profile relevant to the described customer relationship with indicative character.
In the embodiment of the present invention one preferred implementation, as shown in Figure 3, described system also comprises: data outputting unit, data outputting unit is between described data diffusion unit and described data processing unit, described data outputting unit is used for the customer relationship by having indicative character described in obtaining according to data sorting unit, and the user profile relevant to the described customer relationship with indicative character obtained according to data diffusion unit is further exported to data processing unit processes.
Be illustrated in figure 4 an application scenarios schematic diagram of application present system embodiment, Fig. 4 comprises data capture unit, the relationship taxon specific implementation of data sorting unit (in the Fig. 3), the relationship diffusion unit specific implementation of data diffusion unit (in the Fig. 3), the relationship output unit specific implementation of data outputting unit (in the Fig. 3) and data processing unit.Data capture unit obtains the data in order to have the customer relationship of indicative character described in analyzing from multiple data source, in this application scene, the described customer relationship with indicative character is for relationship, through relationship taxon-relationship diffusion unit-relationship output unit, the described relationship identified is delivered to data processing unit process, data processing unit collects information to upgrade the database of N number of application according to described relationship, according to the analysis result to described information, different application is adopted to send recommendation information, the accuracy of user's recommendation information must be risen to.Wherein, described N number of application comprises: IM friend recommendation is applied, and IM good friend cohesion estimates that application and various advertisement recommend platform as logical in extensively put.
Multiple data sources in this application scene comprise:
Data type one: the off-line data that instant messaging (IM) is applied;
Data type two: local communication application is as the contact data in cell phone address book;
Data type three: each World Jam, interaction platform as search dog ask, microblogging is as the interactive data produced time mutual between the users such as Sina's microblogging.
Wherein, data type one and data type two characterizing consumer personal attribute usually, such as on the user contact person of IM application, remarks have user personal attribute to be " father ", " mother ", " aunt " etc., then just can know whether there is relationship between certain several user by this remarks; In like manner, data type two also can adopt this remarks, and data type two due to can the project of remarks and word amount larger than data type one, can also remarks individual subscriber attribute be: subscriber household address, postcode number etc., if the home address that certain several user's remarks is the same, just illustrates to there is relationship between these users, or know that certain several user is in same area or same street etc. by postcode number, also can play influence for the judgement of relationship.Generally speaking, it is large that data type one and data type two all belong to data volume, and the data type that content of text is short, alternatively, the two all belongs to short text type.
Data type three: due to be each World Jam, interaction platform as search dog ask, microblogging is as the interactive data produced time mutual between the users such as Sina's microblogging, such as, " where father has gone ", " what time go home to have a meal " etc., belong to data volume little, the data type that content of text is long, alternatively, data type three belongs to long text.
In addition, data type one-data type three can disclose user social contact topological structure.
With regard to above-mentioned data source for example, access the data of multiple data source by above-mentioned data capture unit, comprise the off-line data of IM good friend, the contact library of mobile phone IM address list, the interaction in IM space is had a talk about (comprising comment to forward).Wherein, the off-line data of IM good friend has IM individual subscriber attribute (such as good friend's remarks, good friend's grouping etc.), the circle information of IM, the information (such as group's title) of IM group, IM social networks chain etc.These data indicate relationship on different dimensions, such as an IM group is named as " relatives group ", and so each crowd of friends in the inside are probably relative each other.
In sum, due to the data in order to have the customer relationship of indicative character described in analyzing, the data such as analyzing relationship come from multiple data source, and the corresponding a kind of data type of each data source, therefore, described data are divided into numerous types of data.Described numerous types of data comprises characterizing consumer personal attribute, the social topological structure of characterizing consumer, at least two kinds of data types in characterizing consumer mutual-action behavior, due to personal attribute's feature of user effectively synthetically can be considered, social topological structure, and the information of social networks interaction, therefore, the described data possessing numerous types of data can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, thus, adopt the embodiment of the present invention, based on the customer relationship described in described data analysis with indicative character, it is a kind of comprehensive analysis, can guarantee that the customer relationship described in identifying with indicative character is enough accurate, the embodiment of the present invention is different from this single matching mechanisms of prior art key word, more superior.
There is the customer relationship of indicative character for relationship for described, have a look by the shortcoming of this single matching mechanisms of prior art key word as follows:
One, the also various factor that can judge relationship of reasonable analysis is failed to consider:
Whether impact exists a lot of because have of relationship, such as user is " father " by IM good friend remarks; User adds the group that is named as " relatives "; On social topological structure, the relative of relative may also be relative etc.Want each influence factor of accurate analysis, the method for analysis needs specific aim.Simply come to judge that relationship is too rough to all kinds of data of different nature according to keyword match, effect is bad.For example in the interaction of IM spatial user, keyword match can judge that the user that interactive model " where father goes " is corresponding exists relationship mistakenly.In addition, each indicative function affecting the factor of relationship is also different.For example inside cell phone address book, remarks are the good friend of " father ", than the good friend mentioning " father " in the interaction of IM spatial user, are more likely the relative of user.The single mechanism of existing keyword match can not consider various influence factor.
Two, the coverage rate excavating relationship is not enough:
The word of expressing relationship is a lot, such as " father " just has " father ", " father ", or even " father's ratio ", statements such as " old beans ".The single mechanism of existing keyword match is difficult to all possible keyword all to enumerate.Particularly in interaction, some term may not have the keyword of relative to occur, but they can indicate relationship, such as at IM space interaction model, the both sides as " when returning to have a meal " relatively may exist relationship.
And the data of the embodiment of the present invention owing to being comprehensive various data type, data can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, adopt comprehensive analysis mechanisms, the shortcoming that above-mentioned prior art exists can be evaded, thus can precisely identify the customer relationship with indicative character, as relationship, can give security for the accuracy improving pushed information.
Because the various social interaction relations between user imply a large amount of information recommendation possibilities, such as whenever festivals or holidays, a large amount of mutual blessing behaviors between friends and family, all can be produced.On the other hand, what participation social activity was mutual has various types of people, such as comprises the relative of oneself, teacher, classmate, colleague, stranger, or even intermediary promotes.In these crowds, the user of relationship has very large information recommendation possibility, for example advertiser's (e.g. restaurant, health treatment) can throw in targetedly to related user, helps them more easily to find suitable application, product or service; Can recommend its relative to user, auxiliary its expansion existing subscriber relation chain, adding users stickiness is user's recommendation information, improves Consumer's Experience.
Also there are the various combinations possibilities in said system embodiment one in subsequent embodiment, in order to simplified characterization, does not repeat.
System embodiment two:
A kind of data mining disposal system of the embodiment of the present invention, as shown in Figure 5, described system comprises: data capture unit, data sorting unit, data processing unit.Wherein, data capture unit is for obtaining data, and export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.Data sorting unit is used for comprehensively analyzing according to classification policy described numerous types of data, obtains the customer relationship with indicative character to analyze from described data, has the customer relationship of indicative character to described data processing unit described in output.The customer relationship that data processing unit is used for having described in basis indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Here it is pointed out that described data sorting unit comprises: policy selection subelement and strategy execution subelement.Wherein, policy selection subelement is for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determines that described data type is short text data, select the first strategy as described classification policy.Strategy execution subelement for adopt described first strategy described short text data carried out described in there is the identification of the customer relationship of indicative character time, random extraction seed words, described seed words can characterize the customer relationship with indicative character, using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
And, data sorting unit is subdivided into policy selection subelement and strategy execution subelement, the classification policy that the data selection that policy selection subelement correspondence is different is different, the short text type that the present embodiment is mentioned for system embodiment one, short text type belongs to the large and data type that content of text is short of data volume, alternatively its characteristic parameter is for characterizing the large and characteristic that content of text is short of described data volume, policy selection subelement can parse this characteristic parameter, by comparing with the threshold value preset, judge as described short text type, then select the first strategy as classification policy, described first strategy is performed by strategy execution subelement, described first strategy is: extract seed words at random, described seed words can characterize the customer relationship with indicative character, using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
Be illustrated in figure 6 the composition structural representation of strategy execution subelement in Fig. 5, described strategy execution subelement comprises following two kinds of implementations, the first implementation: vector generation module is not adopt fixed dimension, the second implementation: vector generation module adopts fixed dimension.
The first implementation of described strategy execution subelement is:
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data.
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying.
Analysis result output module, exports the customer relationship described in identifying with indicative character.
The second implementation of described strategy execution subelement is:
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data.
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying.
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Be illustrated in figure 7 an application scenarios schematic diagram of strategy execution subelement in application drawing 6, comprise: the semantic vector generation module specific implementation of vector generation module (in the Fig. 6), classification based training module, the prediction relationship output module specific implementation of analysis result output module (in the Fig. 6).
For the described customer relationship with indicative character for relationship illustrates, the data sorting unit be made up of policy selection subelement and strategy execution subelement as shown in Figure 5, the relationship sort module in Fig. 4 can be specially, this relationship sort module can predict the relationship of user respectively according to multiple data sources, because the data characteristic in different pieces of information source is different, need to adopt different operation logics to process targetedly and there is data source of different nature, as the data type one mentioned for system embodiment one and this short text of data type two type take a kind of processing logic (the first strategy is as classification policy), another kind of processing logic (the second strategy is classification policy) is taked to this long text of the data type three mentioned for system embodiment one.The present embodiment is for the first strategy execution, and the description of follow-up system embodiment three is asked for an interview in the description for the second strategy, does not repeat here.
The present embodiment is for the first strategy, and maximum feature takes a seed words at random.For data be: the off-line data of IM good friend and the contact person of mobile phone IM address list, for example IM individual subscriber attribute (good friend's remarks, good friend's grouping etc.), the circle title of IM, IM group's title, consider this kind of data text very short (generally only having several words), belong to described short text type, take the seed words of relative to input classification based training module at random and carry out classification based training, here it is to be noted, described classification based training module can be the training classifier based on support vector machine (SVM) technology, utilize the seed words of described relative to the relationship existing for the data identifying these two kinds of data types.
First utilize semantic vector generation module, by data with the vector representation in vector space, by classification based training module, discriminator is carried out to the relationship existing for data afterwards.Particularly, semantic vector generation module is based on vector space model (VSM), by 0/1 representation tables of data is shown as the space vector (can be a vector) in vector space, then in vector space, finds out a segmentation plane by classification based training module.
0/1 representation allows data if each word in a text is as an element (also can be called a dimension of vector) of vector, and total dimension of vector is total word number of full text.When a certain bar text is expressed as vector, if the word corresponding to each dimension of vector occurs in the text, then the value of this dimension is just 1, otherwise is 0.For 0/1 representation, for example, as a text " when father goes home ", what is gone home ", " time ", " can to obtain " father ", " through participle " four words, if by the text with vector representation, this vector has four dimensions.0/1 representation is that all Chinese words are done attribute, if Chinese word has 100,000 dimensions, so this text representation vector is out [0,0,0,1 ...., .., 1 ..., 0 ..., 1, ..., 1,0,0], and this vector only " father ", " what ", " time ", value above dimension corresponding to " going home " these four words is 1, other are all be 0.For the short text type of mass data, if the vector representation of 0/1 mode of utilization, so dimension can very greatly (because the dimension of vector is total word amount of text).
Because the dimension of above-mentioned 0/1 vector representation mode is very large, dyscalculia, and the similarity between the text that can not reflect synonym or similar import, the treatment effeciency of superelevation dimension meeting grievous injury classification based training module and performance, and, based on 0/1 method for expressing, semantically close word can not reflect with the cosine angle of the vector of its correspondence.For example follow " father " to be expressed as vector with 0/1 method for expressing " father ", the cosine angle of the word of these two semantic similitude is but 0, and this can affect the effect of classification very negatively.
Consider the shortcoming of above-mentioned 0/1 vector representation mode, improvement project is: adopt the semantic vector representation of fixed dimension, instead of by total word number of full text total dimension as vector.
For this improvement project, the first text of learning data, draws the semantic vector of a fixed dimension (such as 200 dimensions) of each word.Below describe and how to set up semantic vector.
Such as text " when father goes home ", can have " father " through participle, " what ", " time ", " going home " four words, the corresponding semantic vector of each word, for example " father " corresponding [0.1,0.2,0.1, ..., 0.5], " what " corresponding [0.2,0.1,0.3 ..., 0.3]; " time " corresponding [0.1,0.2,0.2 ..., 0.1]; " going home " correspondence [0.0,0.1,0.0 ..., 0.1], so whole text " when father goes home " is just expressed as a semantic vector, and this semantic vector is exactly that the semantic vector of word each in text is added up, and such as [0.1,0.2,0.1, ..., 0.5]+[0.2,0.1,0.3 ..., 0.3]+[0.1,0.2,0.2 ..., 0.1]+[0.0,0.1,0.0 ..., 0.1]=[0.4,0.6,0.6 ..., 1].After normalization, will [0.4,0.6,0.6 ..., 1] become [0.2,0.3,0.3 ..., 0.5] represent.
Visible: for same text, be expressed as 100,000 multi-C vectors [0,0,0 by above-mentioned 0/1 representation, 1 .... .., 1, ..., 0 ..., 1, ..., 1,0,0], become a fixed dimension (such as 200 dimension vectors) [0.2,0.3,0.3 ..., 0.5], dimension decreases a lot, and calculated amount reduces much thereupon, thus improves treatment effeciency and the performance of classification based training module.In addition, because semantic vector can measure the context of co-text between word better, it can calculate similarity better, for example can identify " father " follows " old beans " to be similar below certain linguistic context, so just can better calculate " when father goes home ", the similarity of " when old beans go home " these two texts.
Put it briefly, semantic vector utilizes neural network to be the expression that each word finds in a vector row space.It considers that word is in contextual linguistic context, the frequency jointly occurred in same linguistic context between word is utilized to portray the correlative character of word, for example " cat " follows " dog " jointly to occur in same linguistic context of being everlasting, and so its distance based on semantic vector is just less than the distance that " cat " follows " apple " correspondence.
Particularly, semantic vector needs the context of co-text information that can contain word.The word of such semantic similitude,
The cosine angle value of its corresponding vector can be larger.We portray the context of co-text of word with conditional probability P, the namely impact of word that only occurred above of the probability of each word, i.e. P (w i| w 1..., w i-1).In order to simplify calculating, generally only consider the impact of each word by its front n-1 word, i.e. P (w i| w i-n+1..., w i-1).A good semantic vector should be able to maximize the conditional probability P (w of each word i| w i-n+1..., w i-1).We carry out the most value of this probability of optimization with the neural network model of three layers.The input layer of this neural network is n-1 word above, and the corresponding semantic vector of each word, remembers C (w i-n+1) ..., C (w i-1), wherein C is the set of all term vectors, and the dimension of each vector is m.This n-1 vectorial end to end spelling is got up, forms the vector that (n-1) m ties up, be denoted as x.Then be x modeling with a nonlinear hidden layer, i.e. tanh (Hx+d), wherein d is bias term, and tanh is activation function.The output layer of neural network is one | predicting the outcome of V| dimension, and wherein V is the set of word, with reference to following formula (1):
y=softmax(U·tanh(Hx+d)+Wx+b) (1)
Wherein softmax is activation function, U (| the matrix of V| × h, h is the number of plies of hidden layer) be the parameter of hidden layer to output layer; W (| the matrix of V| × (n-1) m) be directly to a linear transformation of output layer from input layer.The i-th dimension y of this y that predicts the outcome irepresent that next word is the probability of i, i.e. y i=P (w i| w i-n+1..., w i-1).
By this neural network of backpropagation (Back Propagation) Algorithm for Solving, and then obtain semantic vector set C (the word w of word icorresponding semantic vector is exactly C (w i)).In solution procedure, need to add up (n-1) individual context word and its relevant frequency information before each word, we do corpus by the data that IM space is had a talk about and carry out statistical correlation frequency information.
The embodiment of the present invention adopts:
Prior art is mated by key word, is for text-processing, and needs to look for a lot of keywords, not only hard but also may look for and entirely do not cause accuracy rate not to be guaranteed; And the embodiment of the present invention is more accurate in order to classify, be not classify simply by text, but become by text representation mathematically can the vector form of analyzing and processing, need first to carry out entry cutting to text, obtain forming reprocessing after each word of text.Be vector form by VSM by text representation, described VSM is a statistical model, and being mainly used in is the data point (some vector) in the vector space of being opened by one group of normalized orthogonal entry vector by the text mapping in data.Text table be shown as mathematically can after the vector form of analyzing and processing, on this basis, classify based on probability with based on distance, such as, based on distance, text is regarded as a data point in vector space, classified by the distance calculated between data point, the process of classification is the process of a machine learning, these data points (some vector) are the points that n ties up in the real space, in vector space, a segmentation plane is being found out by classification based training module, be illustrated in figure 8 a segmentation plane, inhomogeneous data point is separated to realize Data classification, preferably the lineoid of these data points by a n-1 dimension can be separated, usually this is called as linear classifier, be not limited to the SVM of the embodiment of the present invention, a lot of sorter is had all to meet this requirement.If can find a best plane (largest interval lineoid) of classifying, namely make to belong to that maximum face of two inhomogeneous data point intervals, classifying quality is just better.
System embodiment three:
A kind of data mining disposal system of the embodiment of the present invention, as shown in Figure 5, described system comprises: data capture unit, data sorting unit, data processing unit.Wherein, data capture unit is for obtaining data, and export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.Data sorting unit is used for comprehensively analyzing according to classification policy described numerous types of data, obtains the customer relationship with indicative character to analyze from described data, has the customer relationship of indicative character to described data processing unit described in output.The customer relationship that data processing unit is used for having described in basis indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Here it is pointed out that described data sorting unit comprises: policy selection subelement and strategy execution subelement.Wherein, policy selection subelement is for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy.Strategy execution subelement for adopt described second strategy described long article notebook data carried out described in there is the identification of the customer relationship of indicative character time, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
And, data sorting unit is subdivided into policy selection subelement and strategy execution subelement, the classification policy that the data selection that policy selection subelement correspondence is different is different, the long text that the present embodiment is mentioned for system embodiment one, long text belongs to the long data type of the little and content of text of data volume, alternatively its characteristic parameter is for characterizing the long characteristic of the little and content of text of described data volume, policy selection subelement can parse this characteristic parameter, by comparing with the threshold value preset, judge as described long text, then select the second strategy as classification policy, described second strategy is performed by strategy execution subelement, described second strategy is: by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
Be illustrated in figure 9 the composition structural representation of strategy execution subelement in Fig. 5, described strategy execution subelement comprises following two kinds of implementations, the first implementation: vector generation module is not adopt fixed dimension, the second implementation: vector generation module adopts fixed dimension.
The first implementation of described strategy execution subelement is:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension.
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data.
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying.
Analysis result output module, exports the customer relationship described in identifying with indicative character.
The second implementation of described strategy execution subelement is:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension;
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
Be an application scenarios schematic diagram of strategy execution subelement in application drawing 9 as shown in Figure 10, comprise: the semantic vector generation module specific implementation of vector generation module (in the Fig. 9), classification based training module, the prediction relationship output module specific implementation of analysis result output module (in the Fig. 9), also comprise the high Confidence relationship abstraction module specific implementation of seed words constructing module (in the Fig. 9).
For the described customer relationship with indicative character for relationship illustrates, the data sorting unit be made up of policy selection subelement and strategy execution subelement as shown in Figure 5, the relationship sort module in Fig. 4 can be specially, this relationship sort module can predict the relationship of user respectively according to multiple data sources, because the data characteristic in different pieces of information source is different, need to adopt different operation logics to process targetedly and there is data source of different nature, as the data type one mentioned for system embodiment one and this short text of data type two type take a kind of processing logic (the first strategy is as classification policy), another kind of processing logic (the second strategy is classification policy) is taked to this long text of the data type three mentioned for system embodiment one.The present embodiment is for the second strategy execution.
The present embodiment is for the second strategy, whether maximum feature takes a seed words at random, but to described short text data (data type one and data type two), employing first strategy is identified that the customer relationship (as relationship) described in obtaining with indicative character constructs seed words.
For data be: the interactive data of forum, data are had a talk about in interaction such as IM space, its text long (on average having 54 words), and the noise word contained is many, the IM good friend off-line data mentioned during the probability distribution of its relative's classification describes with system embodiment two and probability distribution corresponding to mobile phone IM address list different.For this reason, the relationship in data is had a talk about in the interaction of IM space to adopt described second strategy more effectively to identify.Key point be not random for choosing of seed words, but the relationship obtained based on IM good friend off-line data and the identification of mobile phone IM address list is as seed words, this seed words is obtained positive sample seed words and negative sample seed words after high Confidence relationship abstraction module is chosen, input classification based training module carries out classification based training, here it is pointed out that described classification based training module can for the training classifier based on support vector machine (SVM) technology.
The positive and negative samples seed words of training classifier is constructed as follows:
According to the relationship recognition result based on two class data (IM good friend off-line data and mobile phone IM address list) before Fig. 6 generation, extract those are predicted as relationship simultaneously customer relationship pair in multiple dimension, such as at IM good friend remarks, the word of multiple dimensions such as IM good friend grouping is predicted to be the relation pair of relative simultaneously.These relationships are to having high Confidence.These relations can be regarded as positive sample seed words to the interaction record had a talk about in IM space in data (comment forwards word).Correspondingly, we extract those are not predicted to be relative relation pair in any one dimension from the relationship recognition result that Fig. 6 generates, with their interaction record as negative sample seed words.Based on semantic vector generation module generative semantics vector, the semantic vector input training classifier aligning negative sample generation corresponding carries out classification based training.
First utilize semantic vector generation module, by data with the vector representation in vector space, by classification based training module, discriminator is carried out to the relationship existing for data afterwards.Particularly, semantic vector generation module is based on vector space model (VSM), by 0/1 representation tables of data is shown as the space vector (can be a vector) in vector space, then in vector space, finds out a segmentation plane by classification based training module.
0/1 representation allows data if each word in a text is as an element (also can be called a dimension of vector) of vector, and total dimension of vector is total word number of full text.When a certain bar text is expressed as vector, if the word corresponding to each dimension of vector occurs in the text, then the value of this dimension is just 1, otherwise is 0.For 0/1 representation, for example, as a text " when father goes home ", what goes home ", " time ", " that " four words, if by the text with vector representation, this vector has four dimensions can to obtain " father ", " through participle.0/1 representation is that all Chinese words are done attribute, if Chinese word has 100,000 dimensions, so this text representation vector is out [0,0,0,1 ...., .., 1 ..., 0 ..., 1, ..., 1,0,0], and the value of this vector only on the dimension that " father ", " what ", " going home " these four words are corresponding is 1, and other are all be 0.For the short text type of mass data, if the vector representation of 0/1 mode of utilization, so dimension can very greatly (because the dimension of vector is total word amount of text).
Because the dimension of above-mentioned 0/1 vector representation mode is very large, dyscalculia, and the similarity between the text that can not reflect synonym or similar import, the treatment effeciency of superelevation dimension meeting grievous injury classification based training module and performance, and, based on 0/1 method for expressing, semantically close word can not reflect with the cosine angle of the vector of its correspondence.For example follow " father " to be expressed as vector with 0/1 method for expressing " father ", the cosine angle of the word of these two semantic similitude is but 0, and this can affect the effect of classification very negatively.
Consider the shortcoming of above-mentioned 0/1 vector representation mode, improvement project is: adopt the semantic vector representation of fixed dimension, instead of by total word number of full text total dimension as vector.
For this improvement project, the first text of learning data, draws the semantic vector of a fixed dimension (such as 200 dimensions) of each word.Below describe and how to set up semantic vector.
Such as text " when father goes home ", can have " father " through participle, " what ", " time ", " going home " four words, the corresponding semantic vector of each word, for example " father " corresponding [0.1,0.2,0.1, ..., 0.5], " what " corresponding [0.2,0.1,0.3 ..., 0.3]; " time " corresponding [0.1,0.2,0.2 ..., 0.1]; " going home " correspondence [0.0,0.1,0.0 ..., 0.1], so whole text " bold and unconstrained garden, Shenzhen " is just expressed as a semantic vector, and this semantic vector is exactly that the semantic vector of word each in text is added up, and such as [0.1,0.2,0.1, ..., 0.5]+[0.2,0.1,0.3 ..., 0.3]+[0.1,0.2,0.2 ..., 0.1]+[0.0,0.1,0.0 ..., 0.1]=[0.4,0.6,0.6 ..., 1].After normalization, will [0.4,0.6,0.6 ..., 1] become [0.2,0.3,0.3 ..., 0.5] represent.
Visible: for same text, be expressed as 100,000 multi-C vectors [0,0,0 by above-mentioned 0/1 representation, 1 ... .., 1, ..., 0 ..., 1, ..., 1,0,0], become a fixed dimension (such as 200 dimension vectors) [0.2,0.3,0.3 ..., 0.5], dimension decreases a lot, and calculated amount reduces much thereupon, thus improves treatment effeciency and the performance of classification based training module.In addition, because semantic vector can measure the context of co-text between word better, it can calculate similarity better, for example can identify " father " follows " old beans " to be similar below certain linguistic context, so just can better calculate " when father goes home ", the similarity of " when old beans go home " these two texts.
Put it briefly, semantic vector utilizes neural network to be the expression that each word finds in a vector row space.It considers that word is in contextual linguistic context, the frequency jointly occurred in same linguistic context between word is utilized to portray the correlative character of word, for example " cat " follows " dog " jointly to occur in same linguistic context of being everlasting, and so its distance based on semantic vector is just less than the distance that " cat " follows " apple " correspondence.
Particularly, semantic vector needs the context of co-text information that can contain word.The word of such semantic similitude, the cosine angle value of its corresponding vector can be larger.We portray the context of co-text of word with conditional probability P, the namely impact of word that only occurred above of the probability of each word, i.e. P (w i| w 1..., w i-1).In order to simplify calculating, generally only consider the impact of each word by its front n-1 word, i.e. P (w i| w i-n+1..., w i-1).A good semantic vector should be able to maximize the conditional probability P (w of each word i| w i-n+1..., w i-1).We carry out the most value of this probability of optimization with the neural network model of three layers.The input layer of this neural network is n-1 word above, and the corresponding semantic vector of each word, remembers C (w i-n+1) ..., C (w i-1), wherein C is the set of all term vectors, and the dimension of each vector is m.This n-1 vectorial end to end spelling is got up, forms the vector that (n-1) m ties up, be denoted as x.Then be x modeling with a nonlinear hidden layer, i.e. tanh (Hx+d), wherein d is bias term, and tanh is activation function.The output layer of neural network is one | predicting the outcome of V| dimension, and wherein V is the set of word, with reference to following formula (1):
y=softmax(U·tanh(Hx+d)+Wx+b) (1)
Wherein softmax is activation function, U (| the matrix of V| × h, h is the number of plies of hidden layer) be the parameter of hidden layer to output layer; W (| the matrix of V| × (n-1) m) be directly to a linear transformation of output layer from input layer.The i-th dimension y of this y that predicts the outcome irepresent that next word is the probability of i, i.e. y i=P (w i| w i-n+1..., w i-1).
By this neural network of backpropagation (Back Propagation) Algorithm for Solving, and then obtain semantic vector set C (the word w of word icorresponding semantic vector is exactly C (w i)).In solution procedure, need to add up (n-1) individual context word and its relevant frequency information before each word, we do corpus by the data that IM space is had a talk about and carry out statistical correlation frequency information.
The embodiment of the present invention adopts:
Prior art is mated by key word, is for text-processing, and needs to look for a lot of keywords, not only hard but also may look for and entirely do not cause accuracy rate not to be guaranteed; And the embodiment of the present invention is more accurate in order to classify, be not classify simply by text, but become by text representation mathematically can the vector form of analyzing and processing, need first to carry out entry cutting to text, obtain forming reprocessing after each word of text.Be vector form by VSM by text representation, described VSM is a statistical model, and being mainly used in is the data point (some vector) in the vector space of being opened by one group of normalized orthogonal entry vector by the text mapping in data.Text table be shown as mathematically can after the vector form of analyzing and processing, on this basis, classify based on probability with based on distance, such as, based on distance, text is regarded as a data point in vector space, classified by the distance calculated between data point, the process of classification is the process of a machine learning, these data points (some vector) are the points that n ties up in the real space, in vector space, a segmentation plane is being found out by classification based training module, be illustrated in figure 8 a segmentation plane, inhomogeneous data point is separated to realize Data classification, preferably the lineoid of these data points by a n-1 dimension can be separated, usually this is called as linear classifier, be not limited to the SVM of the embodiment of the present invention, a lot of sorter is had all to meet this requirement.If can find a best plane (largest interval lineoid) of classifying, namely make to belong to that maximum face of two inhomogeneous data point intervals, classifying quality is just better.
Based on said system embodiment one to three, the data diffusion unit that described system also comprises is for according to positive inverse relation and transitive relation, the described customer relationship with indicative character is analyzed further, obtain the user profile relevant to the described customer relationship with indicative character, with the described customer relationship with indicative character for relationship citing is described below:
Be a specific implementation schematic diagram of respective functional module in relationship expanding element in Fig. 4 as shown in figure 11, relationship diffusion unit is used for the relative being obtained relative by dispersion relation.A dispersion relation table as shown in the following Table 1.
Father Brother Cousins Aunt Son Aunt
Father Grandfather The cousin Uncle Relative Brother Relative
Brother Father Brother Cousins Aunt Nephew Aunt
Cousins Relative Cousins 0 0 Nephew 0
Aunt Granddad Uncle 0 0 Cousins 0
Son Man and wife Children Relative Relative Relative Relative
Aunt 0 0 0 0 Cousins 0
Table 1
Table 1 also can become dispersion relation matrix, adopts the relationship taxon in Fig. 4, can, according to the personal attribute information of user, also have language word interactive between user to judge whether to there is relationship.But consider the loss of learning of some user, some user that there is relationship is not interactive in IM space, and therefore, the relationship diffusion unit further by Fig. 4 spreads the relationship chain of user, to obtain the relative of relative.The relationship that this relationship diffusion unit identifies according to relationship taxon, social networks topological structure in conjunction with user does the diffusion of relationship, to improve the coverage rate that relative identify, the specific implementation of relationship diffuse module as shown in figure 11, comprise IM customer relationship chain abstraction module, front and back are to relation diffuse module, universal relation diffuse module, based on Confidence to relative's recognition result beta pruning module, IM customer relationship chain abstraction module is used for extracting relationship from the relationship identified; Described front and back are used for adopting positive inverse relation to spread the relative of relative according to the dispersion relation table shown in table 1 to relation diffuse module; Described universal relation diffuse module, for adopting transitive relation to spread the relative of relative according to the dispersion relation table shown in table 1; Described based on Confidence to relative's recognition result beta pruning module be used for diffusion result be optimized based on high Confidence rule, to reduce False Rate.
For positive inverse relation (front and back are to relation), as shown in Figure 12 example, positive inverse relation diffusion is spread related both sides, such as user A is the relative of user B, so by obtaining the relative that user B is user A after diffusion.For transitive relation (two degree of relation diffusions), as shown in Figure 13 example, transitive relation is the transmission of relationship, such as user A is user B " father ", and user B is user C " younger brother ", and user A just exists relationship with user C.
For described based on Confidence for relative's recognition result beta pruning module, diffusion due to relationship may bring the decline of accuracy rate, such as user A is user B " cousin ", user B is user C " elder male on father's side ", user A may not have relationship with user C, or only has the relation of becoming estranged very much; Especially, it be user C is relationship that relationship sort module may judge user B by accident, and after so doing two degree relations diffusions, mistake will be applied, and namely can judge user A further by accident and follow user C to be relative.In order to improve the accuracy rate that relative identify, by a method based on Confidence rule, relative's recognition result is optimized.Such as in diffusion, user A is with the same surname of user C, or in areal, the Confidence of this diffusion can obtain weighting; Such as user A is with user C simultaneously at IM good friend remarks, and IM packet name, multiple dimensions such as IM circle name are judged as relative simultaneously, and so this pass is that the Confidence of relationship also can weighting.
Here it is to be noted: the description of following methods item, is similar with the description of said system item, and the beneficial effect of homologous ray item describes, and does not repeat.For the ins and outs do not disclosed in the inventive method embodiment, please refer to the description of present system embodiment.
Embodiment of the method one:
The data mining disposal route of the embodiment of the present invention, as shown in figure 14, described method comprises:
Step 101, acquisition data, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.
Step 102, described numerous types of data comprehensively to be analyzed according to classification policy, obtain the customer relationship with indicative character to analyze from described data.
The customer relationship described in step 103, basis with indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
In the embodiment of the present invention one preferred implementation, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
Embodiment of the method two:
The data mining disposal route of the embodiment of the present invention, as shown in figure 15, described method comprises:
Step 201, acquisition data, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.
Step 202, resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determine that described data type is short text data, select the first strategy as described classification policy.
Step 203, perform described first strategy, extract seed words at random, described seed words can characterize the customer relationship with indicative character.
Step 204, using described seed words as with reference to benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
The customer relationship described in step 205, basis with indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
And determine that described data type is short text data by step 202, select the first strategy as described classification policy, by step 203-204, the seed words of random selecting carrys out to have described in identification data the customer relationship of indicative character.
In the embodiment of the present invention one preferred implementation, step 204 specifically comprises:
Step 2041a: according to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Step 2041b: the vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
In the embodiment of the present invention one preferred implementation, step 204 specifically also comprises:
Step 2042a: according to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Step 2042b: the vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Embodiment of the method three:
The data mining disposal route of the embodiment of the present invention, as shown in figure 16, described method comprises:
Step 301, acquisition data, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions.
Step 302, resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy.
Step 303, perform described second strategy, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words.
Step 304, using described seed words as with reference to benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
The customer relationship described in step 305, basis with indicative character collects information, sends recommendation information with basis to the analysis result of described information.
Adopt the embodiment of the present invention, because the described data obtained have numerous types of data, and these data types can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions, that is, it is an overall target that data divide by different types of data the data itself obtained, again by comprehensively analyzing according to classification policy the data with numerous types of data, the customer relationship with indicative character is obtained to analyze from described data, therefore, specific this can not only to be excavated in customer relationship chain from the data of the internet communication of vastness there is the customer relationship of indicative character, also certainly will can improve and identify that this has the accuracy of the customer relationship of indicative character, the customer relationship so described in basis with indicative character collects information, with basis, recommendation information is sent to the analysis result of described information, the accuracy of user's recommendation information must be risen to.
And determine that described data type is long article notebook data by step 302, select the second strategy as described classification policy, by step 303-304, the seed words of random selecting carrys out to have described in identification data the customer relationship of indicative character.
In the embodiment of the present invention one preferred implementation, step 303 specifically comprises:
User relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension.
In the embodiment of the present invention one preferred implementation, step 304 specifically comprises:
Step 3041a: according to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Step 3041b: the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
In the embodiment of the present invention one preferred implementation, step 304 specifically comprises:
Step 3042a: according to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Step 3042b: the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
Based on the inventive method embodiment one to three, described method also comprises: according to positive inverse relation and transitive relation, analyzes further the described customer relationship with indicative character, obtains the user profile relevant to the described customer relationship with indicative character.
If module integrated described in the embodiment of the present invention using the form of software function module realize and as independently production marketing or use time, also can be stored in a computer read/write memory medium.Based on such understanding, those skilled in the art should understand, the embodiment of the application can be provided as method, system or computer program.Therefore, the application can adopt the form of complete hardware embodiment, completely software implementation or the embodiment in conjunction with software and hardware aspect.And, the application can adopt in one or more form wherein including the computer program that the computer-usable storage medium of computer usable program code is implemented, described storage medium includes but not limited to USB flash disk, portable hard drive, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk memory, CD-ROM, optical memory etc.
The application is that the process flow diagram of method, equipment (system) and computer program according to the embodiment of the present application and/or block scheme describe.Should understand can by the combination of the flow process in each flow process in computer program instructions realization flow figure and/or block scheme and/or square frame and process flow diagram and/or block scheme and/or square frame.These computer program instructions can being provided to the processor of multi-purpose computer, special purpose computer, Embedded Processor or other programmable data processing device to produce a machine, making the instruction performed by the processor of computing machine or other programmable data processing device produce device for realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be stored in can in the computer-readable memory that works in a specific way of vectoring computer or other programmable data processing device, the instruction making to be stored in this computer-readable memory produces the manufacture comprising command device, and this command device realizes the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
These computer program instructions also can be loaded in computing machine or other programmable data processing device, make on computing machine or other programmable devices, to perform sequence of operations step to produce computer implemented process, thus the instruction performed on computing machine or other programmable devices is provided for the step realizing the function of specifying in process flow diagram flow process or multiple flow process and/or block scheme square frame or multiple square frame.
Although described the preferred embodiment of the application, those skilled in the art once obtain the basic creative concept of cicada, then can make other change and amendment to these embodiments.So claims are intended to be interpreted as comprising preferred embodiment and falling into all changes and the amendment of the application's scope.
Accordingly, the embodiment of the present invention also provides a kind of computer-readable storage medium, wherein stores computer program, and this computer program is for performing data mining disposal system and the method for the embodiment of the present invention.
The above, be only preferred embodiment of the present invention, be not intended to limit protection scope of the present invention.

Claims (19)

1. a data mining disposal system, is characterized in that, described system comprises: data capture unit, data sorting unit, data processing unit; Wherein,
Described data capture unit, for obtaining data, export described data to described data sorting unit, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions;
Described data sorting unit, for comprehensively analyzing according to classification policy described numerous types of data, obtaining the customer relationship with indicative character to analyze from described data, having the customer relationship of indicative character to described data processing unit described in output;
Described data processing unit, collects information for the customer relationship described in basis with indicative character, sends recommendation information with basis to the analysis result of described information.
2. system according to claim 1, is characterized in that, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
3. system according to claim 1, is characterized in that, described data sorting unit, comprising:
Policy selection subelement, for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determines that described data type is short text data, select the first strategy as described classification policy;
Strategy execution subelement, for adopt described first strategy described short text data carried out described in there is the identification of the customer relationship of indicative character time, random extraction seed words, described seed words can characterize the customer relationship with indicative character, using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
4. system according to claim 3, is characterized in that, described strategy execution subelement, comprising:
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
5. system according to claim 3, is characterized in that, described strategy execution submodule, comprising:
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding with described seed words according to described vector, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
6. system according to claim 1, is characterized in that, described data sorting unit, comprising:
Policy selection subelement, for resolving the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy;
Strategy execution subelement, for adopt described second strategy described long article notebook data carried out described in there is the identification of the customer relationship of indicative character time, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
7. system according to claim 6, is characterized in that, described strategy execution subelement, comprising:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension;
Vector generation module, for being expressed as the vector in vector space by described data according to vector space model; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
8. system according to claim 6, is characterized in that, described strategy execution subelement, comprising:
Seed words constructing module, for adopt the first strategy to described short text data identify obtain described in there is indicative character customer relationship to construct seed words time, user relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension;
Vector generation module, described data are expressed as the vector in vector space by the fixed dimension preset for basis and vector space model; Described fixed dimension obtains based on the context of co-text information of word each in described data;
Classification based training module, determines segmentation plane for the distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words, to have the customer relationship of indicative character described in identifying;
Analysis result output module, exports the customer relationship described in identifying with indicative character.
9. the system according to any one of claim 1 to 8, is characterized in that, described system also comprises: data diffusion unit, and described data diffusion unit is between described data sorting unit and described data processing unit;
Described data diffusion unit, for according to positive inverse relation and transitive relation, analyzes further to the described customer relationship with indicative character, obtains the user profile relevant to the described customer relationship with indicative character.
10. a data mining disposal route, is characterized in that, described method comprises:
Obtain data, described data are divided into numerous types of data, can have the customer relationship of indicative character from characterizing consumer relation chain different dimensions;
Described numerous types of data is comprehensively analyzed according to classification policy, obtains the customer relationship with indicative character to analyze from described data;
Collect information according to the described customer relationship with indicative character, with basis, recommendation information is sent to the analysis result of described information.
11. methods according to claim 10, is characterized in that, described numerous types of data comprises at least two kinds of data types in characterizing consumer personal attribute, the social topological structure of characterizing consumer, characterizing consumer mutual-action behavior.
12. methods according to claim 10, is characterized in that, describedly comprehensively analyze according to classification policy described numerous types of data, obtain the customer relationship with indicative character, comprising to analyze from described data:
Resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of each data type in described numerous types of data is all lower than the threshold value preset, determines that described data type is short text data, select the first strategy as described classification policy;
Perform described first strategy, extract seed words at random, described seed words can characterize the customer relationship with indicative character;
Using described seed words as reference benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data.
13. methods according to claim 12, it is characterized in that, described using described seed words as with reference to benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
The vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
14. methods according to claim 12, it is characterized in that, described using described seed words as with reference to benchmark, the described data with described numerous types of data are compared as training sample to be analyzed and described seed words and realizes classification based training, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
The vector distributing position in described vector space corresponding with described seed words according to described vector determines segmentation plane, to have the customer relationship of indicative character described in identifying.
15. methods according to claim 10, is characterized in that, describedly comprehensively analyze according to classification policy described numerous types of data, obtain the customer relationship with indicative character, comprising to analyze from described data:
Resolve the characterisitic parameter of described numerous types of data, when the characteristic parameter of part data type in described numerous types of data is lower than the threshold value preset, determine that described data type is short text data, when the characteristic parameter of part data type is higher than the threshold value preset, determine that described data type is long article notebook data, select the second strategy as described classification policy;
Perform described second strategy, by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words;
Using described seed words as reference benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data.
16. methods according to claim 15, is characterized in that, described by employing first strategy to described short text data identify obtain described in there is the customer relationship of indicative character to construct seed words, comprising:
User relationship data that the customer relationship with indicative character formed will be identified as in multiple dimension to as positive sample seed words simultaneously, will not be identified as user relationship data that the customer relationship with indicative character formed to as negative sample seed words in any one dimension.
17. methods according to claim 16, it is characterized in that, described using described seed words as with reference to benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to vector space model described data are expressed as the vector in vector space; Each word in described data is as a dimension of described vector, and total dimension of described vector is total word number of described data;
The distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
18. methods according to claim 16, it is characterized in that, described using described seed words as with reference to benchmark, the described data with described numerous types of data are carried out similarity comparison to realize classification based training as training sample to be analyzed and described seed words, to have the customer relationship of indicative character described in identifying from described data, comprising:
According to the fixed dimension preset and vector space model described data are expressed as the vector in vector space; Described fixed dimension obtains based on the context of co-text information of word each in described data;
The distributing position of vector in described vector space corresponding to described vector and described positive sample seed words and described negative sample seed words determines segmentation plane, to have the customer relationship of indicative character described in identifying.
19., according to claim 10 to the method described in 18 any one, is characterized in that, described method also comprises:
According to positive inverse relation and transitive relation, the described customer relationship with indicative character is analyzed further, obtain the user profile relevant to the described customer relationship with indicative character.
CN201410174489.4A 2014-04-28 2014-04-28 A kind of data mining processing system and method Active CN104615608B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410174489.4A CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410174489.4A CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Publications (2)

Publication Number Publication Date
CN104615608A true CN104615608A (en) 2015-05-13
CN104615608B CN104615608B (en) 2018-05-15

Family

ID=53150057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410174489.4A Active CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Country Status (1)

Country Link
CN (1) CN104615608B (en)

Cited By (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468723A (en) * 2015-11-20 2016-04-06 小米科技有限责任公司 Information recommendation method and device
CN106157114A (en) * 2016-07-06 2016-11-23 商宴通(上海)网络科技有限公司 Have dinner based on user the homepage proposed algorithm of behavior modeling
CN106453030A (en) * 2015-08-12 2017-02-22 大连民族学院 Method and device for getting social relationship chain
CN106547856A (en) * 2016-10-19 2017-03-29 天脉聚源(北京)科技有限公司 A kind of method and device of Application share data
WO2017054343A1 (en) * 2015-09-30 2017-04-06 百度在线网络技术(北京)有限公司 Recognition method and apparatus for user relationship, storage medium, and server
CN107392781A (en) * 2017-06-20 2017-11-24 挖财网络技术有限公司 The recognition methods of customer relationship, the recognition methods of object relationship and device
CN107464141A (en) * 2017-08-07 2017-12-12 北京京东尚科信息技术有限公司 For the method, apparatus of information popularization, electronic equipment and computer-readable medium
CN107800608A (en) * 2016-09-05 2018-03-13 腾讯科技(深圳)有限公司 A kind of processing method and processing device of user profile
CN107948255A (en) * 2017-11-13 2018-04-20 苏州达家迎信息技术有限公司 The method for pushing and computer-readable recording medium of APP
CN108170725A (en) * 2017-12-11 2018-06-15 仲恺农业工程学院 The social network user relationship strength computational methods and device of integrated multicharacteristic information
CN108737506A (en) * 2018-04-27 2018-11-02 苏州达家迎信息技术有限公司 A kind of application method for pushing, equipment, storage medium and system
CN108874821A (en) * 2017-05-11 2018-11-23 腾讯科技(深圳)有限公司 A kind of application recommended method, device and server
CN109241048A (en) * 2018-06-29 2019-01-18 深圳市彬讯科技有限公司 For the data processing method of data statistics, server and storage medium
WO2019051962A1 (en) * 2017-09-14 2019-03-21 平安科技(深圳)有限公司 Real relationship matching method and apparatus for social platform users, and readable storage medium
CN109767278A (en) * 2017-11-09 2019-05-17 北京京东尚科信息技术有限公司 Method and apparatus for output information
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN110751284A (en) * 2019-06-06 2020-02-04 北京嘀嘀无限科技发展有限公司 Heterogeneous information network embedding method and device, electronic equipment and storage medium
CN110851491A (en) * 2019-10-17 2020-02-28 天津大学 Network link prediction method based on multiple semantic influences of multiple neighbor nodes
WO2021022900A1 (en) * 2019-08-02 2021-02-11 华为技术有限公司 Method and device for recognizing text
CN112514358A (en) * 2018-09-26 2021-03-16 深圳市欢太科技有限公司 Game page switching method and related product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN103425686A (en) * 2012-05-21 2013-12-04 微梦创科网络科技(中国)有限公司 Information publishing method and device
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN103425686A (en) * 2012-05-21 2013-12-04 微梦创科网络科技(中国)有限公司 Information publishing method and device
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics

Cited By (30)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106453030A (en) * 2015-08-12 2017-02-22 大连民族学院 Method and device for getting social relationship chain
CN106453030B (en) * 2015-08-12 2019-10-11 大连民族学院 A kind of method and device obtaining social networks chain
US10827012B2 (en) 2015-09-30 2020-11-03 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for recognizing user relationship, storage medium and server
WO2017054343A1 (en) * 2015-09-30 2017-04-06 百度在线网络技术(北京)有限公司 Recognition method and apparatus for user relationship, storage medium, and server
CN105468723A (en) * 2015-11-20 2016-04-06 小米科技有限责任公司 Information recommendation method and device
CN106157114A (en) * 2016-07-06 2016-11-23 商宴通(上海)网络科技有限公司 Have dinner based on user the homepage proposed algorithm of behavior modeling
CN107800608A (en) * 2016-09-05 2018-03-13 腾讯科技(深圳)有限公司 A kind of processing method and processing device of user profile
CN106547856B (en) * 2016-10-19 2020-03-17 天脉聚源(北京)科技有限公司 Method and device for sharing data by application
CN106547856A (en) * 2016-10-19 2017-03-29 天脉聚源(北京)科技有限公司 A kind of method and device of Application share data
CN108874821B (en) * 2017-05-11 2021-06-15 腾讯科技(深圳)有限公司 Application recommendation method and device and server
CN108874821A (en) * 2017-05-11 2018-11-23 腾讯科技(深圳)有限公司 A kind of application recommended method, device and server
CN107392781A (en) * 2017-06-20 2017-11-24 挖财网络技术有限公司 The recognition methods of customer relationship, the recognition methods of object relationship and device
CN107464141A (en) * 2017-08-07 2017-12-12 北京京东尚科信息技术有限公司 For the method, apparatus of information popularization, electronic equipment and computer-readable medium
CN107464141B (en) * 2017-08-07 2021-09-07 北京京东尚科信息技术有限公司 Method and device for information popularization, electronic equipment and computer readable medium
WO2019051962A1 (en) * 2017-09-14 2019-03-21 平安科技(深圳)有限公司 Real relationship matching method and apparatus for social platform users, and readable storage medium
CN109767278A (en) * 2017-11-09 2019-05-17 北京京东尚科信息技术有限公司 Method and apparatus for output information
CN109767278B (en) * 2017-11-09 2021-03-30 北京京东尚科信息技术有限公司 Method and apparatus for outputting information
CN107948255A (en) * 2017-11-13 2018-04-20 苏州达家迎信息技术有限公司 The method for pushing and computer-readable recording medium of APP
WO2019091367A1 (en) * 2017-11-13 2019-05-16 苏州达家迎信息技术有限公司 App pushing method, device, electronic device and computer-readable storage medium
CN107948255B (en) * 2017-11-13 2019-09-03 苏州达家迎信息技术有限公司 The method for pushing and computer readable storage medium of APP
US11379206B2 (en) 2017-11-13 2022-07-05 Suzhou Dajiaying Information Technology Co., Ltd. APP pushing method and device, electronic device and computer-readable storage medium
CN108170725A (en) * 2017-12-11 2018-06-15 仲恺农业工程学院 The social network user relationship strength computational methods and device of integrated multicharacteristic information
CN110020420A (en) * 2018-01-10 2019-07-16 腾讯科技(深圳)有限公司 Text handling method, device, computer equipment and storage medium
CN108737506A (en) * 2018-04-27 2018-11-02 苏州达家迎信息技术有限公司 A kind of application method for pushing, equipment, storage medium and system
CN109241048A (en) * 2018-06-29 2019-01-18 深圳市彬讯科技有限公司 For the data processing method of data statistics, server and storage medium
CN112514358A (en) * 2018-09-26 2021-03-16 深圳市欢太科技有限公司 Game page switching method and related product
CN110751284A (en) * 2019-06-06 2020-02-04 北京嘀嘀无限科技发展有限公司 Heterogeneous information network embedding method and device, electronic equipment and storage medium
WO2021022900A1 (en) * 2019-08-02 2021-02-11 华为技术有限公司 Method and device for recognizing text
CN110851491A (en) * 2019-10-17 2020-02-28 天津大学 Network link prediction method based on multiple semantic influences of multiple neighbor nodes
CN110851491B (en) * 2019-10-17 2023-06-30 天津大学 Network link prediction method based on multiple semantic influence of multiple neighbor nodes

Also Published As

Publication number Publication date
CN104615608B (en) 2018-05-15

Similar Documents

Publication Publication Date Title
CN104615608A (en) Data mining processing system and method
Li et al. Mining opinion summarizations using convolutional neural networks in Chinese microblogging systems
Venugopalan et al. Exploring sentiment analysis on twitter data
Kumar et al. Fake news detection using machine learning and natural language processing
Stamatatos et al. Clustering by authorship within and across documents
US9262438B2 (en) Geotagging unstructured text
US20180232443A1 (en) Intelligent matching system with ontology-aided relation extraction
CN108885623A (en) The lexical analysis system and method for knowledge based map
Huang et al. A multi-source integration framework for user occupation inference in social media systems
CN103544188A (en) Method and device for pushing mobile internet content based on user preference
CN103023714A (en) Activeness and cluster structure analyzing system and method based on network topics
US20160314398A1 (en) Attitude Detection
CN110880142A (en) Risk entity acquisition method and device
Bhattacharjee et al. Identifying extremism in social media with multi-view context-aware subset optimization
CN109271624A (en) A kind of target word determines method, apparatus and storage medium
KR101931624B1 (en) Trend Analyzing Method for Fassion Field and Storage Medium Having the Same
CN113569118B (en) Self-media pushing method, device, computer equipment and storage medium
Atreja et al. Citicafe: An interactive interface for citizen engagement
Prabhakar et al. New ensemble approach to analyze user sentiments from social media Twitter data
Rauniyar A survey on deep learning based various methods analysis of text summarization
Huang et al. Exploiting long-term dependency for topic sentiment analysis
Sangeetha et al. Modelling of E-governance framework for mining knowledge from massive grievance redressal data
KR102328234B1 (en) System and method for detecting local event by analyzing relevant documents in social network
Kamel et al. Robust sentiment fusion on distribution of news
Alsubari et al. [Retracted] Computational Intelligence Based Recurrent Neural Network for Identification Deceptive Review in the E‐Commerce Domain

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230705

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right