CN104615608B - A kind of data mining processing system and method - Google Patents

A kind of data mining processing system and method Download PDF

Info

Publication number
CN104615608B
CN104615608B CN201410174489.4A CN201410174489A CN104615608B CN 104615608 B CN104615608 B CN 104615608B CN 201410174489 A CN201410174489 A CN 201410174489A CN 104615608 B CN104615608 B CN 104615608B
Authority
CN
China
Prior art keywords
data
seed words
customer relationship
vector
indicative character
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410174489.4A
Other languages
Chinese (zh)
Other versions
CN104615608A (en
Inventor
余建兴
高瀚
司徒志远
黄华伟
高岩
贺鹏
陈川
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410174489.4A priority Critical patent/CN104615608B/en
Publication of CN104615608A publication Critical patent/CN104615608A/en
Application granted granted Critical
Publication of CN104615608B publication Critical patent/CN104615608B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/283Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of data mining processing system and method, wherein, the system comprises:Data capture unit, data sorting unit, data processing unit;Wherein, data capture unit, for obtaining data, exports the data and is divided into numerous types of data to the data sorting unit, the data, can take over the customer relationship in the relation chain of family with indicative character for use from different dimensions upper table;The data sorting unit, for carrying out comprehensive analysis according to classification policy to the numerous types of data, the customer relationship with indicative character is obtained to be analyzed from the data, exports the customer relationship with indicative character to the data processing unit;The data processing unit, for collecting information according to the customer relationship with indicative character, to send recommendation information according to the analysis result to described information.

Description

A kind of data mining processing system and method
Technical field
The present invention relates to the digging technology of internet communication, more particularly to a kind of data mining processing system and method.
Background technology
Present inventor at least has found exist in correlation technique during the embodiment of the present application technical solution is realized Following technical problem:
With the rapid development of Internet technology and the transition of social structure, more and more people are on network, mobile phone In seek to link up, contact and associate, and thus produce magnanimity interpersonal interbehavior, based on this interbehavior energy Obtain polytype relation chain between user.Polytype relation chain can be applied to side's aspect of social life between user Face, service provider make a reservation application to provide service to the user by various applications, such as the reservation of cell-phone customer terminal.
By can preferably analyze user demand to polytype relation chain user, so as to provide to the user more preferable Service, such as, recommended user need be used for do shopping APP application, help user carry out needed for article shopping guide guide, again Such as, required dining room and dining room special service, or health product etc. are recommended for user, in short, once can accurately obtain Polytype relation chain between the user, polytype obtained database of relation chain is user between being just based on the user Best service is provided, is embodied as the purpose that user accurately recommends various useful applications, meanwhile, the process of service is being provided In, service provider can also update the database of itself application by this recommendation and the assessment of user's purchasing power.
There are some customer relationships with indicative character in polytype relation chain between user, for example, such as Indicate certain relationship, the service sense that the user of relationship may can be provided same or same class application is emerging Interest, therefore, is applied the database of itself for improving using this relationship and is improved as using by this database Accurately recommendation information plays decisive role at family.If as it can be seen that it can excavate this special with instruction in customer relationship chain The customer relationship of sign, just can improve data validity as valid data, largely be taken in number to avoid invalid data Cause data redundancy according to storehouse, so that the mesh of user's accurately recommendation information can be embodied as by the raising of this data validity 's.It is the skill to be solved that the customer relationship with indicative character, which how to be excavated, to rise to the accuracy of user's recommendation information Art problem.
However, the customer relationship with indicative character is excavated from the data of immense internet communication seems letter Single, practical operation gets up to be not easy to, and is ensured that that excavates should have the accuracy of the customer relationship of indicative character Accomplish with regard to more difficult, still by taking the customer relationship with indicative character is certain relationship as an example, current is existing Technology is realized by simple keyword match, such as, it is that some user's remarks is " father " inside address list, is Another user's remarks is " aunt ", then is probably relationship between the two users;In addition, the word of expression relationship Very much, for example " father " just has " father ", and the statement such as " father ", the mode of above-mentioned keyword match is difficult all possible key Word is all enumerated, it is seen then that not there is the effective scheme to solve the above problems in correlation technique.
The content of the invention
In view of this, the embodiment of the present invention, can be from the mutual of vastness desirable to provide a kind of data mining processing system and method Being excavated in the data of combined network communication specifically has the customer relationship of indicative character in customer relationship chain, pushed away with rising to user Recommend the accuracy of information.
What the technical solution of the embodiment of the present invention was realized in:
A kind of data mining processing system of the embodiment of the present invention, the system comprises:Data capture unit, data classification Unit, data processing unit;Wherein,
The data capture unit, for obtaining data, exports the data to the data sorting unit, the data It is divided into numerous types of data, the customer relationship in the relation chain of family with indicative character can be taken over for use from different dimensions upper table;
The data sorting unit, for the numerous types of data according to classification policy carry out comprehensive analysis, with from Analysis obtains the customer relationship with indicative character in the data, exports the customer relationship with indicative character to described Data processing unit;
The data processing unit, for collecting information according to the customer relationship with indicative character, with according to right The analysis result of described information sends recommendation information.
Preferably, the numerous types of data includes characterization individual subscriber attribute, characterization user social contact topological structure, characterization At least two data types in user interaction behavior.
Preferably, the data sorting unit, including:
Policy selection subelement, for parsing the characterisitic parameter of the numerous types of data, when the numerous types of data In the characteristic parameter of each data type when being all less than default threshold value, it is short text data to determine the data type, choosing Select the first strategy and be used as the classification policy;
Strategy execution subelement, it is described special with instruction for being carried out using the described first strategy to the short text data During the identification of the customer relationship of sign, seed words are extracted at random, and the seed words can characterize the customer relationship with indicative character, will The seed words as refer to benchmark, using the data with the numerous types of data as training sample to be analyzed and The seed words are compared to realize classification based training, to identify that the user with indicative character is closed from the data System.
Preferably, the strategy execution subelement, including:
Vector generation module, for the vector being expressed as the data according to vector space model in vector space;Institute Each word in data is stated as a vectorial dimension, vectorial total dimension is total word number of the data;
Classification based training module, for according to described vectorial corresponding with the seed words vectorial in the vector space Distributing position determines segmentation plane, to identify the customer relationship with indicative character;
Analysis result output module, exports the customer relationship with indicative character identified.
Preferably, the strategy execution submodule, including:
Vector generation module, for the data to be expressed as vector according to default fixed dimension and vector space model Vector in space;Context of co-text information of the fixed dimension based on each word in the data obtains;
Classification based training module, for according to described vectorial corresponding with the seed words vectorial in the vector space Distributing position determines segmentation plane, to identify the customer relationship with indicative character;
Analysis result output module, exports the customer relationship with indicative character identified.
Preferably, the data sorting unit, including:
Policy selection subelement, for parsing the characterisitic parameter of the numerous types of data, when the numerous types of data When the characteristic parameter of middle part data type is less than default threshold value, determine that the data type is short text data, part number When being higher than default threshold value according to the characteristic parameter of type, it is long article notebook data to determine the data type, selects the second strategy to make For the classification policy;
Strategy execution subelement, it is described special with instruction for being carried out using the described second strategy to the long article notebook data During the identification of the customer relationship of sign, will using first strategy the short text data will be identified described in have instruction The customer relationship of feature constructs seed words, using the seed words as benchmark is referred to, by with the numerous types of data The data as training sample to be analyzed and the seed words carry out similarity and compare to realize classification based training, with from described The customer relationship with indicative character is identified in data.
Preferably, the strategy execution subelement, including:
Seed words constructing module, for using first strategy the short text data is identified described in have When the customer relationship of indicative character is to construct seed words, the customer relationship with indicative character will be identified as at the same time in multiple dimensions The user relationship data of formation in any one dimension to as positive sample seed words, will not be identified as having indicative character The user relationship data that customer relationship is formed is to as negative sample seed words;
Vector generation module, for the vector being expressed as the data according to vector space model in vector space;Institute Each word in data is stated as a vectorial dimension, vectorial total dimension is total word number of the data;
Classification based training module, for according to described vectorial right with the positive sample seed words and negative sample seed words institute Distributing position of the vector answered in the vector space determines segmentation plane, to identify the use with indicative character Family relation;
Analysis result output module, exports the customer relationship with indicative character identified.
Preferably, the strategy execution subelement, including:
Seed words constructing module, for using first strategy the short text data is identified described in have When the customer relationship of indicative character is to construct seed words, the customer relationship with indicative character will be identified as at the same time in multiple dimensions The user relationship data of formation in any one dimension to as positive sample seed words, will not be identified as having indicative character The user relationship data that customer relationship is formed is to as negative sample seed words;
Vector generation module, for the data to be expressed as vector according to default fixed dimension and vector space model Vector in space;Context of co-text information of the fixed dimension based on each word in the data obtains;
Classification based training module, for according to described vectorial right with the positive sample seed words and negative sample seed words institute Distributing position of the vector answered in the vector space determines segmentation plane, to identify the use with indicative character Family relation;
Analysis result output module, exports the customer relationship with indicative character identified.
Preferably, the system also includes:Data diffusion unit, the data diffusion unit are located at the data grouping sheet Between first and described data processing unit;
The data diffusion unit, for according to positive inverse relation and transitive relation, to the user with indicative character Relation is further analyzed, and is obtained and the relevant user information of the customer relationship with indicative character.
A kind of data mining processing method of the embodiment of the present invention, the described method includes:
Data are obtained, the data are divided into numerous types of data, and can be taken over for use from different dimensions upper table in the relation chain of family has The customer relationship of indicative character;
Comprehensive analysis is carried out according to classification policy to the numerous types of data, is had to be analyzed from the data The customer relationship of indicative character;
Information is collected according to the customer relationship with indicative character, to be sent out according to the analysis result to described information Send recommendation information.
Preferably, the numerous types of data includes characterization individual subscriber attribute, characterization user social contact topological structure, characterization At least two data types in user interaction behavior.
Preferably, it is described that comprehensive analysis is carried out according to classification policy to the numerous types of data, with from the data Analysis obtains the customer relationship with indicative character, including:
The characterisitic parameter of the numerous types of data is parsed, as the spy of each data type in the numerous types of data When sign parameter is all less than default threshold value, it is short text data to determine the data type, selects the first strategy as described point Class strategy;
First strategy is performed, extracts seed words at random, the seed words can characterize the user with indicative character and close System;
Using the seed words as benchmark is referred to, using the data with the numerous types of data as to be analyzed Training sample is compared with the seed words to realize classification based training, described special with instruction to be identified from the data The customer relationship of sign.
Preferably, it is described using the seed words as benchmark is referred to, by the data with the numerous types of data It is compared as training sample to be analyzed with the seed words to realize classification based training, to identify institute from the data The customer relationship with indicative character is stated, including:
The vector being expressed as the data according to vector space model in vector space;Each word in the data As a vectorial dimension, vectorial total dimension is total word number of the data;
Determined point according to the vectorial distributing position of the vector in the vector space corresponding with the seed words Cutting plane, to identify the customer relationship with indicative character.
Preferably, it is described using the seed words as benchmark is referred to, by the data with the numerous types of data It is compared as training sample to be analyzed with the seed words to realize classification based training, to identify institute from the data The customer relationship with indicative character is stated, including:
The vector being expressed as the data according to default fixed dimension and vector space model in vector space;It is described Context of co-text information of the fixed dimension based on each word in the data obtains;
Determined point according to the vectorial distributing position of the vector in the vector space corresponding with the seed words Cutting plane, to identify the customer relationship with indicative character.
Preferably, it is described that comprehensive analysis is carried out according to classification policy to the numerous types of data, with from the data Analysis obtains the customer relationship with indicative character, including:
The characterisitic parameter of the numerous types of data is parsed, when the feature of part data type in the numerous types of data When parameter is less than default threshold value, it is short text data to determine the data type, and the characteristic parameter of part data type is higher than During default threshold value, it is long article notebook data to determine the data type, selects the second strategy to be used as the classification policy;
Perform it is described second strategy, will using first strategy the short text data will be identified described in have The customer relationship of indicative character constructs seed words;
Using the seed words as benchmark is referred to, using the data with the numerous types of data as to be analyzed Training sample and the seed words carry out similarity and compare to realize classification based training, to identify described having from the data The customer relationship of indicative character.
Preferably, it is described the short text data will be identified using the first strategy described in have the instruction special The customer relationship of sign constructs seed words, including:
Using the user relationship data that the customer relationship for being identified as having indicative character at the same time in multiple dimensions is formed to as Positive sample seed words, will not be identified as the customer relationship number of the customer relationship formation with indicative character in any one dimension According to as negative sample seed words.
Preferably, it is described using the seed words as benchmark is referred to, by the data with the numerous types of data Compared as training sample to be analyzed and seed words progress similarity to realize classification based training, to know from the data Do not go out the customer relationship with indicative character, including:
The vector being expressed as the data according to vector space model in vector space;Each word in the data As a vectorial dimension, vectorial total dimension is total word number of the data;
According to the vectorial vector with the positive sample seed words and corresponding to the negative sample seed words it is described to Distributing position in quantity space determines segmentation plane, to identify the customer relationship with indicative character.
Preferably, it is described using the seed words as benchmark is referred to, by the data with the numerous types of data Compared as training sample to be analyzed and seed words progress similarity to realize classification based training, to know from the data Do not go out the customer relationship with indicative character, including:
The vector being expressed as the data according to default fixed dimension and vector space model in vector space;It is described Context of co-text information of the fixed dimension based on each word in the data obtains;
According to the vectorial vector with the positive sample seed words and corresponding to the negative sample seed words it is described to Distributing position in quantity space determines segmentation plane, to identify the customer relationship with indicative character.
Preferably, the method further includes:
According to positive inverse relation and transitive relation, the customer relationship with indicative character is further analyzed, obtain with The relevant user information of the customer relationship with indicative character.
The data mining processing system of the embodiment of the present invention includes:Data capture unit, data sorting unit, data processing Unit;Wherein, data capture unit is used to obtain data, exports the data to the data sorting unit, the data point For numerous types of data, the customer relationship in the relation chain of family with indicative character can be taken over for use from different dimensions upper table;The data Taxon is used to carry out comprehensive analysis according to classification policy to the numerous types of data, is obtained with being analyzed from the data Customer relationship with indicative character, exports the customer relationship with indicative character to the data processing unit;It is described Data processing unit is used to collect information according to the customer relationship with indicative character, with according to the analysis to described information As a result recommendation information is sent.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
Brief description of the drawings
Fig. 1 is a composition structure diagram of present system embodiment;
Fig. 2 is a composition structure diagram of present system embodiment;
Fig. 3 is a composition structure diagram of present system embodiment;
Fig. 4 is the application scenarios schematic diagram using present system embodiment;
Fig. 5 is a composition structure diagram of present system embodiment;
Fig. 6 is the composition structure diagram of strategy execution subelement in Fig. 5;
Fig. 7 is an application scenarios schematic diagram of each module in application drawing 6;
Fig. 8 is that different pieces of information point is separated the schematic diagram for realizing classification by a segmentation plane;
Fig. 9 is the composition structure diagram of strategy execution subelement in Fig. 5;
Figure 10 is an application scenarios schematic diagram of each module in application drawing 9;
Figure 11 realizes schematic diagram for one of each function module of relationship expanding element in application drawing 4;
Figure 12 spreads schematic diagram for positive inverse relation;
Figure 13 spreads schematic diagram for transitive relation;
Figure 14 realizes flow chart for the method for the present invention embodiment;
Figure 15 realizes flow chart for the method for the present invention embodiment;
Figure 16 realizes flow chart for the method for the present invention embodiment.
Embodiment
The implementation to technical solution is described in further detail below in conjunction with the accompanying drawings.
System embodiment one:
A kind of data mining processing system of the embodiment of the present invention, as shown in Figure 1, the system comprises:Data acquisition list Member, data sorting unit, data processing unit.Wherein, data capture unit is used to obtain data, exports the data to described Data sorting unit, the data are divided into numerous types of data, and can be taken over for use from different dimensions upper table has instruction in the relation chain of family The customer relationship of feature.Data sorting unit is used to carry out comprehensive analysis according to classification policy to the numerous types of data, with Analysis obtains the customer relationship with indicative character from the data, exports the customer relationship with indicative character to institute State data processing unit.Data processing unit is used to collect information according to the customer relationship with indicative character, with basis Recommendation information is sent to the analysis result of described information.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
In one preferred embodiment of the embodiment of the present invention, the numerous types of data include characterization individual subscriber attribute, Characterize at least two data types in user social contact topological structure, characterization user interaction behavior.
In one preferred embodiment of the embodiment of the present invention, as shown in Fig. 2, the system also includes:Data diffusion unit, The data diffusion unit between the data sorting unit and the data processing unit, use by the data diffusion unit According to positive inverse relation and transitive relation, the customer relationship with indicative character is further analyzed, is obtained and the tool There is the relevant user information of customer relationship of indicative character.
In one preferred embodiment of the embodiment of the present invention, as shown in figure 3, the system also includes:Data outputting unit, For data outputting unit between the data diffusion unit and the data processing unit, the data outputting unit is used for will There is the customer relationship of indicative character according to obtaining data sorting unit, and will be further according to data diffusion unit Being exported with the relevant user information of the customer relationship with indicative character for obtaining is handled to data processing unit.
The application scenarios schematic diagram using present system embodiment is illustrated in figure 4, Fig. 4 includes data acquisition list (data expand in Fig. 3 for member, relationship taxon (specific implementation of data sorting unit in Fig. 3), relationship diffusion unit The specific implementation of throwaway member), relationship output unit (specific implementation of data outputting unit in Fig. 3) and data processing list Member.Data capture unit is obtained from multiple data sources to analyze the data of the customer relationship with indicative character, should With in scene, the customer relationship with indicative character is closed by taking relationship as an example by relationship taxon-relative It is diffusion unit-relationship output unit, the relationship that will identify that, which is sent to data processing unit, to be handled, number Information is collected to update the database of N number of application according to the relationship according to processing unit, according to the analysis to described information As a result, sending recommendation information using different application, the accuracy of user's recommendation information can be necessarily risen to.Wherein, it is described N number of Using including:IM friend recommendation applications, IM good friends cohesion estimation application and various advertisements recommend platform as logical such as wide point.
Multiple data sources in this application scene include:
Data type one:The off-line data of instant messaging (IM) application;
Data type two:Local communication is applied such as the contact data in cell phone address book;
Data type three:Each World Jam, interaction platform such as search dog are asked, are produced between microblogging such as Sina weibo user during interaction Raw interactive data.
Wherein, data type one and data type two usually characterize individual subscriber attribute, such as user's connection in IM applications It is that the upper remarks of people have user personal attribute for " father ", " mother ", " aunt " etc., then just can know that certain is several by this remarks It whether there is relationship between user;Similarly, data type two can also use this remarks, and data type two is due to can be standby The project and word amount ratio data type one of note are big, can using remarks individual subscriber attribute as:Subscriber household address, postcode number Deng if the home address that certain several user's remarks is the same, just illustrating between these users there are relationship, or pass through postcode Number know certain several user in same regional or same street etc., or the judgement of relationship is played influence and made With.Generally speaking, data type one and data type two belong to that data volume is big, and the short data type of content of text can also Say, both belong to short text type.
Data type three:Due to being that each World Jam, interaction platform such as search dog are asked, handed between microblogging such as Sina weibo user Interactive data caused by mutually, such as, " father where go ", " what time going home to have a meal " etc., it is little to belong to data volume, text The data type of content length, it may also be said to, data type three belongs to long text type.
In addition, one-data type of data type three can disclose user social contact topological structure.
With regard to above-mentioned data source for example, the data of multiple data sources can be accessed by above-mentioned data capture unit, including The off-line data of IM good friends, the contact library of mobile phone IM address lists, the interaction in IM spaces are had a talk about (including comment forwarding).Wherein, The off-line data of IM good friends has an IM individual subscribers attribute (for example good friend's remarks, good friend packet etc.), the circle information of IM, IM groups Information (for example group's title), IM social networks chains etc..These data indicate relationship on different dimensions, for example one IM groups It is named as " relatives group ", then each group friend in the inside is likely to be relative between each other.
In conclusion due to analyzing the data of the customer relationship with indicative character, for example analysis relative are closed The data of system come from multiple data sources, and each data source corresponds to a kind of data type, and therefore, the data are divided into a variety of Data type.The numerous types of data is mutual including characterization individual subscriber attribute, characterization user social contact topological structure, characterization user At least two data types in dynamic behavior, due to effectively can synthetically consider personal attribute's feature of user, social topology knot Structure, and the information of social networks interaction, therefore, the data for possessing numerous types of data can be taken over for use from different dimensions upper table There is the customer relationship of indicative character in the relation chain of family, so that, using the embodiment of the present invention, based on tool described in the data analysis There is the customer relationship of indicative character, be a kind of comprehensive analysis, can ensure that the identification customer relationship with indicative character is enough Accurately, the embodiment of the present invention is different from this single matching mechanisms of prior art keyword, more superior.
By taking the customer relationship with indicative character is relationship as an example, have a look this with prior art keyword The shortcomings that single matching mechanisms, is as follows:
First, fail to consider and the various factors that can judge relationship of reasonable analysis:
Influence with the presence or absence of relationship because being known as very much, for example user is " father " by IM good friends remarks;User adds Enter a group for being named as " relatives ";The relative of relative may be also relative etc. on social topological structure.Accurately to analyze every A influence factor, the method for analysis need targetedly.Simply all kinds of data of different nature are sentenced according to keyword match Disconnected relationship is too rough, and effect is bad.For example in the interaction of IM spatial users, keyword match can mistakenly judge interaction There are relationship by the corresponding user of model " father where go ".In addition, each influence the indicative function of the factor of relationship Also it is different.For example inside cell phone address book remarks be " father " good friend, than mentioning " father in the interaction of IM spatial users The good friend of father ", be more likely to be user relative.The single mechanism of existing keyword match can not consider various shadows The factor of sound.
2nd, the coverage rate deficiency of relationship is excavated:
The word of expression relationship is very much, and for example " father " just has " father ", " father ", even " father's ratio ", " old beans " Deng statement.The single mechanism of existing keyword match is difficult that all possible keyword is all enumerated.Particularly in interaction In some terms may occur without the keyword of relative, but they can indicate relationship, for example in IM spaces interaction note Son, as the both sides of " when returning to have a meal " compare, there may be relationship.
And the embodiment of the present invention due to be comprehensive various data types data, data can from different dimensions upper table take over for use family There is the customer relationship of indicative character in relation chain, using comprehensive analysis mechanism, above-mentioned shortcoming existing in the prior art can be evaded, Can be that the accuracy for improving pushed information carries so as to precisely identify the customer relationship with indicative character, such as relationship For ensureing.
Because the various social interaction relations between user imply substantial amounts of information recommendation possibility, for example whenever section is false Day, substantial amounts of mutually blessing behavior can be all produced between friends and family.On the other hand, it is various types of to participate in having for social interaction People, for example includes the relative of oneself, teacher, classmate, colleague, stranger, even intermediary distribution etc..In these crowds, parent The user of relative relation has very big information recommendation possibility, and for example advertiser's (e.g. restaurants, health treatment) can be directed to Property be delivered to related user, help them to be easier to find suitable application, product or service;It can give User recommends its relative, aids in it to extend existing subscriber's relation chain, increases user's stickiness, is user's recommendation information, improves user Experience.
Following embodiment, in order to simplify description, is not done superfluous there is also the various combinations possibility in said system embodiment one State.
System embodiment two:
A kind of data mining processing system of the embodiment of the present invention, as shown in figure 5, the system comprises:Data acquisition list Member, data sorting unit, data processing unit.Wherein, data capture unit is used to obtain data, exports the data to described Data sorting unit, the data are divided into numerous types of data, and can be taken over for use from different dimensions upper table has instruction in the relation chain of family The customer relationship of feature.Data sorting unit is used to carry out comprehensive analysis according to classification policy to the numerous types of data, with Analysis obtains the customer relationship with indicative character from the data, exports the customer relationship with indicative character to institute State data processing unit.Data processing unit is used to collect information according to the customer relationship with indicative character, with basis Recommendation information is sent to the analysis result of described information.
It is to be herein pointed out the data sorting unit includes:Policy selection subelement and strategy execution subelement. Wherein, policy selection subelement is used for the characterisitic parameter for parsing the numerous types of data, when every in the numerous types of data When a kind of characteristic parameter of data type is all less than default threshold value, it is short text data to determine the data type, selection the One strategy is used as the classification policy.Strategy execution subelement is used to carry out the short text data using the described first strategy During the identification of the customer relationship with indicative character, seed words are extracted at random, and the seed words can characterize special with instruction The customer relationship of sign, using the seed words as referring to benchmark, using the data with the numerous types of data as treating The training sample of analysis is compared with the seed words to realize classification based training, to identify described having from the data The customer relationship of indicative character.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
Moreover, data sorting unit is subdivided into policy selection subelement and strategy execution subelement, policy selection subelement Corresponding different data select different classification policies, the present embodiment by taking the short text type that system embodiment one refers to as an example, Short text type belongs to the data type that data volume is big and content of text is short, it may also be said to which its characteristic parameter is used to characterize the number According to amount, big and short content of text characteristic, policy selection subelement can parse this feature parameter, by with default threshold value into Row compares, and judges for the short text type, then selects the first strategy to be used as classification policy, by strategy execution subelement come First strategy is performed, first strategy is:Random extraction seed words, the seed words can be characterized with indicative character Customer relationship, using the seed words as referring to benchmark, using the data with the numerous types of data as treating point The training sample of analysis is compared with the seed words to realize classification based training, described with finger to be identified from the data Show the customer relationship of feature.
The composition structure diagram of strategy execution subelement in Fig. 5 is illustrated in figure 6, the strategy execution subelement includes Following two implementations, the first implementation:Vector generation module is not to use fixed dimension, second of implementation: Vector generation module uses fixed dimension.
The first implementation of the strategy execution subelement is:
Vector generation module, for the vector being expressed as the data according to vector space model in vector space;Institute Each word in data is stated as a vectorial dimension, vectorial total dimension is total word number of the data.
Classification based training module, for according to described vectorial corresponding with the seed words vectorial in the vector space Distributing position determines segmentation plane, to identify the customer relationship with indicative character.
Analysis result output module, exports the customer relationship with indicative character identified.
Second of implementation of the strategy execution subelement be:
Vector generation module, for the data to be expressed as vector according to default fixed dimension and vector space model Vector in space;Context of co-text information of the fixed dimension based on each word in the data obtains.
Classification based training module, for according to described vectorial corresponding with the seed words vectorial in the vector space Distributing position determines segmentation plane, to identify the customer relationship with indicative character.
Analysis result output module, exports the customer relationship with indicative character identified.
An application scenarios schematic diagram of strategy execution subelement in application drawing 6 is illustrated in figure 7, including:Semantic vector is given birth to (divide into module (specific implementation of vector generation module in Fig. 6), classification based training module, prediction relationship output module in Fig. 6 Analyse the specific implementation of result output module).
Illustrate so that the customer relationship with indicative character is relationship as an example, it is as shown in Figure 5 by policy selection The data sorting unit that subelement and strategy execution subelement are formed, can be specially the relationship sort module in Fig. 4, should Relationship sort module can predict the relationship of user respectively according to multiple data sources, due to the data of different data sources Characteristic is different, it is necessary to targetedly handled with data source of different nature using different operation logics, is such as directed to This short text type of data type one and data type two that system embodiment one refers to takes a kind of processing logic (the first plan Slightly it is classification policy), another place is taken to this long text type of data type three referred to for system embodiment one Manage logic (the second strategy is classification policy).For the present embodiment for the first strategy execution, the description for the second strategy please The description of continuous system embodiment three is seen below, is not repeated here.
The present embodiment for first strategy, it is maximum the characteristics of be to take a seed words at random.It is for data:IM good friends Off-line data and mobile phone IM address lists contact person, for example IM individual subscribers attribute (good friend's remarks, good friend packet etc.), IM's Circle title, IM groups of titles, it is contemplated that this kind of data text is very short (generally there was only several words), belongs to the short text class Type, takes the seed words input classification based training module of a relative to carry out classification based training, it is to be herein pointed out described at random Classification based training module can be the training grader based on support vector machines (SVM) technology, using the seed words of the relative come Identify the relationship present in the data of both data types.
First with semantic vector generation module, by data with the vector representation in vector space, classification based training is used afterwards Classification is identified to the relationship present in data in module.Specifically, semantic vector generation module is based on vector space mould Type (VSM), with 0/1 representation the space vector (can be a vector) that data are expressed as in vector space, then uses and divides Class training module finds out a segmentation plane in vector space.
0/1 representation is that each word allowed in a data such as text (is referred to as an element of vector One dimension of vector), vectorial total dimension is total word number of full text.When a certain bar text is expressed as vector, to If the corresponding word of every dimension of amount occurs in the text, the value of the dimension is just 1, is otherwise 0.For 0/1 representation, For example, such as a text " when father goes home ", can obtain by participle " father ", " what ", " when ", " Go home " four words, if the text is had four dimensions with vector representation, the vector.0/1 representation be will be all in Cliction does attribute, if Chinese word have 100,000 dimension, then this text representation come out vector for [0,0,0,1 ... .., 1 ..., 0 ..., 1 ..., 1,0,0], and this vector only " father ", " what ", " when ", " going home " this four words Value above corresponding dimension is 1, other are all for 0.For the short text type of mass data, if with 0/1 mode Vector representation, then dimension can be very big (because the dimension of vector is total word amount of text).
Since the dimension of above-mentioned 0/1 vector representation mode is very big, dyscalculia, and cannot reflect synonymous or meaning phase Similitude between near text, superelevation dimension can seriously damage the treatment effeciency and performance of classification based training module, moreover, being based on 0/1 method for expressing, semantically similar word cannot be reflected with its corresponding vectorial cosine angle.For example represented with 0/1 " father " is expressed as vector by method with " father ", and the cosine angle of the two semantic similar words is but 0, this can be very negative Ground influences the effect of classification.
The shortcomings that in view of above-mentioned 0/1 vector representation mode, improvement project is:Represented using the semantic vector of fixed dimension Mode, rather than it is used as by the use of total word number of full text total dimension of vector.
For the text of this improvement project, first learning data, a fixed dimension (such as 200 for each word is drawn Dimension) semantic vector.It is described below and how establishes semantic vector.
Such as text " when father goes home ", there can be " father " by participle, " what ", " when ", " going home " Four words, each word correspond to a semantic vector, and for example " father " is corresponding [0.1,0.2,0.1 ..., 0.5], and " what " is corresponding [0.2,0.1,0.3,...,0.3];" when " corresponding [0.1,0.2,0.2 ..., 0.1];" going home " correspondence [0.0,0.1, 0.0 ..., 0.1], then whole text " when father goes home " means that into a semantic vector, this semantic vector Exactly the semantic vector of each word in text is added up, such as [0.1,0.2,0.1 ..., 0.5]+[0.2,0.1, 0.3 ..., 0.3]+[0.1,0.2,0.2 ..., 0.1]+[0.0,0.1,0.0 ..., 0.1]=[0.4,0.6,0.6 ..., 1].After normalization, [0.4,0.6,0.6 ..., 1] is become into [0.2,0.3,0.3 ..., 0.5] and is represented.
It can be seen that:For same text, with above-mentioned 0/1 representation be expressed as 100,000 multi-C vectors [0,0,0, 1 ... .., 1 ..., 0 ..., 1 ..., 1,0,0], become the fixed dimension vectors of dimension (for example 200) [0.2, 0.3,0.3 ..., 0.5], dimension reduces very much, and calculation amount is much less therewith, so as to improve the place of classification based training module Manage efficiency and performance.Further, since semantic vector can preferably measure the context of co-text between word, it can preferably be calculated Similarity, for example can recognize that " father " with " old beans " is similar below some linguistic context, then just can more preferably calculate " father When go home ", the similarity of " when old beans go home " the two texts.
Put it briefly, semantic vector is the expression found using neutral net for each word in a vector row space. It considers linguistic context of the word in context, and the correlation that word is portrayed using the frequency occurred jointly in same linguistic context between word is special Sign, for example " cat " often occurs jointly with " dog " in same linguistic context, then its distance based on semantic vector is less than " cat " The corresponding distance with " apple ".
Specifically, semantic vector needs the context of co-text information that can cover word.So semantic similar word,
It corresponds to the cosine angle value of vector can be bigger.We portray the context of co-text of word with conditional probability P, The probability of exactly each word is only influenced by the word above occurred, i.e. P (wi|w1,...,wi-1).Calculated to simplify, generally Only consider that each word is influenced by its preceding n-1 word, i.e. P (wi|wi-n+1,...,wi-1).One good semantic vector should be able to pole Change the conditional probability P (w of each word greatlyi|wi-n+1,...,wi-1).We are with one three layers of neural network model come optimization The most value of this probability.The input layer of the neutral net is n-1 word above, and each word corresponds to a semantic vector, remembers C (wi-n+1),...,C(wi-1), wherein C is the set of all term vectors, and the dimension of each vector is m.This n-1 vector head and the tail Connect spelling, forms the vector of (n-1) m dimension, is denoted as x.Then it is x modelings with a nonlinear hidden layer, i.e., Tanh (Hx+d), wherein d are bias term, and tanh is activation primitive.The output layer of neutral net is one | V | the prediction knot of dimension Fruit, wherein V are the set of word, with reference to the following formula (1):
Y=softmax (Utanh (Hx+d)+Wx+b) (1)
Wherein softmax is activation primitive, and U (| V | the matrix of × h, h are the numbers of plies of hidden layer) is hidden layer to output layer Parameter;W (| V | the matrix of × (n-1) m) it is directly to a linear transformation of output layer from input layer.This prediction result y I-th dimension degree yiRepresent the probability that next word is i, i.e. yi=P (wi|wi-n+1,...,wi-1)。
With this neutral net of backpropagation (Back Propagation) Algorithm for Solving, and then obtain the semantic vector of word Set C (word wiCorresponding semantic vector is exactly C (wi))., it is necessary to count before each word (n-1) a linguistic context in solution procedure Word and its relevant frequency information, we do corpus come statistical correlation frequency information with the data that IM spaces are had a talk about.
The embodiment of the present invention is used is come the benefit represented with vector by text:
The prior art is matched by keyword, is to be directed to text-processing, and needs to look for many keywords, not only difficultly And may look for does not cause accuracy rate to be guaranteed entirely;It is not simple logical and the embodiment of the present invention is more accurate in order to classify Cross text to classify, but text representation is become to the vector form that can mathematically analyze and process, it is necessary to first be carried out to text Entry cutting, is reprocessed after obtaining each word of composition text.By text representation it is vector form by VSM, the VSM is one A statistical model, is mainly used for the text in data being mapped as the vector space that is turned into by one group of normalized orthogonal entry vector In a data point (point vector).By text representation into after the vector form that can mathematically analyze and process, on this basis, Classify based on probability and based on distance, such as, based on distance as, text is regarded to a data point in vector space, lead to The distance between calculating data point is crossed to classify, the process of classification is the process of a machine learning, these data points (put to Amount) be n dimension the real space in point, a segmentation plane is being found out in vector space with classification based training module, is being illustrated in figure 8 One segmentation plane, inhomogeneous data point is separated to realize that data are classified, is preferably capable of these data points to pass through The hyperplane of one n-1 dimension separates, and usually this is referred to as linear classifier, is not limited to the SVM of the embodiment of the present invention, has very much Grader all meets this requirement.If an optimal plane (largest interval hyperplane) of classification can be found, i.e., so that belonging to That face of two inhomogeneous data point interval maximums, classifying quality is with regard to more preferable.
System embodiment three:
A kind of data mining processing system of the embodiment of the present invention, as shown in figure 5, the system comprises:Data acquisition list Member, data sorting unit, data processing unit.Wherein, data capture unit is used to obtain data, exports the data to described Data sorting unit, the data are divided into numerous types of data, and can be taken over for use from different dimensions upper table has instruction in the relation chain of family The customer relationship of feature.Data sorting unit is used to carry out comprehensive analysis according to classification policy to the numerous types of data, with Analysis obtains the customer relationship with indicative character from the data, exports the customer relationship with indicative character to institute State data processing unit.Data processing unit is used to collect information according to the customer relationship with indicative character, with basis Recommendation information is sent to the analysis result of described information.
It is to be herein pointed out the data sorting unit includes:Policy selection subelement and strategy execution subelement. Wherein, policy selection subelement is used for the characterisitic parameter for parsing the numerous types of data, in the middle part of the numerous types of data When the characteristic parameter of divided data type is less than default threshold value, determine that the data type is short text data, partial data class When the characteristic parameter of type is higher than default threshold value, it is long article notebook data to determine the data type, selects the second strategy to be used as institute State classification policy.Strategy execution subelement is used for having instruction using the described second strategy is described to long article notebook data progress During the identification of the customer relationship of feature, will using first strategy the short text data will be identified described in have refer to Show the customer relationship of feature to construct seed words, using the seed words as benchmark is referred to, there will be the numerous types of data The data as training sample to be analyzed and the seed words carry out similarity and compare to realize classification based training, with from institute State and the customer relationship with indicative character is identified in data.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
Moreover, data sorting unit is subdivided into policy selection subelement and strategy execution subelement, policy selection subelement Corresponding different data select different classification policies, the present embodiment by taking the long text type that system embodiment one refers to as an example, Long text type belongs to that data volume is small and the data type of content of text length, it may also be said to which its characteristic parameter is used to characterize the number According to the characteristic for measuring small and content of text length, policy selection subelement can parse this feature parameter, by with default threshold value into Row compares, and judges for the long text type, then selects the second strategy to be used as classification policy, by strategy execution subelement come Second strategy is performed, second strategy is:The first strategy will be used to be identified to obtain to the short text data The customer relationship with indicative character construct seed words, using the seed words as benchmark is referred to, will have described The data of numerous types of data as training sample to be analyzed and the seed words carry out similarity and compare to realize point Class is trained, to identify the customer relationship with indicative character from the data.
The composition structure diagram of strategy execution subelement in Fig. 5 is illustrated in figure 9, the strategy execution subelement includes Following two implementations, the first implementation:Vector generation module is not to use fixed dimension, second of implementation: Vector generation module uses fixed dimension.
The first implementation of the strategy execution subelement is:
Seed words constructing module, for using first strategy the short text data is identified described in have When the customer relationship of indicative character is to construct seed words, the customer relationship with indicative character will be identified as at the same time in multiple dimensions The user relationship data of formation in any one dimension to as positive sample seed words, will not be identified as having indicative character The user relationship data that customer relationship is formed is to as negative sample seed words.
Vector generation module, for the vector being expressed as the data according to vector space model in vector space;Institute Each word in data is stated as a vectorial dimension, vectorial total dimension is total word number of the data.
Classification based training module, for according to described vectorial right with the positive sample seed words and negative sample seed words institute Distributing position of the vector answered in the vector space determines segmentation plane, to identify the use with indicative character Family relation.
Analysis result output module, exports the customer relationship with indicative character identified.
Second of implementation of the strategy execution subelement be:
Seed words constructing module, for using first strategy the short text data is identified described in have When the customer relationship of indicative character is to construct seed words, the customer relationship with indicative character will be identified as at the same time in multiple dimensions The user relationship data of formation in any one dimension to as positive sample seed words, will not be identified as having indicative character The user relationship data that customer relationship is formed is to as negative sample seed words;
Vector generation module, for the data to be expressed as vector according to default fixed dimension and vector space model Vector in space;Context of co-text information of the fixed dimension based on each word in the data obtains;
Classification based training module, for according to described vectorial right with the positive sample seed words and negative sample seed words institute Distributing position of the vector answered in the vector space determines segmentation plane, to identify the use with indicative character Family relation;
Analysis result output module, exports the customer relationship with indicative character identified.
It is an application scenarios schematic diagram of strategy execution subelement in application drawing 9 as shown in Figure 10, including:Semantic vector is given birth to (divide into module (specific implementation of vector generation module in Fig. 9), classification based training module, prediction relationship output module in Fig. 9 Analyse the specific implementation of result output module), further include high Confidence relationship abstraction module (seed words constructing module in Fig. 9 Specific implementation).
Illustrate so that the customer relationship with indicative character is relationship as an example, it is as shown in Figure 5 by policy selection The data sorting unit that subelement and strategy execution subelement are formed, can be specially the relationship sort module in Fig. 4, should Relationship sort module can predict the relationship of user respectively according to multiple data sources, due to the data of different data sources Characteristic is different, it is necessary to targetedly handled with data source of different nature using different operation logics, is such as directed to This short text type of data type one and data type two that system embodiment one refers to takes a kind of processing logic (the first plan Slightly it is classification policy), another place is taken to this long text type of data type three referred to for system embodiment one Manage logic (the second strategy is classification policy).The present embodiment is for the second strategy execution.
The present embodiment for the second strategy, it is maximum the characteristics of whether take a seed words at random, but the will be used What the short text data (data type one and data type two) was identified in one strategy described has indicative character Customer relationship (such as relationship) construct seed words.
It is for data:Data are had a talk about in the interactive data of forum, such as interaction for IM spaces, its text length is (average Have 54 words), and the noise word contained is relatively more, and the probability distribution of its relative's classification in the description of system embodiment two with referring to IM good friends off-line data and the corresponding probability distribution of mobile phone IM address lists it is different.For this reason, using second strategy come more Efficiently identify the interaction of IM spaces and have a talk about relationship in data.It in the selection for seed words is not random that key point, which is, , but based on the relationship that IM good friends off-line data and mobile phone IM address lists identify as seed words, by the seed Word obtains positive sample seed words and negative sample seed words after the selection of high Confidence relationship abstraction module, inputs classification based training Module carries out classification based training, it is to be herein pointed out the classification based training module can be based on support vector machines (SVM) skill The training grader of art.
The positive and negative samples seed words construction of training grader is as follows:
According to the relationship of two class data (IM good friends off-line data and mobile phone IM address lists) before being generated based on Fig. 6 Recognition result, extracts those customer relationships pair for being predicted as relationship at the same time in multiple dimensions, such as in IM good friend's remarks, IM The word of multiple dimensions such as good friend's packet is predicted to be the relation pair of relative at the same time.These relationships are to high Confidence. The interactive record (comment forwarding word) that these relation pairs are had a talk about in IM spaces in data can be regarded as positive sample seed words. Correspondingly, we extract those from the relationship recognition result of Fig. 6 generations and are not predicted to be parent in any one dimension The relation pair of relative, negative sample seed words are used as by the use of their interactive record.It is vectorial based on semantic vector generation module generative semantics, Align negative sample and produce corresponding semantic vector input training grader progress classification based training.
First with semantic vector generation module, by data with the vector representation in vector space, classification based training is used afterwards Classification is identified to the relationship present in data in module.Specifically, semantic vector generation module is based on vector space mould Type (VSM), with 0/1 representation the space vector (can be a vector) that data are expressed as in vector space, then uses and divides Class training module finds out a segmentation plane in vector space.
0/1 representation is that each word allowed in a data such as text (is referred to as an element of vector One dimension of vector), vectorial total dimension is total word number of full text.When a certain bar text is expressed as vector, to If the corresponding word of every dimension of amount occurs in the text, the value of the dimension is just 1, is otherwise 0.For 0/1 representation, For example, such as a text " when father goes home ", can obtain by participle " father ", " what ", " when ", " Go home " four words, if the text is had four dimensions with vector representation, the vector.0/1 representation be will be all in Cliction does attribute, if Chinese word have 100,000 dimension, then this text representation come out vector for [0,0,0,1 ... .., 1 ..., 0 ..., 1 ..., 1,0,0], and this vector is only in " father ", " what ", " going home " this corresponding dimension of four words The value of degree above is 1, other are all for 0.For the short text type of mass data, if with the vector representation of 0/1 mode, So dimension can be very big (because the dimension of vector is total word amount of text).
Since the dimension of above-mentioned 0/1 vector representation mode is very big, dyscalculia, and cannot reflect synonymous or meaning phase Similitude between near text, superelevation dimension can seriously damage the treatment effeciency and performance of classification based training module, moreover, being based on 0/1 method for expressing, semantically similar word cannot be reflected with its corresponding vectorial cosine angle.For example represented with 0/1 " father " is expressed as vector by method with " father ", and the cosine angle of the two semantic similar words is but 0, this can be very negative Ground influences the effect of classification.
The shortcomings that in view of above-mentioned 0/1 vector representation mode, improvement project is:Represented using the semantic vector of fixed dimension Mode, rather than it is used as by the use of total word number of full text total dimension of vector.
For the text of this improvement project, first learning data, a fixed dimension (such as 200 for each word is drawn Dimension) semantic vector.It is described below and how establishes semantic vector.
Such as text " when father goes home ", there can be " father " by participle, " what ", " when ", " going home " Four words, each word correspond to a semantic vector, and for example " father " is corresponding [0.1,0.2,0.1 ..., 0.5], and " what " is corresponding [0.2,0.1,0.3,...,0.3];" when " corresponding [0.1,0.2,0.2 ..., 0.1];" going home " correspondence [0.0,0.1, 0.0 ..., 0.1], then whole text " the bold and unconstrained garden in Shenzhen " means that into a semantic vector, this semantic vector is exactly The semantic vector of each word in text is added up, such as [0.1,0.2,0.1 ..., 0.5]+[0.2,0.1,0.3 ..., 0.3]+[0.1,0.2,0.2 ..., 0.1]+[0.0,0.1,0.0 ..., 0.1]=[0.4,0.6,0.6 ..., 1].By returning After one changes, [0.4,0.6,0.6 ..., 1] is become into [0.2,0.3,0.3 ..., 0.5] and is represented.
It can be seen that:For same text, with above-mentioned 0/1 representation be expressed as 100,000 multi-C vectors [0,0,0, 1 ... .., 1 ..., 0 ..., 1 ..., 1,0,0], become the fixed dimension vectors of dimension (for example 200) [0.2,0.3, 0.3 ..., 0.5], dimension reduces very much, and calculation amount is much less therewith, so as to improve the processing effect of classification based training module Rate and performance.Further, since semantic vector can preferably measure the context of co-text between word, it can preferably calculate similar Degree, for example can recognize that " father " with " old beans " is similar below some linguistic context, then just can more preferably calculate " father what When go home ", the similarity of " when old beans go home " the two texts.
Put it briefly, semantic vector is the expression found using neutral net for each word in a vector row space. It considers linguistic context of the word in context, and the correlation that word is portrayed using the frequency occurred jointly in same linguistic context between word is special Sign, for example " cat " often occurs jointly with " dog " in same linguistic context, then its distance based on semantic vector is less than " cat " The corresponding distance with " apple ".
Specifically, semantic vector needs the context of co-text information that can cover word.So semantic similar word, its correspond to The cosine angle value of amount can be bigger.We portray the context of co-text of word, that is, the probability of each word with conditional probability P Only influenced by the word above occurred, i.e. P (wi|w1,...,wi-1).Calculated to simplify, generally only consider each word by it The influence of preceding n-1 word, i.e. P (wi|wi-n+1,...,wi-1).One good semantic vector should be able to maximize the bar of each word Part probability P (wi|wi-n+1,...,wi-1).We are with one three layers of neural network model come the most value of this probability of optimization.Should The input layer of neutral net is n-1 word above, and each word corresponds to a semantic vector, remembers C (wi-n+1),...,C(wi-1), Wherein C is the set of all term vectors, and the dimension of each vector is m.The end to end spelling of this n-1 vector is got up, forms one The vector of a (n-1) m dimensions, is denoted as x.Then it is x modelings with a nonlinear hidden layer, i.e. tanh (Hx+d), wherein d are inclined Item is put, tanh is activation primitive.The output layer of neutral net is one | V | the prediction result of dimension, wherein V are the set of word, ginseng Examine the following formula (1):
Y=softmax (Utanh (Hx+d)+Wx+b) (1)
Wherein softmax is activation primitive, and U (| V | the matrix of × h, h are the numbers of plies of hidden layer) is hidden layer to output layer Parameter;W (| V | the matrix of × (n-1) m) it is directly to a linear transformation of output layer from input layer.This prediction result y I-th dimension degree yiRepresent the probability that next word is i, i.e. yi=P (wi|wi-n+1,...,wi-1)。
With this neutral net of backpropagation (Back Propagation) Algorithm for Solving, and then obtain the semantic vector of word Set C (word wiCorresponding semantic vector is exactly C (wi))., it is necessary to count before each word (n-1) a linguistic context in solution procedure Word and its relevant frequency information, we do corpus come statistical correlation frequency information with the data that IM spaces are had a talk about.
The embodiment of the present invention is used is come the benefit represented with vector by text:
The prior art is matched by keyword, is to be directed to text-processing, and needs to look for many keywords, not only difficultly And may look for does not cause accuracy rate to be guaranteed entirely;It is not simple logical and the embodiment of the present invention is more accurate in order to classify Cross text to classify, but text representation is become to the vector form that can mathematically analyze and process, it is necessary to first be carried out to text Entry cutting, is reprocessed after obtaining each word of composition text.By text representation it is vector form by VSM, the VSM is one A statistical model, is mainly used for the text in data being mapped as the vector space that is turned into by one group of normalized orthogonal entry vector In a data point (point vector).By text representation into after the vector form that can mathematically analyze and process, on this basis, Classify based on probability and based on distance, such as, based on distance as, text is regarded to a data point in vector space, lead to The distance between calculating data point is crossed to classify, the process of classification is the process of a machine learning, these data points (put to Amount) be n dimension the real space in point, a segmentation plane is being found out in vector space with classification based training module, is being illustrated in figure 8 One segmentation plane, inhomogeneous data point is separated to realize that data are classified, is preferably capable of these data points to pass through The hyperplane of one n-1 dimension separates, and usually this is referred to as linear classifier, is not limited to the SVM of the embodiment of the present invention, has very much Grader all meets this requirement.If an optimal plane (largest interval hyperplane) of classification can be found, i.e., so that belonging to That face of two inhomogeneous data point interval maximums, classifying quality is with regard to more preferable.
Based on said system embodiment one to three, the system also includes data diffusion unit be used for according to positive inverse relation And transitive relation, the customer relationship with indicative character is further analyzed, is obtained and the use with indicative character The relevant user information of family relation, is described as follows by relationship citing of the customer relationship with indicative character:
It is that each the one of function module implements schematic diagram, parent in relationship expanding element in Fig. 4 as shown in figure 11 Relative relation diffusion unit is used to obtain the relative of relative by dispersion relation.One dispersion relation table is as shown in the following Table 1.
Father Brother Cousins Aunt Son Aunt
Father Grandfather The cousin Uncle Relative Brother Relative
Brother Father Brother Cousins Aunt Nephew Aunt
Cousins Relative Cousins 0 0 Nephew 0
Aunt Granddad Uncle 0 0 Cousins 0
Son Man and wife Children Relative Relative Relative Relative
Aunt 0 0 0 0 Cousins 0
Table 1
Table 1 can also become dispersion relation matrix, can be according to user's using the relationship taxon in Fig. 4 Interactive language word is to determine whether there are relationship between personal attribute information, also user.However, it is contemplated that some are used The loss of learning at family, therefore some, further pass through the parent of Fig. 4 there are user's interaction not in IM spaces of relationship Relative relation diffusion unit spreads the relationship chain of user, to obtain the relative of relative.The relationship diffusion unit according to The relationship that relationship taxon is identified, relationship is done with reference to the social networks topological structure of user Diffusion, to improve the coverage rate of relative's identification, the specific implementation of relationship diffusion module is as shown in figure 11, including IM user is closed Tethers abstraction module, it is front and rear to spread module, universal relation diffusion module, based on Confidence to relative's recognition result beta pruning to relation Module, IM customer relationship chains abstraction module are used to extract relationship from the relationship identified;It is described it is front and rear to Relation diffusion module is used for the relative for spreading relative using positive inverse relation according to the dispersion relation table shown in table 1;It is described general Relation spreads module, for spreading the relative of relative using transitive relation according to the dispersion relation table shown in table 1;It is described to be based on Confidence is used to relative's recognition result pruning module optimize diffusion result based on high Confidence rule, to reduce erroneous judgement Rate.
For positive inverse relation (front and rear to relation), as shown in mono- example of Figure 12, positive inverse relation diffusion is to there is parent The both sides of relative relation are diffused, and for example user A is the relative of user B, then by that can obtain user B after diffusion are user A Relative.For transitive relation (two degree of relation diffusions), as shown in mono- example of Figure 13, transitive relation is the biography of relationship Pass, for example user A is user B " father ", and user B is user C " younger brother ", and user A there is relationship with user C.
For the Confidence that is based on for relative's recognition result pruning module, since the diffusion of relationship may Bring the decline of accuracy rate, for example user A is user B " cousin ", user B is user C " elder male on father's side ", user A may with Family C does not have relationship, or the relation only become estranged very much;Especially, relationship sort module may judge use by accident Family B is that user C is relationship, then after making two degree of relation diffusions, mistake will be applied, i.e., can further judge user A by accident It is relative with user C.In order to improve the accuracy rate of relative's identification, relative are identified with a method based on Confidence rule and are tied Fruit optimizes.For example in diffusion, user A is with the same surnames of user C, or in areal, the Confidence meeting of this diffusion Weighted;For example user A is judged at the same time in multiple dimensions such as IM good friend's remarks, IM packet names, IM circle names at the same time with user C Break as relative, then the relation can also weight for the Confidence of relationship.
It need to be noted that be:The description of following methods item, the description with said system item are similar, homologous ray items Beneficial effect description, do not repeat.For the ins and outs not disclosed in the method for the present invention embodiment, system of the present invention refer to The description for embodiment of uniting.
Embodiment of the method one:
The data mining processing method of the embodiment of the present invention, as shown in figure 14, the described method includes:
Step 101, obtain data, and the data are divided into numerous types of data, can levy customer relationship from different dimensions upper table There is the customer relationship of indicative character in chain.
Step 102, carry out comprehensive analysis to the numerous types of data according to classification policy, to be analyzed from the data Obtain the customer relationship with indicative character.
Step 103, the customer relationship with indicative character collects information according to, with according to the analysis to described information As a result recommendation information is sent.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
In one preferred embodiment of the embodiment of the present invention, the numerous types of data include characterization individual subscriber attribute, Characterize at least two data types in user social contact topological structure, characterization user interaction behavior.
Embodiment of the method two:
The data mining processing method of the embodiment of the present invention, as shown in figure 15, the described method includes:
Step 201, obtain data, and the data are divided into numerous types of data, can levy customer relationship from different dimensions upper table There is the customer relationship of indicative character in chain.
The characterisitic parameter of step 202, the parsing numerous types of data, when each data in the numerous types of data When the characteristic parameter of type is all less than default threshold value, it is short text data to determine the data type, selects the first strategy to make For the classification policy.
Step 203, perform first strategy, extracts seed words at random, the seed words can be characterized with indicative character Customer relationship.
Step 204, using the seed words as referring to benchmark, using the data with the numerous types of data as Training sample to be analyzed is compared with the seed words to realize classification based training, to identify the tool from the data There is the customer relationship of indicative character.
Step 205, the customer relationship with indicative character collects information according to, with according to the analysis to described information As a result recommendation information is sent.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
Moreover, determining that the data type is short text data by step 202, the first strategy is selected to be used as the classification Strategy, by step 203-204, the user with indicative character that the seed words randomly selected come in identification data is closed System.
In one preferred embodiment of the embodiment of the present invention, step 204 specifically includes:
Step 2041a:The vector being expressed as the data according to vector space model in vector space;In the data Each word as a vectorial dimension, vectorial total dimension is total word number of the data;
Step 2041b:According to distribution position of the vectorial vector corresponding with the seed words in the vector space Put and determine segmentation plane, to identify the customer relationship with indicative character.
In one preferred embodiment of the embodiment of the present invention, step 204 specifically further includes:
Step 2042a:The data are expressed as in vector space according to default fixed dimension and vector space model Vector;Context of co-text information of the fixed dimension based on each word in the data obtains;
Step 2042b:According to distribution position of the vectorial vector corresponding with the seed words in the vector space Put and determine segmentation plane, to identify the customer relationship with indicative character.
Embodiment of the method three:
The data mining processing method of the embodiment of the present invention, as shown in figure 16, the described method includes:
Step 301, obtain data, and the data are divided into numerous types of data, can levy customer relationship from different dimensions upper table There is the customer relationship of indicative character in chain.
The characterisitic parameter of step 302, the parsing numerous types of data, when partial data class in the numerous types of data When the characteristic parameter of type is less than default threshold value, determine that the data type is short text data, the feature of part data type When parameter is higher than default threshold value, it is long article notebook data to determine the data type, selects the second strategy to be used as the classification plan Slightly.
Step 303, perform second strategy, by the short text data is identified using the first strategy The customer relationship with indicative character constructs seed words.
Step 304, using the seed words as referring to benchmark, using the data with the numerous types of data as Training sample to be analyzed carries out similarity with the seed words and compares to realize classification based training, to be identified from the data The customer relationship with indicative character.
Step 305, the customer relationship with indicative character collects information according to, with according to the analysis to described information As a result recommendation information is sent.
Using the embodiment of the present invention, since the data of acquisition have numerous types of data, and these data type energy Being taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family, that is to say, that data pass through different pieces of information Type is an overall target in itself come the data divided, then by the data with numerous types of data according to classification Strategy carries out comprehensive analysis, and the customer relationship with indicative character is obtained to be analyzed from the data, therefore, can not only be from great The customer relationship of indicative character, Shi Biye should specifically be had by being excavated in the data of vast internet communication in customer relationship chain The accuracy for identifying the customer relationship with indicative character can be improved, then according to the customer relationship with indicative character Information is collected, to send recommendation information according to the analysis result to described information, can necessarily rise to user's recommendation information Accuracy.
Moreover, determining that the data type is long article notebook data by step 302, the second strategy is selected to be used as the classification Strategy, by step 303-304, the user with indicative character that the seed words randomly selected come in identification data is closed System.
In one preferred embodiment of the embodiment of the present invention, step 303 specifically includes:
Using the user relationship data that the customer relationship for being identified as having indicative character at the same time in multiple dimensions is formed to as Positive sample seed words, will not be identified as the customer relationship number of the customer relationship formation with indicative character in any one dimension According to as negative sample seed words.
In one preferred embodiment of the embodiment of the present invention, step 304 specifically includes:
Step 3041a:The vector being expressed as the data according to vector space model in vector space;In the data Each word as a vectorial dimension, vectorial total dimension is total word number of the data;
Step 3041b:According to it is described it is vectorial with the positive sample seed words and corresponding to the negative sample seed words to The distributing position measured in the vector space determines segmentation plane, to identify that the user with indicative character is closed System.
In one preferred embodiment of the embodiment of the present invention, step 304 specifically includes:
Step 3042a:The data are expressed as in vector space according to default fixed dimension and vector space model Vector;Context of co-text information of the fixed dimension based on each word in the data obtains;
Step 3042b:According to it is described it is vectorial with the positive sample seed words and corresponding to the negative sample seed words to The distributing position measured in the vector space determines segmentation plane, to identify that the user with indicative character is closed System.
Based on the method for the present invention embodiment one to three, the method further includes:According to positive inverse relation and transitive relation, to institute State the customer relationship with indicative character further to analyze, obtain and the relevant user of the customer relationship with indicative character Information.
If the module integrated described in the embodiment of the present invention is realized in the form of software function module and is used as independent production Product are sold or in use, can also be stored in a computer read/write memory medium.Based on such understanding, in the art Technical staff it should be appreciated that embodiments herein can be provided as method, system or computer program product.Therefore, the application Can be using the form of the embodiment in terms of complete hardware embodiment, complete software embodiment or combination software and hardware.Moreover, The application can be used wherein to be included in the computer-usable storage medium of computer usable program code in fact in one or more The form for the computer program product applied, the storage medium include but not limited to USB flash disk, mobile hard disk, read-only storage (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk storage, CD-ROM, Optical memory etc..
The application be according to the method for the embodiment of the present application, the flow chart of equipment (system) and computer program product and/ Or block diagram describes.It should be understood that can be realized by computer program instructions each flow in flowchart and/or the block diagram and/ Or the flow in square frame and flowchart and/or the block diagram and/or the combination of square frame.These computer program instructions can be provided To the processor of all-purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce one A machine so that the instruction performed by computer or the processor of other programmable data processing devices, which produces, to be used for realization The device for the function of being specified in one flow of flow chart or multiple flows and/or one square frame of block diagram or multiple square frames.
These computer program instructions, which may also be stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, thus in computer or The instruction performed on other programmable devices is provided and is used for realization in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in a square frame or multiple square frames.
Although having been described for the preferred embodiment of the application, those skilled in the art once know basic creation Property concept, then can make these embodiments other change and modification.So appended claims be intended to be construed to include it is excellent Select embodiment and fall into all change and modification of the application scope.
Correspondingly, the embodiment of the present invention also provides a kind of computer-readable storage medium, wherein computer program is stored with, the meter Calculation machine program is used for the data mining processing system and method for performing the embodiment of the present invention.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims (17)

  1. A kind of 1. data mining processing system, it is characterised in that the system comprises:Data capture unit, data sorting unit, Data processing unit;Wherein,
    The data capture unit, for obtaining data from multiple data sources, exports the data to the data sorting unit, The data are divided into the numerous types of data for disclosing user social contact topological structure, can levy customer relationship from different dimensions upper table There is the customer relationship of indicative character in chain;
    The data sorting unit, for carrying out comprehensive analysis according to classification policy to the numerous types of data, to short text When data carry out the identification of the customer relationship with indicative character, in a manner of extracting seed words at random or construction seed words Mode, from the data analysis obtain the customer relationship with indicative character, the output user with indicative character Relation gives the data processing unit, wherein, the mode of the construction seed words includes:It will be identified as having at the same time in multiple dimensions The user relationship data that the customer relationship for having indicative character is formed, will be in any one dimension to as positive sample seed words It is identified as having the user relationship data that the customer relationship of indicative character is formed to as negative sample seed words, with vector space Vector represent the positive sample seed words and the negative sample seed words, generate positive sample seed words and negative sample seed words Corresponding semantic vector, by positive sample seed words and the corresponding semantic vector input training classification of negative sample seed words After carrying out classification based training customer relationship is identified classification in device, identifies the customer relationship with indicative character;Wherein, it is described Short text data includes:In the data for characterizing individual subscriber attribute, data volume is big, the short data type of content of text;
    The data processing unit, for collecting information according to the customer relationship with indicative character, with according to described The analysis result of information sends recommendation information;
    The numerous types of data includes characterization individual subscriber attribute, characterization user social contact topological structure, characterization user interaction row At least two data types in.
  2. 2. system according to claim 1, it is characterised in that the data sorting unit, including:
    Policy selection subelement, for parsing the characterisitic parameter of the numerous types of data, when every in the numerous types of data When a kind of characteristic parameter of data type is all less than default threshold value, it is short text data to determine the data type, selection the One strategy is used as the classification policy;
    Strategy execution subelement, for tactful to there is indicative character described in short text data progress using described first The seed words are extracted during the identification of customer relationship at random, the seed words can characterize the customer relationship with indicative character, will The seed words as refer to benchmark, using the data with the numerous types of data as training sample to be analyzed and The seed words are compared to realize classification based training, to identify that the user with indicative character is closed from the data System.
  3. 3. system according to claim 2, it is characterised in that the strategy execution subelement, including:
    Vector generation module, for the vector being expressed as the data according to vector space model in vector space;The number For each word in as a vectorial dimension, vectorial total dimension is total word number of the data;
    Classification based training module, for the distribution according to the vectorial vector corresponding with the seed words in the vector space Location determination goes out segmentation plane, to identify the customer relationship with indicative character;
    Analysis result output module, exports the customer relationship with indicative character identified.
  4. 4. system according to claim 2, it is characterised in that the strategy execution subelement, including:
    Vector generation module, for the data to be expressed as vector space according to default fixed dimension and vector space model In vector;Context of co-text information of the fixed dimension based on each word in the data obtains;
    Classification based training module, for the distribution according to the vectorial vector corresponding with the seed words in the vector space Location determination goes out segmentation plane, to identify the customer relationship with indicative character;
    Analysis result output module, exports the customer relationship with indicative character identified.
  5. 5. system according to claim 1, it is characterised in that the data sorting unit, including:
    Policy selection subelement, for parsing the characterisitic parameter of the numerous types of data, in the middle part of the numerous types of data When the characteristic parameter of divided data type is less than default threshold value, determine that the data type is short text data, partial data class When the characteristic parameter of type is higher than default threshold value, it is long article notebook data to determine the data type, selects the second strategy to be used as institute State classification policy;
    Strategy execution subelement, for tactful to there is indicative character described in long article notebook data progress using described second During the identification of customer relationship, will using first strategy the short text data is identified described in there is indicative character Customer relationship construct the seed words, using the seed words as benchmark is referred to, by with the numerous types of data The data as training sample to be analyzed and the seed words carry out similarity and compare to realize classification based training, with from described The customer relationship with indicative character is identified in data.
  6. 6. system according to claim 5, it is characterised in that the strategy execution subelement, including:
    Seed words constructing module, for using first strategy the short text data is identified described in have instruction When the customer relationship of feature is to construct the seed words, the positive sample seed words and the negative sample seed words are obtained;
    Vector generation module, for the vector being expressed as the data according to vector space model in vector space;The number For each word in as a vectorial dimension, vectorial total dimension is total word number of the data;
    Classification based training module, for according to corresponding to the described vectorial and positive sample seed words and the negative sample seed words Distributing position of the vector in the vector space determines segmentation plane, to identify that the user with indicative character is closed System;
    Analysis result output module, exports the customer relationship with indicative character identified.
  7. 7. system according to claim 5, it is characterised in that the strategy execution subelement, including:
    Seed words constructing module, for using first strategy the short text data is identified described in have instruction When the customer relationship of feature is to construct the seed words, the positive sample seed words and the negative sample seed words are obtained;
    Vector generation module, for the data to be expressed as vector space according to default fixed dimension and vector space model In vector;Context of co-text information of the fixed dimension based on each word in the data obtains;
    Classification based training module, for according to corresponding to the described vectorial and positive sample seed words and the negative sample seed words Distributing position of the vector in the vector space determines segmentation plane, to identify that the user with indicative character is closed System;
    Analysis result output module, exports the customer relationship with indicative character identified.
  8. 8. system according to any one of claims 1 to 7, it is characterised in that the system also includes:Data diffusion is single Member, the data diffusion unit is between the data sorting unit and the data processing unit;
    The data diffusion unit, for according to positive inverse relation and transitive relation, to the customer relationship with indicative character Further analysis, obtains and the relevant user information of the customer relationship with indicative character.
  9. A kind of 9. data mining processing method, it is characterised in that the described method includes:
    Data are obtained from multiple data sources, the data are divided into the numerous types of data for disclosing user social contact topological structure, Can be taken over for use from different dimensions upper table has the customer relationship of indicative character in the relation chain of family;
    Comprehensive analysis is carried out according to classification policy to the numerous types of data, short text data is carried out described special with instruction During the identification of the customer relationship of sign, in a manner of extracting seed words at random or in the way of construction seed words, divide from the data Analysis obtains the customer relationship with indicative character, wherein, the mode of the construction seed words includes:It will know at the same time in multiple dimensions Wei not be with the user relationship data that the customer relationship of indicative character is formed to as positive sample seed words, will be not any one A dimension is identified as having the user relationship data that the customer relationship of indicative character is formed to as negative sample seed words, with vector Vector in space represents the positive sample seed words and the negative sample seed words, generates positive sample seed words and negative sample The corresponding semantic vector of seed words, by positive sample seed words and the corresponding semantic vector input instruction of negative sample seed words Practice after grader carries out classification based training and customer relationship is identified classification, identify the customer relationship with indicative character;Its In, the short text data includes:In the data for characterizing individual subscriber attribute, data volume is big, the short data type of content of text;
    To collect information according to the customer relationship with indicative character, to be sent according to the analysis result to described information Recommendation information;
    The numerous types of data includes characterization individual subscriber attribute, characterization user social contact topological structure, characterization user interaction row At least two data types in.
  10. 10. according to the method described in claim 9, it is characterized in that, it is described to the numerous types of data according to classification policy Comprehensive analysis is carried out, including:
    The characterisitic parameter of the numerous types of data is parsed, when the feature ginseng of each data type in the numerous types of data When number is all less than default threshold value, it is short text data to determine the data type, selects the first strategy to be used as the classification plan Slightly;
    The seed words are extracted when performing first strategy at random, the seed words can characterize the user with indicative character and close System;
    Using the seed words as benchmark is referred to, using the data with the numerous types of data as training to be analyzed Sample is compared with the seed words to realize classification based training, described with indicative character to be identified from the data Customer relationship.
  11. 11. according to the method described in claim 10, it is characterized in that, described using the seed words as benchmark is referred to, will have The data for having the numerous types of data are compared as training sample to be analyzed with the seed words to realize point Class is trained, to identify the customer relationship with indicative character from the data, including:
    The vector being expressed as the data according to vector space model in vector space;Each word conduct in the data A vectorial dimension, vectorial total dimension are total word number of the data;
    Determine that segmentation is flat according to distributing position of the vectorial vector corresponding with the seed words in the vector space Face, to identify the customer relationship with indicative character.
  12. 12. according to the method described in claim 10, it is characterized in that, described using the seed words as benchmark is referred to, will have The data for having the numerous types of data are compared as training sample to be analyzed with the seed words to realize point Class is trained, to identify the customer relationship with indicative character from the data, including:
    The vector being expressed as the data according to default fixed dimension and vector space model in vector space;The fixation Context of co-text information of the dimension based on each word in the data obtains;
    Determine that segmentation is flat according to distributing position of the vectorial vector corresponding with the seed words in the vector space Face, to identify the customer relationship with indicative character.
  13. 13. according to the method described in claim 9, it is characterized in that, it is described to the numerous types of data according to classification policy Comprehensive analysis is carried out, including:
    The characterisitic parameter of the numerous types of data is parsed, when the characteristic parameter of part data type in the numerous types of data During less than default threshold value, it is short text data to determine the data type, and the characteristic parameter of part data type is higher than default Threshold value when, it is long article notebook data to determine the data type, select second strategy be used as the classification policy;
    Have described in the short text data being identified using the first strategy when performing second strategy and refer to Show the customer relationship of feature to construct the seed words;
    Using the seed words as benchmark is referred to, using the data with the numerous types of data as training to be analyzed Sample carries out similarity with the seed words and compares to realize classification based training, described with instruction to be identified from the data The customer relationship of feature.
  14. 14. according to the method for claim 13, it is characterised in that the method further includes:Using the first strategy to described When the customer relationship with indicative character that short text data is identified is to construct seed words, the positive sample is obtained This seed words and the negative sample seed words.
  15. 15. according to the method for claim 14, it is characterised in that it is described using the seed words as benchmark is referred to, will have The data for having the numerous types of data carry out similarity as training sample to be analyzed and the seed words and compare Realize classification based training, to identify the customer relationship with indicative character from the data, including:
    The vector being expressed as the data according to vector space model in vector space;Each word conduct in the data A vectorial dimension, vectorial total dimension are total word number of the data;
    Vector according to corresponding to the described vectorial and positive sample seed words and the negative sample seed words is empty in the vector Between in distributing position determine segmentation plane, to identify the customer relationship with indicative character.
  16. 16. according to the method for claim 14, it is characterised in that it is described using the seed words as benchmark is referred to, will have The data for having the numerous types of data carry out similarity as training sample to be analyzed and the seed words and compare Realize classification based training, to identify the customer relationship with indicative character from the data, including:
    The vector being expressed as the data according to default fixed dimension and vector space model in vector space;The fixation Context of co-text information of the dimension based on each word in the data obtains;
    Vector according to corresponding to the described vectorial and positive sample seed words and the negative sample seed words is empty in the vector Between in distributing position determine segmentation plane, to identify the customer relationship with indicative character.
  17. 17. according to claim 9 to 16 any one of them method, it is characterised in that the method further includes:
    According to positive inverse relation and transitive relation, the customer relationship with indicative character is further analyzed, obtain with it is described The relevant user information of customer relationship with indicative character.
CN201410174489.4A 2014-04-28 2014-04-28 A kind of data mining processing system and method Active CN104615608B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410174489.4A CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410174489.4A CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Publications (2)

Publication Number Publication Date
CN104615608A CN104615608A (en) 2015-05-13
CN104615608B true CN104615608B (en) 2018-05-15

Family

ID=53150057

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410174489.4A Active CN104615608B (en) 2014-04-28 2014-04-28 A kind of data mining processing system and method

Country Status (1)

Country Link
CN (1) CN104615608B (en)

Families Citing this family (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106453030B (en) * 2015-08-12 2019-10-11 大连民族学院 A kind of method and device obtaining social networks chain
CN106557942B (en) 2015-09-30 2020-07-10 百度在线网络技术(北京)有限公司 User relationship identification method and device
CN105468723B (en) * 2015-11-20 2019-08-20 小米科技有限责任公司 Information recommendation method and device
CN106157114A (en) * 2016-07-06 2016-11-23 商宴通(上海)网络科技有限公司 Have dinner based on user the homepage proposed algorithm of behavior modeling
CN107800608A (en) * 2016-09-05 2018-03-13 腾讯科技(深圳)有限公司 A kind of processing method and processing device of user profile
CN106547856B (en) * 2016-10-19 2020-03-17 天脉聚源(北京)科技有限公司 Method and device for sharing data by application
CN108874821B (en) * 2017-05-11 2021-06-15 腾讯科技(深圳)有限公司 Application recommendation method and device and server
CN107392781B (en) * 2017-06-20 2021-11-02 挖财网络技术有限公司 User relationship identification method, object relationship identification method and device
CN107464141B (en) * 2017-08-07 2021-09-07 北京京东尚科信息技术有限公司 Method and device for information popularization, electronic equipment and computer readable medium
CN107741953B (en) * 2017-09-14 2020-01-21 平安科技(深圳)有限公司 Method and device for matching realistic relationship of social platform user and readable storage medium
CN109767278B (en) * 2017-11-09 2021-03-30 北京京东尚科信息技术有限公司 Method and apparatus for outputting information
CN107948255B (en) 2017-11-13 2019-09-03 苏州达家迎信息技术有限公司 The method for pushing and computer readable storage medium of APP
CN108170725A (en) * 2017-12-11 2018-06-15 仲恺农业工程学院 The social network user relationship strength computational methods and device of integrated multicharacteristic information
CN110020420B (en) * 2018-01-10 2023-07-21 腾讯科技(深圳)有限公司 Text processing method, device, computer equipment and storage medium
CN108737506A (en) * 2018-04-27 2018-11-02 苏州达家迎信息技术有限公司 A kind of application method for pushing, equipment, storage medium and system
CN109241048A (en) * 2018-06-29 2019-01-18 深圳市彬讯科技有限公司 For the data processing method of data statistics, server and storage medium
WO2020061815A1 (en) * 2018-09-26 2020-04-02 深圳市欢太科技有限公司 Method for switching game page, and related product
CN110751284B (en) * 2019-06-06 2020-12-25 北京嘀嘀无限科技发展有限公司 Heterogeneous information network embedding method and device, electronic equipment and storage medium
CN110880013A (en) * 2019-08-02 2020-03-13 华为技术有限公司 Text recognition method and device
CN110851491B (en) * 2019-10-17 2023-06-30 天津大学 Network link prediction method based on multiple semantic influence of multiple neighbor nodes

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN103425686A (en) * 2012-05-21 2013-12-04 微梦创科网络科技(中国)有限公司 Information publishing method and device
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101609450A (en) * 2009-04-10 2009-12-23 南京邮电大学 Web page classification method based on training set
CN102098332A (en) * 2010-12-30 2011-06-15 北京新媒传信科技有限公司 Method and device for examining and verifying contents
CN103425686A (en) * 2012-05-21 2013-12-04 微梦创科网络科技(中国)有限公司 Information publishing method and device
CN103617157A (en) * 2013-12-10 2014-03-05 东北师范大学 Text similarity calculation method based on semantics

Also Published As

Publication number Publication date
CN104615608A (en) 2015-05-13

Similar Documents

Publication Publication Date Title
CN104615608B (en) A kind of data mining processing system and method
CN106980692B (en) Influence calculation method based on microblog specific events
Schouten et al. Supervised and unsupervised aspect category detection for sentiment analysis with co-occurrence data
CN107609101B (en) Intelligent interaction method, equipment and storage medium
CN106940732A (en) A kind of doubtful waterborne troops towards microblogging finds method
CN105005594B (en) Abnormal microblog users recognition methods
CN103324665B (en) Hot spot information extraction method and device based on micro-blog
CN103678670B (en) Micro-blog hot word and hot topic mining system and method
CN104574192B (en) Method and device for identifying same user in multiple social networks
CN103455545A (en) Location estimation of social network users
CN108090607A (en) A kind of social media user's ascribed characteristics of population Forecasting Methodology based on the fusion of multi-model storehouse
CN103313248B (en) Method and device for identifying junk information
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CN107515873A (en) A kind of junk information recognition methods and equipment
TW201115370A (en) Systems and methods for capturing and managing collective social intelligence information
CN106354845A (en) Microblog rumor recognizing method and system based on propagation structures
CN110457404A (en) Social media account-classification method based on complex heterogeneous network
CN103064880B (en) A kind of methods, devices and systems providing a user with website selection based on search information
CN108845986A (en) A kind of sentiment analysis method, equipment and system, computer readable storage medium
KR101869815B1 (en) Method and apparatus for spotting fake news using collective intelligence
CN110990683A (en) Microblog rumor integrated identification method and device based on region and emotional characteristics
CN107679135A (en) The topic detection of network-oriented text big data and tracking, device
CN104933475A (en) Network forwarding behavior prediction method and apparatus
JP2008203933A (en) Category creation method and apparatus and document classification method and apparatus
Liu et al. Correlation identification in multimodal weibo via back propagation neural network with genetic algorithm

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230705

Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right