CN104615608B - A kind of data mining processing system and method - Google Patents
A kind of data mining processing system and method Download PDFInfo
- Publication number
- CN104615608B CN104615608B CN201410174489.4A CN201410174489A CN104615608B CN 104615608 B CN104615608 B CN 104615608B CN 201410174489 A CN201410174489 A CN 201410174489A CN 104615608 B CN104615608 B CN 104615608B
- Authority
- CN
- China
- Prior art keywords
- data
- user
- word
- indication
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012545 processing Methods 0.000 title claims abstract description 56
- 238000000034 method Methods 0.000 title claims abstract description 46
- 238000007418 data mining Methods 0.000 title claims abstract description 19
- 238000004458 analytical method Methods 0.000 claims abstract description 46
- 239000013598 vector Substances 0.000 claims description 309
- 238000012549 training Methods 0.000 claims description 83
- 230000011218 segmentation Effects 0.000 claims description 37
- 238000009792 diffusion process Methods 0.000 claims description 33
- 238000009826 distribution Methods 0.000 claims description 26
- 230000002452 interceptive effect Effects 0.000 claims description 12
- 230000006399 behavior Effects 0.000 claims description 10
- 238000010276 construction Methods 0.000 claims description 7
- 238000003672 processing method Methods 0.000 claims description 7
- 238000012546 transfer Methods 0.000 claims description 7
- 238000013481 data capture Methods 0.000 abstract 2
- 238000010586 diagram Methods 0.000 description 23
- 230000003993 interaction Effects 0.000 description 16
- 238000004891 communication Methods 0.000 description 13
- 230000008569 process Effects 0.000 description 10
- 238000004590 computer program Methods 0.000 description 9
- 238000012351 Integrated analysis Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 8
- 238000005065 mining Methods 0.000 description 8
- 238000004364 calculation method Methods 0.000 description 6
- 230000006872 improvement Effects 0.000 description 6
- 238000003860 storage Methods 0.000 description 6
- 238000012706 support-vector machine Methods 0.000 description 6
- 244000046052 Phaseolus vulgaris Species 0.000 description 5
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 230000014509 gene expression Effects 0.000 description 5
- 239000011159 matrix material Substances 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 5
- 230000004913 activation Effects 0.000 description 4
- 238000005516 engineering process Methods 0.000 description 4
- 238000000605 extraction Methods 0.000 description 4
- 239000012452 mother liquor Substances 0.000 description 3
- 238000013138 pruning Methods 0.000 description 3
- 230000008901 benefit Effects 0.000 description 2
- 238000004422 calculation algorithm Methods 0.000 description 2
- 230000036541 health Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000003062 neural network model Methods 0.000 description 2
- 238000010606 normalization Methods 0.000 description 2
- 230000003997 social interaction Effects 0.000 description 2
- 238000013179 statistical model Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 150000003512 tertiary amines Chemical class 0.000 description 1
- 230000007704 transition Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/283—Multi-dimensional databases or data warehouses, e.g. MOLAP or ROLAP
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of data mining processing system and method, wherein, the system comprises:Data capture unit, data sorting unit, data processing unit;Wherein, data capture unit, for obtaining data, exports the data and is divided into numerous types of data to the data sorting unit, the data, can take over the customer relationship in the relation chain of family with indicative character for use from different dimensions upper table;The data sorting unit, for carrying out comprehensive analysis according to classification policy to the numerous types of data, the customer relationship with indicative character is obtained to be analyzed from the data, exports the customer relationship with indicative character to the data processing unit;The data processing unit, for collecting information according to the customer relationship with indicative character, to send recommendation information according to the analysis result to described information.
Description
Technical Field
The invention relates to the mining technology of internet communication, in particular to a data mining processing system and a data mining processing method.
Background
In the process of implementing the technical solution of the embodiment of the present application, the inventor of the present application finds at least the following technical problems in the related art:
with the rapid development of internet technology and the transition of social structures, more and more people seek communication, association and interaction in a mobile phone on the internet, and massive interactive behaviors between people are generated, and various types of relation chains among users can be obtained based on the interactive behaviors. Various types of relationship chains among users can be applied to aspects of social life, and service providers provide services for the users through various applications, such as a reservation and ordering application of a mobile phone client.
The user requirements can be better analyzed through various types of relationship chains among users, so that better services are provided for the users, for example, APP (application) for shopping required by the users is recommended, the users are helped to conduct shopping guide of required articles, and for example, restaurant and restaurant special services or health products and the like are recommended for the users.
Some user relationships with indication features exist among multiple types of relationship chains among users, for example, a relationship is indicated, and users in the relationship may be interested in services provided by the same or the same type of application, so that the adoption of the relationship plays a decisive role in improving the database of the application itself and accurately recommending information through the improvement of the database. Therefore, if the user relationship with the indication characteristics in the user relationship chain can be mined, the user relationship can be used as effective data to improve the data effectiveness, so that data redundancy caused by the fact that a large amount of invalid data occupies a database is avoided, and the purpose of accurately recommending information for the user can be achieved through the improvement of the data effectiveness. How to dig out the user relationship with the indication features to improve the accuracy of recommending information for the user is a technical problem to be solved.
However, it seems simple to dig out the user relationship with the indication feature from the vast data of internet communication, the actual operation is not easy, and it is still difficult to ensure the accuracy of the dug user relationship with the indication feature, and the user relationship with the indication feature is still taken as an example, the current prior art is implemented by simple keyword matching, for example, one user is noted as "father" in the address book, and the other user is noted as "gunny", and then there may be a relation between the two users; in addition, many words expressing relative relationships, such as "dad" has expressions of "die", "father", and the like, and it is difficult to enumerate all possible keywords in the keyword matching manner, and thus, an effective solution for solving the above problems does not exist in the related art.
Disclosure of Invention
In view of the above, it is desirable to provide a data mining system and method, which can extract a specific user relationship with an indication feature from a vast amount of data of internet communication, so as to improve the accuracy of recommending information for a user.
The technical scheme of the embodiment of the invention is realized as follows:
the data mining processing system of the embodiment of the invention comprises: the device comprises a data acquisition unit, a data classification unit and a data processing unit; wherein,
the data acquisition unit is used for acquiring data and outputting the data to the data classification unit, the data are classified into a plurality of data types, and user relationships with indication characteristics in the user relationship chain can be characterized from different dimensions;
the data classification unit is used for comprehensively analyzing the multiple data types according to a classification strategy so as to obtain a user relationship with an indication characteristic from the data through analysis and output the user relationship with the indication characteristic to the data processing unit;
the data processing unit is used for collecting information according to the user relationship with the indication characteristics so as to send recommendation information according to the analysis result of the information.
Preferably, the plurality of data types comprise at least two data types of characterizing personal attributes of the user, characterizing social topology of the user and characterizing interaction behaviors of the user.
Preferably, the data classification unit includes:
the strategy selection subunit is used for analyzing the characteristic parameters of the multiple data types, determining the data types to be short text data when the characteristic parameters of each data type in the multiple data types are lower than a preset threshold value, and selecting a first strategy as the classification strategy;
and the strategy executing subunit is configured to, when the first strategy is adopted to identify the user relationship with the indication feature for the short text data, randomly extract a seed word, where the seed word can represent the user relationship with the indication feature, use the seed word as a reference, compare the data with the multiple data types as training samples to be analyzed with the seed word to implement classification training, and identify the user relationship with the indication feature from the data.
Preferably, the policy enforcement subunit includes:
a vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
Preferably, the policy execution sub-module includes:
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
Preferably, the data classification unit includes:
the strategy selection subunit is used for analyzing the characteristic parameters of the multiple data types, determining that the data types are short text data when the characteristic parameters of part of the data types in the multiple data types are lower than a preset threshold value, determining that the data types are long text data when the characteristic parameters of part of the data types are higher than the preset threshold value, and selecting a second strategy as the classification strategy;
and the strategy executing subunit is configured to, when the second strategy is adopted to identify the user relationship with the indication feature for the long text data, construct a seed word from the user relationship with the indication feature obtained by identifying the short text data with the first strategy, use the seed word as a reference, and perform similarity comparison between the data with the plurality of data types as a training sample to be analyzed and the seed word to implement classification training, so as to identify the user relationship with the indication feature from the data.
Preferably, the policy enforcement subunit includes:
a seed word construction module, configured to use a first policy to construct a seed word from the user relationship with the indication feature obtained by identifying the short text data, and use a user relationship data pair formed by the user relationship with the indication feature identified simultaneously in multiple dimensions as a positive sample seed word, and use a user relationship data pair formed by the user relationship with the indication feature not identified in any dimension as a negative sample seed word;
a vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication features;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
Preferably, the policy enforcement subunit includes:
a seed word construction module, configured to use a first policy to construct a seed word from the user relationship with the indication feature obtained by identifying the short text data, and use a user relationship data pair formed by the user relationship with the indication feature identified simultaneously in multiple dimensions as a positive sample seed word, and use a user relationship data pair formed by the user relationship with the indication feature not identified in any dimension as a negative sample seed word;
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication features;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
Preferably, the system further comprises: the data diffusion unit is positioned between the data classification unit and the data processing unit;
and the data diffusion unit is used for further analyzing the user relationship with the indication characteristic according to the positive and negative relationship and the transfer relationship to obtain the user information related to the user relationship with the indication characteristic.
The data mining processing method of the embodiment of the invention comprises the following steps:
acquiring data, wherein the data is divided into a plurality of data types and can be used for representing user relations with indication characteristics in a user relation chain from different dimensions;
comprehensively analyzing the multiple data types according to a classification strategy so as to analyze the data to obtain a user relationship with an indication characteristic;
and collecting information according to the user relationship with the indication characteristics to send recommendation information according to the analysis result of the information.
Preferably, the plurality of data types comprise at least two data types of characterizing personal attributes of the user, characterizing social topology of the user and characterizing interaction behaviors of the user.
Preferably, the comprehensively analyzing the plurality of data types according to a classification policy to analyze the data to obtain the user relationship with the indication feature includes:
analyzing the characteristic parameters of the multiple data types, determining the data types to be short text data when the characteristic parameters of each data type in the multiple data types are lower than a preset threshold value, and selecting a first strategy as the classification strategy;
executing the first strategy, and randomly extracting seed words which can represent user relations with indicating characteristics;
and taking the seed words as reference bases, and comparing the data with the multiple data types as training samples to be analyzed with the seed words to realize classification training so as to identify the user relationship with the indication features from the data.
Preferably, the performing classification training by using the seed word as a reference and using the data with the plurality of data types as a training sample to be analyzed and the seed word to identify the user relationship with the indication feature from the data includes:
representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
Preferably, the performing classification training by using the seed word as a reference and using the data with the plurality of data types as a training sample to be analyzed and the seed word to identify the user relationship with the indication feature from the data includes:
representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
Preferably, the comprehensively analyzing the plurality of data types according to a classification policy to analyze the data to obtain the user relationship with the indication feature includes:
analyzing the characteristic parameters of the multiple data types, determining that the data types are short text data when the characteristic parameters of part of the data types are lower than a preset threshold value, determining that the data types are long text data when the characteristic parameters of part of the data types are higher than the preset threshold value, and selecting a second strategy as the classification strategy;
executing the second strategy, and constructing seed words by using the user relationship with the indication characteristics obtained by identifying the short text data by adopting the first strategy;
and taking the seed words as reference bases, and performing similarity comparison on the data with the multiple data types as training samples to be analyzed and the seed words to realize classification training so as to identify the user relationship with the indication characteristics from the data.
Preferably, the step of constructing seed words by using the user relationship with the indication feature obtained by identifying the short text data by using the first strategy comprises the following steps:
and the user relationship data pairs formed by the user relationships which are simultaneously identified as having the indicating characteristics in a plurality of dimensions are used as positive sample seed words, and the user relationship data pairs formed by the user relationships which are not identified as having the indicating characteristics in any dimension are used as negative sample seed words.
Preferably, the performing classification training by using the seed word as a reference and using the data with the plurality of data types as a training sample to be analyzed to perform similarity comparison with the seed word to identify the user relationship with the indication feature from the data includes:
representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
Preferably, the performing classification training by using the seed word as a reference and using the data with the plurality of data types as a training sample to be analyzed to perform similarity comparison with the seed word to identify the user relationship with the indication feature from the data includes:
representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
Preferably, the method further comprises:
and further analyzing the user relationship with the indication characteristic according to the positive and negative relationship and the transfer relationship to obtain user information related to the user relationship with the indication characteristic.
The data mining processing system of the embodiment of the invention comprises: the device comprises a data acquisition unit, a data classification unit and a data processing unit; the data classification unit is used for classifying the data into a plurality of data types and can represent user relationships with indication characteristics in the user relationship chain from different dimensions; the data classification unit is used for comprehensively analyzing the multiple data types according to a classification strategy so as to obtain a user relationship with an indication characteristic from the data through analysis and output the user relationship with the indication characteristic to the data processing unit; the data processing unit is used for collecting information according to the user relationship with the indication characteristic so as to send recommendation information according to the analysis result of the information.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
Drawings
FIG. 1 is a schematic diagram of an exemplary system of the present invention;
FIG. 2 is a schematic diagram of an exemplary system of the present invention;
FIG. 3 is a schematic diagram of an exemplary system of the present invention;
FIG. 4 is a diagram illustrating an application scenario in which embodiments of the system of the present invention are applied;
FIG. 5 is a schematic diagram of an exemplary system of the present invention;
FIG. 6 is a schematic diagram of a structure of the policy execution subunit of FIG. 5;
FIG. 7 is a diagram of an application scenario in which the modules of FIG. 6 are applied;
FIG. 8 is a schematic diagram of a segmentation plane separating different data points for classification;
FIG. 9 is a schematic diagram of a structure of the policy execution subunit of FIG. 5;
FIG. 10 is a diagram of an application scenario in which the modules of FIG. 9 are applied;
fig. 11 is a schematic diagram illustrating an implementation of each functional module of the relative relationship extension unit in fig. 4;
FIG. 12 is a schematic view of forward and reverse relationship diffusion;
FIG. 13 is a schematic view of a propagation relationship diffusion;
FIG. 14 is a flow chart of an implementation of a method embodiment of the present invention;
FIG. 15 is a flow chart of an implementation of a method embodiment of the present invention;
fig. 16 is a flow chart of the implementation of the method embodiment of the present invention.
Detailed Description
The following describes the embodiments in further detail with reference to the accompanying drawings.
The first embodiment of the system:
as shown in fig. 1, a data mining processing system according to an embodiment of the present invention includes: the device comprises a data acquisition unit, a data classification unit and a data processing unit. The data acquisition unit is used for acquiring data and outputting the data to the data classification unit, the data are classified into multiple data types, and user relationships with indication features in the user relationship chain can be characterized from different dimensions. The data classification unit is used for carrying out comprehensive analysis on the multiple data types according to a classification strategy so as to obtain a user relationship with an indication characteristic from the data through analysis, and outputting the user relationship with the indication characteristic to the data processing unit. The data processing unit is used for collecting information according to the user relationship with the indication characteristic so as to send recommendation information according to the analysis result of the information.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
In a preferred implementation manner of the embodiment of the present invention, the plurality of data types include at least two data types of characterizing a personal attribute of the user, characterizing a social topology of the user, and characterizing an interaction behavior of the user.
In a preferred implementation manner of the embodiment of the present invention, as shown in fig. 2, the system further includes: and the data diffusion unit is positioned between the data classification unit and the data processing unit and is used for further analyzing the user relationship with the indication characteristics according to the positive and negative relationship and the transfer relationship to obtain user information related to the user relationship with the indication characteristics.
In a preferred implementation manner of the embodiment of the present invention, as shown in fig. 3, the system further includes: and the data output unit is positioned between the data diffusion unit and the data processing unit and is used for outputting the user relationship with the indication characteristics obtained according to the data classification unit and outputting the user information related to the user relationship with the indication characteristics obtained further according to the data diffusion unit to the data processing unit for processing.
Fig. 4 is a schematic view of an application scenario to which the system of the present invention is applied, where fig. 4 includes a data acquisition unit, a relative classification unit (a specific implementation of the data classification unit in fig. 3), a relative diffusion unit (a specific implementation of the data diffusion unit in fig. 3), a relative output unit (a specific implementation of the data output unit in fig. 3), and a data processing unit. The data acquisition unit acquires data for analyzing the user relationship with the indication feature from a plurality of data sources, in the application scenario, the user relationship with the indication feature takes a relative relationship as an example, the identified relative relationship is sent to the data processing unit for processing through the relative relationship classification unit-relative relationship diffusion unit-relative relationship output unit, the data processing unit collects information according to the relative relationship to update databases of N applications, and the recommendation information is sent by adopting different applications according to the analysis result of the information, so that the accuracy of the recommendation information for the user can be certainly improved. Wherein the N applications include: the system comprises an IM friend recommendation application, an IM friend intimacy estimation application and various advertisement recommendation platforms such as wide-point communication.
The multiple data sources in the application scenario include:
data type one: offline data for Instant Messaging (IM) applications;
data type two: local communication applications such as contact data in a mobile phone address book;
data type three: and interactive data generated during interaction among users such as various big forums, interactive platforms such as dog search questions and microblogs such as Xinlang microblogs and the like.
The data type one and the data type generally represent personal attributes of the user, for example, if the personal attributes of the user are 'dad', 'mom', 'pall' and the like are remarked on a user contact of the IM application, whether a relative relationship exists among some users can be known through the remarks; similarly, the remark can be adopted for the data type two, and the personal attributes of the remark user are that the remark-able items and the text amount of the data type two are larger than those of the data type one: the home address of the user, the post number and the like, if some users note the same home address, it is indicated that there is a relationship between these users, or it is known that some users are in the same area or the same street and the like through the post number, which may also play a role in judging the relationship. In general, the data type one and the data type two belong to data types with large data volume and short text content, or to say, both belong to short text types.
Data type three: because the interactive data generated during interaction among users such as various large forums, interactive platforms such as dog search questions, microblogs such as Sina microblogs and the like, for example, "dad goes where", "several points go home to eat" and the like belong to data types with small data volume and long text content, it can be said that the data type III belongs to a long text type.
In addition, data type one-data type three can reveal the user social topology.
For example, the data sources can access data of a plurality of data sources through the data acquisition unit, including offline data of an IM friend, a contact library of an IM address book of a mobile phone, and interactive saying (including comment forwarding) of an IM space. The offline data of the IM friends includes IM user personal attributes (such as friend notes, friend groups, and the like), IM circle information, IM group information (such as group names), IM social relationship chains, and the like. These data indicate relative relationships in different dimensions, such as an IM group named "parent", then each of the group friends within is likely to be relative to each other.
In summary, since the data for analyzing the user relationship with the indication feature, such as analyzing the relative relationship, is from a plurality of data sources, each data source corresponds to one data type, the data is divided into a plurality of data types. The multiple data types comprise at least two data types of the personal attribute of the representation user, the social topological structure of the representation user and the interaction behavior of the representation user, and the personal attribute characteristics, the social topological structure and the social network interaction information of the user can be effectively and comprehensively considered, so that the data with the multiple data types can represent the user relationship with the indication characteristics in the user relationship chain from different dimensions, and therefore, by adopting the embodiment of the invention, the user relationship with the indication characteristics is analyzed based on the data, the comprehensive analysis is realized, the user relationship with the indication characteristics can be ensured to be identified accurately enough, and the embodiment of the invention is superior to a single matching mechanism of keywords in the prior art.
Taking the user relationship with the indication feature as an example of a relative relationship, it is seen that the disadvantage of using the single matching mechanism of the keyword in the prior art is as follows:
firstly, various factors capable of judging the relationship cannot be comprehensively considered and reasonably analyzed:
there are many factors that affect whether there is a relative relationship, such as the user being annotated as "dad" by an IM friend; a user joins a group called 'relatives'; a relative of a relative above the social topology may also be a relative, and so on. To accurately analyze each influencing factor, the analysis method needs to be targeted. The method is simple to judge that the relative relationship is too violent and the effect is not good for various data with different properties according to keyword matching. For example, in the interaction of users in IM space, keyword matching can erroneously determine where the interactive post "dad goes" corresponds to a user having a relative relationship. In addition, each factor affecting relatives is not indicative of the same role. For example, a friend who is annotated as "dad" in the phone contact list is more likely to be a relative of the user than a friend who mentions "dad" in the user interaction in the IM space. The existing single mechanism for keyword matching cannot comprehensively consider various influence factors.
Secondly, the coverage rate of excavating relative relations is insufficient:
many words are used to express relationships, such as "dad" has "die", "father", and even "dad is", "old bean", etc. It is difficult for existing single mechanisms of keyword matching to enumerate all possible keywords in their entirety. In particular, some phrases may not have keywords of relatives appearing in the interaction, but they can indicate relatives, such as the comparison of two parties who interact posts in the IM space, such as "when to return to a dinner" may have relatives.
In the embodiment of the invention, because data of various data types are integrated, the data can be used for representing the user relationship with the indication characteristic in the user relationship chain from different dimensions, and the defects in the prior art can be overcome by adopting an integrated analysis mechanism, so that the user relationship with the indication characteristic, such as a relative relationship, can be accurately identified, and the guarantee can be provided for improving the accuracy of information pushing.
Because various social interaction relationships among users imply a great deal of information recommendation possibilities, such as a great deal of mutual blessing behavior among relatives and friends every holiday. On the other hand, there are various types of people involved in social interactions, including their own relatives, teachers, classmates, colleagues, strangers, even intermediary promotions, and the like. Among these groups, relatives have a great possibility of information recommendation, such as advertisers (e.g. restaurants, health care products) can be targeted to relatives to help them find suitable applications, products, or services more easily; the method and the system can recommend relatives to the user, assist the user in expanding the existing user relationship chain, increase the viscosity of the user, recommend information to the user, and improve the user experience.
In the following embodiments, various combinations in the first system embodiment may also be possible, and are not described in detail for simplifying the description.
The second embodiment of the system:
as shown in fig. 5, a data mining processing system according to an embodiment of the present invention includes: the device comprises a data acquisition unit, a data classification unit and a data processing unit. The data acquisition unit is used for acquiring data and outputting the data to the data classification unit, the data are classified into multiple data types, and user relationships with indication features in the user relationship chain can be characterized from different dimensions. The data classification unit is used for carrying out comprehensive analysis on the multiple data types according to a classification strategy so as to obtain a user relationship with an indication characteristic from the data through analysis, and outputting the user relationship with the indication characteristic to the data processing unit. The data processing unit is used for collecting information according to the user relationship with the indication characteristic so as to send recommendation information according to the analysis result of the information.
It is to be noted here that the data classification unit includes: a policy selection subunit and a policy enforcement subunit. The strategy selection subunit is configured to analyze the characteristic parameters of the multiple data types, determine that the data type is short text data when the characteristic parameter of each of the multiple data types is lower than a preset threshold, and select a first strategy as the classification strategy. The strategy executing subunit is configured to, when the first strategy is used for identifying the user relationship with the indication feature for the short text data, randomly extract a seed word, where the seed word can represent the user relationship with the indication feature, use the seed word as a reference, compare the data with the multiple data types as a training sample to be analyzed with the seed word to implement classification training, and identify the user relationship with the indication feature from the data.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
In this embodiment, the short text type mentioned in the first embodiment is taken as an example, the short text type belongs to a data type with a large data volume and a short text content, or a characteristic parameter of the short text type is used for representing a characteristic that the data volume is large and the text content is short, the policy selection subunit can analyze the characteristic parameter, and by comparing the characteristic parameter with a preset threshold, the short text type is determined, a first policy is selected as the classification policy, and the policy execution subunit executes the first policy, where the first policy is: randomly extracting seed words which can represent user relations with indication features, taking the seed words as reference, and comparing the data with the multiple data types as training samples to be analyzed with the seed words to realize classification training so as to identify the user relations with the indication features from the data.
Fig. 6 is a schematic diagram illustrating a structure of the policy execution subunit in fig. 5, where the policy execution subunit includes the following two implementation schemes, a first implementation scheme: instead of using fixed dimensions, the vector generation module implements a second implementation: the vector generation module employs fixed dimensions.
The first implementation scheme of the policy enforcement subunit is as follows:
a vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data.
And the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics.
And the analysis result output module outputs the identified user relationship with the indication characteristics.
The second implementation scheme of the policy enforcement subunit is as follows:
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information for each word in the data.
And the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics.
And the analysis result output module outputs the identified user relationship with the indication characteristics.
FIG. 7 is a schematic diagram of an application scenario in which the policy execution subunit of FIG. 6 is applied, including: a semantic vector generation module (a specific implementation of the vector generation module in fig. 6), a classification training module, and a predicted relative relationship output module (a specific implementation of the analysis result output module in fig. 6).
Taking the user relationship with the indication feature as a relative relationship for example, as shown in fig. 5, the data classification unit composed of the policy selection subunit and the policy execution subunit may be specifically the relative relationship classification module in fig. 4, the relative relationship classification module may respectively predict the relative relationship of the user according to multiple data sources, and different operation logics are required to be adopted to specifically process the data sources with different properties because the data characteristics of different data sources are different, for example, one processing logic (a first policy is used as a classification policy) is adopted for the short text type of the first data type and the second data type mentioned in the system embodiment, and another processing logic (a second policy is used as a classification policy) is adopted for the long text type of the third data type mentioned in the system embodiment. The embodiment is executed for the first policy, and the description for the second policy is referred to the description of the third embodiment of the subsequent system, which is not described herein again.
The embodiment aims at the first strategy, and has the greatest characteristic that a seed word is randomly adopted. For data: the method includes that offline data of IM friends and contacts of an IM address book of a mobile phone, such as personal attributes (friend notes, friend groups and the like) of IM users, circle names of IM, names of IM groups, and the like, are classified and trained by a classification training module, wherein the classification training module can be a training classifier based on Support Vector Machine (SVM) technology, and identifies relatives existing in data of two data types by using the seed words of the relatives, considering that the text of the data is very short (generally, only a plurality of words) and belongs to the short text type.
Firstly, a semantic vector generation module is utilized to represent data by vectors in a vector space, and then a classification training module is utilized to identify and classify the relatives existing in the data. Specifically, the semantic vector generation module represents data as a space vector (which may be a point vector) in a vector space in a 0/1 representation manner based on a Vector Space Model (VSM), and then finds a segmentation plane in the vector space by using a classification training module.
The 0/1 representation is to make each word in data such as a piece of text as an element of a vector (which may also be called a dimension of the vector), and the total dimension of the vector is the total number of words in the whole text. When a certain text is represented as a vector, if a word corresponding to each dimension of the vector appears in the text, the value of the dimension is 1, otherwise, the value is 0. For a 0/1 representation, such as, for example, "dad when get home" as a text, four words, "dad", "what", "time", "get home" can be obtained by word segmentation, if the text is represented as a vector, the vector has four dimensions. The 0/1 is expressed by making all the Chinese words as attributes, if the Chinese word has 10 ten thousand dimensions, the vector represented by the text is [0,0,0,1, ….,. 1.,. 0.,. 1,0,0], and the vector has a value of 1 only in the dimension corresponding to the four words "dad", "what", "time", "go home", and the others are 0. For short text types with large amounts of data, if a vector representation in a 0/1 manner is used, the dimensions are very large (since the dimensions of a vector are the total word size of the text).
Because the dimension of the 0/1 vector representation mode is very large, the calculation is difficult, and the similarity between texts with the same meaning or similar meanings cannot be reflected, the processing efficiency and the performance of the classification training module can be seriously damaged due to the ultrahigh dimension, and words which are semantically similar cannot be reflected by the cosine included angle of the corresponding vector based on the 0/1 representation method. For example, if "dad" and "father" are expressed as vectors by using the expression method of 0/1, the cosine angle of the two semantically similar words is 0, which can negatively affect the classification effect.
In consideration of the disadvantages of the 0/1 vector representation mode, the improvement scheme is as follows: and a semantic vector representation mode with fixed dimensions is adopted, and the total word number of all texts is not used as the total dimensions of the vector.
For this improvement, the text of the data is first learned, resulting in a semantic vector of one fixed dimension (e.g., 200 dimensions) for each word. How to build the semantic vector is described below.
For example, the text "when father comes home", there may be four words "father", "what", "time", "go home" through word segmentation, each word corresponds to a semantic vector, such as "father" corresponds to [0.1,0.2, 0.1.., 0.5], "what" corresponds to [0.2,0.1, 0.3.., 0.3 ]; "time" corresponds to [0.1,0.2,0.2,.., 0.1 ]; "go home" corresponds to [0.0,0.1, 0.0.,. 0.1], so that the entire text "dad when go home" is expressed as a semantic vector, which is the addition of the semantic vectors of each word in the text, such as [0.1,0.2, 0.1.,. 0.5] + [0.2,0.1, 0.3.,. 0.3] + [0.1,0.2,0.2,. 0.1] + [0.0,0.1,0.0,. 0.0.,. 0.1] + [0.4,0.6,. 0.6.,. 1 ]. After normalization, the [0.4,0.6, 0.6., 1] is changed to be represented by [0.2,0.3, 0.3., 0.5 ].
It can be seen that: for the same text, the above 0/1 expression is expressed as a 10 ten thousand multidimensional vector [0,0,0,1, ….,1,. 0.,. 1,0,0], which becomes a fixed dimension (e.g., 200-dimensional vector) [0.2,0.3, 0.3.,. 0.5], the dimension is reduced by a large amount, and the amount of calculation is reduced by a large amount, thereby improving the processing efficiency and performance of the classification training module. In addition, since the semantic vector can better measure the context between words, it can better calculate the similarity, for example, it can be recognized that "dad" is similar to "old bean" under a certain context, and then it can better calculate the similarity of the two texts, i.e., "dad' gets home" and "old bean gets home".
In general terms, semantic vectors are representations in a continuous vector space for each word using neural networks. The context of the words in the context is considered, and the relevance characteristics of the words are described by using the frequency of common appearance of the words in the same context, for example, the distance between the words based on the semantic vector is smaller than the distance between the words based on the semantic vector and the apple.
In particular, semantic vectors need contextual information that can encompass words. Such that the words of similar semantics,
the cosine angle value of the corresponding vector is larger. We have found thatThe context of a word is characterized by a conditional probability P, i.e. the probability of each word is only influenced by the preceding words, i.e. P (w)i|w1,...,wi-1). To simplify the calculation, only the influence of each word by its first n-1 words, i.e., P (w), is generally consideredi|wi-n+1,...,wi-1). A good semantic vector should maximize the conditional probability P (w) of each wordi|wi-n+1,...,wi-1). We use a three-layer neural network model to optimize the maximum value of this probability. The input layer of the neural network is n-1 words above, each word corresponds to a semantic vector, remembering C (w)i-n+1),...,C(wi-1) Where C is the set of all word vectors, each vector having a dimension of m. The n-1 vectors are spliced end to form an (n-1) m-dimensional vector, which is denoted as x. Then, a non-linear hidden layer is used to model x, i.e., tan h (Hx + d), where d is the bias term and tan h is the activation function. The output layer of the neural network is a predicted result in | V | dimension, where V is a set of words, with reference to the following equation (1):
y=softmax(U·tanh(Hx+d)+Wx+b) (1)
wherein softmax is an activation function, and U (| V |. xh matrix, h is the number of layers of the hidden layer) is a parameter from the hidden layer to the output layer; w (| V | × (n-1) m's matrix) is a linear transformation from the input layer directly to the output layer. The ith dimension y of this predicted result yiRepresenting the probability that the next word is i, i.e. yi=P(wi|wi-n+1,...,wi-1)。
Solving the neural network by using a Back Propagation (Back Propagation) algorithm to further obtain a semantic vector set C (word w) of the wordiThe corresponding semantic vector is C (w)i)). In the solving process, the (n-1) context words in front of each word and the related frequency information need to be counted, and the related frequency information is counted by using the data spoken in the IM space as a corpus.
The embodiment of the invention has the advantages that the text is expressed by the vector:
in the prior art, matching is performed through keywords, aiming at text processing, and many keywords need to be found, so that not only is effort wasted, but also the accuracy rate cannot be guaranteed due to incomplete finding; in order to make the classification more accurate, the embodiment of the invention expresses the text into a vector form which can be analyzed and processed mathematically instead of simply classifying the text, and the text needs to be segmented into entries to obtain each word forming the text and then processed. Text is represented in vector form by a VSM, which is a statistical model used primarily to map text in data to a data point (point vector) in a vector space spanned by a set of normalized orthogonal term vectors. After representing the text in the form of a vector that can be mathematically analyzed, based on probability and distance, classification is performed, for example, based on distance, the text is regarded as a data point in a vector space, and classification is performed by calculating the distance between data points, the classification process is a machine learning process, the data points (point vectors) are points in an n-dimensional real space, a segmentation plane is found in the vector space by using a classification training module, as shown in fig. 8, the segmentation plane is a segmentation plane, the data points of different classes are separated to realize data classification, and the data points can be preferably separated by an n-1 dimensional hyperplane, which is generally called a linear classifier, but not limited to the SVM of the embodiment of the invention, and many classifiers meet the requirement. If a best-classified plane (maximum-spaced hyperplane) can be found, i.e., the plane having the largest spacing between data points belonging to two different classes, the classification effect is better.
The third embodiment of the system:
as shown in fig. 5, a data mining processing system according to an embodiment of the present invention includes: the device comprises a data acquisition unit, a data classification unit and a data processing unit. The data acquisition unit is used for acquiring data and outputting the data to the data classification unit, the data are classified into multiple data types, and user relationships with indication features in the user relationship chain can be characterized from different dimensions. The data classification unit is used for carrying out comprehensive analysis on the multiple data types according to a classification strategy so as to obtain a user relationship with an indication characteristic from the data through analysis, and outputting the user relationship with the indication characteristic to the data processing unit. The data processing unit is used for collecting information according to the user relationship with the indication characteristic so as to send recommendation information according to the analysis result of the information.
It is to be noted here that the data classification unit includes: a policy selection subunit and a policy enforcement subunit. The strategy selection subunit is configured to analyze the characteristic parameters of the multiple data types, determine that the data types are short text data when the characteristic parameters of some of the multiple data types are lower than a preset threshold, determine that the data types are long text data when the characteristic parameters of some of the data types are higher than the preset threshold, and select a second strategy as the classification strategy. The strategy executing subunit is configured to, when the second strategy is used to identify the user relationship with the indication feature for the long text data, construct a seed word from the user relationship with the indication feature obtained by identifying the short text data with the first strategy, use the seed word as a reference, and perform similarity comparison between the data with the multiple data types as a training sample to be analyzed and the seed word to implement classification training, so as to identify the user relationship with the indication feature from the data.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
In this embodiment, the long text type mentioned in the first embodiment is taken as an example, the long text type belongs to a data type with a small data amount and a long text content, or a characteristic parameter of the long text type is used for representing a characteristic that the data amount is small and the text content is long, the policy selection subunit can analyze the characteristic parameter, and compare the characteristic parameter with a preset threshold to determine that the long text type is the long text type, then select a second policy as the classification policy, and execute the second policy through the policy execution subunit, where the second policy is: constructing seed words by using the user relationship with the indication features obtained by identifying the short text data by adopting a first strategy, taking the seed words as a reference standard, and performing similarity comparison on the data with the multiple data types as training samples to be analyzed and the seed words to realize classification training so as to identify the user relationship with the indication features from the data.
Fig. 9 is a schematic diagram illustrating a structure of the policy execution subunit in fig. 5, where the policy execution subunit includes the following two implementation schemes, a first implementation scheme: instead of using fixed dimensions, the vector generation module implements a second implementation: the vector generation module employs fixed dimensions.
The first implementation scheme of the policy enforcement subunit is as follows:
and the seed word construction module is used for taking a user relationship data pair formed by the user relationship which is simultaneously recognized as having the indication characteristics in a plurality of dimensions as a positive sample seed word and taking a user relationship data pair formed by the user relationship which is not recognized as having the indication characteristics in any dimension as a negative sample seed word when the seed word is constructed by the user relationship with the indication characteristics which is obtained by recognizing the short text data by adopting a first strategy.
A vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data.
And the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
And the analysis result output module outputs the identified user relationship with the indication characteristics.
The second implementation scheme of the policy enforcement subunit is as follows:
a seed word construction module, configured to use a first policy to construct a seed word from the user relationship with the indication feature obtained by identifying the short text data, and use a user relationship data pair formed by the user relationship with the indication feature identified simultaneously in multiple dimensions as a positive sample seed word, and use a user relationship data pair formed by the user relationship with the indication feature not identified in any dimension as a negative sample seed word;
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication features;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
Fig. 10 is a schematic view of an application scenario for applying the policy execution subunit of fig. 9, including: a semantic vector generation module (a specific implementation of the vector generation module in fig. 9), a classification training module, a predicted relative relationship output module (a specific implementation of the analysis result output module in fig. 9), and a high confidence relative relationship extraction module (a specific implementation of the seed word construction module in fig. 9).
Taking the user relationship with the indication feature as a relative relationship for example, as shown in fig. 5, the data classification unit composed of the policy selection subunit and the policy execution subunit may be specifically the relative relationship classification module in fig. 4, the relative relationship classification module may respectively predict the relative relationship of the user according to multiple data sources, and different operation logics are required to be adopted to specifically process the data sources with different properties because the data characteristics of different data sources are different, for example, one processing logic (a first policy is used as a classification policy) is adopted for the short text type of the first data type and the second data type mentioned in the system embodiment, and another processing logic (a second policy is used as a classification policy) is adopted for the long text type of the third data type mentioned in the system embodiment. The present embodiment is performed for the second policy.
The second policy is not a random seed word, but the seed word is constructed by using the user relationship (e.g. relative relationship) with the indication characteristic obtained by identifying the short text data (data type one and data type two) by using the first policy.
For data: the interactive data of forum, such as the interactive say data in IM space, has a long text (with 54 words on average) and contains many noise words, and the probability distribution of the relative categories is different from the probability distribution of the IM buddy offline data and the corresponding IM address book of the mobile phone mentioned in the second description of the system embodiment. To this end, the second strategy is employed to more efficiently identify relatives in the IM spatial interaction utterance data. The key point is that the selection of seed words is not random, but based on the relationship obtained by the identification of IM friend offline data and the mobile phone IM address list, the seed words are selected by a high confidence relationship extraction module to obtain positive sample seed words and negative sample seed words, and the positive sample seed words and the negative sample seed words are input into a classification training module for classification training, wherein the classification training module can be a training classifier based on a Support Vector Machine (SVM) technology.
The positive and negative sample seed words of the training classifier are constructed as follows:
according to the relative relationship recognition result of the first two types of data (the IM friend offline data and the mobile phone IM address book) generated based on fig. 6, the user relationship pairs predicted to be relative relationships in multiple dimensions at the same time are extracted, for example, the user relationship pairs are predicted to be relative relationship pairs in multiple dimensions such as remark of the IM friends, grouping of the IM friends, and the like at the same time. These relative pairs have a high degree of confidence. These relationships can be considered as positive sample seed words to the interaction records (comment forwarding words) in the IM space utterance data. Accordingly, we extract those relationship pairs that are not predicted to be relatives in any dimension from the relationship recognition result generated in fig. 6, and use their interaction records as negative sample seed words. And generating a semantic vector based on a semantic vector generating module, and generating a corresponding semantic vector for positive and negative samples to input into a training classifier for classification training.
Firstly, a semantic vector generation module is utilized to represent data by vectors in a vector space, and then a classification training module is utilized to identify and classify the relatives existing in the data. Specifically, the semantic vector generation module represents data as a space vector (which may be a point vector) in a vector space in a 0/1 representation manner based on a Vector Space Model (VSM), and then finds a segmentation plane in the vector space by using a classification training module.
The 0/1 representation is to make each word in data such as a piece of text as an element of a vector (which may also be called a dimension of the vector), and the total dimension of the vector is the total number of words in the whole text. When a certain text is represented as a vector, if a word corresponding to each dimension of the vector appears in the text, the value of the dimension is 1, otherwise, the value is 0. For a 0/1 representation, such as, for example, "dad when get home" as a text, four words, "dad", "what", "time", "get home" can be obtained by word segmentation, if the text is represented as a vector, the vector has four dimensions. The 0/1 representation is to make all the Chinese words as attributes, if a Chinese word has 10 ten thousand dimensions, the text represents a vector of [0,0,0,1, ….,. 1.,. 0.,. 1,0,0], and the vector has a value of 1 only in the dimension corresponding to the four words "dad", "what" and "going home", and the others are 0. For short text types with large amounts of data, if a vector representation in a 0/1 manner is used, the dimensions are very large (since the dimensions of a vector are the total word size of the text).
Because the dimension of the 0/1 vector representation mode is very large, the calculation is difficult, and the similarity between texts with the same meaning or similar meanings cannot be reflected, the processing efficiency and the performance of the classification training module can be seriously damaged due to the ultrahigh dimension, and words which are semantically similar cannot be reflected by the cosine included angle of the corresponding vector based on the 0/1 representation method. For example, if "dad" and "father" are expressed as vectors by using the expression method of 0/1, the cosine angle of the two semantically similar words is 0, which can negatively affect the classification effect.
In consideration of the disadvantages of the 0/1 vector representation mode, the improvement scheme is as follows: and a semantic vector representation mode with fixed dimensions is adopted, and the total word number of all texts is not used as the total dimensions of the vector.
For this improvement, the text of the data is first learned, resulting in a semantic vector of one fixed dimension (e.g., 200 dimensions) for each word. How to build the semantic vector is described below.
For example, the text "when father comes home", there may be four words "father", "what", "time", "go home" through word segmentation, each word corresponds to a semantic vector, such as "father" corresponds to [0.1,0.2, 0.1.., 0.5], "what" corresponds to [0.2,0.1, 0.3.., 0.3 ]; "time" corresponds to [0.1,0.2,0.2,.., 0.1 ]; "go home" corresponds to [0.0,0.1, 0.0.,. 0.1], and then the entire text "Shenzhen luxury garden" is represented as a semantic vector, which is the addition of semantic vectors for each word in the text, such as [0.1,0.2, 0.1.,. 0.5] + [0.2,0.1, 0.3.,. 0.3] + [0.1,0.2, 0.2.,. 0.1] + [0.0,0.1, 0.0.,. 0.1] + [0.4,0.6, 0.6.,. 1 ]. After normalization, the [0.4,0.6, 0.6., 1] is changed to be represented by [0.2,0.3, 0.3., 0.5 ].
It can be seen that: for the same text, the above 0/1 expression is expressed as a 10 ten thousand multidimensional vector [0,0,0,1, …, ·,1, ·,0.,. 0,1,. 1,0,0], which becomes a fixed dimension (e.g., 200-dimensional vector) [0.2,0.3,0.3,. 0.5], and the number of dimensions is reduced, and the calculation amount is reduced accordingly, so that the processing efficiency and performance of the classification training module are improved. In addition, since the semantic vector can better measure the context between words, it can better calculate the similarity, for example, it can be recognized that "dad" is similar to "old bean" under a certain context, and then it can better calculate the similarity of the two texts, i.e., "dad' gets home" and "old bean gets home".
In general terms, semantic vectors are representations in a continuous vector space for each word using neural networks. The context of the words in the context is considered, and the relevance characteristics of the words are described by using the frequency of common appearance of the words in the same context, for example, the distance between the words based on the semantic vector is smaller than the distance between the words based on the semantic vector and the apple.
In particular, semantic vectors need contextual information that can encompass words. Therefore, the cosine included angle value of the corresponding vector of the semantic similar word is larger. We describe the context of words by conditional probabilities P, i.e. the probability of each word is only influenced by the words that have appeared before, i.e. P (w)i|w1,...,wi-1). To simplify the calculation, only the influence of each word by its first n-1 words, i.e., P (w), is generally consideredi|wi-n+1,...,wi-1). A good semantic vector should maximize the conditional probability P (w) of each wordi|wi-n+1,...,wi-1). We use a three-layer neural network model to optimize the maximum value of this probability. The input layer of the neural network is n-1 words above, each word corresponds to a semantic vector, remembering C (w)i-n+1),...,C(wi-1) Where C is the set of all word vectors, each vector having a dimension of m. The n-1 vectors are spliced end to form an (n-1) m-dimensional vector, which is denoted as x. Then, a non-linear hidden layer is used to model x, i.e., tan h (Hx + d), where d is the bias term and tan h is the activation function. The output layer of the neural network is a predicted result in | V | dimension, where V is a set of words, with reference to the following equation (1):
y=softmax(U·tanh(Hx+d)+Wx+b) (1)
wherein softmax is an activation function, and U (| V |. xh matrix, h is the number of layers of the hidden layer) is a parameter from the hidden layer to the output layer; w (| V | × (n-1) m's matrix) is a linear transformation from the input layer directly to the output layer. The ith dimension y of this predicted result yiRepresenting the probability that the next word is i, i.e. yi=P(wi|wi-n+1,...,wi-1)。
Solving the neural network by using a Back Propagation (Back Propagation) algorithm to further obtain a semantic vector set C (word w) of the wordiThe corresponding semantic vector is C (w)i)). In the solving process, the (n-1) context words in front of each word and the related frequency information need to be counted, and the related frequency information is counted by using the data spoken in the IM space as a corpus.
The embodiment of the invention has the advantages that the text is expressed by the vector:
in the prior art, matching is performed through keywords, aiming at text processing, and many keywords need to be found, so that not only is effort wasted, but also the accuracy rate cannot be guaranteed due to incomplete finding; in order to make the classification more accurate, the embodiment of the invention expresses the text into a vector form which can be analyzed and processed mathematically instead of simply classifying the text, and the text needs to be segmented into entries to obtain each word forming the text and then processed. Text is represented in vector form by a VSM, which is a statistical model used primarily to map text in data to a data point (point vector) in a vector space spanned by a set of normalized orthogonal term vectors. After representing the text in the form of a vector that can be mathematically analyzed, based on probability and distance, classification is performed, for example, based on distance, the text is regarded as a data point in a vector space, and classification is performed by calculating the distance between data points, the classification process is a machine learning process, the data points (point vectors) are points in an n-dimensional real space, a segmentation plane is found in the vector space by using a classification training module, as shown in fig. 8, the segmentation plane is a segmentation plane, the data points of different classes are separated to realize data classification, and the data points can be preferably separated by an n-1 dimensional hyperplane, which is generally called a linear classifier, but not limited to the SVM of the embodiment of the invention, and many classifiers meet the requirement. If a best-classified plane (maximum-spaced hyperplane) can be found, i.e., the plane having the largest spacing between data points belonging to two different classes, the classification effect is better.
Based on the first to third embodiments of the system, the system further includes a data diffusion unit, configured to further analyze the user relationship with the indication feature according to a positive-negative relationship and a transitive relationship, to obtain user information related to the user relationship with the indication feature, and the example that the user relationship with the indication feature is a relative relationship is described as follows:
fig. 11 is a schematic diagram illustrating an implementation of the respective functional modules in the relative relationship expansion unit in fig. 4, where the relative relationship expansion unit is configured to obtain the relative of the relative through the diffusion relationship. A table of the diffusion relationships is shown in table 1 below.
Father and father | Brother | Table brother | Aunt | Son (son) | 'Jiu' mother liquor | |
Father and father | Grandpa | Tertiary amine primary | Watch tertiary | Relative and relative | Brother | Relative and relative |
Brother | Father and father | Brother | Table brother | Aunt | Nephew (nephew) | 'Jiu' mother liquor |
Table brother | Relative and relative | Table brother | 0 | 0 | Exterior nephew | 0 |
Aunt | External male | Jiujiu (a kind of liquor) | 0 | 0 | Table brother | 0 |
Son (son) | Couple of man | Child-woman | Relative and relative | Relative and relative | Relative and relative | Relative and relative |
'Jiu' mother liquor | 0 | 0 | 0 | 0 | Table brother | 0 |
TABLE 1
Table 1 may also be a diffusion relation matrix, and whether there is a relationship may be determined by using the relationship classification unit in fig. 4 according to the personal attribute information of the users and the language words of the interaction between the users. However, considering that some users have missing information and some users having a relative do not interact with each other in the IM space, the relative relationship chain of the user is further diffused by the relative relationship diffusion unit of fig. 4 to obtain the relative of the relative. The relationship diffusion unit is used for diffusing the relationship of the relatives according to the relationship of the relatives identified by the relationship classification unit and by combining a social network topological structure of a user, so as to improve the coverage rate of the relationship identification, and the relationship diffusion module is specifically realized as shown in fig. 11 and comprises an IM user relationship chain extraction module, a forward-backward relationship diffusion module and a general relationship diffusion module, wherein the relationship diffusion module is used for pruning the relationship identification result based on self-confidence, and the IM user relationship chain extraction module is used for extracting the relationship of the relatives from the identified relationship; the forward-backward relationship diffusion module is used for diffusing relatives of the relatives by adopting a forward-backward relationship according to the diffusion relationship table shown in the table 1; the general relation diffusion module is used for diffusing relatives of the relatives by adopting a transfer relation according to the diffusion relation table shown in the table 1; the self-confidence-based relative recognition result pruning module is used for optimizing the diffusion result based on a high self-confidence rule so as to reduce the misjudgment rate.
For the positive-negative relationship (forward-backward relationship), as shown in fig. 12, the positive-negative relationship diffusion is performed on both sides having a relative relationship, for example, if the user a is a relative of the user B, the user B is obtained by diffusion as the relative of the user a. For the delivery relationship (second degree relationship diffusion), as shown in fig. 13 as an example, the delivery relationship is the delivery of a relative relationship, such as user a being "dad" of user B, user B being "brother" of user C, user a and user C having a relative relationship.
For the pruning module for the relative recognition result based on self-confidence, accuracy may be reduced due to the diffusion of the relative relationship, for example, the user a is a "table brother" of the user B, the user B is a "desk brother" of the user C, and the user a may have no relative relationship with the user C or only have a very distant relationship; particularly, the relationship classification module may misjudge that the user B is the user C as a relationship, and then the second degree relationship is diffused, and the misjudgment is superimposed, that is, the user a and the user C are further misjudged as a relationship. In order to improve the accuracy of the relative identification, a method based on the self-confidence rule is used for optimizing the relative identification result. For example, in diffusion, the confidence of diffusion is weighted when the user A and the user C have the same family name or in the same region; for example, if the user a and the user C make notes on the IM friend, the IM group name, the IM circle name, and other dimensions are determined as relatives at the same time, the confidence that the relationship is a relative relationship is also weighted.
Here, it should be noted that: the following description of the method items is similar to the above description of the system items, and the description of the beneficial effects of the system items is not repeated. For technical details not disclosed in the method embodiment of the present invention, refer to the description of the system embodiment of the present invention.
The first embodiment of the method comprises the following steps:
as shown in fig. 14, the data mining processing method according to the embodiment of the present invention includes:
step 101, data are obtained, wherein the data are divided into a plurality of data types, and user relations with indication characteristics in a user relation chain can be evaluated from different dimensions.
And 102, comprehensively analyzing the multiple data types according to a classification strategy so as to analyze the data to obtain the user relationship with the indication characteristics.
And 103, collecting information according to the user relationship with the indication characteristics so as to send recommendation information according to the analysis result of the information.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
In a preferred implementation manner of the embodiment of the present invention, the plurality of data types include at least two data types of characterizing a personal attribute of the user, characterizing a social topology of the user, and characterizing an interaction behavior of the user.
The second method embodiment:
as shown in fig. 15, the data mining processing method according to the embodiment of the present invention includes:
step 201, data is obtained, the data is divided into a plurality of data types, and user relations with indication characteristics in the user relation chain can be evaluated from different dimensions.
Step 202, analyzing the characteristic parameters of the multiple data types, when the characteristic parameters of each data type in the multiple data types are lower than a preset threshold value, determining that the data type is short text data, and selecting a first strategy as the classification strategy.
And 203, executing the first strategy, and randomly extracting seed words which can represent the user relationship with the indication characteristics.
And 204, taking the seed word as a reference standard, comparing the data with the multiple data types as training samples to be analyzed with the seed word to realize classification training, and identifying the user relationship with the indication characteristics from the data.
And step 205, collecting information according to the user relationship with the indication characteristics, so as to send recommendation information according to the analysis result of the information.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
Furthermore, the data type is determined to be short text data by step 202, a first policy is selected as the classification policy, and the user relationship with the indicated feature in the data is identified by randomly selected seed words by steps 203-204.
In a preferred implementation manner of the embodiment of the present invention, step 204 specifically includes:
step 2041 a: representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
step 2041 b: and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
In a preferred implementation manner of the embodiment of the present invention, step 204 further includes:
step 2042 a: representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
step 2042 b: and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
The third method embodiment:
as shown in fig. 16, the data mining processing method according to the embodiment of the present invention includes:
step 301, data is obtained, the data is divided into a plurality of data types, and user relations with indication characteristics in the user relation chain can be evaluated from different dimensions.
Step 302, analyzing the characteristic parameters of the multiple data types, determining that the data types are short text data when the characteristic parameters of part of the data types in the multiple data types are lower than a preset threshold, determining that the data types are long text data when the characteristic parameters of part of the data types are higher than the preset threshold, and selecting a second strategy as the classification strategy.
Step 303, executing the second policy, and constructing a seed word by using the user relationship with the indication feature obtained by identifying the short text data by using the first policy.
And 304, taking the seed word as a reference standard, and performing similarity comparison on the data with the multiple data types as training samples to be analyzed and the seed word to realize classification training so as to identify the user relationship with the indication characteristics from the data.
Step 305, collecting information according to the user relationship with the indication characteristic, and sending recommendation information according to the analysis result of the information.
With the embodiment of the present invention, since the acquired data has a plurality of data types, and the data types can represent the user relationship with the indication feature in the user relationship chain from different dimensions, that is, the data obtained by dividing the data by different data types is an integrated index, and then the user relationship with the indication feature is analyzed from the data by performing integrated analysis on the data with a plurality of data types according to the classification strategy, the accuracy of identifying the user relationship with the indication feature can be improved by not only mining the specific user relationship with the indication feature in the user relationship chain from the spacious internet communication data, but also collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information, the accuracy of recommending information for the user can be certainly improved.
Furthermore, the data type is determined to be long text data by step 302, a second policy is selected as the classification policy, and the user relationship with the indicated feature in the data is identified by randomly selected seed words by steps 303-304.
In a preferred implementation manner of the embodiment of the present invention, step 303 specifically includes:
and the user relationship data pairs formed by the user relationships which are simultaneously identified as having the indicating characteristics in a plurality of dimensions are used as positive sample seed words, and the user relationship data pairs formed by the user relationships which are not identified as having the indicating characteristics in any dimension are used as negative sample seed words.
In a preferred implementation manner of the embodiment of the present invention, step 304 specifically includes:
step 3041 a: representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
step 3041 b: and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
In a preferred implementation manner of the embodiment of the present invention, step 304 specifically includes:
step 3042 a: representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
step 3042 b: and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
Based on the first to third embodiments of the method of the present invention, the method further includes: and further analyzing the user relationship with the indication characteristic according to the positive and negative relationship and the transfer relationship to obtain user information related to the user relationship with the indication characteristic.
The integrated module according to the embodiment of the present invention may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as an independent product. Based on such an understanding, it will be apparent to one skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media having computer-usable program code embodied therewith, including but not limited to, a U disk, a removable hard disk, a Read-Only Memory (ROM), a Random-Access Memory (RAM), a magnetic disk storage, a CD-ROM, an optical storage device, and the like.
The present application is described in terms of flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
Correspondingly, the embodiment of the invention also provides a computer storage medium, wherein a computer program is stored in the computer storage medium, and the computer program is used for executing the data mining processing system and the data mining processing method of the embodiment of the invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the scope of the present invention.
Claims (17)
1. A data mining processing system, the system comprising: the device comprises a data acquisition unit, a data classification unit and a data processing unit; wherein,
the data acquisition unit is used for acquiring data from a plurality of data sources and outputting the data to the data classification unit, the data are divided into a plurality of data types used for revealing a social topological structure of a user, and the user relation with an indication characteristic in a user relation chain can be represented from different dimensions;
the data classification unit is used for performing comprehensive analysis on the multiple data types according to a classification strategy, analyzing the data to obtain the user relation with the indication characteristic in a mode of randomly extracting seed words or a mode of constructing the seed words when the user relation with the indication characteristic is identified for the short text data, and outputting the user relation with the indication characteristic to the data processing unit, wherein the mode of constructing the seed words comprises the following steps: the method comprises the steps of taking a user relationship data pair formed by user relationships which are simultaneously recognized in multiple dimensions as indicating features as a positive sample seed word, taking a user relationship data pair formed by user relationships which are not recognized in any dimension as indicating features as a negative sample seed word, representing the positive sample seed word and the negative sample seed word by a vector in a vector space, generating semantic vectors corresponding to the positive sample seed word and the negative sample seed word respectively, inputting the semantic vectors corresponding to the positive sample seed word and the negative sample seed word into a training classifier for classification training, then recognizing and classifying the user relationships, and recognizing the user relationships with the indicating features; wherein the short text data comprises: the data type which represents the personal attribute of the user has large data volume and short text content;
the data processing unit is used for collecting information according to the user relationship with the indication characteristic so as to send recommendation information according to the analysis result of the information;
the multiple data types comprise at least two data types of characterizing personal attributes of the user, characterizing social topological structures of the user and characterizing interactive behaviors of the user.
2. The system of claim 1, wherein the data classification unit comprises:
the strategy selection subunit is used for analyzing the characteristic parameters of the multiple data types, determining the data types to be short text data when the characteristic parameters of each data type in the multiple data types are lower than a preset threshold value, and selecting a first strategy as the classification strategy;
and the strategy executing subunit is configured to randomly extract the seed word when the first strategy is adopted to identify the user relationship with the indication feature for the short text data, the seed word can represent the user relationship with the indication feature, the seed word is used as a reference, the data with the multiple data types is used as a training sample to be analyzed, and the training sample is compared with the seed word to realize classification training, so that the user relationship with the indication feature is identified from the data.
3. The system of claim 2, wherein the policy enforcement subunit comprises:
a vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
4. The system of claim 2, wherein the policy enforcement subunit comprises:
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vector and the seed words in the vector space so as to identify the user relationship with the indication characteristics;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
5. The system of claim 1, wherein the data classification unit comprises:
the strategy selection subunit is used for analyzing the characteristic parameters of the multiple data types, determining that the data types are short text data when the characteristic parameters of part of the data types in the multiple data types are lower than a preset threshold value, determining that the data types are long text data when the characteristic parameters of part of the data types are higher than the preset threshold value, and selecting a second strategy as the classification strategy;
and the strategy executing subunit is configured to, when the second strategy is adopted to identify the user relationship with the indication feature for the long text data, construct the seed word from the user relationship with the indication feature obtained by identifying the short text data with the first strategy, use the seed word as a reference, and perform similarity comparison between the data with the plurality of data types as a training sample to be analyzed and the seed word to implement classification training, so as to identify the user relationship with the indication feature from the data.
6. The system of claim 5, wherein the policy enforcement subunit comprises:
a seed word construction module, configured to obtain the positive sample seed word and the negative sample seed word when the seed word is constructed by using the user relationship with the indication feature obtained by identifying the short text data by using a first policy;
a vector generation module to represent the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication features;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
7. The system of claim 5, wherein the policy enforcement subunit comprises:
a seed word construction module, configured to obtain the positive sample seed word and the negative sample seed word when the seed word is constructed by using the user relationship with the indication feature obtained by identifying the short text data by using a first policy;
the vector generation module is used for representing the data as a vector in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
the classification training module is used for determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors, the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication features;
and the analysis result output module outputs the identified user relationship with the indication characteristics.
8. The system of any one of claims 1 to 7, further comprising: the data diffusion unit is positioned between the data classification unit and the data processing unit;
and the data diffusion unit is used for further analyzing the user relationship with the indication characteristic according to the positive and negative relationship and the transfer relationship to obtain the user information related to the user relationship with the indication characteristic.
9. A data mining processing method, characterized in that the method comprises:
acquiring data from a plurality of data sources, wherein the data is divided into a plurality of data types for revealing social topological structures of users, and user relations with indication characteristics in user relation chains can be evaluated from different dimensions;
comprehensively analyzing the multiple data types according to a classification strategy, and analyzing the data to obtain the user relation with the indication characteristics in a mode of randomly extracting seed words or a mode of constructing the seed words when identifying the user relation with the indication characteristics for the short text data, wherein the mode of constructing the seed words comprises the following steps: the method comprises the steps of taking a user relationship data pair formed by user relationships which are simultaneously recognized in multiple dimensions as indicating features as a positive sample seed word, taking a user relationship data pair formed by user relationships which are not recognized in any dimension as indicating features as a negative sample seed word, representing the positive sample seed word and the negative sample seed word by a vector in a vector space, generating semantic vectors corresponding to the positive sample seed word and the negative sample seed word respectively, inputting the semantic vectors corresponding to the positive sample seed word and the negative sample seed word into a training classifier for classification training, then recognizing and classifying the user relationships, and recognizing the user relationships with the indicating features; wherein the short text data comprises: the data type which represents the personal attribute of the user has large data volume and short text content;
collecting information according to the user relationship with the indication feature to send recommendation information according to the analysis result of the information;
the multiple data types comprise at least two data types of characterizing personal attributes of the user, characterizing social topological structures of the user and characterizing interactive behaviors of the user.
10. The method of claim 9, wherein the analyzing the plurality of data types according to a classification policy comprises:
analyzing the characteristic parameters of the multiple data types, determining the data types to be short text data when the characteristic parameters of each data type in the multiple data types are lower than a preset threshold value, and selecting a first strategy as the classification strategy;
randomly extracting the seed words when the first strategy is executed, wherein the seed words can represent user relations with indication characteristics;
and taking the seed words as reference bases, and comparing the data with the multiple data types as training samples to be analyzed with the seed words to realize classification training so as to identify the user relationship with the indication features from the data.
11. The method of claim 10, wherein the taking the seed word as a reference basis, comparing the data with the plurality of data types as training samples to be analyzed with the seed word to realize classification training, so as to identify the user relationship with the indicated feature from the data, comprises:
representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
12. The method of claim 10, wherein the taking the seed word as a reference basis, comparing the data with the plurality of data types as training samples to be analyzed with the seed word to realize classification training, so as to identify the user relationship with the indicated feature from the data, comprises:
representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the seed words in the vector space so as to identify the user relationship with the indication features.
13. The method of claim 9, wherein the analyzing the plurality of data types according to a classification policy comprises:
analyzing the characteristic parameters of the multiple data types, determining that the data types are short text data when the characteristic parameters of part of the data types are lower than a preset threshold value, determining that the data types are long text data when the characteristic parameters of part of the data types are higher than the preset threshold value, and selecting a second strategy as the classification strategy;
when the second strategy is executed, the seed words are constructed by the user relation with the indication characteristics obtained by identifying the short text data by adopting the first strategy;
and taking the seed words as reference bases, and performing similarity comparison on the data with the multiple data types as training samples to be analyzed and the seed words to realize classification training so as to identify the user relationship with the indication characteristics from the data.
14. The method of claim 13, further comprising: and when a seed word is constructed by the user relationship with the indication characteristics obtained by identifying the short text data by adopting a first strategy, obtaining the positive sample seed word and the negative sample seed word.
15. The method according to claim 14, wherein the performing classification training by using the seed word as a reference and using the data with the plurality of data types as training samples to be analyzed and the seed word for similarity comparison to identify the user relationship with the indicated features from the data comprises:
representing the data as vectors in a vector space according to a vector space model; each word in the data is used as one dimension of the vector, and the total dimension of the vector is the total word number of the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
16. The method according to claim 14, wherein the performing classification training by using the seed word as a reference and using the data with the plurality of data types as training samples to be analyzed and the seed word for similarity comparison to identify the user relationship with the indicated features from the data comprises:
representing the data as vectors in a vector space according to a preset fixed dimension and a vector space model; the fixed dimension is derived based on context information of each word in the data;
and determining a segmentation plane according to the distribution positions of the vectors corresponding to the vectors and the positive sample seed words and the negative sample seed words in the vector space so as to identify the user relationship with the indication characteristics.
17. The method according to any one of claims 9 to 16, further comprising:
and further analyzing the user relationship with the indication characteristic according to the positive and negative relationship and the transfer relationship to obtain user information related to the user relationship with the indication characteristic.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410174489.4A CN104615608B (en) | 2014-04-28 | 2014-04-28 | A kind of data mining processing system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410174489.4A CN104615608B (en) | 2014-04-28 | 2014-04-28 | A kind of data mining processing system and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104615608A CN104615608A (en) | 2015-05-13 |
CN104615608B true CN104615608B (en) | 2018-05-15 |
Family
ID=53150057
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410174489.4A Active CN104615608B (en) | 2014-04-28 | 2014-04-28 | A kind of data mining processing system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104615608B (en) |
Families Citing this family (20)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106453030B (en) * | 2015-08-12 | 2019-10-11 | 大连民族学院 | A kind of method and device obtaining social networks chain |
CN106557942B (en) * | 2015-09-30 | 2020-07-10 | 百度在线网络技术(北京)有限公司 | User relationship identification method and device |
CN105468723B (en) * | 2015-11-20 | 2019-08-20 | 小米科技有限责任公司 | Information recommendation method and device |
CN106157114A (en) * | 2016-07-06 | 2016-11-23 | 商宴通(上海)网络科技有限公司 | Have dinner based on user the homepage proposed algorithm of behavior modeling |
CN107800608A (en) * | 2016-09-05 | 2018-03-13 | 腾讯科技(深圳)有限公司 | A kind of processing method and processing device of user profile |
CN106547856B (en) * | 2016-10-19 | 2020-03-17 | 天脉聚源(北京)科技有限公司 | Method and device for sharing data by application |
CN108874821B (en) * | 2017-05-11 | 2021-06-15 | 腾讯科技(深圳)有限公司 | Application recommendation method and device and server |
CN107392781B (en) * | 2017-06-20 | 2021-11-02 | 挖财网络技术有限公司 | User relationship identification method, object relationship identification method and device |
CN107464141B (en) * | 2017-08-07 | 2021-09-07 | 北京京东尚科信息技术有限公司 | Method and device for information popularization, electronic equipment and computer readable medium |
CN107741953B (en) * | 2017-09-14 | 2020-01-21 | 平安科技(深圳)有限公司 | Method and device for matching realistic relationship of social platform user and readable storage medium |
CN109767278B (en) * | 2017-11-09 | 2021-03-30 | 北京京东尚科信息技术有限公司 | Method and apparatus for outputting information |
CN107948255B (en) | 2017-11-13 | 2019-09-03 | 苏州达家迎信息技术有限公司 | The method for pushing and computer readable storage medium of APP |
CN108170725A (en) * | 2017-12-11 | 2018-06-15 | 仲恺农业工程学院 | Social network user relationship strength calculation method and device integrating multi-feature information |
CN110020420B (en) * | 2018-01-10 | 2023-07-21 | 腾讯科技(深圳)有限公司 | Text processing method, device, computer equipment and storage medium |
CN108737506A (en) * | 2018-04-27 | 2018-11-02 | 苏州达家迎信息技术有限公司 | A kind of application method for pushing, equipment, storage medium and system |
CN109241048A (en) * | 2018-06-29 | 2019-01-18 | 深圳市彬讯科技有限公司 | For the data processing method of data statistics, server and storage medium |
WO2020061815A1 (en) * | 2018-09-26 | 2020-04-02 | 深圳市欢太科技有限公司 | Method for switching game page, and related product |
CN110751284B (en) * | 2019-06-06 | 2020-12-25 | 北京嘀嘀无限科技发展有限公司 | Heterogeneous information network embedding method and device, electronic equipment and storage medium |
CN110880013A (en) * | 2019-08-02 | 2020-03-13 | 华为技术有限公司 | Text recognition method and device |
CN110851491B (en) * | 2019-10-17 | 2023-06-30 | 天津大学 | Network link prediction method based on multiple semantic influence of multiple neighbor nodes |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN102098332A (en) * | 2010-12-30 | 2011-06-15 | 北京新媒传信科技有限公司 | Method and device for examining and verifying contents |
CN103425686A (en) * | 2012-05-21 | 2013-12-04 | 微梦创科网络科技(中国)有限公司 | Information publishing method and device |
CN103617157A (en) * | 2013-12-10 | 2014-03-05 | 东北师范大学 | Text similarity calculation method based on semantics |
-
2014
- 2014-04-28 CN CN201410174489.4A patent/CN104615608B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101609450A (en) * | 2009-04-10 | 2009-12-23 | 南京邮电大学 | Web page classification method based on training set |
CN102098332A (en) * | 2010-12-30 | 2011-06-15 | 北京新媒传信科技有限公司 | Method and device for examining and verifying contents |
CN103425686A (en) * | 2012-05-21 | 2013-12-04 | 微梦创科网络科技(中国)有限公司 | Information publishing method and device |
CN103617157A (en) * | 2013-12-10 | 2014-03-05 | 东北师范大学 | Text similarity calculation method based on semantics |
Also Published As
Publication number | Publication date |
---|---|
CN104615608A (en) | 2015-05-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104615608B (en) | A kind of data mining processing system and method | |
Umer et al. | Sentiment analysis of tweets using a unified convolutional neural network‐long short‐term memory network model | |
US10026021B2 (en) | Training image-recognition systems using a joint embedding model on online social networks | |
Volkova et al. | Inferring user political preferences from streaming communications | |
CN110888990B (en) | Text recommendation method, device, equipment and medium | |
KR102288249B1 (en) | Information processing method, terminal, and computer storage medium | |
WO2015185019A1 (en) | Semantic comprehension-based expression input method and apparatus | |
CA3009758A1 (en) | Systems and methods for suggesting emoji | |
US20180089542A1 (en) | Training Image-Recognition Systems Based on Search Queries on Online Social Networks | |
CN110990683B (en) | Microblog rumor integrated identification method and device based on region and emotional characteristics | |
CN112307351A (en) | Model training and recommending method, device and equipment for user behavior | |
US20230214679A1 (en) | Extracting and classifying entities from digital content items | |
Windiatmoko et al. | Developing facebook chatbot based on deep learning using rasa framework for university enquiries | |
Yang et al. | Enhanced twitter sentiment analysis by using feature selection and combination | |
Al Maruf et al. | Emotion detection from text and sentiment analysis of Ukraine Russia war using machine learning technique | |
Kabra et al. | Convolutional neural network based sentiment analysis with tf-idf based vectorization | |
CN113268667A (en) | Chinese comment emotion guidance-based sequence recommendation method and system | |
Trisal et al. | K-RCC: A novel approach to reduce the computational complexity of KNN algorithm for detecting human behavior on social networks | |
Ijaz et al. | Biasness identification of talk show's host by using twitter data | |
Yenkikar et al. | Sentimlbench: Benchmark evaluation of machine learning algorithms for sentiment analysis | |
Walha et al. | A Lexicon approach to multidimensional analysis of tweets opinion | |
Tarasova | Classification of hate tweets and their reasons using svm | |
Handayanto et al. | Corpus usage for sentiment analysis of a hashtag twitter | |
Suresh | An innovative and efficient method for Twitter sentiment analysis | |
CN113505293B (en) | Information pushing method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230705 Address after: 518057 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 floors Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd. Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd. |
|
TR01 | Transfer of patent right |