CN107122455A - A kind of network user's enhancing method for expressing based on microblogging - Google Patents
A kind of network user's enhancing method for expressing based on microblogging Download PDFInfo
- Publication number
- CN107122455A CN107122455A CN201710283853.4A CN201710283853A CN107122455A CN 107122455 A CN107122455 A CN 107122455A CN 201710283853 A CN201710283853 A CN 201710283853A CN 107122455 A CN107122455 A CN 107122455A
- Authority
- CN
- China
- Prior art keywords
- mrow
- network
- user
- text
- msub
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 46
- 230000002708 enhancing effect Effects 0.000 title claims description 18
- 239000013598 vector Substances 0.000 claims description 30
- 239000011159 matrix material Substances 0.000 claims description 19
- 238000009826 distribution Methods 0.000 claims description 11
- 239000013604 expression vector Substances 0.000 claims description 9
- 238000000605 extraction Methods 0.000 claims description 8
- 230000000694 effects Effects 0.000 claims description 6
- 230000006870 function Effects 0.000 claims description 6
- NJPPVKZQTLUDBO-UHFFFAOYSA-N novaluron Chemical compound C1=C(Cl)C(OC(F)(F)C(OC(F)(F)F)F)=CC=C1NC(=O)NC(=O)C1=C(F)C=CC=C1F NJPPVKZQTLUDBO-UHFFFAOYSA-N 0.000 claims description 6
- 230000010354 integration Effects 0.000 claims description 5
- 238000001914 filtration Methods 0.000 claims description 3
- 239000012634 fragment Substances 0.000 claims description 3
- 230000008569 process Effects 0.000 claims description 3
- 230000003190 augmentative effect Effects 0.000 claims description 2
- 238000000205 computational method Methods 0.000 claims description 2
- 238000002474 experimental method Methods 0.000 claims description 2
- 238000002156 mixing Methods 0.000 claims description 2
- 238000003058 natural language processing Methods 0.000 claims description 2
- 238000010606 normalization Methods 0.000 claims description 2
- 238000012545 processing Methods 0.000 claims description 2
- 230000011218 segmentation Effects 0.000 claims description 2
- 238000009412 basement excavation Methods 0.000 abstract description 2
- 244000097202 Rathbunia alamosensis Species 0.000 description 5
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 5
- 238000010586 diagram Methods 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- PEDCQBHIVMGVHV-UHFFFAOYSA-N Glycerine Chemical compound OCC(O)CO PEDCQBHIVMGVHV-UHFFFAOYSA-N 0.000 description 2
- 206010068052 Mosaicism Diseases 0.000 description 2
- 230000006399 behavior Effects 0.000 description 2
- 230000008859 change Effects 0.000 description 2
- 238000005520 cutting process Methods 0.000 description 2
- 238000013135 deep learning Methods 0.000 description 2
- 238000003780 insertion Methods 0.000 description 2
- 230000037431 insertion Effects 0.000 description 2
- 238000005070 sampling Methods 0.000 description 2
- 210000003765 sex chromosome Anatomy 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 238000012549 training Methods 0.000 description 2
- 238000012800 visualization Methods 0.000 description 2
- 230000002159 abnormal effect Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003012 network analysis Methods 0.000 description 1
- 230000006855 networking Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 239000012466 permeate Substances 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 108090000623 proteins and genes Proteins 0.000 description 1
- 238000013139 quantization Methods 0.000 description 1
- 238000005295 random walk Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- General Health & Medical Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computing Systems (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Evolutionary Computation (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
Strengthen method for expressing the invention discloses a kind of network based on microblogging, the invention belongs to microblog data excavation applications, more particularly to for the network representation learning method of microblog data.This method considers the colloquial style feature of microblogging short text, the pretreatment of text is targetedly carried out, so as to reduce the influence of noise data;The character representation of user's history blog article text is generated using LDA topic models, and calculates the cosine similarity between any two users' blog article feature, so as to build potential friend relation network;The structural information of primitive network is integrated, and potential friend relation is fused in primitive network, revised network structure is obtained.The present invention corrects original network topology structure, so as to obtain the character representation of more accurately microblog users node using the potential friend relation network extracted in generating text from user.Compared to the network representation learning method for only considering network structure, in two tasks of sex and age reasoning, accuracy rate is significantly improved.
Description
Technical field
Belong to microblog data excavation applications the invention belongs to microblog data, more particularly to for the network representation of microblog data
Learning method.
Background technology
The internet in Web2.0 epoch is just progressively developing into ubiquitous Information Communication platform, the face such as Twitter, microblogging
Masses are obtained rapidly to the social new media of social network services (Social Networking Services, abbreviation SNS)
Favor.Newest statistics shows that Twitter moon any active ues reach 3.1 hundred million, and the moon any active ues of Sina weibo reach
To 2.97 hundred million.People express viewpoint, sharing information, exchange and interdynamic by social media, and social media is propagated by social networks
And flood message, produce profound influence in the field such as politics, economic, culture, education.Then, online social network data scale
The characteristic such as huge, various informative, complicated, dynamic change, and the far-reaching guide effect of focus public sentiment so that online social
Network analysis has important researching value.By taking Sina weibo as an example, user can issue the original blog article within 140 words, can
To be the diversified forms such as picture, hyperlink, video, audio, the blog article of good friend of interest can also be browsed, forwarded, commenting on.Microblogging
The characteristics of data are presented multi-source heterogeneous, it is all important data that user, which generates text, Customer attribute row form, network topology etc.,
Source, how to merge the character representation of multi-source micro-blog information calculating user node becomes most important.
Represent that to be that one, machine learning field is important studies a question for study, by learning one automatically from being originally inputted number
According to the conversion to new character representation, effective character representation is obtained.Network representation study is exactly learning network node in low-dimensional
The character representation in space, realizes the purpose that quantization characteristic and dimensionality reduction are represented.
At present, many achievements in research have been occurred in that in network representation learning areas.Traditional popular learning method from
Low dimensional manifold structure is recovered in high dimensional data, the low-dimensional insertion for finding higher-dimension network data is represented.Such as, Isomap algorithms base
In MDS theoretical frames, the geodesic curve distance of any two points is regard as the geometric description of manifold, LLE algorithms (Locally linear
Embedding) think that a manifold can approximately regard local linear as in the local neighborhood of very little, by this linear fit
Coefficient portraying as this manifold local geometric property, the basic thought of LE algorithms (Laplacian Eigenmaps) is
A manifold is described with a undirected authorized graph, is then represented with figure insertion to find low-dimensional, that is, keeps the part of figure adjacent
Figure, is signed in lower dimensional space by relation again from higher dimensional space.
In recent years, deep learning for network representation study provided new thinking, for large scale network structured data with
Abundant network node information, the network representation model based on deep learning continuously emerges.
Inspired by word2vec models, Deepwalk models only consider the topological structure of network, by the node in network
Word in correspondence corpus, the sequence pair of node generation answers the sentence in corpus, and standard is produced using the method for random walk
List entries, then using Skip-gram models to Series Modeling so as to obtaining the vector representation of network node.But,
Deepwalk algorithms do not set up object function, it is impossible to which the node for learning Weighted Directed Graph is represented, and sequence node is random production
Raw, it is affected by noise big.
LINE models consider the single order and second order similitude of network topology structure simultaneously, and single order similitude is represented two in network
Point between individual node is to similitude, the weight on side as between node, and second order similitude is set up " if sharing phase between node
As neighbor node, then both tend to be similar " it is assumed that portraying second order similitude using the common neighbours of two nodes.Base
After the model of single order similitude and second order similitude is built up, the node table of network is obtained using the negative method of sampling based on side
Show.GraRep models consider the affinity information of higher order, and the local message of every single order is modeled respectively, using SVD matrixes point
Solution method obtains the vector representation of network node, it is adaptable to large-scale network structure.
The side of neighbor node is found in the randomness generated for Deepwalk algorithms sequence node, node2vec model refinements
Formula, it is believed that the node in network has a content similarities and structural similarity, wherein content similarities be mainly adjacent node it
Between similitude, there is the neighbor node of homogeney by breadth-first search, and the node of structural similarity might not phase
Neighbour, the neighbor node by depth-first search with structure homogeney uses Skip-grim methods to obtained sequence node
Extract the vector representation of node.
The studies above is the angle from network structure, but the online social networks by representative of Sina weibo is not
Only network topology, node also includes the information of a large amount of other forms.In view of the diversity of network node information, TADW
(Text-associated Deepwalk) method uses induced matrix filling algorithm, while being built to text feature and network structure
Mould, obtains more preferable network node and represents.GENE models consider that online social network user can voluntarily build group and selection adds
Enter the group that other people build, even and if the side that is not joined directly together of same group of node, the fact that can also have some internal relations,
The information of group is considered in network representation study.Multi-faceted Representations models consider user's generation
Text, three kinds of information of node attribute information and network topology structure, obtain network node and more really represent.
However, the network in real world is typically sparse, i.e., the side number being joined directly together in network very little, just with
The initial finite structure information of network is difficult that accurate network representation is arrived in study.For the user in online social networks,
The similarity feature that generation text is reflected can imply that the two has common concern interest, then, it is understood that there may be it is potential
Friend relation.Current research not yet carrys out the topological structure of extended network from the text message of node, so as to strengthen net list
The effect that dendrography is practised.
The content of the invention
The present invention is true based on above-mentioned hypothesis for the openness feature of network structure, establishes a kind of combination user life
Network user's enhancing into text message represents learning method, and with regard to the character representation of user, realizes user's sex and age
Reasoning task.
The present invention to implement step as follows:
Step 1: with reference to existing microblogging short essay treatment method, being pre-processed to user's generation blog article, so as to eliminate
The influence of noise data;
Step 2: with reference to related natural language processing technique, the characteristic vector of pretreated user's blog article text is generated,
The similarity between blog article vector is calculated with reference to measuring similarity function, the potential friend relation for generating text based on user is carried
Take, build potential friend relation network;
Step 3: considering the single order and second order similitude of network structure, integrate primitive network structural information and expand original
Micro blog network topological relation network;
Step 4: by from blog article information extraction to the potential friend relation network integration to the network topology structure after integration
On, correct original network structure information, including increase part connect while and augmenting portion connect while two kinds of amendment sides of weighted value
Formula;
Step 5: with reference to existing network representation learning art, learning the character representation of enhanced micro blog network user;
Step 6: for the difference on effect between the expression vector of Contrast enhanced network and the expression vector of primitive network,
Above-mentioned expression learning outcome is applied in sex and age reasoning task, the accuracy rate of the reasoning results is contrasted with pedestal method.
Compared with prior art, the advantage of the invention is that:The present invention is directed to the sparse sex chromosome mosaicism of network topology structure, examines
The fact that consider " two users of similar blog article are delivered in online social networks has similar hobby ", proposes a kind of knot
The network enhancing for sharing family generation text represents learning method, more accurately portrays the user characteristics of online social networks, carries
The accuracy rate of high microblog users attribute reasoning task.
Brief description of the drawings
Fig. 1 is to combine the network enhancing method for expressing flow chart that user generates text
Fig. 2 is that the network enhancing of the embodiment of the present invention represents schematic diagram
Fig. 3 is the distribution map of the text feature of LDA extractions in the embodiment of the present invention
Fig. 4 be in the embodiment of the present invention from user generate Text Feature Extraction to potential network structure effect of visualization figure
Fig. 5 is the effect of visualization figure of enhancing network topology structure in the embodiment of the present invention
Fig. 6 is the experimental result comparison diagram of age reasoning task in the embodiment of the present invention
Embodiment:
The present invention is directed to the openness feature of network structure, true based on above-mentioned hypothesis, establishes a kind of combination user life
Network user's enhancing into text message represents learning method, and with regard to the character representation of user, realizes user's sex and age
Reasoning task.
The present invention is illustrated with reference to the accompanying drawings and detailed description.First, following formal definitions are provided:
In social networks, node is correspondence user, the substantial amounts of text message of each node correspondence, represents going through for correspondence user
History blog article information.It is assumed that represent network with G, then G=(V, E, T), wherein, V={ viIt is user node set, E={ (vi,
vj) it is two-value side collection, wherein each edge respective weights w, w ∈ { 0,1 }, T={ tiIt is the blog article set that user generates.Then,
The goal in research of the present invention is to capture the characteristic information of text from user's generation blog article and primitive network is modified, so that
The low-dimensional of each node is represented in study corrective networks G "
Microblogging short text is pre-processed, and the blog article of Sina weibo is the short text that number of words is no more than 140 words, first, will be each
The history blog article of user is integrated into a text fragment.The colloquial expression way of blog article causes microblogging text to there is substantial amounts of make an uproar
Sound data, for the pretreatment operation of microblogging short text, by filtering stop words, replace abnormal word, and the process such as participle rejects text
Noise data in this information, so as to be more beneficial for the extraction of text feature.The present invention for microblogging text used it is specific
Pretreatment operation have it is following some:
1) content of text is the topic information for corresponding to blog article between two " # " are provided in Sina weibo, can reflect user's
Interest is paid close attention to, then, the content of text between two " # " is directly extracted to be used as keyword, without cutting again;
2) "@" represents to refer to certain user, therefore the content of text after "@" is user's pet name, without further cutting;
3) additional characters such as punctuation mark in urtext are filtered out;
4) unusual vocabulary is compareed, all unusual words in text are replaced.Unusual word is that some are generally accepted often by netizen
With cyberspeak, including initialism, splice word.Such as, if you wish to expression " thank you ", can be used " 3Q " or " 3q ";Also
Have, " harmony " is possible to split into " standing grain mouthful speech is all " to express for some expression purposes;
5) complicated and simple vocabulary is compareed, all complex forms of Chinese characters are substituted for corresponding simplified Chinese character;
6) word segmentation processing is carried out to the microblogging text of reservation using HanLP participles instrument;
7) filtering disables the stop words in vocabulary;
8) the TF-IDF values of all words are counted, and filter out low frequency words therein;
The potential friend relation for generating text based on user is extracted, it is contemplated that similar blog article information can reflect between user
Common concern interest, in other words there is a possibility that potential friend relation than larger between the corresponding user of similar blog article,
Then, the customer relationship extracted from user's generation text is referred to as potential friend relation.
The extraction of potential friend relation can substantially incorporate text similarity computational problem into.First, using LDA topics
Model generates the characteristic vector of user's microblogging text, then, and the cosine similarity calculated between any two users' blog article vector is characterized
The weight size on corresponding potential relation side, so as to build potential friend relation network.
LDA is a generating probability model, is related to document, three levels of topic and word.It is considered that document can be with
The random mixing of K potential topics is expressed as, wherein each topic obeys the multinomial distribution of word, every document obeys k words
The multinomial distribution of topic.Then, for corpusIn every document, generating process is described as follows:
1) for each document Mi, selection θ~Dir (α), wherein Dir (α) is the Di Li Crays distribution of parameter alpha, and θ is
The each topic of each element representation in one topic vector, vector appears in the probability in the document;
2) for j-th of word w in i-th documentij, pass through conditional probability p (zi| θ), select one from topic vector θ
Individual potential topic zi, then pass through conditional probability p (wj|zi, β) and generation word wj.
3) given parameters α and parameter beta, the Joint Distribution of model is,
Wherein, w is observational variable, and θ is hidden variable, then we using EM algorithm (EM) learning parameter α and
Parameter beta.
It is assumed that retaining preceding T topic, then each text fragment is embedded in vectorIts
In, wiIt is the weight for corresponding to i-th of topic, represents user viThe text of generation belongs to the possibility of i-th of topic.Fig. 2 is text
The distribution map of eigen, for the generation text of each user, selects first three topic, then calculates and corresponds on three coordinates
Coordinate value, the vector representation of point one text of correspondence.
Finally, each characteristic vector represents to generate the topic of textual association with each user, in other words, represents user's hair
The concern interest extracted in the blog article of table.Then, we use cosine similarity computational methods, are extracted from these expression vectors
Potential friend relation.Certainly, other similarity functions can be used for calculating the similarity between different vectors.It is given two
Represent vectorWithThen two users viAnd vjThe potential friend relation of generation can be defined as,
Therefore, the potential adjacency matrix extracted from user's generation text can be described as matrix
Wherein, each element w 'ij∈ [0,1].
Primitive network structural information is integrated, the social networks of real world is typically sparse, because only that certain customers
Between have direct concern relation.Moreover, directly friend relation is typically that user voluntarily adds according to the hobby of oneself, institute
So that direct concern relation plays important role in the internet startup disk problem for only considering network structure.However, direct good friend
Relation is not enough to describe whole network structure, may not be two people of good friend, it may have some common features.In fact, society
In friendship network there are two users of common friend to level off to has identical interest and feature.
Then, LINE considers that above-mentioned two is true, and the concept that first proposed single order and second order similitude is fully portrayed
The part and global information of network structure.
1) single order similitude:
Deckle collection E is given, for each node pair therein, the weighted value of corresponding sides represents single order similarity.Represent one
Rank similarity matrix W1Element, can be defined as,
2) second order similitude:
Common neighbours' number of arbitrary node pair is used for defining second order similarity, to describe the neighbour of two users in social networks
Occupy the similitude of structure.User v is given respectivelyiWith user vjNeighbor node setWithThen common friend is calculated
Number, second order similarity is defined as,
Now, we consider single order and second order similitude, in being fused to the adjacency matrix extracted from network structure.Cause
This, we introduce W, represent neighbours' matrix after integrating, and each element of matrix is made up of two Similarity values,
Wherein, λ and μ are normalization coefficients, and specific value is determined by experiment constantly adjustment.
With potential friend relation corrective networks structure, carry out corrective networks knot from the potential friend relation of Text Feature Extraction first
Structure, then learns the potential expression of network structure after extension using LINE models.This extension can bring two kinds of changes:The
One, weight from 0 by without to having, i.e., becoming 1;Second, weight is changed from small to big.Shown in accompanying drawing 1, the subgraph of grayed-out nodes is former
The network structure of beginning, colored node now is isolated node, i.e., colored node is closed with other node onrelevants in network
System.After network structure complete with potential friend relation amendment, the dotted line side newly produced is the new good friend from microblogging Text Feature Extraction
Relation, the solid line side of overstriking then represents the side right weight values increase in primitive network structure, i.e. friend relation strengthens.Accompanying drawing 3 and attached
Fig. 4 is respectively the microblogging friend relation topological diagram before and after network structure amendment.
The adjacency matrix that W " is corrective networks is made, wherein, each element w "ijFor,
However, some of revised adjacency matrix element is too small, so needing given threshold, delete all less than this
The element of threshold value.Then, we represent using last amendment adjacency matrix as LINE input to calculate low-dimensional.LINE is first
Single order and second order similarity are first introduced, and is based respectively on single order similitude and second order similitude, is that each node study is corresponding
Vector is represented, then, introduces and how to represent permeate a final node of the two vector representations.
Substantially, what single order similitude was represented be the side of nodes pair weighted value.In order to model single order similarity,
LINE models set up empirical probability using direct weight, then use by representing vectorial tectonic syntaxis probability, using K-L divergences
To describe the error between empirical probability and joint probability, so as to set up object function.Similarly, second order similitude can also be built
Vertical similar object function, respectively obtains the knot vector under two similarities using negative sampling optimization algorithm and represents
Finally two vectors are simply spliced, final network representation is obtained
The sex reasoning task of microblog users can regard a two-value point for having supervision represented based on user characteristics as
Class problem.Then, we use the SVM models of linear kernel, and final expression vector is trained as the feature extracted
Gender sorter.With the experimental result such as table 1 of pedestal method, method of the invention is as shown in table 2.
The experimental result (pedestal method) of the sex reasoning task of table 1
The experimental result (method of the invention) of the sex reasoning task of table 2
As can be seen from the table, Average Accuracy improves about 4 percentage points.Moreover, with test set sample
The increase of amount, accuracy rate increases, and to this, we can so explain that number of training is more, the classification that SVM training is obtained
Device is more accurate.
Age reasoning, which is then one, many classification problems of supervision.For the age of more accurately reasoning test sample, I
According to the distribution of date of birth in user profile, age of user is divided into 4 intervals.Statistics is it can be found that mostly
Several users is the young people between being in 18 years old to 30 years old.Then, we are based on " one-to-one " and " a pair remaining " two kinds of SVM
Expandable algorithm makes inferences to age of user.Experimental result is as shown in Table 3 and Table 4.
The experimental result (pedestal method) of 3 age of table reasoning task
The experimental result (method of the invention) of 4 age of table reasoning task
The SVM classifier that the first behavior of accuracy rate is extended by the way of " one-to-one " in two tables realizes age reasoning
As a result, the SVM classifier that the second behavior is extended by the way of " a pair remaining " realizes the experimental result of age reasoning.From table
The expression vector that data can be seen that obtained by network enhancing is represented has than the classification performance for the expression vector that reference scheme is obtained
Very big raising, such as, when correspondence Percentage is 10% or so, the accuracy rate of the first expansion scheme is from 69.03%
Bring up to 76.25%.Accompanying drawing 6 shows the Comparative result curve map of age reasoning, it is seen that the vector table obtained by network enhancing expression
Show the more preferable classification results of vector representation obtained really than pedestal method.
Generally speaking, we are directed to the sparse sex chromosome mosaicism of online social networks in real world, similar based on blog article is delivered
Two users between have potential friend relation the fact, it is proposed that a kind of aggregators text message network enhancing table
Dendrography learning method, specifically, using potential friend relation network is extracted in generating text from user, corrects original network
Topological structure, is represented so as to obtain more accurately network node.Compared to the network representation study for only considering network topology structure,
In two tasks of sex and age reasoning, accuracy rate is significantly improved.
Therefore, the network enhancing method for expressing proposed by the invention based on microblogging is in network user's character representation and follow-up
In classification and reasoning task, with critically important actual application value.
In order to illustrate present disclosure and implementation, this specification gives a specific embodiment.In embodiment
The middle purpose for introducing details is not the scope for limiting claims, and is to aid in understanding the method for the invention.This area
Technical staff should be understood that:Do not departing from the present invention and its spirit and scope of the appended claims, to most preferred embodiment step
Various modifications, change or replacement be all possible.Therefore, the present invention should not be limited to disclosed in most preferred embodiment and accompanying drawing
Content.
Claims (6)
1. a kind of network user's enhancing method for expressing based on microblogging, it is characterised in that comprise the following steps:
Step 1: with reference to existing microblogging short essay treatment method, being pre-processed to user's generation blog article, so as to eliminate noise
The influence of data;
Step 2: with reference to related natural language processing technique, generating the characteristic vector of pretreated user's blog article text, reference
Measuring similarity function calculates the similarity between blog article vector, and the potential friend relation for generating text based on user is extracted, structure
Build potential friend relation network;
Step 3: considering the single order and second order similitude of network structure, integrate primitive network structural information and expand in microblogging and use
Topological relation network between family;
Step 4: by from blog article information extraction to the potential friend relation network integration in the network topology structure after integration,
The original network structure information of amendment, including increase part potentially connect while and augmenting portion connected while two kinds of weighted value
Correcting mode;
Step 5: with reference to existing network representation learning art, learning the character representation of enhanced micro blog network user;
Step 6: for the difference on effect between the expression vector of Contrast enhanced network and the expression vector of primitive network, will be upper
State expression learning outcome to be applied in sex and age reasoning task, the accuracy rate of the reasoning results is contrasted with pedestal method.
2. a kind of network user's enhancing method for expressing based on microblogging according to claim 1, it is characterised in that social network
In network, node is correspondence user, the substantial amounts of text message of each node correspondence, represents the history blog article information of correspondence user.It is false
Determine to represent network with G, then G=(V, E, T), wherein, V={ viIt is user node set, E- { (vi, vj) it is two-value side collection, often
Bar side respective weights w, wherein w ∈ { 0,1 }, T={ tiIt is the blog article set that user generates, the present invention is to generate blog article from user
The characteristic information of middle capture text is simultaneously modified to primitive network, so as to learn the low-dimensional table of each node in corrective networks G "
Show
3. a kind of network user's enhancing method for expressing based on microblogging according to claim 1, it is characterised in that described to obtain
Microblogging short text preprocess method in step 2 is taken to include herein below:
(1) content of text, extracted between two " # " is used directly as keyword;
(2) content of text after "@", is extracted;
(3) additional characters such as punctuation mark in urtext, are filtered out;
(4) unusual vocabulary, is compareed, all unusual words in text are replaced;
(5) word segmentation processing, is carried out to the microblogging text of reservation using HanLP participles instrument;
(6), filtering disables the stop words in vocabulary;
(7) the TF-IDF values of all words, are counted, and filter out low frequency words therein.
4. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that the step
The method that the potential friend relation for generating text based on user in rapid two is extracted is as follows:
(1) characteristic vector of user's microblogging text, is generated using LDA topic models:
LDA is a generating probability model, is related to document, three levels of topic and word.It is considered that a document can be represented
For the random mixing of K potential topics, wherein each topic obeys the multinomial distribution of word, every document obeys k topic
Multinomial distribution.Then, for corpusIn every document, generating process is described as follows:
For each document Mi, selection θ~Dir (α), wherein Dir (α) is the Di Li Crays distribution of parameter alpha, and θ is a topic
The each topic of each element representation in vector, vector appears in the probability in the document;
For j-th of word w in i-th documentij, pass through conditional probability p (zi| θ), selection one is potential from topic vector θ
Topic Zi, then pass through conditional probability p (wj|zi, β) and generation word wj.
Given parameters α and parameter beta, the Joint Distribution of model is,
<mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>&theta;</mi>
<mo>,</mo>
<mi>z</mi>
<mo>,</mo>
<mi>w</mi>
<mo>|</mo>
<mi>&alpha;</mi>
<mo>,</mo>
<mi>&beta;</mi>
<mo>)</mo>
</mrow>
<mo>=</mo>
<mi>p</mi>
<mrow>
<mo>(</mo>
<mi>&theta;</mi>
<mo>|</mo>
<mi>&alpha;</mi>
<mo>)</mo>
</mrow>
<munderover>
<mo>&Pi;</mo>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>N</mi>
</munderover>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>z</mi>
<mi>i</mi>
</msub>
<mo>|</mo>
<mi>&theta;</mi>
<mo>)</mo>
</mrow>
<mi>p</mi>
<mrow>
<mo>(</mo>
<msub>
<mi>w</mi>
<mi>j</mi>
</msub>
<mo>|</mo>
<msub>
<mi>z</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<mi>&beta;</mi>
<mo>)</mo>
</mrow>
</mrow>
Wherein, w is observational variable, and θ is hidden variable, and then we use EM algorithm (EM) learning parameter α and parameter
β。
It is assumed that retaining preceding T topic, then each text fragment is embedded in vectorWherein, wi
It is the weight for corresponding to i-th of topic, represents user viThe text of generation belongs to the possibility of i-th of topic.
(2) the weight size that the cosine similarity between any two users' blog article vector characterizes corresponding potential relation side is calculated, from
And build potential friend relation network;
Using cosine similarity computational methods, potential friend relation is extracted from these expression vectors.Given two represent vectorWithThen two users viAnd vjThe potential friend relation of generation can be defined as,
Therefore, the potential adjacency matrix extracted from user's generation text can be described as matrixIts
In, each element w 'ij∈ [0,1].
5. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that described to obtain
Take the integration method of step 3 primitive network structural information as follows:
Two users with common friend, which level off to, in social networks identical interest and feature.LINE considers above-mentioned two thing
Real, the concept that first proposed single order and second order similitude fully portrays the part and global information of network structure.
(1), single order similitude:
Deckle collection E is given, for each node pair therein, the weighted value of corresponding sides represents single order similarity.Represent single order phase
Like degree matrix W1Element, can be defined as,
<mrow>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mn>1</mn>
</msubsup>
<mo>=</mo>
<mfenced open = "{" close = "">
<mtable>
<mtr>
<mtd>
<mrow>
<mn>1</mn>
<mo>,</mo>
</mrow>
</mtd>
<mtd>
<mtable>
<mtr>
<mtd>
<mrow>
<mi>i</mi>
<mi>f</mi>
</mrow>
</mtd>
<mtd>
<mrow>
<mo>(</mo>
<msub>
<mi>v</mi>
<mi>i</mi>
</msub>
<mo>,</mo>
<msub>
<mi>v</mi>
<mi>j</mi>
</msub>
<mo>)</mo>
<mo>&Element;</mo>
<mi>E</mi>
</mrow>
</mtd>
</mtr>
</mtable>
</mtd>
</mtr>
<mtr>
<mtd>
<mrow>
<mn>0</mn>
<mo>,</mo>
</mrow>
</mtd>
<mtd>
<mrow>
<mi>o</mi>
<mi>t</mi>
<mi>h</mi>
<mi>e</mi>
<mi>r</mi>
<mi>w</mi>
<mi>i</mi>
<mi>s</mi>
<mi>e</mi>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</mrow>
(2), second order similitude:
Common neighbours' number of arbitrary node pair is used for defining second order similarity, to describe neighbours' knot of two users in social networks
The similitude of structure.User v is given respectivelyiWith user vjNeighbor node setWithThen common friend number, two are calculated
Rank similarity is defined as
Now, we consider single order and second order similitude, in being fused to the adjacency matrix extracted from network structure.Therefore,
We introduce W, represent neighbours' matrix after integrating, and each element of matrix is made up of two Similarity values,
<mrow>
<msub>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>=</mo>
<mi>&lambda;</mi>
<mo>&CenterDot;</mo>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mn>1</mn>
</msubsup>
<mo>+</mo>
<mi>&mu;</mi>
<mo>&CenterDot;</mo>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mn>2</mn>
</msubsup>
</mrow>
2
Wherein, λ and μ are normalization coefficients, and specific value is determined by experiment constantly adjustment.
6. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that described to obtain
Take step 4 as follows with the method for potential friend relation amendment primitive network structure:
The adjacency matrix that W " is corrective networks is made, wherein, each element w "ijFor,
<mrow>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mrow>
<mo>&prime;</mo>
<mo>&prime;</mo>
</mrow>
</msubsup>
<mo>=</mo>
<mfrac>
<mrow>
<msub>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>+</mo>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mo>&prime;</mo>
</msubsup>
</mrow>
<mrow>
<munder>
<mrow>
<mi>m</mi>
<mi>a</mi>
<mi>x</mi>
</mrow>
<mrow>
<mi>i</mi>
<mo>,</mo>
<mi>j</mi>
</mrow>
</munder>
<mo>{</mo>
<msub>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
</msub>
<mo>,</mo>
<msubsup>
<mi>w</mi>
<mrow>
<mi>i</mi>
<mi>j</mi>
</mrow>
<mo>&prime;</mo>
</msubsup>
<mo>}</mo>
</mrow>
</mfrac>
</mrow>
However, some of revised adjacency matrix element is too small, so needing given threshold, delete all less than the threshold value
Element.Then, we carry out the low-dimensional table of calculating network node users using last amendment adjacency matrix as LINE input
Show.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710283853.4A CN107122455B (en) | 2017-04-26 | 2017-04-26 | Network user enhanced representation method based on microblog |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710283853.4A CN107122455B (en) | 2017-04-26 | 2017-04-26 | Network user enhanced representation method based on microblog |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107122455A true CN107122455A (en) | 2017-09-01 |
CN107122455B CN107122455B (en) | 2019-12-31 |
Family
ID=59724978
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710283853.4A Active CN107122455B (en) | 2017-04-26 | 2017-04-26 | Network user enhanced representation method based on microblog |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107122455B (en) |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577782A (en) * | 2017-09-14 | 2018-01-12 | 国家计算机网络与信息安全管理中心 | A kind of people-similarity depicting method based on heterogeneous data |
CN108536844A (en) * | 2018-04-13 | 2018-09-14 | 吉林大学 | A kind of network representation learning method of Text enhancement |
CN108647800A (en) * | 2018-03-19 | 2018-10-12 | 浙江工业大学 | A kind of online social network user missing attribute forecast method based on node insertion |
CN108877946A (en) * | 2018-05-04 | 2018-11-23 | 浙江工业大学 | A kind of doctor's expert recommendation method based on network characterization |
CN109189936A (en) * | 2018-08-13 | 2019-01-11 | 天津科技大学 | A kind of label semanteme learning method measured based on network structure and semantic dependency |
CN109743196A (en) * | 2018-12-13 | 2019-05-10 | 杭州电子科技大学 | It is a kind of based on the network characterisation method across double-layer network random walk |
CN110008975A (en) * | 2018-11-30 | 2019-07-12 | 武汉科技大学 | Social networks navy detection method based on Danger Immune theory |
CN110020151A (en) * | 2017-12-01 | 2019-07-16 | 北京搜狗科技发展有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
CN110245682A (en) * | 2019-05-13 | 2019-09-17 | 华中科技大学 | A kind of network representation learning method based on topic |
CN110555305A (en) * | 2018-05-31 | 2019-12-10 | 武汉安天信息技术有限责任公司 | Malicious application tracing method based on deep learning and related device |
CN110879861A (en) * | 2019-09-05 | 2020-03-13 | 国家计算机网络与信息安全管理中心 | Similar mobile application calculation method and device based on representation learning |
CN111127232A (en) * | 2018-10-31 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Interest circle discovery method, device, server and medium |
CN112134720A (en) * | 2020-05-26 | 2020-12-25 | 北京国腾创新科技有限公司 | Network topology discovery method |
CN113076743A (en) * | 2021-03-30 | 2021-07-06 | 太原理工大学 | Knowledge graph multi-hop inference method based on network structure and representation learning |
CN113722437A (en) * | 2021-08-31 | 2021-11-30 | 平安科技(深圳)有限公司 | User label identification method, device, equipment and medium based on artificial intelligence |
CN117852616A (en) * | 2024-02-29 | 2024-04-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Big language model alignment fine tuning method and system based on enhanced reject sampling training |
CN117852616B (en) * | 2024-02-29 | 2024-05-31 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Big language model alignment fine tuning method and system based on enhanced reject sampling training |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102510551A (en) * | 2011-09-30 | 2012-06-20 | 奇智软件(北京)有限公司 | Method and device for automatic recommendation of friends in mobile communication tool |
CN103150678A (en) * | 2013-03-12 | 2013-06-12 | 中国科学院计算技术研究所 | Method and device for discovering inter-user potential focus relationships on microblogs |
CN104834632A (en) * | 2015-05-13 | 2015-08-12 | 北京工业大学 | Microblog topic detection and hotspot evaluation method based on semantic expansion |
CN104899657A (en) * | 2015-06-09 | 2015-09-09 | 北京邮电大学 | Method for predicting association fusion events |
CN105302866A (en) * | 2015-09-23 | 2016-02-03 | 东南大学 | OSN community discovery method based on LDA Theme model |
-
2017
- 2017-04-26 CN CN201710283853.4A patent/CN107122455B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102510551A (en) * | 2011-09-30 | 2012-06-20 | 奇智软件(北京)有限公司 | Method and device for automatic recommendation of friends in mobile communication tool |
CN103150678A (en) * | 2013-03-12 | 2013-06-12 | 中国科学院计算技术研究所 | Method and device for discovering inter-user potential focus relationships on microblogs |
CN104834632A (en) * | 2015-05-13 | 2015-08-12 | 北京工业大学 | Microblog topic detection and hotspot evaluation method based on semantic expansion |
CN104899657A (en) * | 2015-06-09 | 2015-09-09 | 北京邮电大学 | Method for predicting association fusion events |
CN105302866A (en) * | 2015-09-23 | 2016-02-03 | 东南大学 | OSN community discovery method based on LDA Theme model |
Cited By (26)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107577782A (en) * | 2017-09-14 | 2018-01-12 | 国家计算机网络与信息安全管理中心 | A kind of people-similarity depicting method based on heterogeneous data |
CN110020151A (en) * | 2017-12-01 | 2019-07-16 | 北京搜狗科技发展有限公司 | A kind of data processing method, device, electronic equipment and storage medium |
CN110020151B (en) * | 2017-12-01 | 2022-04-26 | 北京搜狗科技发展有限公司 | Data processing method and device, electronic equipment and storage medium |
CN108647800A (en) * | 2018-03-19 | 2018-10-12 | 浙江工业大学 | A kind of online social network user missing attribute forecast method based on node insertion |
CN108647800B (en) * | 2018-03-19 | 2022-01-11 | 浙江工业大学 | Online social network user missing attribute prediction method based on node embedding |
CN108536844A (en) * | 2018-04-13 | 2018-09-14 | 吉林大学 | A kind of network representation learning method of Text enhancement |
CN108536844B (en) * | 2018-04-13 | 2021-09-03 | 吉林大学 | Text-enhanced network representation learning method |
CN108877946A (en) * | 2018-05-04 | 2018-11-23 | 浙江工业大学 | A kind of doctor's expert recommendation method based on network characterization |
CN110555305A (en) * | 2018-05-31 | 2019-12-10 | 武汉安天信息技术有限责任公司 | Malicious application tracing method based on deep learning and related device |
CN109189936A (en) * | 2018-08-13 | 2019-01-11 | 天津科技大学 | A kind of label semanteme learning method measured based on network structure and semantic dependency |
CN109189936B (en) * | 2018-08-13 | 2021-07-27 | 天津科技大学 | Label semantic learning method based on network structure and semantic correlation measurement |
CN111127232B (en) * | 2018-10-31 | 2023-08-29 | 百度在线网络技术(北京)有限公司 | Method, device, server and medium for discovering interest circle |
CN111127232A (en) * | 2018-10-31 | 2020-05-08 | 百度在线网络技术(北京)有限公司 | Interest circle discovery method, device, server and medium |
CN110008975A (en) * | 2018-11-30 | 2019-07-12 | 武汉科技大学 | Social networks navy detection method based on Danger Immune theory |
CN110008975B (en) * | 2018-11-30 | 2023-05-02 | 武汉科技大学 | Social network water army detection method based on immune hazard theory |
CN109743196B (en) * | 2018-12-13 | 2021-12-17 | 杭州电子科技大学 | Network characterization method based on cross-double-layer network random walk |
CN109743196A (en) * | 2018-12-13 | 2019-05-10 | 杭州电子科技大学 | It is a kind of based on the network characterisation method across double-layer network random walk |
CN110245682A (en) * | 2019-05-13 | 2019-09-17 | 华中科技大学 | A kind of network representation learning method based on topic |
CN110879861A (en) * | 2019-09-05 | 2020-03-13 | 国家计算机网络与信息安全管理中心 | Similar mobile application calculation method and device based on representation learning |
CN110879861B (en) * | 2019-09-05 | 2023-07-14 | 国家计算机网络与信息安全管理中心 | Similar mobile application computing method and device based on representation learning |
CN112134720A (en) * | 2020-05-26 | 2020-12-25 | 北京国腾创新科技有限公司 | Network topology discovery method |
CN113076743A (en) * | 2021-03-30 | 2021-07-06 | 太原理工大学 | Knowledge graph multi-hop inference method based on network structure and representation learning |
CN113722437A (en) * | 2021-08-31 | 2021-11-30 | 平安科技(深圳)有限公司 | User label identification method, device, equipment and medium based on artificial intelligence |
CN113722437B (en) * | 2021-08-31 | 2023-06-23 | 平安科技(深圳)有限公司 | User tag identification method, device, equipment and medium based on artificial intelligence |
CN117852616A (en) * | 2024-02-29 | 2024-04-09 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Big language model alignment fine tuning method and system based on enhanced reject sampling training |
CN117852616B (en) * | 2024-02-29 | 2024-05-31 | 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) | Big language model alignment fine tuning method and system based on enhanced reject sampling training |
Also Published As
Publication number | Publication date |
---|---|
CN107122455B (en) | 2019-12-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107122455A (en) | A kind of network user's enhancing method for expressing based on microblogging | |
US11687728B2 (en) | Text sentiment analysis method based on multi-level graph pooling | |
CN108573411B (en) | Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments | |
CN107330049B (en) | News popularity estimation method and system | |
CN107133213B (en) | Method and system for automatically extracting text abstract based on algorithm | |
CN104765769B (en) | The short text query expansion and search method of a kind of word-based vector | |
CN110209897B (en) | Intelligent dialogue method, device, storage medium and equipment | |
CN105183833B (en) | Microblog text recommendation method and device based on user model | |
CN104008203B (en) | A kind of Users' Interests Mining method for incorporating body situation | |
US11514063B2 (en) | Method and apparatus of recommending information based on fused relationship network, and device and medium | |
CN111914185B (en) | Text emotion analysis method in social network based on graph attention network | |
CN104199833B (en) | The clustering method and clustering apparatus of a kind of network search words | |
CN106202294B (en) | Related news computing method and device based on keyword and topic model fusion | |
CN103631859A (en) | Intelligent review expert recommending method for science and technology projects | |
CN107577782B (en) | Figure similarity depicting method based on heterogeneous data | |
CN112307351A (en) | Model training and recommending method, device and equipment for user behavior | |
CN109992784B (en) | Heterogeneous network construction and distance measurement method fusing multi-mode information | |
CN110990670B (en) | Growth incentive book recommendation method and recommendation system | |
CN110569920A (en) | prediction method for multi-task machine learning | |
CN109446414A (en) | A kind of software information website fast tag recommended method based on neural network classification | |
JP7393060B2 (en) | Personalized search method and search system combining attention mechanism | |
CN113254652B (en) | Social media posting authenticity detection method based on hypergraph attention network | |
CN112966091A (en) | Knowledge graph recommendation system fusing entity information and heat | |
CN105869058B (en) | A kind of method that multilayer latent variable model user portrait extracts | |
CN110472115B (en) | Social network text emotion fine-grained classification method based on deep learning |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |