CN107122455A - A kind of network user's enhancing method for expressing based on microblogging - Google Patents

A kind of network user's enhancing method for expressing based on microblogging Download PDF

Info

Publication number
CN107122455A
CN107122455A CN201710283853.4A CN201710283853A CN107122455A CN 107122455 A CN107122455 A CN 107122455A CN 201710283853 A CN201710283853 A CN 201710283853A CN 107122455 A CN107122455 A CN 107122455A
Authority
CN
China
Prior art keywords
mrow
network
user
text
msub
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710283853.4A
Other languages
Chinese (zh)
Other versions
CN107122455B (en
Inventor
胡玥
贾焰
周斌
杨树强
韩伟红
李爱平
黄九鸣
江荣
全拥
邓璐
刘强
张涛
童咏之
刘心
韩文祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
National University of Defense Technology
Original Assignee
National University of Defense Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by National University of Defense Technology filed Critical National University of Defense Technology
Priority to CN201710283853.4A priority Critical patent/CN107122455B/en
Publication of CN107122455A publication Critical patent/CN107122455A/en
Application granted granted Critical
Publication of CN107122455B publication Critical patent/CN107122455B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/01Social networking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Health & Medical Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computing Systems (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • Tourism & Hospitality (AREA)
  • General Business, Economics & Management (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

Strengthen method for expressing the invention discloses a kind of network based on microblogging, the invention belongs to microblog data excavation applications, more particularly to for the network representation learning method of microblog data.This method considers the colloquial style feature of microblogging short text, the pretreatment of text is targetedly carried out, so as to reduce the influence of noise data;The character representation of user's history blog article text is generated using LDA topic models, and calculates the cosine similarity between any two users' blog article feature, so as to build potential friend relation network;The structural information of primitive network is integrated, and potential friend relation is fused in primitive network, revised network structure is obtained.The present invention corrects original network topology structure, so as to obtain the character representation of more accurately microblog users node using the potential friend relation network extracted in generating text from user.Compared to the network representation learning method for only considering network structure, in two tasks of sex and age reasoning, accuracy rate is significantly improved.

Description

A kind of network user's enhancing method for expressing based on microblogging
Technical field
Belong to microblog data excavation applications the invention belongs to microblog data, more particularly to for the network representation of microblog data Learning method.
Background technology
The internet in Web2.0 epoch is just progressively developing into ubiquitous Information Communication platform, the face such as Twitter, microblogging Masses are obtained rapidly to the social new media of social network services (Social Networking Services, abbreviation SNS) Favor.Newest statistics shows that Twitter moon any active ues reach 3.1 hundred million, and the moon any active ues of Sina weibo reach To 2.97 hundred million.People express viewpoint, sharing information, exchange and interdynamic by social media, and social media is propagated by social networks And flood message, produce profound influence in the field such as politics, economic, culture, education.Then, online social network data scale The characteristic such as huge, various informative, complicated, dynamic change, and the far-reaching guide effect of focus public sentiment so that online social Network analysis has important researching value.By taking Sina weibo as an example, user can issue the original blog article within 140 words, can To be the diversified forms such as picture, hyperlink, video, audio, the blog article of good friend of interest can also be browsed, forwarded, commenting on.Microblogging The characteristics of data are presented multi-source heterogeneous, it is all important data that user, which generates text, Customer attribute row form, network topology etc., Source, how to merge the character representation of multi-source micro-blog information calculating user node becomes most important.
Represent that to be that one, machine learning field is important studies a question for study, by learning one automatically from being originally inputted number According to the conversion to new character representation, effective character representation is obtained.Network representation study is exactly learning network node in low-dimensional The character representation in space, realizes the purpose that quantization characteristic and dimensionality reduction are represented.
At present, many achievements in research have been occurred in that in network representation learning areas.Traditional popular learning method from Low dimensional manifold structure is recovered in high dimensional data, the low-dimensional insertion for finding higher-dimension network data is represented.Such as, Isomap algorithms base In MDS theoretical frames, the geodesic curve distance of any two points is regard as the geometric description of manifold, LLE algorithms (Locally linear Embedding) think that a manifold can approximately regard local linear as in the local neighborhood of very little, by this linear fit Coefficient portraying as this manifold local geometric property, the basic thought of LE algorithms (Laplacian Eigenmaps) is A manifold is described with a undirected authorized graph, is then represented with figure insertion to find low-dimensional, that is, keeps the part of figure adjacent Figure, is signed in lower dimensional space by relation again from higher dimensional space.
In recent years, deep learning for network representation study provided new thinking, for large scale network structured data with Abundant network node information, the network representation model based on deep learning continuously emerges.
Inspired by word2vec models, Deepwalk models only consider the topological structure of network, by the node in network Word in correspondence corpus, the sequence pair of node generation answers the sentence in corpus, and standard is produced using the method for random walk List entries, then using Skip-gram models to Series Modeling so as to obtaining the vector representation of network node.But, Deepwalk algorithms do not set up object function, it is impossible to which the node for learning Weighted Directed Graph is represented, and sequence node is random production Raw, it is affected by noise big.
LINE models consider the single order and second order similitude of network topology structure simultaneously, and single order similitude is represented two in network Point between individual node is to similitude, the weight on side as between node, and second order similitude is set up " if sharing phase between node As neighbor node, then both tend to be similar " it is assumed that portraying second order similitude using the common neighbours of two nodes.Base After the model of single order similitude and second order similitude is built up, the node table of network is obtained using the negative method of sampling based on side Show.GraRep models consider the affinity information of higher order, and the local message of every single order is modeled respectively, using SVD matrixes point Solution method obtains the vector representation of network node, it is adaptable to large-scale network structure.
The side of neighbor node is found in the randomness generated for Deepwalk algorithms sequence node, node2vec model refinements Formula, it is believed that the node in network has a content similarities and structural similarity, wherein content similarities be mainly adjacent node it Between similitude, there is the neighbor node of homogeney by breadth-first search, and the node of structural similarity might not phase Neighbour, the neighbor node by depth-first search with structure homogeney uses Skip-grim methods to obtained sequence node Extract the vector representation of node.
The studies above is the angle from network structure, but the online social networks by representative of Sina weibo is not Only network topology, node also includes the information of a large amount of other forms.In view of the diversity of network node information, TADW (Text-associated Deepwalk) method uses induced matrix filling algorithm, while being built to text feature and network structure Mould, obtains more preferable network node and represents.GENE models consider that online social network user can voluntarily build group and selection adds Enter the group that other people build, even and if the side that is not joined directly together of same group of node, the fact that can also have some internal relations, The information of group is considered in network representation study.Multi-faceted Representations models consider user's generation Text, three kinds of information of node attribute information and network topology structure, obtain network node and more really represent.
However, the network in real world is typically sparse, i.e., the side number being joined directly together in network very little, just with The initial finite structure information of network is difficult that accurate network representation is arrived in study.For the user in online social networks, The similarity feature that generation text is reflected can imply that the two has common concern interest, then, it is understood that there may be it is potential Friend relation.Current research not yet carrys out the topological structure of extended network from the text message of node, so as to strengthen net list The effect that dendrography is practised.
The content of the invention
The present invention is true based on above-mentioned hypothesis for the openness feature of network structure, establishes a kind of combination user life Network user's enhancing into text message represents learning method, and with regard to the character representation of user, realizes user's sex and age Reasoning task.
The present invention to implement step as follows:
Step 1: with reference to existing microblogging short essay treatment method, being pre-processed to user's generation blog article, so as to eliminate The influence of noise data;
Step 2: with reference to related natural language processing technique, the characteristic vector of pretreated user's blog article text is generated, The similarity between blog article vector is calculated with reference to measuring similarity function, the potential friend relation for generating text based on user is carried Take, build potential friend relation network;
Step 3: considering the single order and second order similitude of network structure, integrate primitive network structural information and expand original Micro blog network topological relation network;
Step 4: by from blog article information extraction to the potential friend relation network integration to the network topology structure after integration On, correct original network structure information, including increase part connect while and augmenting portion connect while two kinds of amendment sides of weighted value Formula;
Step 5: with reference to existing network representation learning art, learning the character representation of enhanced micro blog network user;
Step 6: for the difference on effect between the expression vector of Contrast enhanced network and the expression vector of primitive network, Above-mentioned expression learning outcome is applied in sex and age reasoning task, the accuracy rate of the reasoning results is contrasted with pedestal method.
Compared with prior art, the advantage of the invention is that:The present invention is directed to the sparse sex chromosome mosaicism of network topology structure, examines The fact that consider " two users of similar blog article are delivered in online social networks has similar hobby ", proposes a kind of knot The network enhancing for sharing family generation text represents learning method, more accurately portrays the user characteristics of online social networks, carries The accuracy rate of high microblog users attribute reasoning task.
Brief description of the drawings
Fig. 1 is to combine the network enhancing method for expressing flow chart that user generates text
Fig. 2 is that the network enhancing of the embodiment of the present invention represents schematic diagram
Fig. 3 is the distribution map of the text feature of LDA extractions in the embodiment of the present invention
Fig. 4 be in the embodiment of the present invention from user generate Text Feature Extraction to potential network structure effect of visualization figure
Fig. 5 is the effect of visualization figure of enhancing network topology structure in the embodiment of the present invention
Fig. 6 is the experimental result comparison diagram of age reasoning task in the embodiment of the present invention
Embodiment:
The present invention is directed to the openness feature of network structure, true based on above-mentioned hypothesis, establishes a kind of combination user life Network user's enhancing into text message represents learning method, and with regard to the character representation of user, realizes user's sex and age Reasoning task.
The present invention is illustrated with reference to the accompanying drawings and detailed description.First, following formal definitions are provided:
In social networks, node is correspondence user, the substantial amounts of text message of each node correspondence, represents going through for correspondence user History blog article information.It is assumed that represent network with G, then G=(V, E, T), wherein, V={ viIt is user node set, E={ (vi, vj) it is two-value side collection, wherein each edge respective weights w, w ∈ { 0,1 }, T={ tiIt is the blog article set that user generates.Then, The goal in research of the present invention is to capture the characteristic information of text from user's generation blog article and primitive network is modified, so that The low-dimensional of each node is represented in study corrective networks G "
Microblogging short text is pre-processed, and the blog article of Sina weibo is the short text that number of words is no more than 140 words, first, will be each The history blog article of user is integrated into a text fragment.The colloquial expression way of blog article causes microblogging text to there is substantial amounts of make an uproar Sound data, for the pretreatment operation of microblogging short text, by filtering stop words, replace abnormal word, and the process such as participle rejects text Noise data in this information, so as to be more beneficial for the extraction of text feature.The present invention for microblogging text used it is specific Pretreatment operation have it is following some:
1) content of text is the topic information for corresponding to blog article between two " # " are provided in Sina weibo, can reflect user's Interest is paid close attention to, then, the content of text between two " # " is directly extracted to be used as keyword, without cutting again;
2) "@" represents to refer to certain user, therefore the content of text after "@" is user's pet name, without further cutting;
3) additional characters such as punctuation mark in urtext are filtered out;
4) unusual vocabulary is compareed, all unusual words in text are replaced.Unusual word is that some are generally accepted often by netizen With cyberspeak, including initialism, splice word.Such as, if you wish to expression " thank you ", can be used " 3Q " or " 3q ";Also Have, " harmony " is possible to split into " standing grain mouthful speech is all " to express for some expression purposes;
5) complicated and simple vocabulary is compareed, all complex forms of Chinese characters are substituted for corresponding simplified Chinese character;
6) word segmentation processing is carried out to the microblogging text of reservation using HanLP participles instrument;
7) filtering disables the stop words in vocabulary;
8) the TF-IDF values of all words are counted, and filter out low frequency words therein;
The potential friend relation for generating text based on user is extracted, it is contemplated that similar blog article information can reflect between user Common concern interest, in other words there is a possibility that potential friend relation than larger between the corresponding user of similar blog article, Then, the customer relationship extracted from user's generation text is referred to as potential friend relation.
The extraction of potential friend relation can substantially incorporate text similarity computational problem into.First, using LDA topics Model generates the characteristic vector of user's microblogging text, then, and the cosine similarity calculated between any two users' blog article vector is characterized The weight size on corresponding potential relation side, so as to build potential friend relation network.
LDA is a generating probability model, is related to document, three levels of topic and word.It is considered that document can be with The random mixing of K potential topics is expressed as, wherein each topic obeys the multinomial distribution of word, every document obeys k words The multinomial distribution of topic.Then, for corpusIn every document, generating process is described as follows:
1) for each document Mi, selection θ~Dir (α), wherein Dir (α) is the Di Li Crays distribution of parameter alpha, and θ is The each topic of each element representation in one topic vector, vector appears in the probability in the document;
2) for j-th of word w in i-th documentij, pass through conditional probability p (zi| θ), select one from topic vector θ Individual potential topic zi, then pass through conditional probability p (wj|zi, β) and generation word wj.
3) given parameters α and parameter beta, the Joint Distribution of model is,
Wherein, w is observational variable, and θ is hidden variable, then we using EM algorithm (EM) learning parameter α and Parameter beta.
It is assumed that retaining preceding T topic, then each text fragment is embedded in vectorIts In, wiIt is the weight for corresponding to i-th of topic, represents user viThe text of generation belongs to the possibility of i-th of topic.Fig. 2 is text The distribution map of eigen, for the generation text of each user, selects first three topic, then calculates and corresponds on three coordinates Coordinate value, the vector representation of point one text of correspondence.
Finally, each characteristic vector represents to generate the topic of textual association with each user, in other words, represents user's hair The concern interest extracted in the blog article of table.Then, we use cosine similarity computational methods, are extracted from these expression vectors Potential friend relation.Certainly, other similarity functions can be used for calculating the similarity between different vectors.It is given two Represent vectorWithThen two users viAnd vjThe potential friend relation of generation can be defined as,
Therefore, the potential adjacency matrix extracted from user's generation text can be described as matrix Wherein, each element w 'ij∈ [0,1].
Primitive network structural information is integrated, the social networks of real world is typically sparse, because only that certain customers Between have direct concern relation.Moreover, directly friend relation is typically that user voluntarily adds according to the hobby of oneself, institute So that direct concern relation plays important role in the internet startup disk problem for only considering network structure.However, direct good friend Relation is not enough to describe whole network structure, may not be two people of good friend, it may have some common features.In fact, society In friendship network there are two users of common friend to level off to has identical interest and feature.
Then, LINE considers that above-mentioned two is true, and the concept that first proposed single order and second order similitude is fully portrayed The part and global information of network structure.
1) single order similitude:
Deckle collection E is given, for each node pair therein, the weighted value of corresponding sides represents single order similarity.Represent one Rank similarity matrix W1Element, can be defined as,
2) second order similitude:
Common neighbours' number of arbitrary node pair is used for defining second order similarity, to describe the neighbour of two users in social networks Occupy the similitude of structure.User v is given respectivelyiWith user vjNeighbor node setWithThen common friend is calculated Number, second order similarity is defined as,
Now, we consider single order and second order similitude, in being fused to the adjacency matrix extracted from network structure.Cause This, we introduce W, represent neighbours' matrix after integrating, and each element of matrix is made up of two Similarity values,
Wherein, λ and μ are normalization coefficients, and specific value is determined by experiment constantly adjustment.
With potential friend relation corrective networks structure, carry out corrective networks knot from the potential friend relation of Text Feature Extraction first Structure, then learns the potential expression of network structure after extension using LINE models.This extension can bring two kinds of changes:The One, weight from 0 by without to having, i.e., becoming 1;Second, weight is changed from small to big.Shown in accompanying drawing 1, the subgraph of grayed-out nodes is former The network structure of beginning, colored node now is isolated node, i.e., colored node is closed with other node onrelevants in network System.After network structure complete with potential friend relation amendment, the dotted line side newly produced is the new good friend from microblogging Text Feature Extraction Relation, the solid line side of overstriking then represents the side right weight values increase in primitive network structure, i.e. friend relation strengthens.Accompanying drawing 3 and attached Fig. 4 is respectively the microblogging friend relation topological diagram before and after network structure amendment.
The adjacency matrix that W " is corrective networks is made, wherein, each element w "ijFor,
However, some of revised adjacency matrix element is too small, so needing given threshold, delete all less than this The element of threshold value.Then, we represent using last amendment adjacency matrix as LINE input to calculate low-dimensional.LINE is first Single order and second order similarity are first introduced, and is based respectively on single order similitude and second order similitude, is that each node study is corresponding Vector is represented, then, introduces and how to represent permeate a final node of the two vector representations.
Substantially, what single order similitude was represented be the side of nodes pair weighted value.In order to model single order similarity, LINE models set up empirical probability using direct weight, then use by representing vectorial tectonic syntaxis probability, using K-L divergences To describe the error between empirical probability and joint probability, so as to set up object function.Similarly, second order similitude can also be built Vertical similar object function, respectively obtains the knot vector under two similarities using negative sampling optimization algorithm and represents Finally two vectors are simply spliced, final network representation is obtained
The sex reasoning task of microblog users can regard a two-value point for having supervision represented based on user characteristics as Class problem.Then, we use the SVM models of linear kernel, and final expression vector is trained as the feature extracted Gender sorter.With the experimental result such as table 1 of pedestal method, method of the invention is as shown in table 2.
The experimental result (pedestal method) of the sex reasoning task of table 1
The experimental result (method of the invention) of the sex reasoning task of table 2
As can be seen from the table, Average Accuracy improves about 4 percentage points.Moreover, with test set sample The increase of amount, accuracy rate increases, and to this, we can so explain that number of training is more, the classification that SVM training is obtained Device is more accurate.
Age reasoning, which is then one, many classification problems of supervision.For the age of more accurately reasoning test sample, I According to the distribution of date of birth in user profile, age of user is divided into 4 intervals.Statistics is it can be found that mostly Several users is the young people between being in 18 years old to 30 years old.Then, we are based on " one-to-one " and " a pair remaining " two kinds of SVM Expandable algorithm makes inferences to age of user.Experimental result is as shown in Table 3 and Table 4.
The experimental result (pedestal method) of 3 age of table reasoning task
The experimental result (method of the invention) of 4 age of table reasoning task
The SVM classifier that the first behavior of accuracy rate is extended by the way of " one-to-one " in two tables realizes age reasoning As a result, the SVM classifier that the second behavior is extended by the way of " a pair remaining " realizes the experimental result of age reasoning.From table The expression vector that data can be seen that obtained by network enhancing is represented has than the classification performance for the expression vector that reference scheme is obtained Very big raising, such as, when correspondence Percentage is 10% or so, the accuracy rate of the first expansion scheme is from 69.03% Bring up to 76.25%.Accompanying drawing 6 shows the Comparative result curve map of age reasoning, it is seen that the vector table obtained by network enhancing expression Show the more preferable classification results of vector representation obtained really than pedestal method.
Generally speaking, we are directed to the sparse sex chromosome mosaicism of online social networks in real world, similar based on blog article is delivered Two users between have potential friend relation the fact, it is proposed that a kind of aggregators text message network enhancing table Dendrography learning method, specifically, using potential friend relation network is extracted in generating text from user, corrects original network Topological structure, is represented so as to obtain more accurately network node.Compared to the network representation study for only considering network topology structure, In two tasks of sex and age reasoning, accuracy rate is significantly improved.
Therefore, the network enhancing method for expressing proposed by the invention based on microblogging is in network user's character representation and follow-up In classification and reasoning task, with critically important actual application value.
In order to illustrate present disclosure and implementation, this specification gives a specific embodiment.In embodiment The middle purpose for introducing details is not the scope for limiting claims, and is to aid in understanding the method for the invention.This area Technical staff should be understood that:Do not departing from the present invention and its spirit and scope of the appended claims, to most preferred embodiment step Various modifications, change or replacement be all possible.Therefore, the present invention should not be limited to disclosed in most preferred embodiment and accompanying drawing Content.

Claims (6)

1. a kind of network user's enhancing method for expressing based on microblogging, it is characterised in that comprise the following steps:
Step 1: with reference to existing microblogging short essay treatment method, being pre-processed to user's generation blog article, so as to eliminate noise The influence of data;
Step 2: with reference to related natural language processing technique, generating the characteristic vector of pretreated user's blog article text, reference Measuring similarity function calculates the similarity between blog article vector, and the potential friend relation for generating text based on user is extracted, structure Build potential friend relation network;
Step 3: considering the single order and second order similitude of network structure, integrate primitive network structural information and expand in microblogging and use Topological relation network between family;
Step 4: by from blog article information extraction to the potential friend relation network integration in the network topology structure after integration, The original network structure information of amendment, including increase part potentially connect while and augmenting portion connected while two kinds of weighted value Correcting mode;
Step 5: with reference to existing network representation learning art, learning the character representation of enhanced micro blog network user;
Step 6: for the difference on effect between the expression vector of Contrast enhanced network and the expression vector of primitive network, will be upper State expression learning outcome to be applied in sex and age reasoning task, the accuracy rate of the reasoning results is contrasted with pedestal method.
2. a kind of network user's enhancing method for expressing based on microblogging according to claim 1, it is characterised in that social network In network, node is correspondence user, the substantial amounts of text message of each node correspondence, represents the history blog article information of correspondence user.It is false Determine to represent network with G, then G=(V, E, T), wherein, V={ viIt is user node set, E- { (vi, vj) it is two-value side collection, often Bar side respective weights w, wherein w ∈ { 0,1 }, T={ tiIt is the blog article set that user generates, the present invention is to generate blog article from user The characteristic information of middle capture text is simultaneously modified to primitive network, so as to learn the low-dimensional table of each node in corrective networks G " Show
3. a kind of network user's enhancing method for expressing based on microblogging according to claim 1, it is characterised in that described to obtain Microblogging short text preprocess method in step 2 is taken to include herein below:
(1) content of text, extracted between two " # " is used directly as keyword;
(2) content of text after "@", is extracted;
(3) additional characters such as punctuation mark in urtext, are filtered out;
(4) unusual vocabulary, is compareed, all unusual words in text are replaced;
(5) word segmentation processing, is carried out to the microblogging text of reservation using HanLP participles instrument;
(6), filtering disables the stop words in vocabulary;
(7) the TF-IDF values of all words, are counted, and filter out low frequency words therein.
4. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that the step The method that the potential friend relation for generating text based on user in rapid two is extracted is as follows:
(1) characteristic vector of user's microblogging text, is generated using LDA topic models:
LDA is a generating probability model, is related to document, three levels of topic and word.It is considered that a document can be represented For the random mixing of K potential topics, wherein each topic obeys the multinomial distribution of word, every document obeys k topic Multinomial distribution.Then, for corpusIn every document, generating process is described as follows:
For each document Mi, selection θ~Dir (α), wherein Dir (α) is the Di Li Crays distribution of parameter alpha, and θ is a topic The each topic of each element representation in vector, vector appears in the probability in the document;
For j-th of word w in i-th documentij, pass through conditional probability p (zi| θ), selection one is potential from topic vector θ Topic Zi, then pass through conditional probability p (wj|zi, β) and generation word wj.
Given parameters α and parameter beta, the Joint Distribution of model is,
<mrow> <mi>p</mi> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>,</mo> <mi>z</mi> <mo>,</mo> <mi>w</mi> <mo>|</mo> <mi>&amp;alpha;</mi> <mo>,</mo> <mi>&amp;beta;</mi> <mo>)</mo> </mrow> <mo>=</mo> <mi>p</mi> <mrow> <mo>(</mo> <mi>&amp;theta;</mi> <mo>|</mo> <mi>&amp;alpha;</mi> <mo>)</mo> </mrow> <munderover> <mo>&amp;Pi;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>N</mi> </munderover> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>|</mo> <mi>&amp;theta;</mi> <mo>)</mo> </mrow> <mi>p</mi> <mrow> <mo>(</mo> <msub> <mi>w</mi> <mi>j</mi> </msub> <mo>|</mo> <msub> <mi>z</mi> <mi>i</mi> </msub> <mo>,</mo> <mi>&amp;beta;</mi> <mo>)</mo> </mrow> </mrow>
Wherein, w is observational variable, and θ is hidden variable, and then we use EM algorithm (EM) learning parameter α and parameter β。
It is assumed that retaining preceding T topic, then each text fragment is embedded in vectorWherein, wi It is the weight for corresponding to i-th of topic, represents user viThe text of generation belongs to the possibility of i-th of topic.
(2) the weight size that the cosine similarity between any two users' blog article vector characterizes corresponding potential relation side is calculated, from And build potential friend relation network;
Using cosine similarity computational methods, potential friend relation is extracted from these expression vectors.Given two represent vectorWithThen two users viAnd vjThe potential friend relation of generation can be defined as,
Therefore, the potential adjacency matrix extracted from user's generation text can be described as matrixIts In, each element w 'ij∈ [0,1].
5. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that described to obtain Take the integration method of step 3 primitive network structural information as follows:
Two users with common friend, which level off to, in social networks identical interest and feature.LINE considers above-mentioned two thing Real, the concept that first proposed single order and second order similitude fully portrays the part and global information of network structure.
(1), single order similitude:
Deckle collection E is given, for each node pair therein, the weighted value of corresponding sides represents single order similarity.Represent single order phase Like degree matrix W1Element, can be defined as,
<mrow> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mn>1</mn> </msubsup> <mo>=</mo> <mfenced open = "{" close = ""> <mtable> <mtr> <mtd> <mrow> <mn>1</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mtable> <mtr> <mtd> <mrow> <mi>i</mi> <mi>f</mi> </mrow> </mtd> <mtd> <mrow> <mo>(</mo> <msub> <mi>v</mi> <mi>i</mi> </msub> <mo>,</mo> <msub> <mi>v</mi> <mi>j</mi> </msub> <mo>)</mo> <mo>&amp;Element;</mo> <mi>E</mi> </mrow> </mtd> </mtr> </mtable> </mtd> </mtr> <mtr> <mtd> <mrow> <mn>0</mn> <mo>,</mo> </mrow> </mtd> <mtd> <mrow> <mi>o</mi> <mi>t</mi> <mi>h</mi> <mi>e</mi> <mi>r</mi> <mi>w</mi> <mi>i</mi> <mi>s</mi> <mi>e</mi> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow>
(2), second order similitude:
Common neighbours' number of arbitrary node pair is used for defining second order similarity, to describe neighbours' knot of two users in social networks The similitude of structure.User v is given respectivelyiWith user vjNeighbor node setWithThen common friend number, two are calculated Rank similarity is defined as
Now, we consider single order and second order similitude, in being fused to the adjacency matrix extracted from network structure.Therefore, We introduce W, represent neighbours' matrix after integrating, and each element of matrix is made up of two Similarity values,
<mrow> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>=</mo> <mi>&amp;lambda;</mi> <mo>&amp;CenterDot;</mo> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mn>1</mn> </msubsup> <mo>+</mo> <mi>&amp;mu;</mi> <mo>&amp;CenterDot;</mo> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mn>2</mn> </msubsup> </mrow> 2
Wherein, λ and μ are normalization coefficients, and specific value is determined by experiment constantly adjustment.
6. a kind of network user's enhancing method for expressing based on microblogging according to claim 2, it is characterised in that described to obtain Take step 4 as follows with the method for potential friend relation amendment primitive network structure:
The adjacency matrix that W " is corrective networks is made, wherein, each element w "ijFor,
<mrow> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mrow> <mo>&amp;prime;</mo> <mo>&amp;prime;</mo> </mrow> </msubsup> <mo>=</mo> <mfrac> <mrow> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>+</mo> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mo>&amp;prime;</mo> </msubsup> </mrow> <mrow> <munder> <mrow> <mi>m</mi> <mi>a</mi> <mi>x</mi> </mrow> <mrow> <mi>i</mi> <mo>,</mo> <mi>j</mi> </mrow> </munder> <mo>{</mo> <msub> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> </msub> <mo>,</mo> <msubsup> <mi>w</mi> <mrow> <mi>i</mi> <mi>j</mi> </mrow> <mo>&amp;prime;</mo> </msubsup> <mo>}</mo> </mrow> </mfrac> </mrow>
However, some of revised adjacency matrix element is too small, so needing given threshold, delete all less than the threshold value Element.Then, we carry out the low-dimensional table of calculating network node users using last amendment adjacency matrix as LINE input Show.
CN201710283853.4A 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog Active CN107122455B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710283853.4A CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710283853.4A CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Publications (2)

Publication Number Publication Date
CN107122455A true CN107122455A (en) 2017-09-01
CN107122455B CN107122455B (en) 2019-12-31

Family

ID=59724978

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710283853.4A Active CN107122455B (en) 2017-04-26 2017-04-26 Network user enhanced representation method based on microblog

Country Status (1)

Country Link
CN (1) CN107122455B (en)

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577782A (en) * 2017-09-14 2018-01-12 国家计算机网络与信息安全管理中心 A kind of people-similarity depicting method based on heterogeneous data
CN108536844A (en) * 2018-04-13 2018-09-14 吉林大学 A kind of network representation learning method of Text enhancement
CN108647800A (en) * 2018-03-19 2018-10-12 浙江工业大学 A kind of online social network user missing attribute forecast method based on node insertion
CN108877946A (en) * 2018-05-04 2018-11-23 浙江工业大学 A kind of doctor's expert recommendation method based on network characterization
CN109189936A (en) * 2018-08-13 2019-01-11 天津科技大学 A kind of label semanteme learning method measured based on network structure and semantic dependency
CN109743196A (en) * 2018-12-13 2019-05-10 杭州电子科技大学 It is a kind of based on the network characterisation method across double-layer network random walk
CN110008975A (en) * 2018-11-30 2019-07-12 武汉科技大学 Social networks navy detection method based on Danger Immune theory
CN110020151A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN110245682A (en) * 2019-05-13 2019-09-17 华中科技大学 A kind of network representation learning method based on topic
CN110555305A (en) * 2018-05-31 2019-12-10 武汉安天信息技术有限责任公司 Malicious application tracing method based on deep learning and related device
CN110879861A (en) * 2019-09-05 2020-03-13 国家计算机网络与信息安全管理中心 Similar mobile application calculation method and device based on representation learning
CN111127232A (en) * 2018-10-31 2020-05-08 百度在线网络技术(北京)有限公司 Interest circle discovery method, device, server and medium
CN112134720A (en) * 2020-05-26 2020-12-25 北京国腾创新科技有限公司 Network topology discovery method
CN113076743A (en) * 2021-03-30 2021-07-06 太原理工大学 Knowledge graph multi-hop inference method based on network structure and representation learning
CN113722437A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 User label identification method, device, equipment and medium based on artificial intelligence
CN117852616A (en) * 2024-02-29 2024-04-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Big language model alignment fine tuning method and system based on enhanced reject sampling training
CN117852616B (en) * 2024-02-29 2024-05-31 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Big language model alignment fine tuning method and system based on enhanced reject sampling training

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510551A (en) * 2011-09-30 2012-06-20 奇智软件(北京)有限公司 Method and device for automatic recommendation of friends in mobile communication tool
CN103150678A (en) * 2013-03-12 2013-06-12 中国科学院计算技术研究所 Method and device for discovering inter-user potential focus relationships on microblogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN104899657A (en) * 2015-06-09 2015-09-09 北京邮电大学 Method for predicting association fusion events
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102510551A (en) * 2011-09-30 2012-06-20 奇智软件(北京)有限公司 Method and device for automatic recommendation of friends in mobile communication tool
CN103150678A (en) * 2013-03-12 2013-06-12 中国科学院计算技术研究所 Method and device for discovering inter-user potential focus relationships on microblogs
CN104834632A (en) * 2015-05-13 2015-08-12 北京工业大学 Microblog topic detection and hotspot evaluation method based on semantic expansion
CN104899657A (en) * 2015-06-09 2015-09-09 北京邮电大学 Method for predicting association fusion events
CN105302866A (en) * 2015-09-23 2016-02-03 东南大学 OSN community discovery method based on LDA Theme model

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577782A (en) * 2017-09-14 2018-01-12 国家计算机网络与信息安全管理中心 A kind of people-similarity depicting method based on heterogeneous data
CN110020151A (en) * 2017-12-01 2019-07-16 北京搜狗科技发展有限公司 A kind of data processing method, device, electronic equipment and storage medium
CN110020151B (en) * 2017-12-01 2022-04-26 北京搜狗科技发展有限公司 Data processing method and device, electronic equipment and storage medium
CN108647800A (en) * 2018-03-19 2018-10-12 浙江工业大学 A kind of online social network user missing attribute forecast method based on node insertion
CN108647800B (en) * 2018-03-19 2022-01-11 浙江工业大学 Online social network user missing attribute prediction method based on node embedding
CN108536844A (en) * 2018-04-13 2018-09-14 吉林大学 A kind of network representation learning method of Text enhancement
CN108536844B (en) * 2018-04-13 2021-09-03 吉林大学 Text-enhanced network representation learning method
CN108877946A (en) * 2018-05-04 2018-11-23 浙江工业大学 A kind of doctor's expert recommendation method based on network characterization
CN110555305A (en) * 2018-05-31 2019-12-10 武汉安天信息技术有限责任公司 Malicious application tracing method based on deep learning and related device
CN109189936A (en) * 2018-08-13 2019-01-11 天津科技大学 A kind of label semanteme learning method measured based on network structure and semantic dependency
CN109189936B (en) * 2018-08-13 2021-07-27 天津科技大学 Label semantic learning method based on network structure and semantic correlation measurement
CN111127232B (en) * 2018-10-31 2023-08-29 百度在线网络技术(北京)有限公司 Method, device, server and medium for discovering interest circle
CN111127232A (en) * 2018-10-31 2020-05-08 百度在线网络技术(北京)有限公司 Interest circle discovery method, device, server and medium
CN110008975A (en) * 2018-11-30 2019-07-12 武汉科技大学 Social networks navy detection method based on Danger Immune theory
CN110008975B (en) * 2018-11-30 2023-05-02 武汉科技大学 Social network water army detection method based on immune hazard theory
CN109743196B (en) * 2018-12-13 2021-12-17 杭州电子科技大学 Network characterization method based on cross-double-layer network random walk
CN109743196A (en) * 2018-12-13 2019-05-10 杭州电子科技大学 It is a kind of based on the network characterisation method across double-layer network random walk
CN110245682A (en) * 2019-05-13 2019-09-17 华中科技大学 A kind of network representation learning method based on topic
CN110879861A (en) * 2019-09-05 2020-03-13 国家计算机网络与信息安全管理中心 Similar mobile application calculation method and device based on representation learning
CN110879861B (en) * 2019-09-05 2023-07-14 国家计算机网络与信息安全管理中心 Similar mobile application computing method and device based on representation learning
CN112134720A (en) * 2020-05-26 2020-12-25 北京国腾创新科技有限公司 Network topology discovery method
CN113076743A (en) * 2021-03-30 2021-07-06 太原理工大学 Knowledge graph multi-hop inference method based on network structure and representation learning
CN113722437A (en) * 2021-08-31 2021-11-30 平安科技(深圳)有限公司 User label identification method, device, equipment and medium based on artificial intelligence
CN113722437B (en) * 2021-08-31 2023-06-23 平安科技(深圳)有限公司 User tag identification method, device, equipment and medium based on artificial intelligence
CN117852616A (en) * 2024-02-29 2024-04-09 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Big language model alignment fine tuning method and system based on enhanced reject sampling training
CN117852616B (en) * 2024-02-29 2024-05-31 哈尔滨工业大学(深圳)(哈尔滨工业大学深圳科技创新研究院) Big language model alignment fine tuning method and system based on enhanced reject sampling training

Also Published As

Publication number Publication date
CN107122455B (en) 2019-12-31

Similar Documents

Publication Publication Date Title
CN107122455A (en) A kind of network user&#39;s enhancing method for expressing based on microblogging
US11687728B2 (en) Text sentiment analysis method based on multi-level graph pooling
CN108573411B (en) Mixed recommendation method based on deep emotion analysis and multi-source recommendation view fusion of user comments
CN107330049B (en) News popularity estimation method and system
CN107133213B (en) Method and system for automatically extracting text abstract based on algorithm
CN104765769B (en) The short text query expansion and search method of a kind of word-based vector
CN110209897B (en) Intelligent dialogue method, device, storage medium and equipment
CN105183833B (en) Microblog text recommendation method and device based on user model
CN104008203B (en) A kind of Users&#39; Interests Mining method for incorporating body situation
US11514063B2 (en) Method and apparatus of recommending information based on fused relationship network, and device and medium
CN111914185B (en) Text emotion analysis method in social network based on graph attention network
CN104199833B (en) The clustering method and clustering apparatus of a kind of network search words
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN103631859A (en) Intelligent review expert recommending method for science and technology projects
CN107577782B (en) Figure similarity depicting method based on heterogeneous data
CN112307351A (en) Model training and recommending method, device and equipment for user behavior
CN109992784B (en) Heterogeneous network construction and distance measurement method fusing multi-mode information
CN110990670B (en) Growth incentive book recommendation method and recommendation system
CN110569920A (en) prediction method for multi-task machine learning
CN109446414A (en) A kind of software information website fast tag recommended method based on neural network classification
JP7393060B2 (en) Personalized search method and search system combining attention mechanism
CN113254652B (en) Social media posting authenticity detection method based on hypergraph attention network
CN112966091A (en) Knowledge graph recommendation system fusing entity information and heat
CN105869058B (en) A kind of method that multilayer latent variable model user portrait extracts
CN110472115B (en) Social network text emotion fine-grained classification method based on deep learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant