CN107292750A - The formation gathering method and information collection apparatus of social networks - Google Patents
The formation gathering method and information collection apparatus of social networks Download PDFInfo
- Publication number
- CN107292750A CN107292750A CN201610203819.7A CN201610203819A CN107292750A CN 107292750 A CN107292750 A CN 107292750A CN 201610203819 A CN201610203819 A CN 201610203819A CN 107292750 A CN107292750 A CN 107292750A
- Authority
- CN
- China
- Prior art keywords
- user
- active
- content
- active user
- candidate
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 230000015572 biosynthetic process Effects 0.000 title claims abstract description 19
- 238000000605 extraction Methods 0.000 claims abstract description 78
- 230000004927 fusion Effects 0.000 claims description 35
- 239000013598 vector Substances 0.000 claims description 30
- 238000004364 calculation method Methods 0.000 claims description 18
- 239000000284 extract Substances 0.000 claims description 16
- 238000012423 maintenance Methods 0.000 claims description 6
- 235000013399 edible fruits Nutrition 0.000 claims description 3
- 230000008901 benefit Effects 0.000 abstract description 6
- 238000004821 distillation Methods 0.000 abstract description 5
- 230000008569 process Effects 0.000 description 6
- 235000010627 Phaseolus vulgaris Nutrition 0.000 description 5
- 244000046052 Phaseolus vulgaris Species 0.000 description 5
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 238000005267 amalgamation Methods 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000003542 behavioural effect Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 239000004744 fabric Substances 0.000 description 1
- 230000015654 memory Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000644 propagated effect Effects 0.000 description 1
- 238000012163 sequencing technique Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/01—Social networking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Marketing (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- Software Systems (AREA)
- Human Resources & Organizations (AREA)
- Computing Systems (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- Tourism & Hospitality (AREA)
- General Business, Economics & Management (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention provides the formation gathering method of social networks and information collection apparatus.The present invention selects user to be extracted using the inner link between social network user, and according to the extraction priority of user, advantage distillation is worth the content of higher user's issue, so as to improve the information collecting efficiency of social networks.
Description
Technical field
The present invention relates to Web content collection technique field, and in particular to the formation gathering method and letter of a kind of social networks
Cease collection device.
Background technology
Social networks has been deep into daily life, and the form of social networks is varied, such as various microbloggings
Platform.Microblogging is a kind of Information Sharing based on customer relationship, the platform propagated and obtained.User can pass through WEB, WAP
Personal community is set up etc. various clients, the contents such as word, picture or video are issued in personal community and carry out information updating, and
Realization is shared immediately.The user A of microblog can be by paying close attention to user B, and the bean vermicelli as the user B is used with timely
Family B information updating.
Microblogging has great as widely used social networks, its mass data included for many application scenarios
Meaning.In order to obtain above-mentioned data, the information in microblogging website can be collected by web crawlers.However, microblog users
Quantity and microblog on data it is all very huge, existing formation gathering method collects the time mistake required for data
It is long, therefore need a kind of scheme for the efficiency that can improve information badly.
The content of the invention
The formation gathering method and information that the technical problem to be solved of the embodiment of the present invention is to provide a kind of social networks are received
Acquisition means are more accurate and more quick required information is collected from social networks.
One side according to embodiments of the present invention there is provided a kind of formation gathering method of social networks, including:
Based on seed words, the content of the seed words is included in search social networks, the first search result is obtained and protects
It is stored in a collection result database, the social networks includes multiple users, the user subnet of each user, each user at it
Topological relation between user between the content issued on user subnet and user;
It is determined that the user of the content in issue first search result, obtains first order user, the first order is used
Family is added to a candidate user set;
The content that each user issues on its user subnet in the candidate user set is extracted one by one, is saved in the receipts
Collect in result database, until default extraction stop condition is reached, wherein, sent out extracting active user on its user subnet
During the content of cloth, according to topological relation between the user, the next stage user of active user is determined, and by the next stage user
It is added to the candidate user set.
It is described to extract candidate's use one by one in one side according to embodiments of the present invention, above- mentioned information collection method
The content that each user issues on its user subnet in the set of family, is saved in the collection result database, until reaching pre-
If extraction stop condition the step of include:
A user is selected from the candidate user set as active user, wherein, in the candidate user set
In when there is first order user, one first order user of selection is used as active user, otherwise, is selected from the candidate user set
The user that priority is extracted with highest is selected, active user is used as;
Extract the content issued on its user subnet of active user and preserve to the collection result database;
According to topological relation between the user, the next stage user of active user is determined, by active user from the candidate
Deleted in user's set, and the next stage user is added to the candidate user set;
Judge whether the extraction stop condition meets, if so, then terminating flow;Otherwise, return described from the candidate
The step of one user of selection is as active user in user's set.
In one side according to embodiments of the present invention, above- mentioned information collection method, it is described by active user from described
It is described before the step of being deleted in candidate user set, and the next stage user is added into the candidate user set
Method also includes:
The user tag for representing active user's characteristic is extracted, the user tag and the seed words of active user is calculated
Between the first correlation, obtain the label quality of active user;
Calculate the content issued on its user subnet of active user second related between first search result
Property, obtain the content quality of active user;
Label quality and content quality to active user are merged, and obtain the extraction of the next stage user of active user
Priority.
In one side according to embodiments of the present invention, above- mentioned information collection method, first correlation is used to be described
COS distance between the corresponding term vector of family label term vector corresponding with the seed words;
Second correlation is the COS distance between the first bag of words characteristic vector and the second bag of words characteristic vector, described
First bag of words characteristic vector is the bag of words characteristic vector constructed by the content issued based on active user on its user subnet, institute
It is the bag of words characteristic vector built based on first search result to state the second bag of words characteristic vector.
In one side according to embodiments of the present invention, above- mentioned information collection method, the label product to active user
Matter and content quality are merged, obtain active user next stage user extraction priority the step of include:Calculate current
The label quality of user and content quality and value, obtain the extraction priority of the next stage user of the active user.
In one side according to embodiments of the present invention, above- mentioned information collection method, the label product to active user
Matter and content quality are merged, obtain active user next stage user extraction priority the step of include:
According to the label quality of active user, the ranking in all users in the candidate user set is worked as
The first fusion component of preceding user;
According to the content quality of active user, the ranking in all users of the candidate user set obtains current
The second fusion component of user;
Calculate first merge component and the second fusion component and value, the next stage user for obtaining the active user carries
Take priority.
In one side according to embodiments of the present invention, above- mentioned information collection method, described by the next stage user
It is added to after the candidate user set, methods described also includes:According to first correlation, the current use is initialized
The label quality of the next stage user at family;And, according to second correlation, the next stage for initializing the active user is used
The content quality at family.
In one side according to embodiments of the present invention, above- mentioned information collection method, it is described extraction stop condition include with
At least one of lower condition:Reach predetermined extraction time threshold value;Extract predetermined user class depth;Active user's
Extract priority and be less than predetermined threshold.
According to another aspect of the present invention there is provided a kind of information collection apparatus of social networks, described information collects dress
Put including:
Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first
Search result is simultaneously saved in a collection result database, and the social networks includes user's of multiple users, each user
Topological relation between user between content that net, each user issue on its user subnet and user;
Candidate user generation unit, the user for determining the content in issue first search result, obtains first
Level user, a candidate user set is added to by the first order user;
Extraction unit, for extracting one by one in the candidate user set in each user issues on its user subnet
Hold, be saved in the collection result database, until default extraction stop condition is reached, wherein, extracting active user
During the content issued on its user subnet, according to topological relation between the user, the next stage user of active user is determined, and
The next stage user is added to the candidate user set.
In one side according to embodiments of the present invention, above- mentioned information collection device, the extraction unit includes:
Selecting unit, for selecting a user from the candidate user set as active user, wherein, described
When there is first order user in candidate user set, one first order user is as active user for selection, otherwise, from the candidate
The user of priority is extracted in selection with highest in user's set, is used as active user;
Collector unit, for extracting content that active user issues on its user subnet and preserving to the collection result
Database;
Updating block, for according to topological relation between the user, determining the next stage user of active user, will currently use
Family is deleted from the candidate user set, and the next stage user is added into the candidate user set;
Judging unit, for judging whether the extraction stop condition meets, if so, then terminating contents extraction;Otherwise, after
The continuous triggering selecting unit.
In one side according to embodiments of the present invention, above- mentioned information collection device, the updating block includes:
Determining unit, for according to topological relation between the user, determining the next stage user of active user;
Priority calculation unit, for extracting the user tag for being used for representing active user's characteristic, calculates active user's
The first correlation between user tag and the seed words, obtains the label quality of active user;Active user is calculated at it
The second correlation between the content issued on user subnet and first search result, obtains the content product of active user
Matter;Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is preferential
Level;
Candidate user set maintenance unit, the next stage user for obtaining active user in the priority calculation unit
Extraction priority after, active user is deleted from the candidate user set, and the next stage user is added to
The candidate user set.
In one side according to embodiments of the present invention, above- mentioned information collection device, the priority calculation unit, specifically
For calculate active user label quality and content quality and value, obtain the extraction of the next stage user of the active user
Priority.
In one side according to embodiments of the present invention, above- mentioned information collection device, the priority calculation unit, specifically
For the label quality according to active user, the ranking in all users in the candidate user set is currently used
The first fusion component at family;According to the content quality of active user, the ranking in all users of the candidate user set,
Obtain the second fusion component of active user;Calculate first merge component and the second fusion component and value, obtain described current
The extraction priority of the next stage user of user.
In one side according to embodiments of the present invention, above- mentioned information collection device, the updating block also includes:
Initialization unit, for adding the next stage user of the active user in the candidate user set maintenance unit
Enter to after the candidate user set, according to first correlation, initialize the next stage user's of the active user
Label quality;And, according to second correlation, initialize the content quality of the next stage user of the active user.
Compared with prior art, the formation gathering method of social networks provided in an embodiment of the present invention and information dress
Put, using the inner link between social network user, select user to be extracted and add to candidate user set, and according to
The content of higher user's issue is worth in the extraction priority of user, advantage distillation candidate user set, social activity can be improved
The information collecting efficiency of network.
Brief description of the drawings
Fig. 1 is a kind of schematic flow sheet of the formation gathering method of social networks provided in an embodiment of the present invention;
Fig. 2 is a kind of schematic flow sheet of per user progress contents extraction in the embodiment of the present invention;
Fig. 3 is a kind of schematic flow sheet of calculating extraction priority in the embodiment of the present invention;
Fig. 4 is the illustrative view of functional configuration of information collection apparatus provided in an embodiment of the present invention;
Fig. 5 is the illustrative view of functional configuration of the extraction unit of information collection apparatus provided in an embodiment of the present invention;
Fig. 6 is the illustrative view of functional configuration of the updating block of information collection apparatus provided in an embodiment of the present invention;
Fig. 7 is a kind of hardware architecture diagram of information collection apparatus provided in an embodiment of the present invention.
Embodiment
To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool
Body embodiment is described in detail.It is only there is provided the specific detail of such as specific configuration and component in the following description
In order to help comprehensive understanding embodiments of the invention.Therefore, it will be apparent to those skilled in the art that can be to reality described herein
Example is applied to make various changes and modifications without departing from scope and spirit of the present invention.In addition, for clarity and brevity, eliminate pair
The description of known function and construction.
It should be understood that " one embodiment " or " embodiment " that specification is mentioned in the whole text means relevant with embodiment
During special characteristic, structure or characteristic are included at least one embodiment of the present invention.Therefore, occur everywhere in entire disclosure
" in one embodiment " or " in one embodiment " identical embodiment is not necessarily referred to.In addition, these specific feature, knots
Structure or characteristic can be combined in one or more embodiments in any suitable manner.
In various embodiments of the present invention, it should be appreciated that the size of the sequence number of above-mentioned each process is not meant to that execution is suitable
The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention
Process constitutes any limit.
It should be understood that the terms "and/or", a kind of only incidence relation for describing affiliated partner, expression can be deposited
In three kinds of relations, for example, A and/or B, can be represented:Individualism A, while there is A and B, these three situations of individualism B.
Social networks has generally included following element:
User, for example, can be represented by User Identity;
The personal community that user is set up on the user subnet of user, such as microblogging, can generally be positioned by unified resource
(URL, Uniform Resoure Locator) is accorded with to represent the user subnet of some user;
The content that user issues on its user subnet, the content can be the forms such as text, picture, video or audio.
In this paper, will mainly for collect content of text exemplified by illustrate.
By paying close attention to the bean vermicelli relation that mode is set up on topological relation between user between user, such as microblog.When
When second user has paid close attention to the first user on social networks, then formd between above-mentioned user in topological relation from the first user to
The annexation of second user, the annexation is that tool is directive.For the ease of description, herein, described it will be used from first
Family is expressed as to the annexation of second user:The next stage user of first user includes second user.
The embodiment of the present invention is collected to the content of text on above social networks, in infonnation collection process, it is considered to
To the contact between the user of social networks between topological relation, and content and user, the user of next collection is determined, from
And faster can more accurately be collected into required information.
Fig. 1 is refer to, a kind of formation gathering method of social networks provided in an embodiment of the present invention comprises the following steps:
Step 11, based on seed words, the content of the seed words is included in search social networks, the first search knot is obtained
Fruit is simultaneously saved in a collection result database.
In the embodiment of the present invention, the social networks generally includes multiple users, the user subnet of each user, each user
Topological relation between user between the content issued on its user subnet and user.Content described in the present embodiment, refers to
Based on the content of text of one or more natural languages, the perhaps content of text of other language in such as Chinese text.The seed
Word generally includes the logical relation between at least one search keyword and search keyword, such as with or relation.Search is closed
Keyword can be the word that user voluntarily specifies, to indicate the theme that this search is focused on, needed for being searched on social networks
The content wanted.
In the embodiment of the present invention, above-mentioned search can be performed using various existing searching algorithms, include kind to obtain
The content of sub- word.
Step 12, it is determined that the user of the content in issue first search result, obtains first order user, by described the
Primary user is added to a candidate user set.
Directly include above-mentioned seed words in the content issued due to first order user, therefore first order user is value
The information higher with seed word correlation is likely to contain in higher user, the content of these users issue, therefore, is being obtained
When obtaining the first search result, the embodiment of the present invention can further extract the user of the content in the first search result of issue, plus
Enter to candidate user set.
Step 13, the content that each user issues on its user subnet in the candidate user set is extracted one by one, is preserved
Into the collection result database, until default extraction stop condition is reached, wherein, active user is being extracted in its user
During the content of sub- Web realease, according to topological relation between the user, determine the next stage user of active user, and will be described under
Primary user is added to the candidate user set.
Here, the embodiment of the present invention is directed to each user in candidate user set, and contents extraction processing is carried out one by one, extracts
The content that the user issues on the user subnet of the user, for example, all the elements or issuing time are in preset time period
Content, and the content extracted is preserved into the collection result database.During the contents extraction of active user
(specifically can be before contents extraction, in extraction or after the completion of extraction), further according to topological relation between user, extracts current
The next stage user of user, that is, pay close attention to the user of active user, be added to the candidate user set.For example, can be according to micro-
The bean vermicelli relation of rich platform, extracts the bean vermicelli of active user, adds the candidate user set.
In above method, it is primarily based on seed words and scans for, obtains the first search result and first order user, then
The content of the subordinate subscriber issue of first order user and first order user is further extracted, the content of these users issue is led to
Often there is higher correlation with seed words.When reaching predetermined extraction stop condition, extraction process will be stopped, knot is now collected
Information of the content collected by this collection process in fruit database.Here, it can be following condition to extract stop condition
In any one or more:Reach predetermined extraction time threshold value;Extract predetermined user class depth;Active user's
Extract priority and be less than predetermined threshold.
By above method, the embodiment of the present invention can be directly collected into the content comprising seed words, can also be collected into
There is the content of high correlation with seed words, the accuracy of information can be improved, the time required to reducing collection, improve
Information collecting efficiency.
In above step 13, in the content of each user's issue in extracting the candidate user set one by one, Ke Yigen
Corresponding extraction priority is set according to the significance level of each user.Directly include in the content of first order user issue
Seed words are stated, therefore with higher priority, as a kind of implementation, the present embodiment, can be first in above-mentioned steps 13
The content of first order user is first collected, contents extraction is then carried out according to the extraction priority of other users one by one, now, is such as schemed
Shown in 2, above-mentioned steps 13 can specifically include:
Step 131, a user is selected from the candidate user set as active user, wherein, in the candidate
When there is first order user in user's set, one first order user is as active user for selection, otherwise, from the candidate user
The user of priority is extracted in selection with highest in set, is used as active user;
Step 132, extract the content issued on its user subnet of active user and preserve to the collection result data
Storehouse;
Step 133, according to topological relation between the user, determine the next stage user of active user, by active user from
Deleted in the candidate user set, and the next stage user is added to the candidate user set;
Step 134, judge whether the extraction stop condition meets, if so, then terminating flow;Otherwise, return to step
131。
The embodiment of the present invention is directed to the outdoor other users of the first order in the candidate user set, can be based on extraction
Priority determines the sequencing extracted, and advantage distillation is worth the content of higher user.
In social networks, foundation has between the user of Topology connection, generally has higher correlation.For example, for
For microblog, it is assumed that user A is user B bean vermicelli, then user A has the possibility of the interest similar with user B
It is larger, therefore the microblogging that they issue is more likely related to identical theme.Therefore, the embodiment of the present invention is according to active user
The content and its user tag of issue, to calculate the quality score of active user, and regard the quality score as the active user
Next stage user extraction priority.Specific calculating process can be performed in above-mentioned steps 133, now as shown in figure 3,
Above-mentioned steps 133 are specifically included:
Step 1331, according to topological relation between the user, the next stage user of active user is determined.
Step 1332, extract the user tag for representing active user's characteristic, calculate the user tag of active user with
The first correlation between the seed words, obtains the label quality of active user;Active user is calculated on its user subnet
The second correlation between the content of issue and first search result, obtains the content quality of active user.
User tag is the label that user is set from behavior oneself, generally represents the characteristic of user, such as interest, geographical position
Put, the classification such as age.Certainly, social networks can also be that user sets corresponding user tag according to the behavioural characteristic of user.
Generally, user tag is also the form by using content of text, is represented on the user subnet of the user.The present embodiment can be from
The user tag of user is extracted on user subnet.If the user tag extracted is sky, the first correlation can be expressed as 0;
If the content of active user's issue is sky, the second correlation can be expressed as 0.
Here, first correlation can be corresponding with the seed words by the corresponding term vector of the user tag
COS distance between term vector is characterized, and the COS distance is bigger, then it represents that the first correlation is higher.The embodiment of the present invention is carried
A kind of calculation formula supplied is as follows:
In above-mentioned formula, uq is the label quality of active user, it is clear that the first correlation is higher, the score of label quality
It is higher, represent that label quality is better;Label_vec is the term vector corresponding to active user U user tag, label_
vec1,label_vec2,label_vec3…label_vecnFor the active user U corresponding characteristic vector of multiple user tags,
Wherein n is the quantity of active user U user tag.Seed_vec is the term vector of seed words, if seed words are including multiple
Search keyword, then seed_vec is the term vector sum of each search keyword.Label_vec, seed_vec can lead to
Cross what the word2vec models of prior art were obtained, the model can beforehand through social networks content language material (such as microblogging language
Material) training obtain.
Here, second correlation can be by more than between the first bag of words characteristic vector and the second bag of words characteristic vector
Chordal distance is characterized, and the COS distance is bigger, then it represents that the second correlation is higher.The first bag of words characteristic vector is to be based on working as
Bag of words characteristic vector constructed by the content that preceding user issues on its user subnet, the second bag of words characteristic vector is to be based on
The bag of words characteristic vector that first search result is built.A kind of calculation formula of second correlation provided in an embodiment of the present invention
It is as follows:
In above-mentioned formula, mq is the content quality of active user, it is clear that the second correlation is higher, the score of content quality
It is higher, represent that content quality is also better;B_vec is the first bag of words characteristic vector;R_vec is the second bag of words characteristic vector.
Step 1333, the label quality and content quality to active user are merged, and obtain the next stage of active user
The extraction priority of user.Here, priority and the equal positive correlation of label quality and content quality of active user are extracted, i.e., it is current
The label quality of user is more excellent, and the extraction priority is higher;The content quality of active user is more excellent, and the extraction priority is higher.
Step 1334, active user is deleted from the candidate user set, and the next stage user is added
To the candidate user set.
When the next stage user of the active user is added in the candidate user set, the embodiment of the present invention is also
According to first correlation, the label quality of the next stage user of the active user can be initialized;And, according to described
Second correlation, initializes the content quality of the next stage user of the active user, i.e. by the next stage user of active user
Label quality be initialized as active user the first correlation value, by the beginning of the content quality of the next stage user of active user
Beginning turns to the value of the second correlation of active user.
In above-mentioned steps 1333, label quality and content quality to active user are merged, and obtain next stage use
The extraction priority at family.So, even if the user tag of active user is empty or issue content is empty, it can also utilize another
Quality factor calculates extraction priority.
A kind of implementation of above-mentioned fusion treatment provided in an embodiment of the present invention is:Directly calculate the label of active user
Quality and content quality and value, using this and value as the next stage user of active user extraction priority.
Another realization of above-mentioned fusion treatment provided in an embodiment of the present invention can then be carried out in such a way:
1) according to the label quality of active user, the ranking in all users in the candidate user set is obtained
The first fusion component of active user.Here, preset first corresponding to the preferable user of label quality and merge component
Value, not less than the value of the first fusion component corresponding to the user of label inferior quality.
2) according to the content quality of active user, the ranking in all users of the candidate user set is worked as
The second fusion component of preceding user.Here, the value of the second fusion component corresponding to the preferable user of content quality is preset,
The value of the second fusion component corresponding to the user poor not less than content quality.
3) calculate first merge component and second fusion component and value, obtain the next stage user's of the active user
Extract priority.
According to above calculation, above-mentioned and value is bigger, then extracts priority higher.A specific fusion is given below
Example:
Assuming that pre-defined:Label quality user of M, its first fusion point before ranking in the candidate user set
The value of amount is x;User after M (not including M), the value of its first fusion component is 0;Content quality is in institute
The user of L before ranking is stated in candidate user set, value of its second fusion component is y;(not including L after L
Name) user, its second fusion component value be 0.In fusion treatment, according to the label quality and content product of active user
Ranking of the matter in the candidate user set, determines the value of the first fusion component and the second fusion component, then, calculates first
Merge component and second fusion component and value, obtain the extraction priority of the next stage user of the active user.
It the above is only a kind of example of fusion.When using opposite amalgamation mode, label quality is preset for example, working as
The value of the first fusion component corresponding to preferable user, the first fusion point no more than corresponding to the user of label inferior quality
The value of amount, and, the value of the second fusion component corresponding to the preferable user of content quality, the no more than poor use of content quality
During the value of the second fusion component corresponding to family, then it is probably above-mentioned smaller with value, extracts priority higher.
From described above as can be seen that above method of the embodiment of the present invention utilizes the inherent connection between social network user
System, selects user to be extracted, and according to the extraction priority of user, advantage distillation is worth the interior of higher user's issue
Hold, so as to more efficiently carry out content collecting.
Fig. 4 is refer to, the embodiment of the present invention additionally provides a kind of functional structure of the information collection apparatus of social networks and shown
It is intended to, as shown in figure 4, the information collection apparatus 40 includes:
Search unit 41, for based on seed words, the content of the seed words to be included in search social networks, obtains the
One search result is simultaneously saved in a collection result database;
Candidate user generation unit 42, the user for determining the content in issue first search result obtains the
Primary user, a candidate user set is added to by the first order user;
Extraction unit 43, for extracting one by one in the candidate user set in each user issues on its user subnet
Hold, be saved in the collection result database, until default extraction stop condition is reached, wherein, extracting active user
During the content issued on its user subnet, according to topological relation between the user, the next stage user of active user is determined, and
The next stage user is added to the candidate user set.
Fig. 5 is refer to, one side according to embodiments of the present invention, information above is collected and filled in 40, the extraction unit
43 include:
Selecting unit 431, for selecting a user from the candidate user set as active user, wherein,
When there is first order user in the candidate user set, one first order user is as active user for selection, otherwise, from described
The user of priority is extracted in selection with highest in candidate user set, is used as active user;
Collector unit 432, is collected for extracting content that active user issues on its user subnet and preserving to described
Result database;
Updating block 433, ought for according to topological relation between the user, determining the next stage user of active user
Preceding user is deleted from the candidate user set, and the next stage user is added into the candidate user set;
Judging unit 434, for judging whether the extraction stop condition meets, if so, then terminating contents extraction;It is no
Then, continue to trigger the selecting unit 431.
Fig. 5 is refer to, another aspect according to embodiments of the present invention is described to update single in information above collection device 40
Member 433 includes:
Determining unit 4331, for according to topological relation between the user, determining the next stage user of active user;
Priority calculation unit 4332, for extracting the user tag for being used for representing active user's characteristic, calculates current use
The first correlation between the user tag at family and the seed words, obtains the label quality of active user;Calculate active user
The second correlation between the content issued on its user subnet and first search result, obtains the content of active user
Quality;Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is excellent
First level;
Candidate user set maintenance unit 4333, the next stage for obtaining active user in the priority calculation unit
After the extraction priority of user, active user is deleted from the candidate user set, and the next stage user is added
Enter to the candidate user set;
Initialization unit 4334, in the candidate user set maintenance unit 4333 by the next of the active user
Level user is added to after the candidate user set, according to first correlation, initializes the next of the active user
The label quality of level user;And, according to second correlation, initialize the content of the next stage user of the active user
Quality.
In the embodiment of the present invention, the priority calculation unit 4332 can be calculated currently by various ways
The extraction priority of the next stage user of user.The possible calculation of one of which is:The priority calculation unit 4332,
Label quality and content quality and value specifically for calculating active user, regard this as the next of the active user with value
The extraction priority of level user.Alternatively possible calculation is:The priority calculation unit 4332, specifically for basis
The label quality of active user, the ranking in all users in the candidate user set, obtains the first of active user
Merge component;According to the content quality of active user, the ranking in all users of the candidate user set obtains current
The second fusion component of user;Calculate first merge component and second fusion component and value, obtain under the active user
The extraction priority of primary user.
Fig. 7 then gives a kind of hardware architecture diagram of the information collection apparatus of the embodiment of the present invention, the information
Device can be deployed in computer system 70, and the computer system 70 includes:
Processor 71, RAM 72, ROM 73, hard disk 74, input equipment 75, display device 76, and the said equipment is connected
The bus structures 77 connect.
Here, input equipment 75 can include mouse, keyboard and various handwriting input mouses or touch input device;It is aobvious
Show that equipment 76 includes various displays and projector equipment etc.;Bus architecture 77 can include the total of any number of interconnection
Line and bridge;One or more processor unit that processor 71 is represented, and represented by RAM 72 and ROM 73 one or
Various being electrically connected to together of the multiple memories of person.The intermediate result of the computing of processor 71 can be stored in RAM 72, most
The data of the collection result database obtained eventually can be stored in hard disk 74.Bus architecture 77 can also set such as periphery
The various other of standby, voltage-stablizer and management circuit or the like are electrically connected to together, and these are all known in the field.
Therefore, no longer it is described in greater detail herein.
, can when processor 71 calls and performed the program and data stored in the RAM 72 and/or ROM 73123
To realize following functional module:
Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first
Search result is simultaneously saved in a collection result database;
Candidate user generation unit, the user for determining the content in issue first search result, obtains first
Level user, a candidate user set is added to by the first order user;
Extraction unit, for extracting one by one in the candidate user set in each user issues on its user subnet
Hold, be saved in the collection result database, until default extraction stop condition is reached, wherein, extracting active user
During the content issued on its user subnet, according to topological relation between the user, the next stage user of active user is determined, and
The next stage user is added to the candidate user set.
To sum up, formation gathering method and device provided in an embodiment of the present invention, utilize the inherence between social network user
Contact, selects user to be extracted, and according to the extraction priority of user, advantage distillation is worth the interior of higher user's issue
Hold, so as to improve the information collecting efficiency of social networks.
Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art
For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications
It should be regarded as protection scope of the present invention.
Claims (14)
1. a kind of formation gathering method of social networks, it is characterised in that including:
Based on seed words, the content of the seed words is included in search social networks, the first search result is obtained and is saved in
One collects in result database, and the social networks includes multiple users, the user subnet of each user, each user in its user
Topological relation between user between the content of sub- Web realease and user;
It is determined that the user of the content in issue first search result, obtains first order user, the first order user is added
Enter to a candidate user set;
The content that each user issues on its user subnet in the candidate user set is extracted one by one, is saved in described collect and is tied
In fruit database, until default extraction stop condition is reached, wherein, extracting what active user issued on its user subnet
During content, according to topological relation between the user, the next stage user of active user is determined, and the next stage user is added
To the candidate user set.
2. formation gathering method as claimed in claim 1, it is characterised in that described to extract one by one in the candidate user set
The content that each user issues on its user subnet, is saved in the collection result database, until reaching default extraction
The step of stop condition, includes:
A user is selected from the candidate user set as active user, wherein, deposited in the candidate user set
In first order user, otherwise one first order user of selection, selects tool as active user from the candidate user set
There is highest to extract the user of priority, be used as active user;
Extract the content issued on its user subnet of active user and preserve to the collection result database;
According to topological relation between the user, the next stage user of active user is determined, by active user from the candidate user
Deleted in set, and the next stage user is added to the candidate user set;
Judge whether the extraction stop condition meets, if so, then terminating flow;Otherwise, return described from the candidate user
A step of user is as active user is selected in set.
3. formation gathering method as claimed in claim 2, it is characterised in that it is described by active user from the candidate user
Before the step of being deleted in set, and the next stage user is added into the candidate user set, methods described is also wrapped
Include:
The user tag for representing active user's characteristic is extracted, between the user tag and the seed words that calculate active user
The first correlation, obtain the label quality of active user;
The second correlation between the content issued on its user subnet of active user and first search result is calculated, is obtained
To the content quality of active user;
Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is preferential
Level.
4. formation gathering method as claimed in claim 3, it is characterised in that
First correlation is remaining between the corresponding term vector of user tag term vector corresponding with the seed words
Chordal distance;
Second correlation is the COS distance between the first bag of words characteristic vector and the second bag of words characteristic vector, described first
Bag of words characteristic vector is the bag of words characteristic vector constructed by the content issued based on active user on its user subnet, described
Two bag of words characteristic vectors are the bag of words characteristic vectors built based on first search result.
5. formation gathering method as claimed in claim 3, it is characterised in that the label quality and content to active user
Quality is merged, obtain active user next stage user extraction priority the step of include:Calculate the mark of active user
Sign quality and content quality and value, obtain the extraction priority of the next stage user of the active user.
6. formation gathering method as claimed in claim 3, it is characterised in that the label quality and content to active user
Quality is merged, obtain active user next stage user extraction priority the step of include:
According to the label quality of active user, the ranking in all users in the candidate user set is currently used
The first fusion component at family;
According to the content quality of active user, the ranking in all users of the candidate user set obtains active user
Second fusion component;
Calculate first merge component and the second fusion component and value, the extraction for obtaining the next stage user of the active user is excellent
First level.
7. formation gathering method as claimed in claim 3, it is characterised in that
The next stage user is added to after the candidate user set described, methods described also includes:According to described
First correlation, initializes the label quality of the next stage user of the active user;And, according to second correlation,
Initialize the content quality of the next stage user of the active user.
8. formation gathering method as claimed in claim 1, it is characterised in that the extraction stop condition is included in following condition
At least one:Reach predetermined extraction time threshold value;Extract predetermined user class depth;The extraction of active user is preferential
Level is less than predetermined threshold.
9. a kind of information collection apparatus of social networks, it is characterised in that described information collection device includes:
Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first and searches for
As a result and it is saved in a collection result database, the social networks includes multiple users, the user subnet of each user, each
Topological relation between user between content that user issues on its user subnet and user;
Candidate user generation unit, the user for determining the content in issue first search result, obtains first order use
Family, a candidate user set is added to by the first order user;
Extraction unit, for extracting the content that each user issues on its user subnet in the candidate user set one by one, is protected
It is stored in the collection result database, until default extraction stop condition is reached, wherein, used extracting active user at it
During the content of the sub- Web realease in family, according to topological relation between the user, the next stage user of active user is determined, and will be described
Next stage user is added to the candidate user set.
10. information collection apparatus as claimed in claim 9, it is characterised in that the extraction unit includes:
Selecting unit, for selecting a user from the candidate user set as active user, wherein, in the candidate
When there is first order user in user's set, one first order user is as active user for selection, otherwise, from the candidate user
The user of priority is extracted in selection with highest in set, is used as active user;
Collector unit, for extracting content that active user issues on its user subnet and preserving to the collection result data
Storehouse;
Updating block, for according to topological relation between the user, determining the next stage user of active user, by active user from
Deleted in the candidate user set, and the next stage user is added to the candidate user set;
Judging unit, for judging whether the extraction stop condition meets, if so, then terminating contents extraction;Otherwise, continue to touch
Send out selecting unit described.
11. information collection apparatus as claimed in claim 10, it is characterised in that the updating block includes:
Determining unit, for according to topological relation between the user, determining the next stage user of active user;
Priority calculation unit, for extracting the user tag for being used for representing active user's characteristic, calculates the user of active user
The first correlation between label and the seed words, obtains the label quality of active user;Active user is calculated in its user
The second correlation between the content of sub- Web realease and first search result, obtains the content quality of active user;It is right
The label quality and content quality of active user is merged, and obtains the extraction priority of the next stage user of active user;
Candidate user set maintenance unit, for carrying for the next stage user in priority calculation unit acquisition active user
Take after priority, active user is deleted from the candidate user set, and the next stage user is added to described
Candidate user set.
12. information collection apparatus as claimed in claim 11, it is characterised in that
The priority calculation unit, the label quality and content quality and value specifically for calculating active user, obtains institute
State the extraction priority of the next stage user of active user.
13. information collection apparatus as claimed in claim 11, it is characterised in that
The priority calculation unit, specifically for the label quality according to active user, in the candidate user set
Ranking in all users, obtains the first fusion component of active user;According to the content quality of active user, in the candidate
Ranking in all users of user's set, obtains the second fusion component of active user;Calculate first and merge component and second
Merge component and value, obtain the extraction priority of the next stage user of the active user.
14. information collection apparatus as claimed in claim 11, it is characterised in that the updating block also includes:
Initialization unit, for being added to the next stage user of the active user in the candidate user set maintenance unit
After the candidate user set, according to first correlation, the label of the next stage user of the active user is initialized
Quality;And, according to second correlation, initialize the content quality of the next stage user of the active user.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610203819.7A CN107292750B (en) | 2016-04-01 | 2016-04-01 | Information collection method and information collection device for social network |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610203819.7A CN107292750B (en) | 2016-04-01 | 2016-04-01 | Information collection method and information collection device for social network |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107292750A true CN107292750A (en) | 2017-10-24 |
CN107292750B CN107292750B (en) | 2020-08-18 |
Family
ID=60086931
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610203819.7A Active CN107292750B (en) | 2016-04-01 | 2016-04-01 | Information collection method and information collection device for social network |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107292750B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113946731A (en) * | 2021-10-28 | 2022-01-18 | 行吟信息科技(上海)有限公司 | Method, system, device and medium for positioning user-generated content |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402582A (en) * | 2010-09-30 | 2012-04-04 | 微软公司 | Providing associations between objects and individuals associated with relevant media items |
CN102902696A (en) * | 2011-07-29 | 2013-01-30 | 国际商业机器公司 | Method and equipment for managing content of social network |
CN103309957A (en) * | 2013-05-28 | 2013-09-18 | 华东师范大学 | Social network expert locating method introducing levy flight |
CN103631898A (en) * | 2013-11-19 | 2014-03-12 | 西安电子科技大学 | Multimedia social network reputation value calculating method based on strong and weak contact feedback |
CN103914491A (en) * | 2013-01-09 | 2014-07-09 | 腾讯科技(北京)有限公司 | Data excavating method and system for high quality user generation content (UGC) |
CN105302844A (en) * | 2014-08-01 | 2016-02-03 | 腾讯科技(深圳)有限公司 | Internet monitoring method, device and system |
-
2016
- 2016-04-01 CN CN201610203819.7A patent/CN107292750B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102402582A (en) * | 2010-09-30 | 2012-04-04 | 微软公司 | Providing associations between objects and individuals associated with relevant media items |
CN102902696A (en) * | 2011-07-29 | 2013-01-30 | 国际商业机器公司 | Method and equipment for managing content of social network |
CN103914491A (en) * | 2013-01-09 | 2014-07-09 | 腾讯科技(北京)有限公司 | Data excavating method and system for high quality user generation content (UGC) |
CN103309957A (en) * | 2013-05-28 | 2013-09-18 | 华东师范大学 | Social network expert locating method introducing levy flight |
CN103631898A (en) * | 2013-11-19 | 2014-03-12 | 西安电子科技大学 | Multimedia social network reputation value calculating method based on strong and weak contact feedback |
CN105302844A (en) * | 2014-08-01 | 2016-02-03 | 腾讯科技(深圳)有限公司 | Internet monitoring method, device and system |
Non-Patent Citations (1)
Title |
---|
邓夏玮: ""基于社交网络的用户行为研究"", 《中国优秀硕士学位论文全文数据库 信息科技辑》 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113946731A (en) * | 2021-10-28 | 2022-01-18 | 行吟信息科技(上海)有限公司 | Method, system, device and medium for positioning user-generated content |
Also Published As
Publication number | Publication date |
---|---|
CN107292750B (en) | 2020-08-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106682144B (en) | page display method and device | |
CN107066621A (en) | A kind of search method of similar video, device and storage medium | |
Wang et al. | Generating pictorial storylines via minimum-weight connected dominating set approximation in multi-view graphs | |
CN108334533A (en) | keyword extracting method and device, storage medium and electronic device | |
CN110413875A (en) | A kind of method and relevant apparatus of text information push | |
CN103136228A (en) | Image search method and image search device | |
CN106126582A (en) | Recommend method and device | |
CN105512180B (en) | A kind of search recommended method and device | |
CN107657048A (en) | user identification method and device | |
CN103577549A (en) | Crowd portrayal system and method based on microblog label | |
CN105893484A (en) | Microblog Spammer recognition method based on text characteristics and behavior characteristics | |
CN106776860A (en) | One kind search abstraction generating method and device | |
CN106033415A (en) | A text content recommendation method and device | |
CN107438083B (en) | Detection method for phishing site and its detection system under a kind of Android environment | |
Matas | Comparing Network Centrality Measures as Tools for Identifying Key Concepts in Complex Networks: A Case of Wikipedia. | |
CN103324631B (en) | The method and device of data search is provided | |
CN105929979B (en) | Long sentence input method and device | |
CN112100221B (en) | Information recommendation method and device, recommendation server and storage medium | |
CN108170678A (en) | A kind of text entities abstracting method and system | |
CN104503988A (en) | Searching method and device | |
US10318573B2 (en) | Generic card feature extraction based on card rendering as an image | |
CN105989114A (en) | Collection content recommendation method and terminal | |
CN107092621A (en) | Information search method and device | |
CN106446191B (en) | A kind of multiple features network flow row label prediction technique returned based on Logistic | |
CN104123321B (en) | A kind of determining method and device for recommending picture |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |