CN107292750A

CN107292750A - The formation gathering method and information collection apparatus of social networks

Info

Publication number: CN107292750A
Application number: CN201610203819.7A
Authority: CN
Inventors: 童毅轩; 姜珊珊; 白瑞峰; 郑继川; 董滨
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-04-01
Filing date: 2016-04-01
Publication date: 2017-10-24
Anticipated expiration: 2036-04-01
Also published as: CN107292750B

Abstract

The invention provides the formation gathering method of social networks and information collection apparatus.The present invention selects user to be extracted using the inner link between social network user, and according to the extraction priority of user, advantage distillation is worth the content of higher user's issue, so as to improve the information collecting efficiency of social networks.

Description

The formation gathering method and information collection apparatus of social networks

Technical field

The present invention relates to Web content collection technique field, and in particular to the formation gathering method and letter of a kind of social networks Cease collection device.

Background technology

Social networks has been deep into daily life, and the form of social networks is varied, such as various microbloggings Platform.Microblogging is a kind of Information Sharing based on customer relationship, the platform propagated and obtained.User can pass through WEB, WAP Personal community is set up etc. various clients, the contents such as word, picture or video are issued in personal community and carry out information updating, and Realization is shared immediately.The user A of microblog can be by paying close attention to user B, and the bean vermicelli as the user B is used with timely Family B information updating.

Microblogging has great as widely used social networks, its mass data included for many application scenarios Meaning.In order to obtain above-mentioned data, the information in microblogging website can be collected by web crawlers.However, microblog users Quantity and microblog on data it is all very huge, existing formation gathering method collects the time mistake required for data It is long, therefore need a kind of scheme for the efficiency that can improve information badly.

The content of the invention

The formation gathering method and information that the technical problem to be solved of the embodiment of the present invention is to provide a kind of social networks are received Acquisition means are more accurate and more quick required information is collected from social networks.

One side according to embodiments of the present invention there is provided a kind of formation gathering method of social networks, including：

Based on seed words, the content of the seed words is included in search social networks, the first search result is obtained and protects It is stored in a collection result database, the social networks includes multiple users, the user subnet of each user, each user at it Topological relation between user between the content issued on user subnet and user；

It is determined that the user of the content in issue first search result, obtains first order user, the first order is used Family is added to a candidate user set；

The content that each user issues on its user subnet in the candidate user set is extracted one by one, is saved in the receipts Collect in result database, until default extraction stop condition is reached, wherein, sent out extracting active user on its user subnet During the content of cloth, according to topological relation between the user, the next stage user of active user is determined, and by the next stage user It is added to the candidate user set.

It is described to extract candidate's use one by one in one side according to embodiments of the present invention, above- mentioned information collection method The content that each user issues on its user subnet in the set of family, is saved in the collection result database, until reaching pre- If extraction stop condition the step of include：

A user is selected from the candidate user set as active user, wherein, in the candidate user set In when there is first order user, one first order user of selection is used as active user, otherwise, is selected from the candidate user set The user that priority is extracted with highest is selected, active user is used as；

Extract the content issued on its user subnet of active user and preserve to the collection result database；

According to topological relation between the user, the next stage user of active user is determined, by active user from the candidate Deleted in user's set, and the next stage user is added to the candidate user set；

Judge whether the extraction stop condition meets, if so, then terminating flow；Otherwise, return described from the candidate The step of one user of selection is as active user in user's set.

In one side according to embodiments of the present invention, above- mentioned information collection method, it is described by active user from described It is described before the step of being deleted in candidate user set, and the next stage user is added into the candidate user set Method also includes：

The user tag for representing active user's characteristic is extracted, the user tag and the seed words of active user is calculated Between the first correlation, obtain the label quality of active user；

Calculate the content issued on its user subnet of active user second related between first search result Property, obtain the content quality of active user；

Label quality and content quality to active user are merged, and obtain the extraction of the next stage user of active user Priority.

In one side according to embodiments of the present invention, above- mentioned information collection method, first correlation is used to be described COS distance between the corresponding term vector of family label term vector corresponding with the seed words；

Second correlation is the COS distance between the first bag of words characteristic vector and the second bag of words characteristic vector, described First bag of words characteristic vector is the bag of words characteristic vector constructed by the content issued based on active user on its user subnet, institute It is the bag of words characteristic vector built based on first search result to state the second bag of words characteristic vector.

In one side according to embodiments of the present invention, above- mentioned information collection method, the label product to active user Matter and content quality are merged, obtain active user next stage user extraction priority the step of include：Calculate current The label quality of user and content quality and value, obtain the extraction priority of the next stage user of the active user.

In one side according to embodiments of the present invention, above- mentioned information collection method, the label product to active user Matter and content quality are merged, obtain active user next stage user extraction priority the step of include：

According to the label quality of active user, the ranking in all users in the candidate user set is worked as The first fusion component of preceding user；

According to the content quality of active user, the ranking in all users of the candidate user set obtains current The second fusion component of user；

Calculate first merge component and the second fusion component and value, the next stage user for obtaining the active user carries Take priority.

In one side according to embodiments of the present invention, above- mentioned information collection method, described by the next stage user It is added to after the candidate user set, methods described also includes：According to first correlation, the current use is initialized The label quality of the next stage user at family；And, according to second correlation, the next stage for initializing the active user is used The content quality at family.

In one side according to embodiments of the present invention, above- mentioned information collection method, it is described extraction stop condition include with At least one of lower condition：Reach predetermined extraction time threshold value；Extract predetermined user class depth；Active user's Extract priority and be less than predetermined threshold.

According to another aspect of the present invention there is provided a kind of information collection apparatus of social networks, described information collects dress Put including：

Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first Search result is simultaneously saved in a collection result database, and the social networks includes user's of multiple users, each user Topological relation between user between content that net, each user issue on its user subnet and user；

Candidate user generation unit, the user for determining the content in issue first search result, obtains first Level user, a candidate user set is added to by the first order user；

Extraction unit, for extracting one by one in the candidate user set in each user issues on its user subnet Hold, be saved in the collection result database, until default extraction stop condition is reached, wherein, extracting active user During the content issued on its user subnet, according to topological relation between the user, the next stage user of active user is determined, and The next stage user is added to the candidate user set.

In one side according to embodiments of the present invention, above- mentioned information collection device, the extraction unit includes：

Selecting unit, for selecting a user from the candidate user set as active user, wherein, described When there is first order user in candidate user set, one first order user is as active user for selection, otherwise, from the candidate The user of priority is extracted in selection with highest in user's set, is used as active user；

Collector unit, for extracting content that active user issues on its user subnet and preserving to the collection result Database；

Updating block, for according to topological relation between the user, determining the next stage user of active user, will currently use Family is deleted from the candidate user set, and the next stage user is added into the candidate user set；

Judging unit, for judging whether the extraction stop condition meets, if so, then terminating contents extraction；Otherwise, after The continuous triggering selecting unit.

In one side according to embodiments of the present invention, above- mentioned information collection device, the updating block includes：

Determining unit, for according to topological relation between the user, determining the next stage user of active user；

Priority calculation unit, for extracting the user tag for being used for representing active user's characteristic, calculates active user's The first correlation between user tag and the seed words, obtains the label quality of active user；Active user is calculated at it The second correlation between the content issued on user subnet and first search result, obtains the content product of active user Matter；Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is preferential Level；

Candidate user set maintenance unit, the next stage user for obtaining active user in the priority calculation unit Extraction priority after, active user is deleted from the candidate user set, and the next stage user is added to The candidate user set.

In one side according to embodiments of the present invention, above- mentioned information collection device, the priority calculation unit, specifically For calculate active user label quality and content quality and value, obtain the extraction of the next stage user of the active user Priority.

In one side according to embodiments of the present invention, above- mentioned information collection device, the priority calculation unit, specifically For the label quality according to active user, the ranking in all users in the candidate user set is currently used The first fusion component at family；According to the content quality of active user, the ranking in all users of the candidate user set, Obtain the second fusion component of active user；Calculate first merge component and the second fusion component and value, obtain described current The extraction priority of the next stage user of user.

In one side according to embodiments of the present invention, above- mentioned information collection device, the updating block also includes：

Initialization unit, for adding the next stage user of the active user in the candidate user set maintenance unit Enter to after the candidate user set, according to first correlation, initialize the next stage user's of the active user Label quality；And, according to second correlation, initialize the content quality of the next stage user of the active user.

Compared with prior art, the formation gathering method of social networks provided in an embodiment of the present invention and information dress Put, using the inner link between social network user, select user to be extracted and add to candidate user set, and according to The content of higher user's issue is worth in the extraction priority of user, advantage distillation candidate user set, social activity can be improved The information collecting efficiency of network.

Brief description of the drawings

Fig. 1 is a kind of schematic flow sheet of the formation gathering method of social networks provided in an embodiment of the present invention；

Fig. 2 is a kind of schematic flow sheet of per user progress contents extraction in the embodiment of the present invention；

Fig. 3 is a kind of schematic flow sheet of calculating extraction priority in the embodiment of the present invention；

Fig. 4 is the illustrative view of functional configuration of information collection apparatus provided in an embodiment of the present invention；

Fig. 5 is the illustrative view of functional configuration of the extraction unit of information collection apparatus provided in an embodiment of the present invention；

Fig. 6 is the illustrative view of functional configuration of the updating block of information collection apparatus provided in an embodiment of the present invention；

Fig. 7 is a kind of hardware architecture diagram of information collection apparatus provided in an embodiment of the present invention.

Embodiment

To make the technical problem to be solved in the present invention, technical scheme and advantage clearer, below in conjunction with accompanying drawing and tool Body embodiment is described in detail.It is only there is provided the specific detail of such as specific configuration and component in the following description In order to help comprehensive understanding embodiments of the invention.Therefore, it will be apparent to those skilled in the art that can be to reality described herein Example is applied to make various changes and modifications without departing from scope and spirit of the present invention.In addition, for clarity and brevity, eliminate pair The description of known function and construction.

It should be understood that " one embodiment " or " embodiment " that specification is mentioned in the whole text means relevant with embodiment During special characteristic, structure or characteristic are included at least one embodiment of the present invention.Therefore, occur everywhere in entire disclosure " in one embodiment " or " in one embodiment " identical embodiment is not necessarily referred to.In addition, these specific feature, knots Structure or characteristic can be combined in one or more embodiments in any suitable manner.

In various embodiments of the present invention, it should be appreciated that the size of the sequence number of above-mentioned each process is not meant to that execution is suitable The priority of sequence, the execution sequence of each process should be determined with its function and internal logic, without the implementation of the reply embodiment of the present invention Process constitutes any limit.

It should be understood that the terms "and/or", a kind of only incidence relation for describing affiliated partner, expression can be deposited In three kinds of relations, for example, A and/or B, can be represented：Individualism A, while there is A and B, these three situations of individualism B.

Social networks has generally included following element：

User, for example, can be represented by User Identity；

The personal community that user is set up on the user subnet of user, such as microblogging, can generally be positioned by unified resource (URL, Uniform Resoure Locator) is accorded with to represent the user subnet of some user；

The content that user issues on its user subnet, the content can be the forms such as text, picture, video or audio. In this paper, will mainly for collect content of text exemplified by illustrate.

By paying close attention to the bean vermicelli relation that mode is set up on topological relation between user between user, such as microblog.When When second user has paid close attention to the first user on social networks, then formd between above-mentioned user in topological relation from the first user to The annexation of second user, the annexation is that tool is directive.For the ease of description, herein, described it will be used from first Family is expressed as to the annexation of second user：The next stage user of first user includes second user.

The embodiment of the present invention is collected to the content of text on above social networks, in infonnation collection process, it is considered to To the contact between the user of social networks between topological relation, and content and user, the user of next collection is determined, from And faster can more accurately be collected into required information.

Fig. 1 is refer to, a kind of formation gathering method of social networks provided in an embodiment of the present invention comprises the following steps：

Step 11, based on seed words, the content of the seed words is included in search social networks, the first search knot is obtained Fruit is simultaneously saved in a collection result database.

In the embodiment of the present invention, the social networks generally includes multiple users, the user subnet of each user, each user Topological relation between user between the content issued on its user subnet and user.Content described in the present embodiment, refers to Based on the content of text of one or more natural languages, the perhaps content of text of other language in such as Chinese text.The seed Word generally includes the logical relation between at least one search keyword and search keyword, such as with or relation.Search is closed Keyword can be the word that user voluntarily specifies, to indicate the theme that this search is focused on, needed for being searched on social networks The content wanted.

In the embodiment of the present invention, above-mentioned search can be performed using various existing searching algorithms, include kind to obtain The content of sub- word.

Step 12, it is determined that the user of the content in issue first search result, obtains first order user, by described the Primary user is added to a candidate user set.

Directly include above-mentioned seed words in the content issued due to first order user, therefore first order user is value The information higher with seed word correlation is likely to contain in higher user, the content of these users issue, therefore, is being obtained When obtaining the first search result, the embodiment of the present invention can further extract the user of the content in the first search result of issue, plus Enter to candidate user set.

Step 13, the content that each user issues on its user subnet in the candidate user set is extracted one by one, is preserved Into the collection result database, until default extraction stop condition is reached, wherein, active user is being extracted in its user During the content of sub- Web realease, according to topological relation between the user, determine the next stage user of active user, and will be described under Primary user is added to the candidate user set.

Here, the embodiment of the present invention is directed to each user in candidate user set, and contents extraction processing is carried out one by one, extracts The content that the user issues on the user subnet of the user, for example, all the elements or issuing time are in preset time period Content, and the content extracted is preserved into the collection result database.During the contents extraction of active user (specifically can be before contents extraction, in extraction or after the completion of extraction), further according to topological relation between user, extracts current The next stage user of user, that is, pay close attention to the user of active user, be added to the candidate user set.For example, can be according to micro- The bean vermicelli relation of rich platform, extracts the bean vermicelli of active user, adds the candidate user set.

In above method, it is primarily based on seed words and scans for, obtains the first search result and first order user, then The content of the subordinate subscriber issue of first order user and first order user is further extracted, the content of these users issue is led to Often there is higher correlation with seed words.When reaching predetermined extraction stop condition, extraction process will be stopped, knot is now collected Information of the content collected by this collection process in fruit database.Here, it can be following condition to extract stop condition In any one or more：Reach predetermined extraction time threshold value；Extract predetermined user class depth；Active user's Extract priority and be less than predetermined threshold.

By above method, the embodiment of the present invention can be directly collected into the content comprising seed words, can also be collected into There is the content of high correlation with seed words, the accuracy of information can be improved, the time required to reducing collection, improve Information collecting efficiency.

In above step 13, in the content of each user's issue in extracting the candidate user set one by one, Ke Yigen Corresponding extraction priority is set according to the significance level of each user.Directly include in the content of first order user issue Seed words are stated, therefore with higher priority, as a kind of implementation, the present embodiment, can be first in above-mentioned steps 13 The content of first order user is first collected, contents extraction is then carried out according to the extraction priority of other users one by one, now, is such as schemed Shown in 2, above-mentioned steps 13 can specifically include：

Step 131, a user is selected from the candidate user set as active user, wherein, in the candidate When there is first order user in user's set, one first order user is as active user for selection, otherwise, from the candidate user The user of priority is extracted in selection with highest in set, is used as active user；

Step 132, extract the content issued on its user subnet of active user and preserve to the collection result data Storehouse；

Step 133, according to topological relation between the user, determine the next stage user of active user, by active user from Deleted in the candidate user set, and the next stage user is added to the candidate user set；

Step 134, judge whether the extraction stop condition meets, if so, then terminating flow；Otherwise, return to step 131。

The embodiment of the present invention is directed to the outdoor other users of the first order in the candidate user set, can be based on extraction Priority determines the sequencing extracted, and advantage distillation is worth the content of higher user.

In social networks, foundation has between the user of Topology connection, generally has higher correlation.For example, for For microblog, it is assumed that user A is user B bean vermicelli, then user A has the possibility of the interest similar with user B It is larger, therefore the microblogging that they issue is more likely related to identical theme.Therefore, the embodiment of the present invention is according to active user The content and its user tag of issue, to calculate the quality score of active user, and regard the quality score as the active user Next stage user extraction priority.Specific calculating process can be performed in above-mentioned steps 133, now as shown in figure 3, Above-mentioned steps 133 are specifically included：

Step 1331, according to topological relation between the user, the next stage user of active user is determined.

Step 1332, extract the user tag for representing active user's characteristic, calculate the user tag of active user with The first correlation between the seed words, obtains the label quality of active user；Active user is calculated on its user subnet The second correlation between the content of issue and first search result, obtains the content quality of active user.

User tag is the label that user is set from behavior oneself, generally represents the characteristic of user, such as interest, geographical position Put, the classification such as age.Certainly, social networks can also be that user sets corresponding user tag according to the behavioural characteristic of user. Generally, user tag is also the form by using content of text, is represented on the user subnet of the user.The present embodiment can be from The user tag of user is extracted on user subnet.If the user tag extracted is sky, the first correlation can be expressed as 0； If the content of active user's issue is sky, the second correlation can be expressed as 0.

Here, first correlation can be corresponding with the seed words by the corresponding term vector of the user tag COS distance between term vector is characterized, and the COS distance is bigger, then it represents that the first correlation is higher.The embodiment of the present invention is carried A kind of calculation formula supplied is as follows：

In above-mentioned formula, uq is the label quality of active user, it is clear that the first correlation is higher, the score of label quality It is higher, represent that label quality is better；Label_vec is the term vector corresponding to active user U user tag, label_ vec₁,label_vec₂,label_vec₃…label_vec_nFor the active user U corresponding characteristic vector of multiple user tags, Wherein n is the quantity of active user U user tag.Seed_vec is the term vector of seed words, if seed words are including multiple Search keyword, then seed_vec is the term vector sum of each search keyword.Label_vec, seed_vec can lead to Cross what the word2vec models of prior art were obtained, the model can beforehand through social networks content language material (such as microblogging language Material) training obtain.

Here, second correlation can be by more than between the first bag of words characteristic vector and the second bag of words characteristic vector Chordal distance is characterized, and the COS distance is bigger, then it represents that the second correlation is higher.The first bag of words characteristic vector is to be based on working as Bag of words characteristic vector constructed by the content that preceding user issues on its user subnet, the second bag of words characteristic vector is to be based on The bag of words characteristic vector that first search result is built.A kind of calculation formula of second correlation provided in an embodiment of the present invention It is as follows：

In above-mentioned formula, mq is the content quality of active user, it is clear that the second correlation is higher, the score of content quality It is higher, represent that content quality is also better；B_vec is the first bag of words characteristic vector；R_vec is the second bag of words characteristic vector.

Step 1333, the label quality and content quality to active user are merged, and obtain the next stage of active user The extraction priority of user.Here, priority and the equal positive correlation of label quality and content quality of active user are extracted, i.e., it is current The label quality of user is more excellent, and the extraction priority is higher；The content quality of active user is more excellent, and the extraction priority is higher.

Step 1334, active user is deleted from the candidate user set, and the next stage user is added To the candidate user set.

When the next stage user of the active user is added in the candidate user set, the embodiment of the present invention is also According to first correlation, the label quality of the next stage user of the active user can be initialized；And, according to described Second correlation, initializes the content quality of the next stage user of the active user, i.e. by the next stage user of active user Label quality be initialized as active user the first correlation value, by the beginning of the content quality of the next stage user of active user Beginning turns to the value of the second correlation of active user.

In above-mentioned steps 1333, label quality and content quality to active user are merged, and obtain next stage use The extraction priority at family.So, even if the user tag of active user is empty or issue content is empty, it can also utilize another Quality factor calculates extraction priority.

A kind of implementation of above-mentioned fusion treatment provided in an embodiment of the present invention is：Directly calculate the label of active user Quality and content quality and value, using this and value as the next stage user of active user extraction priority.

Another realization of above-mentioned fusion treatment provided in an embodiment of the present invention can then be carried out in such a way：

1) according to the label quality of active user, the ranking in all users in the candidate user set is obtained The first fusion component of active user.Here, preset first corresponding to the preferable user of label quality and merge component Value, not less than the value of the first fusion component corresponding to the user of label inferior quality.

2) according to the content quality of active user, the ranking in all users of the candidate user set is worked as The second fusion component of preceding user.Here, the value of the second fusion component corresponding to the preferable user of content quality is preset, The value of the second fusion component corresponding to the user poor not less than content quality.

3) calculate first merge component and second fusion component and value, obtain the next stage user's of the active user Extract priority.

According to above calculation, above-mentioned and value is bigger, then extracts priority higher.A specific fusion is given below Example：

Assuming that pre-defined：Label quality user of M, its first fusion point before ranking in the candidate user set The value of amount is x；User after M (not including M), the value of its first fusion component is 0；Content quality is in institute The user of L before ranking is stated in candidate user set, value of its second fusion component is y；(not including L after L Name) user, its second fusion component value be 0.In fusion treatment, according to the label quality and content product of active user Ranking of the matter in the candidate user set, determines the value of the first fusion component and the second fusion component, then, calculates first Merge component and second fusion component and value, obtain the extraction priority of the next stage user of the active user.

It the above is only a kind of example of fusion.When using opposite amalgamation mode, label quality is preset for example, working as The value of the first fusion component corresponding to preferable user, the first fusion point no more than corresponding to the user of label inferior quality The value of amount, and, the value of the second fusion component corresponding to the preferable user of content quality, the no more than poor use of content quality During the value of the second fusion component corresponding to family, then it is probably above-mentioned smaller with value, extracts priority higher.

From described above as can be seen that above method of the embodiment of the present invention utilizes the inherent connection between social network user System, selects user to be extracted, and according to the extraction priority of user, advantage distillation is worth the interior of higher user's issue Hold, so as to more efficiently carry out content collecting.

Fig. 4 is refer to, the embodiment of the present invention additionally provides a kind of functional structure of the information collection apparatus of social networks and shown It is intended to, as shown in figure 4, the information collection apparatus 40 includes：

Search unit 41, for based on seed words, the content of the seed words to be included in search social networks, obtains the One search result is simultaneously saved in a collection result database；

Candidate user generation unit 42, the user for determining the content in issue first search result obtains the Primary user, a candidate user set is added to by the first order user；

Extraction unit 43, for extracting one by one in the candidate user set in each user issues on its user subnet Hold, be saved in the collection result database, until default extraction stop condition is reached, wherein, extracting active user During the content issued on its user subnet, according to topological relation between the user, the next stage user of active user is determined, and The next stage user is added to the candidate user set.

Fig. 5 is refer to, one side according to embodiments of the present invention, information above is collected and filled in 40, the extraction unit 43 include：

Selecting unit 431, for selecting a user from the candidate user set as active user, wherein, When there is first order user in the candidate user set, one first order user is as active user for selection, otherwise, from described The user of priority is extracted in selection with highest in candidate user set, is used as active user；

Collector unit 432, is collected for extracting content that active user issues on its user subnet and preserving to described Result database；

Updating block 433, ought for according to topological relation between the user, determining the next stage user of active user Preceding user is deleted from the candidate user set, and the next stage user is added into the candidate user set；

Judging unit 434, for judging whether the extraction stop condition meets, if so, then terminating contents extraction；It is no Then, continue to trigger the selecting unit 431.

Fig. 5 is refer to, another aspect according to embodiments of the present invention is described to update single in information above collection device 40 Member 433 includes：

Determining unit 4331, for according to topological relation between the user, determining the next stage user of active user；

Priority calculation unit 4332, for extracting the user tag for being used for representing active user's characteristic, calculates current use The first correlation between the user tag at family and the seed words, obtains the label quality of active user；Calculate active user The second correlation between the content issued on its user subnet and first search result, obtains the content of active user Quality；Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is excellent First level；

Candidate user set maintenance unit 4333, the next stage for obtaining active user in the priority calculation unit After the extraction priority of user, active user is deleted from the candidate user set, and the next stage user is added Enter to the candidate user set；

Initialization unit 4334, in the candidate user set maintenance unit 4333 by the next of the active user Level user is added to after the candidate user set, according to first correlation, initializes the next of the active user The label quality of level user；And, according to second correlation, initialize the content of the next stage user of the active user Quality.

In the embodiment of the present invention, the priority calculation unit 4332 can be calculated currently by various ways The extraction priority of the next stage user of user.The possible calculation of one of which is：The priority calculation unit 4332, Label quality and content quality and value specifically for calculating active user, regard this as the next of the active user with value The extraction priority of level user.Alternatively possible calculation is：The priority calculation unit 4332, specifically for basis The label quality of active user, the ranking in all users in the candidate user set, obtains the first of active user Merge component；According to the content quality of active user, the ranking in all users of the candidate user set obtains current The second fusion component of user；Calculate first merge component and second fusion component and value, obtain under the active user The extraction priority of primary user.

Fig. 7 then gives a kind of hardware architecture diagram of the information collection apparatus of the embodiment of the present invention, the information Device can be deployed in computer system 70, and the computer system 70 includes：

Processor 71, RAM 72, ROM 73, hard disk 74, input equipment 75, display device 76, and the said equipment is connected The bus structures 77 connect.

Here, input equipment 75 can include mouse, keyboard and various handwriting input mouses or touch input device；It is aobvious Show that equipment 76 includes various displays and projector equipment etc.；Bus architecture 77 can include the total of any number of interconnection Line and bridge；One or more processor unit that processor 71 is represented, and represented by RAM 72 and ROM 73 one or Various being electrically connected to together of the multiple memories of person.The intermediate result of the computing of processor 71 can be stored in RAM 72, most The data of the collection result database obtained eventually can be stored in hard disk 74.Bus architecture 77 can also set such as periphery The various other of standby, voltage-stablizer and management circuit or the like are electrically connected to together, and these are all known in the field. Therefore, no longer it is described in greater detail herein.

, can when processor 71 calls and performed the program and data stored in the RAM 72 and/or ROM 73123 To realize following functional module：

Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first Search result is simultaneously saved in a collection result database；

To sum up, formation gathering method and device provided in an embodiment of the present invention, utilize the inherence between social network user Contact, selects user to be extracted, and according to the extraction priority of user, advantage distillation is worth the interior of higher user's issue Hold, so as to improve the information collecting efficiency of social networks.

Described above is the preferred embodiment of the present invention, it is noted that for those skilled in the art For, on the premise of principle of the present invention is not departed from, some improvements and modifications can also be made, these improvements and modifications It should be regarded as protection scope of the present invention.

Claims

1. a kind of formation gathering method of social networks, it is characterised in that including：

Based on seed words, the content of the seed words is included in search social networks, the first search result is obtained and is saved in One collects in result database, and the social networks includes multiple users, the user subnet of each user, each user in its user Topological relation between user between the content of sub- Web realease and user；

It is determined that the user of the content in issue first search result, obtains first order user, the first order user is added Enter to a candidate user set；

The content that each user issues on its user subnet in the candidate user set is extracted one by one, is saved in described collect and is tied In fruit database, until default extraction stop condition is reached, wherein, extracting what active user issued on its user subnet During content, according to topological relation between the user, the next stage user of active user is determined, and the next stage user is added To the candidate user set.

2. formation gathering method as claimed in claim 1, it is characterised in that described to extract one by one in the candidate user set The content that each user issues on its user subnet, is saved in the collection result database, until reaching default extraction The step of stop condition, includes：

A user is selected from the candidate user set as active user, wherein, deposited in the candidate user set In first order user, otherwise one first order user of selection, selects tool as active user from the candidate user set There is highest to extract the user of priority, be used as active user；

According to topological relation between the user, the next stage user of active user is determined, by active user from the candidate user Deleted in set, and the next stage user is added to the candidate user set；

Judge whether the extraction stop condition meets, if so, then terminating flow；Otherwise, return described from the candidate user A step of user is as active user is selected in set.

3. formation gathering method as claimed in claim 2, it is characterised in that it is described by active user from the candidate user Before the step of being deleted in set, and the next stage user is added into the candidate user set, methods described is also wrapped Include：

The user tag for representing active user's characteristic is extracted, between the user tag and the seed words that calculate active user The first correlation, obtain the label quality of active user；

The second correlation between the content issued on its user subnet of active user and first search result is calculated, is obtained To the content quality of active user；

Label quality and content quality to active user are merged, and the extraction for obtaining the next stage user of active user is preferential Level.

4. formation gathering method as claimed in claim 3, it is characterised in that

First correlation is remaining between the corresponding term vector of user tag term vector corresponding with the seed words Chordal distance；

Second correlation is the COS distance between the first bag of words characteristic vector and the second bag of words characteristic vector, described first Bag of words characteristic vector is the bag of words characteristic vector constructed by the content issued based on active user on its user subnet, described Two bag of words characteristic vectors are the bag of words characteristic vectors built based on first search result.

5. formation gathering method as claimed in claim 3, it is characterised in that the label quality and content to active user Quality is merged, obtain active user next stage user extraction priority the step of include：Calculate the mark of active user Sign quality and content quality and value, obtain the extraction priority of the next stage user of the active user.

6. formation gathering method as claimed in claim 3, it is characterised in that the label quality and content to active user Quality is merged, obtain active user next stage user extraction priority the step of include：

According to the label quality of active user, the ranking in all users in the candidate user set is currently used The first fusion component at family；

According to the content quality of active user, the ranking in all users of the candidate user set obtains active user Second fusion component；

Calculate first merge component and the second fusion component and value, the extraction for obtaining the next stage user of the active user is excellent First level.

7. formation gathering method as claimed in claim 3, it is characterised in that

The next stage user is added to after the candidate user set described, methods described also includes：According to described First correlation, initializes the label quality of the next stage user of the active user；And, according to second correlation, Initialize the content quality of the next stage user of the active user.

8. formation gathering method as claimed in claim 1, it is characterised in that the extraction stop condition is included in following condition At least one：Reach predetermined extraction time threshold value；Extract predetermined user class depth；The extraction of active user is preferential Level is less than predetermined threshold.

9. a kind of information collection apparatus of social networks, it is characterised in that described information collection device includes：

Search unit, for based on seed words, searching for the content for including the seed words in social networks, obtains first and searches for As a result and it is saved in a collection result database, the social networks includes multiple users, the user subnet of each user, each Topological relation between user between content that user issues on its user subnet and user；

Candidate user generation unit, the user for determining the content in issue first search result, obtains first order use Family, a candidate user set is added to by the first order user；

Extraction unit, for extracting the content that each user issues on its user subnet in the candidate user set one by one, is protected It is stored in the collection result database, until default extraction stop condition is reached, wherein, used extracting active user at it During the content of the sub- Web realease in family, according to topological relation between the user, the next stage user of active user is determined, and will be described Next stage user is added to the candidate user set.

10. information collection apparatus as claimed in claim 9, it is characterised in that the extraction unit includes：

Selecting unit, for selecting a user from the candidate user set as active user, wherein, in the candidate When there is first order user in user's set, one first order user is as active user for selection, otherwise, from the candidate user The user of priority is extracted in selection with highest in set, is used as active user；

Collector unit, for extracting content that active user issues on its user subnet and preserving to the collection result data Storehouse；

Updating block, for according to topological relation between the user, determining the next stage user of active user, by active user from Deleted in the candidate user set, and the next stage user is added to the candidate user set；

Judging unit, for judging whether the extraction stop condition meets, if so, then terminating contents extraction；Otherwise, continue to touch Send out selecting unit described.

11. information collection apparatus as claimed in claim 10, it is characterised in that the updating block includes：

Priority calculation unit, for extracting the user tag for being used for representing active user's characteristic, calculates the user of active user The first correlation between label and the seed words, obtains the label quality of active user；Active user is calculated in its user The second correlation between the content of sub- Web realease and first search result, obtains the content quality of active user；It is right The label quality and content quality of active user is merged, and obtains the extraction priority of the next stage user of active user；

Candidate user set maintenance unit, for carrying for the next stage user in priority calculation unit acquisition active user Take after priority, active user is deleted from the candidate user set, and the next stage user is added to described Candidate user set.

12. information collection apparatus as claimed in claim 11, it is characterised in that

The priority calculation unit, the label quality and content quality and value specifically for calculating active user, obtains institute State the extraction priority of the next stage user of active user.

13. information collection apparatus as claimed in claim 11, it is characterised in that

The priority calculation unit, specifically for the label quality according to active user, in the candidate user set Ranking in all users, obtains the first fusion component of active user；According to the content quality of active user, in the candidate Ranking in all users of user's set, obtains the second fusion component of active user；Calculate first and merge component and second Merge component and value, obtain the extraction priority of the next stage user of the active user.

14. information collection apparatus as claimed in claim 11, it is characterised in that the updating block also includes：

Initialization unit, for being added to the next stage user of the active user in the candidate user set maintenance unit After the candidate user set, according to first correlation, the label of the next stage user of the active user is initialized Quality；And, according to second correlation, initialize the content quality of the next stage user of the active user.