CN107292750B

CN107292750B - Information collection method and information collection device for social network

Info

Publication number: CN107292750B
Application number: CN201610203819.7A
Authority: CN
Inventors: 童毅轩; 姜珊珊; 白瑞峰; 郑继川; 董滨
Original assignee: Ricoh Co Ltd
Current assignee: Ricoh Co Ltd
Priority date: 2016-04-01
Filing date: 2016-04-01
Publication date: 2020-08-18
Anticipated expiration: 2036-04-01
Also published as: CN107292750A

Abstract

The invention provides an information collection method and an information collection device of a social network. According to the method and the device, the internal connection among the social network users is utilized, the users to be extracted are selected, and the contents published by the users with higher values are preferentially extracted according to the extraction priority of the users, so that the information collection efficiency of the social network is improved.

Description

Information collection method and information collection device for social network

Technical Field

The invention relates to the technical field of network content collection, in particular to an information collection method and an information collection device of a social network.

Background

Social networks have been advanced into people's daily lives, and are in various forms, such as various microblog platforms. Microblogging is a platform for sharing, spreading and acquiring information based on user relationships. Users can build a personal community through various clients such as WEB, WAP and the like, and release contents such as characters, pictures or videos and the like on the personal community for information updating, and instant sharing is realized. The user A of the microblog platform can become the fan of the user B by paying attention to the user B, and the information of the user B is updated in time.

The mass data contained by the microblog as a widely used social network has great significance for many application scenarios. In order to obtain the data, information in the microblog website can be collected through a web crawler. However, the number of microblog users and the data on the microblog platform are huge, and the time required for collecting the data by the existing information collection method is too long, so a scheme capable of improving the information collection efficiency is urgently needed.

Disclosure of Invention

The technical problem to be solved by the embodiments of the present invention is to provide an information collecting method and an information collecting device for a social network, so as to collect required information from the social network more accurately and more quickly.

According to an aspect of the embodiments of the present invention, there is provided an information collecting method for a social network, including:

searching the content containing the seed words in a social network based on the seed words, obtaining a first search result and storing the first search result in a collected result database, wherein the social network comprises a plurality of users, user subnets of the users, content published by the users on the user subnets of the users and topological relations among the users;

determining users who issue the content in the first search result to obtain first-level users, and adding the first-level users into a candidate user set;

and extracting the contents published by the users on the user sub-network in the candidate user set one by one, storing the contents in the collection result database until reaching a preset extraction stop condition, wherein when the contents published by the current user on the user sub-network are extracted, the next-level user of the current user is determined according to the topological relation among the users, and the next-level user is added into the candidate user set.

According to an aspect of an embodiment of the present invention, in the information collecting method, the step of extracting, one by one, contents published by each user in the candidate user set on the user subnet thereof and storing the contents in the collection result database until a preset extraction stop condition is reached includes:

selecting one user from the candidate user set as a current user, wherein when a first-level user exists in the candidate user set, the first-level user is selected as the current user, otherwise, the user with the highest extraction priority is selected from the candidate user set as the current user;

extracting the content published by the current user on the user sub-network and storing the content in the collection result database;

determining the next-level user of the current user according to the topological relation among the users, deleting the current user from the candidate user set, and adding the next-level user into the candidate user set;

judging whether the extraction stopping condition is met, if so, ending the process; otherwise, returning to the step of selecting one user from the candidate user set as the current user.

According to an aspect of an embodiment of the present invention, in the above information collecting method, before the step of deleting the current user from the candidate user set and adding the next-level user to the candidate user set, the method further includes:

extracting a user label for representing the characteristics of the current user, and calculating a first correlation between the user label of the current user and the seed word to obtain the label quality of the current user;

calculating a second correlation between the content published by the current user on the user sub-network and the first search result to obtain the content quality of the current user;

and fusing the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user.

According to an aspect of an embodiment of the present invention, in the information collecting method, the first correlation is a cosine distance between a word vector corresponding to the user tag and a word vector corresponding to the seed word;

the second correlation is a cosine distance between a first bag-of-words feature vector and a second bag-of-words feature vector, the first bag-of-words feature vector is a bag-of-words feature vector constructed based on content published by a current user on a user subnet of the current user, and the second bag-of-words feature vector is a bag-of-words feature vector constructed based on the first search result.

According to an aspect of an embodiment of the present invention, in the information collecting method, the step of fusing the label quality and the content quality of the current user to obtain the extraction priority of the user at the next level of the current user includes: and calculating the sum of the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user.

According to an aspect of an embodiment of the present invention, in the information collecting method, the step of fusing the label quality and the content quality of the current user to obtain the extraction priority of the user at the next level of the current user includes:

ranking all users in the candidate user set according to the label quality of the current user to obtain a first fusion component of the current user;

ranking in all users of the candidate user set according to the content quality of the current user to obtain a second fusion component of the current user;

and calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

According to an aspect of an embodiment of the present invention, in the above information collecting method, after the adding the next-level user to the candidate user set, the method further includes: initializing the label quality of a next-level user of the current user according to the first correlation; and initializing the content quality of the next-level user of the current user according to the second correlation.

According to an aspect of an embodiment of the present invention, in the above information collecting method, the extraction stop condition includes at least one of the following conditions: reaching a predetermined extraction time threshold; extracting a predetermined user level depth; the extraction priority of the current user is lower than a predetermined threshold.

According to another aspect of the present invention, there is provided an information collecting apparatus of a social network, the information collecting apparatus including:

the search unit is used for searching the content containing the seed words in the social network based on the seed words, obtaining a first search result and storing the first search result in a collected result database, wherein the social network comprises a plurality of users, user subnets of the users, content published by the users on the user subnets of the users and topological relations among the users;

the candidate user generating unit is used for determining the users who issue the contents in the first search result, obtaining first-level users and adding the first-level users into a candidate user set;

and the extraction unit is used for extracting the contents published by the users on the user sub-network in the candidate user set one by one and storing the contents in the collection result database until a preset extraction stopping condition is reached, wherein when the contents published by the current user on the user sub-network are extracted, the next-level user of the current user is determined according to the topological relation among the users, and the next-level user is added into the candidate user set.

According to an aspect of an embodiment of the present invention, in the information collecting apparatus, the extracting unit includes:

a selecting unit, configured to select a user from the candidate user set as a current user, where when a first-level user exists in the candidate user set, the first-level user is selected as the current user, and otherwise, a user with the highest extraction priority is selected from the candidate user set as the current user;

the collection unit is used for extracting the content published by the current user on the user sub-network and storing the content in the collection result database;

the updating unit is used for determining the next-level user of the current user according to the topological relation among the users, deleting the current user from the candidate user set and adding the next-level user into the candidate user set;

a judging unit, configured to judge whether the extraction stop condition is satisfied, and if so, end content extraction; otherwise, the selection unit is continuously triggered.

According to an aspect of an embodiment of the present invention, in the information collecting apparatus, the updating unit includes:

the determining unit is used for determining the next-level user of the current user according to the topological relation among the users;

the priority calculating unit is used for extracting a user label for representing the characteristics of the current user, calculating a first correlation between the user label of the current user and the seed word, and obtaining the label quality of the current user; calculating a second correlation between the content published by the current user on the user sub-network and the first search result to obtain the content quality of the current user; fusing the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user;

and the candidate user set maintenance unit is used for deleting the current user from the candidate user set and adding the next user into the candidate user set after the priority calculation unit obtains the extraction priority of the next user of the current user.

According to an aspect of an embodiment of the present invention, in the information collecting apparatus, the priority calculating unit is specifically configured to calculate a sum of a tag quality and a content quality of a current user, and obtain an extraction priority of a user next to the current user.

According to an aspect of an embodiment of the present invention, in the information collecting apparatus, the priority calculating unit is specifically configured to obtain a first fusion component of the current user according to the label quality of the current user and the ranking of all users in the candidate user set; ranking in all users of the candidate user set according to the content quality of the current user to obtain a second fusion component of the current user; and calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

According to an aspect of an embodiment of the present invention, in the information collecting apparatus, the updating unit further includes:

an initialization unit, configured to initialize the tag quality of the next-level user of the current user according to the first correlation after the candidate user set maintenance unit adds the next-level user of the current user to the candidate user set; and initializing the content quality of the next-level user of the current user according to the second correlation.

Compared with the prior art, the information collection method and the information collection device for the social network provided by the embodiment of the invention have the advantages that the internal connection among the social network users is utilized, the user to be extracted is selected to be added into the candidate user set, the content published by the user with higher value in the candidate user set is preferentially extracted according to the extraction priority of the user, and the information collection efficiency of the social network can be improved.

Drawings

Fig. 1 is a schematic flowchart of an information collecting method for a social network according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart of content extraction performed on a user-by-user basis according to an embodiment of the present invention;

FIG. 3 is a flow chart illustrating the calculation of extraction priorities according to an embodiment of the present invention;

FIG. 4 is a functional structure diagram of an information collecting apparatus according to an embodiment of the present invention;

fig. 5 is a functional structure diagram of an extracting unit of the information collecting apparatus according to the embodiment of the present invention;

FIG. 6 is a functional structure diagram of an update unit of an information collection device according to an embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of an information collecting apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical problems, technical solutions and advantages of the present invention more apparent, the following detailed description is given with reference to the accompanying drawings and specific embodiments. In the following description, specific details such as specific configurations and components are provided only to help the full understanding of the embodiments of the present invention. Thus, it will be apparent to those skilled in the art that various changes and modifications may be made to the embodiments described herein without departing from the scope and spirit of the invention. In addition, descriptions of well-known functions and constructions are omitted for clarity and conciseness.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

In various embodiments of the present invention, it should be understood that the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation on the implementation process of the embodiments of the present invention.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone.

Social networks typically include the following elements:

a user, for example, may be represented by a user identity;

user subnets of users, such as personal communities established by users on microblogs, may generally represent a user subnet of a certain user by a Uniform Resource Locator (URL);

the content published by users on their user sub-networks can be in the form of text, pictures, video or audio. In this document, the description will be mainly given for the example of collecting text content.

Topological relationships among users, such as fan relationships established on a microblog platform in an attention mode. When the second user pays attention to the first user on the social network, the connection relationship from the first user to the second user is formed in the topological relationship among the users, and the connection relationship has a direction. For convenience of description, the connection relationship from the first user to the second user is expressed as: the next level of users of the first user comprises the second user.

The embodiment of the invention collects the text content on the social network, and determines the next collected user by considering the topological relation among the users of the social network and the connection between the content and the user in the information collection process, thereby collecting the required information more quickly and accurately.

Referring to fig. 1, an embodiment of the present invention provides a method for collecting information of a social network, including the following steps:

and step 11, searching the content containing the seed words in the social network based on the seed words, obtaining a first search result and storing the first search result in a collected result database.

In the embodiment of the present invention, the social network generally includes a plurality of users, a user subnet of each user, content published by each user on the user subnet of each user, and a topological relationship between users. The content described in this embodiment refers to text content based on one or more natural languages, such as chinese text content or text content in other languages. The seed term typically includes at least one search keyword and a logical relationship, such as an and, or, etc., relationship between the search keywords. The search keyword may be a word that the user specifies by himself to indicate a subject matter on which the search is focused to search for desired content on the social network.

In the embodiment of the present invention, the search may be performed by applying various existing search algorithms to obtain the content including the seed word.

And step 12, determining the users who issue the contents in the first search result to obtain first-level users, and adding the first-level users into a candidate user set.

Because the content published by the first-level user directly contains the seed words, the first-level users are all users with higher value, and the content published by the users probably contains information with higher relevance with the seed words, so that when the first search result is obtained, the embodiment of the invention can further extract the user publishing the content in the first search result and add the user publishing the content in the first search result into the candidate user set.

And step 13, extracting the contents published by the users on the user sub-network in the candidate user set one by one, storing the contents in the collection result database until a preset extraction stopping condition is reached, wherein when the contents published by the current user on the user sub-network are extracted, the next-level user of the current user is determined according to the topological relation among the users, and the next-level user is added into the candidate user set.

Here, the embodiment of the present invention performs content extraction processing for each user in the candidate user set one by one, extracts content that the user publishes on the user subnet of the user, for example, all content or content whose publication time is within a preset time period, and saves the extracted content in the collection result database. In the process of extracting the content of the current user (specifically, before, during or after the content is extracted), the next-level user of the current user, that is, the user paying attention to the current user, is further extracted according to the topological relation among the users and added into the candidate user set. For example, fans of the current user can be extracted according to fan relations of the microblog platform, and the fans are added into the candidate user set.

In the above method, firstly, a search is performed based on seed words to obtain a first search result and a first-level user, and then, contents issued by the first-level user and subordinate users of the first-level user are further extracted, and the contents issued by the users generally have higher relevance with the seed words. When a preset extraction stopping condition is reached, the extraction processing is stopped, and the content in the collection result database is the information collected in the current collection process. Here, the extraction stop condition may be any one or more of the following conditions: reaching a predetermined extraction time threshold; extracting a predetermined user level depth; the extraction priority of the current user is lower than a predetermined threshold.

Through the method, the embodiment of the invention can directly collect the content containing the seed words and can also collect the content with higher relevance with the seed words, thereby improving the accuracy of information collection, reducing the time required by collection and improving the information collection efficiency.

In the step 13, when the contents issued by each user in the candidate user set are extracted one by one, the corresponding extraction priority may be set according to the importance degree of each user. As an implementation manner, in step 13 of this embodiment, the content of the first-level user may be collected first, and then content extraction is performed one by one according to extraction priorities of other users, at this time, as shown in fig. 2, step 13 may specifically include:

step 131, selecting a user from the candidate user set as a current user, wherein when a first-level user exists in the candidate user set, selecting the first-level user as the current user, otherwise, selecting the user with the highest extraction priority from the candidate user set as the current user;

step 132, extracting the content published by the current user on the user sub-network and storing the content in the collection result database;

step 133, determining a next-level user of the current user according to the topological relation between the users, deleting the current user from the candidate user set, and adding the next-level user to the candidate user set;

step 134, judging whether the extraction stopping condition is met, if yes, ending the process; otherwise, return to step 131.

The embodiment of the invention can determine the extraction sequence based on the extraction priority aiming at other users outside the first-level user in the candidate user set, and preferentially extract the content of the user with higher value.

In a social network, users who establish topological connections often have a high relevance to each other. For example, for the microblog platform, assuming that user a is a fan of user B, there is a greater likelihood that user a has similar interests as user B, and therefore the microblogs they post are more likely to be related to the same topic. Therefore, in the embodiment of the present invention, the quality score of the current user is calculated according to the content and the user tag thereof issued by the current user, and the quality score is used as the extraction priority of the next-level user of the current user. The specific calculation process may be executed in step 133, where as shown in fig. 3, step 133 specifically includes:

step 1331, determining the next level user of the current user according to the topological relation among the users.

Step 1332, extracting a user tag used for representing the characteristics of the current user, and calculating a first correlation between the user tag of the current user and the seed word to obtain the tag quality of the current user; and calculating the second correlation between the content published by the current user on the user sub-network and the first search result to obtain the content quality of the current user.

The user label is a label set by the user himself/herself, and generally represents characteristics of the user, such as interest, geographical position, age and other classifications. Of course, the social network may also set a corresponding user tag for the user according to the behavior characteristics of the user. Typically, the user tags are also represented on the user's user sub-network by means of text content. The embodiment may extract the user tag of the user from the user sub-network. If the extracted user tag is null, the first correlation may be represented as 0; if the content published by the current user is empty, the second correlation may be represented as 0.

Here, the first correlation may be characterized by a cosine distance between the word vector corresponding to the user tag and the word vector corresponding to the seed word, and a larger cosine distance indicates a higher first correlation. The embodiment of the invention provides a calculation formula as follows:

in the above formula, uq is the label quality of the current user, and obviously, the higher the first correlation is, the higher the score of the label quality is, and the better the label quality is represented; label _ vec is a word vector corresponding to the user label of the current user U, and label _ vec₁,label_vec₂,label_vec₃…label_vec_nFeature vectors corresponding to a plurality of user tags of a current user UWhere n is the number of user tags for the current user U. seed _ vec is a word vector of seed words, and if a seed word includes a plurality of search keywords, then seed _ vec is the sum of the word vectors for each search keyword. The label _ vec and seed _ vec can be obtained through a word2vec model in the prior art, and the model can be obtained through content corpus (such as microblog corpus) training of a social network in advance.

Here, the second correlation may be characterized by a cosine distance between the first bag-of-words feature vector and the second bag-of-words feature vector, and the larger the cosine distance, the higher the second correlation. The first bag-of-words feature vector is a bag-of-words feature vector constructed based on content published by a current user on a user sub-network of the current user, and the second bag-of-words feature vector is a bag-of-words feature vector constructed based on the first search result. A calculation formula of the second correlation provided in the embodiment of the present invention is as follows:

in the above formula, mq is the content quality of the current user, and obviously, the higher the second relevance is, the higher the score of the content quality is, and the better the content quality is represented; b _ vec is a first bag-of-words feature vector; r _ vec is the second bag-of-words feature vector.

And 1333, fusing the label quality and the content quality of the current user to obtain the extraction priority of the next user of the current user. Here, the extraction priority is positively correlated with both the tag quality and the content quality of the current user, that is, the better the tag quality of the current user is, the higher the extraction priority is; the better the content quality of the current user, the higher the extraction priority.

Step 1334, deleting the current user from the candidate user set, and adding the next user to the candidate user set.

When the next-level user of the current user is added into the candidate user set, the embodiment of the present invention may further initialize the label quality of the next-level user of the current user according to the first correlation; and initializing the content quality of the next-level user of the current user according to the second correlation, namely initializing the label quality of the next-level user of the current user to the value of the first correlation of the current user, and initializing the content quality of the next-level user of the current user to the value of the second correlation of the current user.

In the above step 1333, the label quality and the content quality of the current user are fused to obtain the extraction priority of the next user. In this way, even if the user tag of the current user is empty or the distribution content is empty, another quality factor can be used to calculate the extraction priority.

An implementation manner of the above fusion processing provided by the embodiment of the present invention is as follows: and directly calculating the sum value of the label quality and the content quality of the current user, and taking the sum value as the extraction priority of the next user of the current user.

Another implementation of the fusion process provided in the embodiment of the present invention may be performed in the following manner:

1) and according to the label quality of the current user, ranking in all users in the candidate user set to obtain a first fusion component of the current user. Here, the value of the first fused component corresponding to the user with the better tag quality is set in advance to be not smaller than the value of the first fused component corresponding to the user with the poorer tag quality.

2) And according to the content quality of the current user, ranking in all users of the candidate user set to obtain a second fusion component of the current user. Here, the value of the second fused component corresponding to the user with better content quality is set in advance to be not smaller than the value of the second fused component corresponding to the user with poorer content quality.

3) And calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

In the above calculation manner, the larger the above sum value is, the higher the extraction priority is. A specific example of fusion is given below:

it is assumed that: the label quality of M users ranked before in the candidate user set is x, and the values of the first fusion components of the users are all x; the users after the Mth (excluding the Mth) have the first fused component with the value of 0; the content quality of the users with L top ranks in the candidate user set is y; the users after the lth name (excluding the lth name) all have a second fused component value of 0. During fusion processing, determining values of the first fusion component and the second fusion component according to the ranking of the label quality and the content quality of the current user in the candidate user set, and then calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

The above is merely an example of one type of fusion. When the opposite fusion method is adopted, for example, when the value of the first fusion component corresponding to the user with better tag quality is preset to be not greater than the value of the first fusion component corresponding to the user with poorer tag quality, and the value of the second fusion component corresponding to the user with better content quality is preset to be not greater than the value of the second fusion component corresponding to the user with poorer content quality, the smaller the sum value is, the higher the extraction priority is likely to be.

It can be seen from the above description that, in the method of the embodiment of the present invention, the user to be extracted is selected by using the internal connection between the social network users, and the content published by the user with higher value is preferentially extracted according to the extraction priority of the user, so that the content collection can be performed more effectively.

Referring to fig. 4, an embodiment of the present invention further provides a functional structure diagram of an information collecting device of a social network, as shown in fig. 4, the information collecting device 40 includes:

the searching unit 41 is configured to search, based on the seed word, content in the social network that includes the seed word, obtain a first search result, and store the first search result in a collected result database;

a candidate user generating unit 42, configured to determine a user who issues the content in the first search result, obtain a first-level user, and add the first-level user to a candidate user set;

and an extracting unit 43, configured to extract, one by one, contents that are published on the user subnet of each user in the candidate user set, and store the extracted contents in the collection result database until a preset extraction stop condition is reached, where when extracting a content that is published on the user subnet of a current user, a next-level user of the current user is determined according to the topological relation between users, and the next-level user is added to the candidate user set.

Referring to fig. 5, in the above information collecting apparatus 40, according to an aspect of an embodiment of the present invention, the extracting unit 43 includes:

a selecting unit 431, configured to select a user from the candidate user set as a current user, where when a first-level user exists in the candidate user set, the first-level user is selected as the current user, and otherwise, the user with the highest extraction priority is selected from the candidate user set as the current user;

a collecting unit 432, configured to extract content published by a current user on a user subnet of the current user and store the content in the collection result database;

an updating unit 433, configured to determine a next-level user of the current user according to the topological relation between the users, delete the current user from the candidate user set, and add the next-level user to the candidate user set;

a determining unit 434, configured to determine whether the extraction stop condition is met, and if yes, end content extraction; otherwise, the selection unit 431 continues to be triggered.

Referring to fig. 5, according to another aspect of the embodiment of the present invention, in the above information collecting apparatus 40, the updating unit 433 includes:

a determining unit 4331, configured to determine, according to the topological relation between the users, a next-level user of the current user;

a priority calculating unit 4332, configured to extract a user tag used for representing characteristics of a current user, and calculate a first correlation between the user tag of the current user and the seed word, so as to obtain a tag quality of the current user; calculating a second correlation between the content published by the current user on the user sub-network and the first search result to obtain the content quality of the current user; fusing the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user;

a candidate user set maintaining unit 4333, configured to delete the current user from the candidate user set and add the next user to the candidate user set after the priority calculating unit obtains the extraction priority of the next user of the current user;

an initializing unit 4334, configured to initialize, according to the first correlation, a tag quality of a next-level user of the current user after the candidate user set maintaining unit 4333 adds the next-level user of the current user to the candidate user set; and initializing the content quality of the next-level user of the current user according to the second correlation.

In this embodiment of the present invention, the priority calculating unit 4332 may calculate the extraction priority of the next user of the current user in a plurality of different manners. One possible calculation method is as follows: the priority calculating unit 4332 is specifically configured to calculate a sum of the tag quality and the content quality of the current user, and use the sum as an extraction priority of a user next to the current user. Another possible calculation is: the priority calculating unit 4332 is specifically configured to obtain a first fusion component of the current user according to the label quality of the current user and the ranking of all users in the candidate user set; ranking in all users of the candidate user set according to the content quality of the current user to obtain a second fusion component of the current user; and calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

Fig. 7 is a schematic diagram of a hardware structure of an information collecting apparatus according to an embodiment of the present invention, where the information collecting apparatus may be disposed in a computer system 70, and the computer system 70 includes:

a processor 71, a RAM 72, a ROM 73, a hard disk 74, an input device 75, a display device 76, and a bus structure 77 connecting the above devices.

Here, the input device 75 may include a mouse, a keyboard, and various handwriting input mouse or touch input devices; the display device 76 includes various displays, projection devices, and the like; the bus architecture 77 may be any suitable type of bus or bridge that may include any number of interconnects; one or more processor units, represented by processor 71, and various circuits of one or more memories, represented by RAM 72 and ROM 73, are connected together. Intermediate results of the operations of the processor 71 may be stored in the RAM 72, and the resulting data of the collected result database may be stored in the hard disk 74. The bus architecture 77 may also connect various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art. Therefore, it will not be described in detail herein.

When the processor 71 calls and executes the programs and data stored in the RAM 72 and/or the ROM 73123, the following functional modules may be implemented:

the search unit is used for searching the content containing the seed words in the social network based on the seed words, obtaining a first search result and storing the first search result in a collected result database;

In summary, the information collection method and apparatus provided by the embodiments of the present invention select a user to be extracted by using the internal connection between users in the social network, and preferentially extract the content published by the user with a higher value according to the extraction priority of the user, thereby improving the information collection efficiency of the social network.

While the foregoing is directed to the preferred embodiment of the present invention, it will be understood by those skilled in the art that various changes and modifications may be made without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. An information collection method for a social network, comprising:

extracting contents, published on a user sub-network of each user, in the candidate user set one by one, and storing the extracted contents in the collection result database until a preset extraction stop condition is reached, wherein when the contents, published on the user sub-network of the current user, of the current user are extracted, the next-level user of the current user is determined according to the topological relation among the users, and the next-level user is added into the candidate user set;

the step of extracting the contents issued by the users on the user sub-network in the candidate user set one by one and storing the contents in the collection result database until reaching the preset extraction stop condition comprises the following steps:

judging whether the extraction stopping condition is met, if so, ending the process; otherwise, returning to the step of selecting one user from the candidate user set as the current user;

before the step of removing the current user from the set of candidate users and adding the next level user to the set of candidate users, the method further comprises:

extracting a user label for representing the characteristics of the current user, and calculating a first correlation between the user label of the current user and the seed word to obtain a label quality of the current user represented by formula (1), wherein:

in the above formula (1), uq is the label quality of the current user; label _ vec₁,label_vec₂,label_vec₃…label_vec_nFeature vectors corresponding to a plurality of user tags of the current user U are obtained, wherein n is the number of the user tags of the current user U; seed _ vec is a word vector of seed words, and if a seed word includes a plurality of search keywords, then seed _ vec is the sum of the word vectors for each search keyword;

calculating a second correlation between the content published by the current user on the user sub-network thereof and the first search result, and obtaining the content quality of the current user represented by formula (2):

in the above formula (2), mq is the content quality of the current user; b _ vec is a first bag-of-words feature vector; r _ vec is a second bag-of-words feature vector;

2. The information collecting method according to claim 1,

the first correlation is the cosine distance between the word vector corresponding to the user label and the word vector corresponding to the seed word;

3. The information collecting method according to claim 1, wherein the step of fusing the label quality and the content quality of the current user to obtain the extraction priority of the user next to the current user comprises: and calculating the sum of the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user.

4. The information collecting method according to claim 1, wherein the step of fusing the label quality and the content quality of the current user to obtain the extraction priority of the user next to the current user comprises:

5. The information collecting method according to claim 1,

after the joining of the next level user to the set of candidate users, the method further comprises: initializing the label quality of a next-level user of the current user according to the first correlation; and initializing the content quality of the next-level user of the current user according to the second correlation.

6. The information collection method according to claim 1, wherein the extraction stop condition includes at least one of the following conditions: reaching a predetermined extraction time threshold; extracting a predetermined user level depth; the extraction priority of the current user is lower than a predetermined threshold.

7. An information collection apparatus of a social network, the information collection apparatus comprising:

the extraction unit is used for extracting the contents published by the users on the user sub-network in the candidate user set one by one and storing the contents in the collection result database until a preset extraction stopping condition is reached, wherein when the contents published by the current user on the user sub-network are extracted, the next-level user of the current user is determined according to the topological relation among the users, and the next-level user is added into the candidate user set;

the extraction unit includes:

a judging unit, configured to judge whether the extraction stop condition is satisfied, and if so, end content extraction; otherwise, continuing to trigger the selection unit;

the update unit includes:

the priority calculating unit is used for extracting a user label for representing the characteristics of the current user, calculating a first correlation between the user label of the current user and the seed word, and obtaining the label quality of the current user represented by the formula (1); calculating a second correlation between the content published by the current user on the user sub-network of the current user and the first search result to obtain the content quality of the current user represented by the formula (2); fusing the label quality and the content quality of the current user to obtain the extraction priority of the next-level user of the current user, wherein:

at the upper partIn the formula (1), uq is the label quality of the current user; label _ vec₁,label_vec₂,label_vec₃…label_vec_nFeature vectors corresponding to a plurality of user tags of the current user U are obtained, wherein n is the number of the user tags of the current user U; seed _ vec is a word vector of seed words, and if a seed word includes a plurality of search keywords, then seed _ vec is the sum of the word vectors for each search keyword;

8. The information collecting apparatus according to claim 7,

the priority calculating unit is specifically configured to calculate a sum of the tag quality and the content quality of the current user, and obtain an extraction priority of a user next to the current user.

9. The information collecting apparatus according to claim 7,

the priority calculating unit is specifically configured to obtain a first fusion component of the current user according to the label quality of the current user and the ranking of all users in the candidate user set; ranking in all users of the candidate user set according to the content quality of the current user to obtain a second fusion component of the current user; and calculating the sum of the first fusion component and the second fusion component to obtain the extraction priority of the next-level user of the current user.

10. The information collection apparatus according to claim 7, wherein the update unit further comprises: