CN103377207A

CN103377207A - Method for acquiring microblog user relationships on basis of script engines

Info

Publication number: CN103377207A
Application number: CN201210114869XA
Authority: CN
Inventors: 都云程
Original assignee: BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Current assignee: TOLS INFORMATION TECHNOLOGY Co.,Ltd.
Priority date: 2012-04-17
Filing date: 2012-04-17
Publication date: 2013-10-30
Anticipated expiration: 2032-04-17
Also published as: CN103377207B

Abstract

The invention relates to the technical field of information acquisition, and discloses a method for acquiring microblog user relationships on the basis of script engines. The method includes steps of S1, automatically logging in a microblog website by the aid of a script engine technology; S2, crawling on specific account information in a website acquisition mode to acquire content page information corresponding to the specific account information; S3, analyzing metadata and acquiring a user list, user behavior mechanisms and user basic information; S4, extracting the user relationships; S5, performing breadth-first searching on the user list and enriching the user association relationships. The method has the advantages that the problem of API (application program interface) access restriction is solved, the method is favorable for acquiring microblog information on a large scale, and the information acquisition accuracy is improved.

Description

Microblog users based on script engine concerns acquisition method

Technical field

The invention belongs to areas of information technology, specifically, relate to a kind of microblog users based on script engine and concern acquisition method.

Background technology

Be accompanied by the fast development of WEB infotech, the research of entity social networks is subject to the close attention of academia and business circles.Social networks is along with the Internet model of new rise---and the development of microblogging presents exponential expansion, is wherein containing a large amount of customer relationships such as Facebook, LinkedIn, Sina etc., is hiding very large commercial value in these customer relationships.

The microblog users Relation extraction is a basic task of microblogging magnanimity information Real-time Collection.The microblog users relation helps to provide micro-blog information to upgrade the strategy that gathers, and the clue that real-time update gathers when can be used as the collection of microblogging magnanimity information is the basic resource of microblogging further investigation.

At present, microblogging visitor customer relationship extracts the main mode that adopts based on the microblogging opening API, by microblogging distinctive " Following " and " Followed " rule.Thereby quantity, scope, the frequency of institute's obtaining information are subjected to the restriction of microblogging API.This method some shortcomings, one, acquisition system can only be obtained limited data according to frequency and the scope of application demand; Its two, different for different API limiting access frequencies, affect dynamically updating of data; Its three, user profile and the customer relationship of extraction are incomplete, cause acquisition rate not high.

Summary of the invention

The technical matters that (one) will solve

The technical problem to be solved in the present invention is: how to solve the scale collecting of user profile in the microblogging, improve acquisition rate, make up comparatively complete customer relationship.

(2) technical scheme

For solving the problems of the technologies described above, the invention provides a kind of customer relationship acquisition method based on script engine, said method comprising the steps of:

S1 adopts the script engine technology automatically to login microblogging visitor website, realizes the pinpoint accuracy collection to microblogging visitor website;

S2 adopts the web retrieval mode that particular account information is crawled its corresponding content page info;

S3 utilizes the metadata analytic technique that wherein user profile, user behavior mechanism are resolved, and obtains user profile;

S4 on the basis of S3, utilizes user behavior mechanism, realizes the user-association Relation extraction, and stores;

S5 uses breadth First traversal user list, and each user id that collects is repeated above-mentioned steps, with the information that the collects customer relationship tabulation of enriching constantly;

Wherein, in step S1, adopt the Javascript script to realize the script function of configuration software, employing SpiderMonkey realizes the embedding engine of configuration software script module, only resolves in the page to connect the script relevant with microblogging visitor content with production;

In step 1, the interpreter in the script engine is expanded, it is had simultaneously explain and compile two kinds of execution patterns;

The target that the script engine Frame Design will reach is that SpiderMonkey is embedded in the engine modules of configuration software, makes it have the most basic JavaScript Language Processing ability, and implementation step specifically comprises:

S11 creates engine encapsulation class JSEngine;

S12, the initialization output function InitScript () of realization script engine;

S13, the unloading output function UnInitScript () of realization script engine;

Wherein, the realization of step S3 specifically comprises:

S31 sums up the html document structure of each microblogging webpage, finds out the difference of the label of different node;

S32 filters out invalid information according to the html document structure law among the S31, and HTML is converted into XHTML, obtains the XHTML document of standard, and document is carried out dom tree resolve, and sets up the metadata feature templates;

S33, matching template, according to XHTML document characteristics, algorithm for design is realized the template matches effect of template set the inside;

S34 according to the template node path that matches, extracts information needed, deposits according to certain form.

Among the step S3, user behavior mechanism comprises: the mechanism of following that the user " pays close attention to and is concerned ", the forwarding of user's pushed information, comment mechanism;

Among the step S4, the specific implementation step that customer relationship is extracted is:

S41, the user URL of searching Following and Follower;

S42 filters URL and puts into the URL formation, as object to be collected;

S43 in conjunction with active user's URL, according to the mechanism of the user behavior among the S3, sets up the user-association relation table, and stores;

Wherein, in step S5, use breadth First traversal mode to travel through user list, gather each user list information to local, simultaneously the user is gone to the storehouse of reentrying.

Description of drawings

Fig. 1 is the schematic flow sheet that a kind of microblog users based on script engine provided by the invention concerns acquisition method;

Embodiment

Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.

Fig. 1 shows the schematic flow sheet of the microblog users relationship module acquisition method that the embodiment of the invention provides, and as shown in Figure 1, said method comprising the steps of:

S1: adopt the script engine technology automatically to login the microblogging website;

The microblogging visitor is a kind of community website, need to be just can carry out information acquisition after certain user identity login.In order to ensure user's security, the user authentication process of microblogging visitor website is very complicated strict.The microblogging visitor belongs to the form that represents of web2.0, and realizing technological layer, uses in a large number script.Concerning microblogging visitor collector, do not need to resolve scripts all in the page, only need parsing to link the script relevant with the microblogging content with production and get final product.

Adopt SpiderMonkey as the embedding engine of configuration software script module, make configuration software have JavaScript Language Processing ability, its concrete methods of realizing, as follows:

S11, the SpiderMonkey of download latest edition copies the js32d.dll under the src catalogue and js32.dll to system now, and wherein dynamic link library js32d.dll is used for supporting Debug version js32.dll then to be used for supporting the Release version.Create engine encapsulation class JSEngine, to be encapsulated in the JSEngine class about initialization engine the most basic setting and initialization function, call its core function Create (), finish the initialization of the various elements of script engine, comprising: runtime environment Runtime, establishment context Context and establishment and initialization global variable GlobalObject etc.

S12, the initialization output function InitScript () of realization script engine, the member function Create () that calls the JSEngine class finishes the initialization of script environment.

S13, the unloading function UnInitScript () of realization script engine, when application program withdraws from, the internal memory that distributes during the initialization of cleaning script engine, key step has: discharge the context handle that creates; Discharge runtime environment and variable, object space on this runtime environment; It is the space distributed of global structure etc. when discharging initialization.

Adopt the Javascript script to realize the script function of configuration software, comprise: the define symbol table, realize morphology/syntax analyzer, realize semantic test device, intermediate code production device, code optimization device, code generator, final generation script engine virtual machine.Its advantage is that exploitation is simple, flexible function, professional platform independence are good.

S2: adopt the webpage information acquisition technology, as source data, comprising: user's homepage, the comment/forwarding page, bean vermicelli page etc. crawl webpage source information with the User Page after the login.

S3: the webpage oss message that crawls is carried out source data resolve:

Information page among the microblogging visitor is although also be that its content of html file of standard and common webpage exist very big-difference.It has stored bulk information with certain architectural feature, and described information comprises: whether spokesman, time limit of speech, speech platform, propelling movement relation and the corresponding propelling movement frequency, publisher authenticate, the user pays close attention to relation etc.It is implemented as follows description:

S31, study the html document structure of each microblogging visitor webpage, and to the HTML of the model page at the various microbloggings visitor webpages place analysis of summarizing, find out needed speech and answer, author information, deliver the place nodes such as time, uploading channel, sum up the label and the difference that needs the label of information of unwanted information place node, extract feasible rule.For example analyze the structure of web page of classmate net, wherein＜p〉and＜/p〉be the information of issuing between the label;＜span〉and＜/span〉be information source and user location between the label.The structure of web page of Sina's microblogging and for example is at＜p〉and＜/p〉between the label be information;＜li〉and＜/li between be user's id, user id be identification the user unique identification;＜strong〉and＜/strong between be information source etc.

S32, pre-service.By the summary of page rule, simplify file structure, unwanted information place node is removed, to reduce the degree of depth and the hierarchical structure of document, the HTML after simplifying is converted into XHTML after, document is carried out dom tree resolves, sum up some based on the rule of statistics, set up the metadata feature templates.

S33 according to the characteristics of simplifying the XHTML document that obtains behind the document, designs corresponding algorithm, reaches the template matches effect with the template set the inside, if coupling, the node at the information place that then can need according to the path orientation of template institute mark; If do not mate, pattern of descriptive parts collection the inside does not have template can extract this file, will re-start step S32, to upgrade the template of template set the inside.

S34, information extraction according to the node path of the model of the template that matches, extracts the information that needs, and comprises three major types: user ID tabulation, user basic information, user behavior mechanism, and deposit according to certain form.

S4: customer relationship extracts:

Webpage is resolved the user URL that searches out Following and Follower, and filtration URL also puts it in the URL formation, as next acquisition target; In conjunction with active user URL, according to the user behavior mechanism that extracts among the S3, set up the user-association relation table, and store.

S5: use the breadth First mode to travel through user list, gather each user's information, and the user is gone to the storehouse of reentrying.

Claims

1. the microblogging visitor customer relationship acquisition method based on script engine is characterized in that, may further comprise the steps:

S1 adopts the script engine technology automatically to login microblogging visitor website;

S3 utilizes the metadata analytic technique that wherein user id, user behavior mechanism are resolved, and obtains user profile;

S4 according to user behavior mechanism, extracts the user-association relation;

S5 adopts the breadth First mode to travel through user list, and each user id that collects is repeated above-mentioned steps, enriches the customer relationship collection.

2. according to right 1 described acquisition method, it is characterized in that, in step S1, adopt the script engine technology to realize the automatic login of microblogging visitor website.

3. according to right 1 described acquisition method, it is characterized in that, adopt the web retrieval mode to grasp content page information to the visitor of the microblogging after automatic login website.

4. according to right 1 described acquisition method, it is characterized in that, in step S3, adopt metadata to resolve the information such as user list, user basic information, user behavior mechanism of obtaining.

5. according to right 1 described acquisition method, it is characterized in that user behavior mechanism comprises: the mechanism of following that the user " pays close attention to and is concerned ", the forwarding of user's pushed information, comment mechanism.

6. according to right 1 described acquisition method, it is characterized in that in step 4, the user-association Relation acquisition is affected by user basic information, user behavior mechanism.

7. according to right 1 described acquisition method, it is characterized in that, adopt the breadth First mode to travel through user list, realize increasing progressively collection.

8. such as right 2 described acquisition methods, it is characterized in that, adopt the Javascript script to realize the script function of configuration software, adopt SpiderMonkey to realize the embedding engine of configuration software script module.

9. such as right 4 described acquisition methods, it is characterized in that, according to certain structure law, html document is converted into the XHTML of standard, and document is carried out dom tree resolve, set up the metadata feature templates.

10. such as right 4 described acquisition methods, it is characterized in that, according to the XHTML document characteristics after simplifying, design matching algorithm, location information needed, or the template in the renewal template set.

11. such as right 4 described acquisition methods, it is characterized in that the information of mating comprises three major types: user ID, user basic information, user behavior mechanism.