CN103377207A - Method for acquiring microblog user relationships on basis of script engines - Google Patents

Method for acquiring microblog user relationships on basis of script engines Download PDF

Info

Publication number
CN103377207A
CN103377207A CN201210114869XA CN201210114869A CN103377207A CN 103377207 A CN103377207 A CN 103377207A CN 201210114869X A CN201210114869X A CN 201210114869XA CN 201210114869 A CN201210114869 A CN 201210114869A CN 103377207 A CN103377207 A CN 103377207A
Authority
CN
China
Prior art keywords
user
information
described acquisition
script
adopt
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210114869XA
Other languages
Chinese (zh)
Other versions
CN103377207B (en
Inventor
都云程
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
TOLS INFORMATION TECHNOLOGY Co.,Ltd.
Original Assignee
BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING TRS INFORMATION TECHNOLOGY Co Ltd filed Critical BEIJING TRS INFORMATION TECHNOLOGY Co Ltd
Priority to CN201210114869.XA priority Critical patent/CN103377207B/en
Publication of CN103377207A publication Critical patent/CN103377207A/en
Application granted granted Critical
Publication of CN103377207B publication Critical patent/CN103377207B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention relates to the technical field of information acquisition, and discloses a method for acquiring microblog user relationships on the basis of script engines. The method includes steps of S1, automatically logging in a microblog website by the aid of a script engine technology; S2, crawling on specific account information in a website acquisition mode to acquire content page information corresponding to the specific account information; S3, analyzing metadata and acquiring a user list, user behavior mechanisms and user basic information; S4, extracting the user relationships; S5, performing breadth-first searching on the user list and enriching the user association relationships. The method has the advantages that the problem of API (application program interface) access restriction is solved, the method is favorable for acquiring microblog information on a large scale, and the information acquisition accuracy is improved.

Description

Microblog users based on script engine concerns acquisition method
Technical field
The invention belongs to areas of information technology, specifically, relate to a kind of microblog users based on script engine and concern acquisition method.
Background technology
Be accompanied by the fast development of WEB infotech, the research of entity social networks is subject to the close attention of academia and business circles.Social networks is along with the Internet model of new rise---and the development of microblogging presents exponential expansion, is wherein containing a large amount of customer relationships such as Facebook, LinkedIn, Sina etc., is hiding very large commercial value in these customer relationships.
The microblog users Relation extraction is a basic task of microblogging magnanimity information Real-time Collection.The microblog users relation helps to provide micro-blog information to upgrade the strategy that gathers, and the clue that real-time update gathers when can be used as the collection of microblogging magnanimity information is the basic resource of microblogging further investigation.
At present, microblogging visitor customer relationship extracts the main mode that adopts based on the microblogging opening API, by microblogging distinctive " Following " and " Followed " rule.Thereby quantity, scope, the frequency of institute's obtaining information are subjected to the restriction of microblogging API.This method some shortcomings, one, acquisition system can only be obtained limited data according to frequency and the scope of application demand; Its two, different for different API limiting access frequencies, affect dynamically updating of data; Its three, user profile and the customer relationship of extraction are incomplete, cause acquisition rate not high.
Summary of the invention
The technical matters that (one) will solve
The technical problem to be solved in the present invention is: how to solve the scale collecting of user profile in the microblogging, improve acquisition rate, make up comparatively complete customer relationship.
(2) technical scheme
For solving the problems of the technologies described above, the invention provides a kind of customer relationship acquisition method based on script engine, said method comprising the steps of:
S1 adopts the script engine technology automatically to login microblogging visitor website, realizes the pinpoint accuracy collection to microblogging visitor website;
S2 adopts the web retrieval mode that particular account information is crawled its corresponding content page info;
S3 utilizes the metadata analytic technique that wherein user profile, user behavior mechanism are resolved, and obtains user profile;
S4 on the basis of S3, utilizes user behavior mechanism, realizes the user-association Relation extraction, and stores;
S5 uses breadth First traversal user list, and each user id that collects is repeated above-mentioned steps, with the information that the collects customer relationship tabulation of enriching constantly;
Wherein, in step S1, adopt the Javascript script to realize the script function of configuration software, employing SpiderMonkey realizes the embedding engine of configuration software script module, only resolves in the page to connect the script relevant with microblogging visitor content with production;
In step 1, the interpreter in the script engine is expanded, it is had simultaneously explain and compile two kinds of execution patterns;
The target that the script engine Frame Design will reach is that SpiderMonkey is embedded in the engine modules of configuration software, makes it have the most basic JavaScript Language Processing ability, and implementation step specifically comprises:
S11 creates engine encapsulation class JSEngine;
S12, the initialization output function InitScript () of realization script engine;
S13, the unloading output function UnInitScript () of realization script engine;
Wherein, the realization of step S3 specifically comprises:
S31 sums up the html document structure of each microblogging webpage, finds out the difference of the label of different node;
S32 filters out invalid information according to the html document structure law among the S31, and HTML is converted into XHTML, obtains the XHTML document of standard, and document is carried out dom tree resolve, and sets up the metadata feature templates;
S33, matching template, according to XHTML document characteristics, algorithm for design is realized the template matches effect of template set the inside;
S34 according to the template node path that matches, extracts information needed, deposits according to certain form.
Among the step S3, user behavior mechanism comprises: the mechanism of following that the user " pays close attention to and is concerned ", the forwarding of user's pushed information, comment mechanism;
Among the step S4, the specific implementation step that customer relationship is extracted is:
S41, the user URL of searching Following and Follower;
S42 filters URL and puts into the URL formation, as object to be collected;
S43 in conjunction with active user's URL, according to the mechanism of the user behavior among the S3, sets up the user-association relation table, and stores;
Wherein, in step S5, use breadth First traversal mode to travel through user list, gather each user list information to local, simultaneously the user is gone to the storehouse of reentrying.
Description of drawings
Fig. 1 is the schematic flow sheet that a kind of microblog users based on script engine provided by the invention concerns acquisition method;
Embodiment
Below in conjunction with drawings and Examples, the specific embodiment of the present invention is described in further detail.Following examples are used for explanation the present invention, but are not used for limiting the scope of the invention.
Fig. 1 shows the schematic flow sheet of the microblog users relationship module acquisition method that the embodiment of the invention provides, and as shown in Figure 1, said method comprising the steps of:
S1: adopt the script engine technology automatically to login the microblogging website;
The microblogging visitor is a kind of community website, need to be just can carry out information acquisition after certain user identity login.In order to ensure user's security, the user authentication process of microblogging visitor website is very complicated strict.The microblogging visitor belongs to the form that represents of web2.0, and realizing technological layer, uses in a large number script.Concerning microblogging visitor collector, do not need to resolve scripts all in the page, only need parsing to link the script relevant with the microblogging content with production and get final product.
Adopt SpiderMonkey as the embedding engine of configuration software script module, make configuration software have JavaScript Language Processing ability, its concrete methods of realizing, as follows:
S11, the SpiderMonkey of download latest edition copies the js32d.dll under the src catalogue and js32.dll to system now, and wherein dynamic link library js32d.dll is used for supporting Debug version js32.dll then to be used for supporting the Release version.Create engine encapsulation class JSEngine, to be encapsulated in the JSEngine class about initialization engine the most basic setting and initialization function, call its core function Create (), finish the initialization of the various elements of script engine, comprising: runtime environment Runtime, establishment context Context and establishment and initialization global variable GlobalObject etc.
S12, the initialization output function InitScript () of realization script engine, the member function Create () that calls the JSEngine class finishes the initialization of script environment.
S13, the unloading function UnInitScript () of realization script engine, when application program withdraws from, the internal memory that distributes during the initialization of cleaning script engine, key step has: discharge the context handle that creates; Discharge runtime environment and variable, object space on this runtime environment; It is the space distributed of global structure etc. when discharging initialization.
Adopt the Javascript script to realize the script function of configuration software, comprise: the define symbol table, realize morphology/syntax analyzer, realize semantic test device, intermediate code production device, code optimization device, code generator, final generation script engine virtual machine.Its advantage is that exploitation is simple, flexible function, professional platform independence are good.
S2: adopt the webpage information acquisition technology, as source data, comprising: user's homepage, the comment/forwarding page, bean vermicelli page etc. crawl webpage source information with the User Page after the login.
S3: the webpage oss message that crawls is carried out source data resolve:
Information page among the microblogging visitor is although also be that its content of html file of standard and common webpage exist very big-difference.It has stored bulk information with certain architectural feature, and described information comprises: whether spokesman, time limit of speech, speech platform, propelling movement relation and the corresponding propelling movement frequency, publisher authenticate, the user pays close attention to relation etc.It is implemented as follows description:
S31, study the html document structure of each microblogging visitor webpage, and to the HTML of the model page at the various microbloggings visitor webpages place analysis of summarizing, find out needed speech and answer, author information, deliver the place nodes such as time, uploading channel, sum up the label and the difference that needs the label of information of unwanted information place node, extract feasible rule.For example analyze the structure of web page of classmate net, wherein<p〉and</p〉be the information of issuing between the label;<span〉and</span〉be information source and user location between the label.The structure of web page of Sina's microblogging and for example is at<p〉and</p〉between the label be information;<li〉and</li between be user's id, user id be identification the user unique identification;<strong〉and</strong between be information source etc.
S32, pre-service.By the summary of page rule, simplify file structure, unwanted information place node is removed, to reduce the degree of depth and the hierarchical structure of document, the HTML after simplifying is converted into XHTML after, document is carried out dom tree resolves, sum up some based on the rule of statistics, set up the metadata feature templates.
S33 according to the characteristics of simplifying the XHTML document that obtains behind the document, designs corresponding algorithm, reaches the template matches effect with the template set the inside, if coupling, the node at the information place that then can need according to the path orientation of template institute mark; If do not mate, pattern of descriptive parts collection the inside does not have template can extract this file, will re-start step S32, to upgrade the template of template set the inside.
S34, information extraction according to the node path of the model of the template that matches, extracts the information that needs, and comprises three major types: user ID tabulation, user basic information, user behavior mechanism, and deposit according to certain form.
S4: customer relationship extracts:
Webpage is resolved the user URL that searches out Following and Follower, and filtration URL also puts it in the URL formation, as next acquisition target; In conjunction with active user URL, according to the user behavior mechanism that extracts among the S3, set up the user-association relation table, and store.
S5: use the breadth First mode to travel through user list, gather each user's information, and the user is gone to the storehouse of reentrying.

Claims (11)

1. the microblogging visitor customer relationship acquisition method based on script engine is characterized in that, may further comprise the steps:
S1 adopts the script engine technology automatically to login microblogging visitor website;
S2 adopts the web retrieval mode that particular account information is crawled its corresponding content page info;
S3 utilizes the metadata analytic technique that wherein user id, user behavior mechanism are resolved, and obtains user profile;
S4 according to user behavior mechanism, extracts the user-association relation;
S5 adopts the breadth First mode to travel through user list, and each user id that collects is repeated above-mentioned steps, enriches the customer relationship collection.
2. according to right 1 described acquisition method, it is characterized in that, in step S1, adopt the script engine technology to realize the automatic login of microblogging visitor website.
3. according to right 1 described acquisition method, it is characterized in that, adopt the web retrieval mode to grasp content page information to the visitor of the microblogging after automatic login website.
4. according to right 1 described acquisition method, it is characterized in that, in step S3, adopt metadata to resolve the information such as user list, user basic information, user behavior mechanism of obtaining.
5. according to right 1 described acquisition method, it is characterized in that user behavior mechanism comprises: the mechanism of following that the user " pays close attention to and is concerned ", the forwarding of user's pushed information, comment mechanism.
6. according to right 1 described acquisition method, it is characterized in that in step 4, the user-association Relation acquisition is affected by user basic information, user behavior mechanism.
7. according to right 1 described acquisition method, it is characterized in that, adopt the breadth First mode to travel through user list, realize increasing progressively collection.
8. such as right 2 described acquisition methods, it is characterized in that, adopt the Javascript script to realize the script function of configuration software, adopt SpiderMonkey to realize the embedding engine of configuration software script module.
9. such as right 4 described acquisition methods, it is characterized in that, according to certain structure law, html document is converted into the XHTML of standard, and document is carried out dom tree resolve, set up the metadata feature templates.
10. such as right 4 described acquisition methods, it is characterized in that, according to the XHTML document characteristics after simplifying, design matching algorithm, location information needed, or the template in the renewal template set.
11. such as right 4 described acquisition methods, it is characterized in that the information of mating comprises three major types: user ID, user basic information, user behavior mechanism.
CN201210114869.XA 2012-04-17 2012-04-17 Microblog users relation acquisition method based on script engine Active CN103377207B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210114869.XA CN103377207B (en) 2012-04-17 2012-04-17 Microblog users relation acquisition method based on script engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210114869.XA CN103377207B (en) 2012-04-17 2012-04-17 Microblog users relation acquisition method based on script engine

Publications (2)

Publication Number Publication Date
CN103377207A true CN103377207A (en) 2013-10-30
CN103377207B CN103377207B (en) 2016-12-07

Family

ID=49462335

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210114869.XA Active CN103377207B (en) 2012-04-17 2012-04-17 Microblog users relation acquisition method based on script engine

Country Status (1)

Country Link
CN (1) CN103377207B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045803A (en) * 2015-05-27 2015-11-11 国家计算机网络与信息安全管理中心 Acquisition method and system of social network relationships
CN108256106A (en) * 2018-02-06 2018-07-06 深圳鼎智通讯股份有限公司 A kind of analog access website adapter system
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187515A1 (en) * 2008-01-17 2009-07-23 Microsoft Corporation Query suggestion generation

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090187515A1 (en) * 2008-01-17 2009-07-23 Microsoft Corporation Query suggestion generation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
康书龙: "基于用户行为及关系的社交网络节点影响力评价", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *
王晓光: "微博社区交流结构及其特征研究", 《中国优秀硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105045803A (en) * 2015-05-27 2015-11-11 国家计算机网络与信息安全管理中心 Acquisition method and system of social network relationships
CN108256106A (en) * 2018-02-06 2018-07-06 深圳鼎智通讯股份有限公司 A kind of analog access website adapter system
CN108256106B (en) * 2018-02-06 2021-11-02 深圳鼎智通讯股份有限公司 Simulation access website adapter system
CN109492149A (en) * 2018-11-29 2019-03-19 深圳墨世科技有限公司 Crawler task processing method and device
CN109492149B (en) * 2018-11-29 2021-04-09 深圳大宇无限科技有限公司 Crawler task processing method and device

Also Published As

Publication number Publication date
CN103377207B (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN103365924B (en) A kind of method of internet information search, device and terminal
CN102426610B (en) Microblog rank searching method and microblog searching engine
CN104462547B (en) A kind of method and system of configurable collecting webpage data
CN103942335B (en) Construction method of uninterrupted crawler system oriented to web page structure change
CN107665191A (en) A kind of proprietary protocol message format estimating method based on expanded prefix tree
CN107437026B (en) Malicious webpage advertisement detection method based on advertisement network topology
CN103294732B (en) Webpage capture method and reptile
CN106980651B (en) Crawling seed list updating method and device based on knowledge graph
CN106095979A (en) URL merging treatment method and apparatus
CN101520798A (en) Webpage classification technology based on vertical search and focused crawler
CN105426502A (en) Social network based person information search and relational network drawing method
CN103530429B (en) Webpage content extracting method
CN103023714A (en) Activeness and cluster structure analyzing system and method based on network topics
Kim et al. Event diffusion patterns in social media
CN101853300A (en) Method and system for identifying and evaluating video downloading service website
US11263062B2 (en) API mashup exploration and recommendation
CN103778200A (en) Method for extracting information source of message and system thereof
CN104252532A (en) Website information statistic method and device
CN105302876A (en) Regular expression based URL filtering method
CN106933973A (en) A kind of visual network reptile method
CN104991904A (en) Page data acquisition method of dynamic webpage
CN103440328B (en) A kind of user classification method based on mouse behavior
CN107862039A (en) Web data acquisition methods, system and Data Matching method for pushing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CP01 Change in the name or title of a patent holder

Address after: 100088 Beijing city Haidian District No. 6 Zhichun Road Jinqiu International Building 14 floor 14B04

Patentee after: TOLS INFORMATION TECHNOLOGY Co.,Ltd.

Address before: 100088 Beijing city Haidian District No. 6 Zhichun Road Jinqiu International Building 14 floor 14B04

Patentee before: BEIJING TRS INFORMATION TECHNOLOGY Co.,Ltd.

CP01 Change in the name or title of a patent holder