CN107818145A - A kind of user behavior tag along sort extracting method based on dynamic reptile - Google Patents

A kind of user behavior tag along sort extracting method based on dynamic reptile Download PDF

Info

Publication number
CN107818145A
CN107818145A CN201710969018.6A CN201710969018A CN107818145A CN 107818145 A CN107818145 A CN 107818145A CN 201710969018 A CN201710969018 A CN 201710969018A CN 107818145 A CN107818145 A CN 107818145A
Authority
CN
China
Prior art keywords
along sort
reptile
tag along
user behavior
resource identifier
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710969018.6A
Other languages
Chinese (zh)
Inventor
王攀
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Post Mdt Infotech Ltd
Original Assignee
Nanjing Post Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Post Mdt Infotech Ltd filed Critical Nanjing Post Mdt Infotech Ltd
Priority to CN201710969018.6A priority Critical patent/CN107818145A/en
Publication of CN107818145A publication Critical patent/CN107818145A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9562Bookmark management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a kind of user behavior tag along sort extracting method based on dynamic reptile, including step:User's surfing flow is perceived using deep packet inspection technical, identifies that the resource identifier of user access resources obtains resource identifier database;Insert in new website address request url, the resource identifier of user access resources is crawled by reptile, obtain the tag along sort on the page, and match relative users;The resource identifier and user behavior tag along sort of the user access resources are extracted according to the frequency of occurrence of setting, and is stored in into static reptile storehouse;And outdated information deletion action is carried out to static reptile storehouse;When exist matching less than user behavior tag along sort when, again from the resource identifier of user access resources extraction obtain user behavior tag along sort.The present invention completes matching extraction label by way of dynamic reptile, makees data basis for user behavior analysis, can improve the accuracy rate of mark user behavior tag along sort.

Description

A kind of user behavior tag along sort extracting method based on dynamic reptile
Technical field
The present invention relates to a kind of user behavior tag along sort extracting method based on dynamic reptile, belong to Internet Users According to the technical field of excavation.
Background technology
In recent years, operator carried out behavioural analysis in the case where obtaining user website visit capacity master data to user, And be combined with net marketing strategy etc., the problem of so as to find to there may be in current network marketing activity, and it is further Correct or reformulate net marketing strategy and foundation is provided.It is how smart and tag along sort is the basic data of user behavior analysis Accurate extraction user tag turn into now with it is to be solved the problem of.
Nowadays the realization of the user behavior tag along sort of main flow, it is directed to a kind of user's row based on static reptile database For the method for classification, mainly completed by three steps:First, reptile is carried out to all resources on the resource websites such as electric business, video, Establish the huge static database of a resource name, resource website, resource classification;Second, the number of resources that identification user accesses According to the static database established based on the first step is matched, and is matched the tag along sort of resource, is realized the label of user behavior Match somebody with somebody;3rd, regularly static database is updated at regular intervals, the resource data on resource website is carried out again Crawl.
The realization of the user behavior tag along sort of above-mentioned main flow, all resources on website need to be crawled.Now electricity The main flow resource website quantity such as business, video is a lot, and each resource website has more than one hundred million individual resources again, and therefore, each resource website will Needing to carry out the reptiles of more than one hundred million times, and finally produce the record of more than one hundred million, this consumption to hard disk resources is very big, and reptile Time spends length, it is easy to receives the anti-reptile of resource website;Useless resource is many in static database, many records not by with The resource that family accesses is matched, and it is low to match utilization rate, finally also results in that match time is long, efficiency is low;Once resource website adds New resource or the modification that resource identifier is carried out to old resource, then resources bank need regularly to be updated, once more Newly not in time, the accuracy rate of user tag extraction will be reduced.
The content of the invention
The technical problems to be solved by the invention are overcome the deficiencies in the prior art, there is provided a kind of based on dynamic reptile User behavior tag along sort extracting method, solves that static reptile database largely takes the overhead issues of storage resource, high frequency is climbed The problem of updating some site resources is taken, by way of dynamic reptile, improves the accuracy rate of mark user behavior tag along sort.
It is of the invention specifically to solve above-mentioned technical problem using following technical scheme:
A kind of user behavior tag along sort extracting method based on dynamic reptile, it is characterised in that comprise the following steps:
Step 1: perceiving user's surfing flow using deep packet inspection technical, identify that the resource identifier of user access resources obtains To resource identifier database;
Step 2: the resource identifier database that step 1 obtains is inserted in new website address request url, by reptile to new Url requests network address is crawled, and obtains the tag along sort on resource Webpage, and classify tag along sort as user behavior Tag match is to relative users;
Step 3: resource identifier and step 2 according to user access resources obtained by the frequency of occurrence extraction step one of setting Acquired user behavior tag along sort, and be stored in into static reptile storehouse;And outdated information deletion action is carried out to static reptile storehouse, Resource classification storehouse after being updated is for extraction matching next time;When exist matching less than user behavior tag along sort when, Using step 2, extraction obtains user behavior tag along sort from the resource identifier of user access resources again.
Further, as a preferred technical solution of the present invention:Also include in the step 1 to resource identifier Database carries out deduplication operation.
Further, as a preferred technical solution of the present invention:Also include in the step 1 to resource identifier Resource identifier in database is ranked up according to frequency of occurrence temperature.
The present invention uses above-mentioned technical proposal, can produce following technique effect:
The present invention is the method for researching and solving high efficiency extraction user behavior label, by way of dynamic reptile, finally and user Matching extraction label is completed, makees data basis for user behavior analysis.Can solve static reptile database and largely take storage The overhead issues of resource, high frequency crawl the problem of updating some site resources, by way of dynamic reptile, improve mark user The accuracy rate of behavior tag along sort.
Therefore the present invention has had the advantage that:
(1)The inventive method only needs to establish the resource classification storehouse of high frequency time temperature, solves static reptile database and largely takes The overhead issues of storage resource, and tag match performance can be improved;
(2)The inventive method makes the ability that system has preferably resistance signal conflict, and throughput of system and case propagation delays all obtain To improvement.
Brief description of the drawings
Fig. 1 is the principle schematic of the user behavior tag along sort extracting method of the invention based on dynamic reptile.
Fig. 2 accesses record schematic diagram for user in the present invention.
Fig. 3 is that the temperature of resource identifier in the present invention ranks schematic diagram.
Fig. 4 is popular high frequency static state reptile database schematic diagram in the present invention.
Fig. 5 is matching result schematic diagram in the present invention.
Embodiment
Embodiments of the present invention are described with reference to Figure of description.
As shown in figure 1, the present invention devises a kind of user behavior tag along sort extracting method based on dynamic reptile, the party Method specifically includes following steps:
Step 1: perceiving user's surfing flow using deep packet inspection technical, identification user accesses the money of the resources such as electric business, video Source identifier obtains resource identifier database.
Preferably, in addition to resource identifier database carry out deduplication operation, amount of compressed data, and according to occur frequency Secondary temperature is ranked up.
Step 2: the resource identifier database that step 1 obtains is inserted in new website address request url, pass through reptile pair New url requests network address is crawled, and such as accesses business to user, the resource identifier of video resource crawls, acquisition page Tag along sort on the resource Webpage of face, and relative users are matched using tag along sort as user behavior tag along sort, it is User behavior analysis does data basis.
Step 3: the resource identification according to the user access resources obtained by the frequency of occurrence extraction step one of setting User behavior tag along sort acquired in symbol and step 2, and be stored in into static reptile storehouse;Periodically or static reptile storehouse is entered daily Row outdated information deletion action, portion after being updated in the recent period the resource classification storehouse such as the electric business of high frequency time temperature, video with In extraction matching next time;
Finally, when exist matching less than user behavior tag along sort when, using step 2 again dynamically from user access provide Extraction obtains user behavior tag along sort in the resource identifier in source.
In order to verify the method method that is that by high efficiency extraction user behavior label of the present invention, spy enumerates a checking Example illustrates.
Step 1: first, using the method for the present invention by deep packet analytic technique, identify user access resources website Url, acquire user as shown in Figure 2 and access record.
Then, these resource identifiers are gone by the resource identifier of the resource such as the electric business to identification user's access, video Operate, amount of compressed data, and be ranked up according to frequency of occurrence temperature again, obtain seniority among brothers and sisters schematic diagram as shown in Figure 3.
Step 2: the resource identifier that the resources such as business, video are accessed user by reptile crawls, obtain on the page Tag along sort, and match relative users using tag along sort as user behavior tag along sort.
Step 3: extracting the of a relatively high resource identifier of frequency of occurrence temperature, tag along sort, arrange into static reptile Storehouse, form static reptile database as shown in Figure 4.And outdated information deletion action is carried out to static reptile storehouse daily, obtain A resource classification storehouse such as the electric business of high frequency time temperature, video in the recent period, is matched, most for the extraction of next user behavior tag along sort Afterwards those matching less than user behavior tag along sort, using step 2 again go extract user behavior tag along sort.
The user behavior tag along sort extraction result finally obtained is as shown in figure 5, corresponded to the use of extraction by user identity Family behavior tag along sort.
To sum up, the present invention finally completes to match extraction label, for user behavior point by way of dynamic reptile with user Data basis is made in analysis.Can solve static reptile database largely take the overhead issues of storage resource, high frequency crawl renewal certain The problem of a little site resources, by way of dynamic reptile, improve the accuracy rate of mark user behavior tag along sort.
Embodiments of the present invention are explained in detail above in conjunction with accompanying drawing, but the present invention is not limited to above-mentioned implementation Mode, can also be on the premise of present inventive concept not be departed from those of ordinary skill in the art's possessed knowledge Make a variety of changes.

Claims (4)

1. a kind of user behavior tag along sort extracting method based on dynamic reptile, it is characterised in that comprise the following steps:
Step 1: perceiving user's surfing flow, identify that the resource identifier of user access resources obtains resource identifier database;
Step 2: the resource identifier database that step 1 obtains is inserted in new website address request url, by reptile to new Url requests network address is crawled, and obtains the tag along sort on resource Webpage, and classify tag along sort as user behavior Tag match is to relative users;
Step 3: resource identifier and step 2 according to user access resources obtained by the frequency of occurrence extraction step one of setting Acquired user behavior tag along sort, and it is stored in static reptile storehouse;Outdated information deletion action is carried out to static reptile storehouse simultaneously, Resource classification storehouse after being updated is for extraction matching next time;When exist matching less than user behavior tag along sort when, Using step 2, extraction obtains user behavior tag along sort from the resource identifier of user access resources again.
2. the user behavior tag along sort extracting method based on dynamic reptile according to claim 1, it is characterised in that:It is described Also include carrying out deduplication operation to resource identifier database in step 1.
3. the user behavior tag along sort extracting method based on dynamic reptile according to claim 1, it is characterised in that:It is described Also include being ranked up the resource identifier in resource identifier database according to frequency of occurrence temperature in step 1.
4. the user behavior tag along sort extracting method based on dynamic reptile according to claim 1, it is characterised in that:It is described In step 1 user's surfing flow is perceived using deep packet inspection technical.
CN201710969018.6A 2017-10-18 2017-10-18 A kind of user behavior tag along sort extracting method based on dynamic reptile Pending CN107818145A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710969018.6A CN107818145A (en) 2017-10-18 2017-10-18 A kind of user behavior tag along sort extracting method based on dynamic reptile

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710969018.6A CN107818145A (en) 2017-10-18 2017-10-18 A kind of user behavior tag along sort extracting method based on dynamic reptile

Publications (1)

Publication Number Publication Date
CN107818145A true CN107818145A (en) 2018-03-20

Family

ID=61608110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710969018.6A Pending CN107818145A (en) 2017-10-18 2017-10-18 A kind of user behavior tag along sort extracting method based on dynamic reptile

Country Status (1)

Country Link
CN (1) CN107818145A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875781A (en) * 2018-05-07 2018-11-23 腾讯科技(深圳)有限公司 A kind of labeling method, apparatus, electronic equipment and storage medium
CN111400627A (en) * 2020-03-09 2020-07-10 政采云有限公司 Information acquisition method and device, electronic equipment and readable storage medium
CN112000748A (en) * 2020-07-14 2020-11-27 北京神州泰岳智能数据技术有限公司 Data processing method and device, electronic equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102143224A (en) * 2011-01-25 2011-08-03 张金海 Mobile phone Internet accessing-based user behavior analysis method and device
CN106446115A (en) * 2016-09-18 2017-02-22 成都九鼎瑞信科技股份有限公司 Mobile Internet user classification method and device
CN106484889A (en) * 2016-10-18 2017-03-08 合信息技术(北京)有限公司 The flooding method and apparatus of Internet resources
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
CN107124653A (en) * 2017-05-16 2017-09-01 四川长虹电器股份有限公司 The construction method of TV user portrait

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102143224A (en) * 2011-01-25 2011-08-03 张金海 Mobile phone Internet accessing-based user behavior analysis method and device
CN106446115A (en) * 2016-09-18 2017-02-22 成都九鼎瑞信科技股份有限公司 Mobile Internet user classification method and device
CN106484889A (en) * 2016-10-18 2017-03-08 合信息技术(北京)有限公司 The flooding method and apparatus of Internet resources
CN106815297A (en) * 2016-12-09 2017-06-09 宁波大学 A kind of academic resources recommendation service system and method
CN107124653A (en) * 2017-05-16 2017-09-01 四川长虹电器股份有限公司 The construction method of TV user portrait

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108875781A (en) * 2018-05-07 2018-11-23 腾讯科技(深圳)有限公司 A kind of labeling method, apparatus, electronic equipment and storage medium
CN108875781B (en) * 2018-05-07 2022-08-19 腾讯科技(深圳)有限公司 Label classification method and device, electronic equipment and storage medium
CN111400627A (en) * 2020-03-09 2020-07-10 政采云有限公司 Information acquisition method and device, electronic equipment and readable storage medium
CN111400627B (en) * 2020-03-09 2023-07-07 政采云有限公司 Information acquisition method and device, electronic equipment and readable storage medium
CN112000748A (en) * 2020-07-14 2020-11-27 北京神州泰岳智能数据技术有限公司 Data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
CN102929928B (en) Multidimensional-similarity-based personalized news recommendation method
US9405746B2 (en) User behavior models based on source domain
CN103546326A (en) Website traffic statistic method
CN104869009B (en) The system and method for website data statistics
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN102164186B (en) Method and system for realizing cloud search service
CN102831114B (en) Realize method and the device of internet user access Statistic Analysis
CN104394118A (en) User identity identification method and system
CN102831193A (en) Topic detecting device and topic detecting method based on distributed multistage cluster
CN104899273A (en) Personalized webpage recommendation method based on topic and relative entropy
WO2017084205A1 (en) Network user identity authentication method and system
CN101409690A (en) Method and system for obtaining internet user behaviors
CN103530429B (en) Webpage content extracting method
CN104216889B (en) Data dissemination analyzing and predicting method and system based on cloud service
CN107818145A (en) A kind of user behavior tag along sort extracting method based on dynamic reptile
CN105183873A (en) Malicious clicking behavior detection method and device
CN102710795A (en) Hotspot collecting method and device
CN102968510B (en) The searching method of internet personage information and system
CN102769818A (en) Method and device for pushing information in mobile internet
CN109947935A (en) The generation method and device of media event
CN104699851A (en) Service tag extension method in big data environment
CN104765823A (en) Method and device for collecting website data
CN103745383A (en) Method and system of realizing redirection service based on operator data
CN107086925B (en) Deep learning-based internet traffic big data analysis method
CN105653550B (en) Webpage filtering method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180320

WD01 Invention patent application deemed withdrawn after publication