CN110275998A - The determination method and device of webpage attribute data - Google Patents

The determination method and device of webpage attribute data Download PDF

Info

Publication number
CN110275998A
CN110275998A CN201810219804.9A CN201810219804A CN110275998A CN 110275998 A CN110275998 A CN 110275998A CN 201810219804 A CN201810219804 A CN 201810219804A CN 110275998 A CN110275998 A CN 110275998A
Authority
CN
China
Prior art keywords
data
webpage
result
target webpage
web page
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201810219804.9A
Other languages
Chinese (zh)
Other versions
CN110275998B (en
Inventor
王蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201810219804.9A priority Critical patent/CN110275998B/en
Publication of CN110275998A publication Critical patent/CN110275998A/en
Application granted granted Critical
Publication of CN110275998B publication Critical patent/CN110275998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

The invention discloses a kind of determination method and devices of webpage attribute data.Wherein, this method comprises: determining multiple target webpages;Data are carried out to each target webpage in multiple target webpages to crawl, and are obtained data and are crawled result;It is crawled according to data as a result, obtaining multiple labeled data on each target webpage, wherein include the number that every element occurs in target webpage in each labeled data;According to the number that element every in target webpage occurs, the attribute data of object element is determined.The present invention solves the deviation occurred in the related technology due to communication process, causes to crawl the big technical problem of web data deviation.

Description

The determination method and device of webpage attribute data
Technical field
The present invention relates to Internet technical fields, in particular to the determination method and dress of a kind of webpage attribute data It sets.
Background technique
In the related technology, need to obtain certain fields and element property data in webpage in business personnel or client When, it needs constantly to be linked up between business personnel and technical staff, i.e., business personnel needs to inform that technical staff wants The web page field or attribute data arrived, technical staff crawl according to the understanding of oneself, but in this process, it needs Technical staff has stronger understandability, can know the desired content that business personnel or client propose in time, can just climb in this way The web page contents that business personnel or client want are got, and during the work time, can there are business personnel or client's statement Unclear perhaps technical staff understand deviation cause the webpage attribute data or webpage that crawl element and client (or Business personnel) the obtained content variation of expection it is very big, need to re-start and crawl.
For above-mentioned in the related technology due to the deviation that communication process occurs, cause to crawl the big skill of web page element deviation Art problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of determination method and devices of webpage attribute data, at least to solve the relevant technologies In due to communication process occur deviation, cause to crawl the big technical problem of web data deviation.
According to an aspect of an embodiment of the present invention, a kind of determination method of webpage attribute data is provided, comprising: determine Multiple target webpages;Data are carried out to each target webpage in the multiple target webpage to crawl, and are obtained data and are crawled result; It is crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein include in each labeled data The number that every element occurs in target webpage;According to the number that element every in the target webpage occurs, target element is determined The attribute data of element.
Further, data are carried out to each target webpage in the multiple target webpage to crawl, obtains data and crawls As a result after, which comprises capture element code to each target webpage injection, wherein the capture element generation Every element of the code for capturing the target webpage being marked and being marked on each target webpage.
Further, it after the multiple target webpage is marked, is crawled according to the data as a result, obtaining each target Multiple labeled data on webpage include: to be crawled according to the data as a result, obtaining to capture element code;Pass through the capture member Plain code captures the every element being marked and element property data on each target webpage, obtains capturing result;Using institute Capture is stated as a result, determining the multiple labeled data.
Further, the number occurred according to element every in the target webpage, determines the attribute data of object element Include: the total degree that every web page element in the multiple target webpages of statistics occurs in the multiple target webpage, is united Count result;According to the statistical result, determine that the number for the web page element being marked is more than or equal to the object element of preset threshold; According to the object element, multiple attributes corresponding to each described object element are obtained, with the category of the determination object element Property data.
Further, count every web page element in multiple target webpages occur in the multiple target webpage it is total Number, obtaining statistical result includes: the total degree that statistical web page access session occurs, wherein the web page access session is every Corresponding session when secondary access webpage;The element for filtering out the webpage duplicated in web page access conversation procedure obtains the first filter Division result;It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistics knot Fruit.
Further, count every web page element in multiple target webpages occur in the multiple target webpage it is total Number, obtaining statistical result includes: that the corresponding use of user of each web page element locator is clicked in statistical web page access process User data;According to the user data, the member that webpage clicking is duplicated during clicking the web page element locator is filtered out The data of element, obtain second and filter out result;It is filtered out according to described second as a result, determining total time that each web page element occurs Number, to obtain the statistical result.
Further, before determining multiple target webpages, the method also includes: receive business demand parameter;According to The business demand parameter, obtains the multiple target webpage, wherein to each target during obtaining target webpage Every element in webpage is embedded in marker code, and the marker code is labeled on the target webpage for recording user The data of operation.
According to another aspect of an embodiment of the present invention, a kind of determining device of webpage attribute data is additionally provided, comprising: the One determination unit, for determining multiple target webpages;Unit is crawled, for each target network in the multiple target webpage Page carries out data and crawls, and obtains data and crawls result;Acquiring unit, for being crawled according to the data as a result, obtaining each mesh Mark multiple labeled data on webpage, wherein include the number that every element occurs in target webpage in each labeled data;The Two determination units, the number for being occurred according to element every in the target webpage, determine the attribute data of object element.
Further, described device further include: injection unit, for each target in the multiple target webpage Webpage carries out data and crawls, and obtains capturing element code to each target webpage injection after data crawl result, In, the every element for capturing element code and being used to capture the target webpage being marked and be marked on each target webpage.
Further, the acquiring unit includes: the first acquisition module, for being marked in the multiple target webpage Afterwards, it is crawled according to the data as a result, obtaining and captures element code, wherein the capture element code is marked for capturing Target webpage and each target webpage on every element for being marked;Capture module is used for through the capture element code, The every element being marked and the element property data on each target webpage are captured, obtain capturing result;First determining module, For being captured using described as a result, determining the multiple labeled data.
Further, second determination unit includes: statistical module, for counting every net in multiple target webpages The total degree that page element occurs in the multiple target webpage, obtains statistical result;Second determining module, for according to Statistical result determines that the number for the web page element being marked is more than or equal to the object element of preset threshold;Second obtains module, uses According to the object element, obtaining multiple attributes corresponding to each described object element, with the determination object element Attribute data.
Further, the statistical module includes: the first statistic submodule, is occurred for statistical web page access session total Number, wherein the web page access session corresponding session when being every time access webpage;First filters out submodule, for filtering out The element of the webpage duplicated in web page access conversation procedure obtains first and filters out result;First determines submodule, is used for root It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistical result.
Further, the statistical module further include: the second statistic submodule, for being clicked in statistical web page access process The corresponding user data of the user of each web page element locator;Second filters out submodule, for according to the user data, filter Except the data for the element for duplicating webpage clicking during the click web page element locator, obtains second and filter out result; Second determines submodule, for being filtered out according to described second as a result, the total degree of each web page element appearance is determined, to obtain To the statistical result.
Further, described device further include: receiving module, for receiving business before determining multiple target webpages Demand parameter;Module is obtained, for obtaining the multiple target webpage according to the business demand parameter, wherein obtaining mesh It marks webpage and marker code is embedded in every element in each target webpage in the process, the marker code is used for recording Family is labeled the data of operation on the target webpage.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, the storage medium is for storing Program, wherein equipment where controlling the storage medium in described program operation executes webpage described in above-mentioned any one The determination method of attribute data.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, the processor is used to run program, Wherein, the determination method of webpage attribute data described in above-mentioned any one is executed when described program is run.
In the present invention, it can be counted by determining the multiple target webpages selected, and to each target webpage According to crawling, it then can use the data and crawl as a result, get multiple labeled data on each target webpage, the mark number According to the information for the every element that can be corresponding with webpage, the number that every element occurs in webpage is also included, so as to root According to the number that every element occurs, the data of object element and the corresponding attribute of object element are determined.I.e. in this embodiment, Data can be carried out to the target webpage selected to crawl, and get the information of the element on webpage, it can obtain target Attribute of an element data are linked up without business, it can are obtained expected obtained data, and then solved in the related technology due to ditch The deviation occurred by journey, causes to crawl the big technical problem of web data deviation.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the determination method of webpage attribute data according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of the determining device of webpage attribute data according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of embodiment of the method for the determination of webpage attribute data is provided, needs to illustrate It is that step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions, Also, although logical order is shown in flow charts, and it in some cases, can be to be different from sequence execution herein Shown or described step.
Present invention could apply to the rings such as the attribute parsing of the parsing of various web page fields, web page element extraction and web page element In border, specifically, can be in various internets, especially for the different web pages in internet, due to relative clients or Person business personnel, at work, the web page field or attribute for needing to obtain the relevant technologies support, in the present invention, business personnel Or client can carry out content mark to the webpage of selection, directly directly can be labeled content according to marked content from the background Parsing, to obtain oneself desired web page contents.
Following embodiments are that a kind of preferred embodiment of the method, Fig. 1 are nets according to an embodiment of the present invention according to the present invention The flow chart of the determination method of page attribute data, as shown in Figure 1, this method comprises the following steps:
Step S102 determines multiple target webpages.
Optionally, the webpage that the target webpage in the present invention can be chosen with terminal according to business demand, for webpage Concrete type, can include but is not limited to: shopping webpage (for example, Taobao or Jingdone district net in every shopping webpage), Tourism webpage (for example, travel with partners's webpage, remove webpage of where travelling), electric appliance webpage (for example, the webpages such as Gree Electric Appliances), technology Webpage (for example, Baidupedia webpage) etc..It, can be by receiving business demand parameter before determining multiple target webpages;Root According to business demand parameter, to obtain multiple target webpages, wherein in each target webpage during obtaining target webpage Every element insertion marker code (can be used for guiding capture element code, be easier to remember so that capturing element code Record the data of mark), marker code is for recording the data that user is labeled operation on target webpage.It can be according to industry Business demand parameter, chooses the webpage oneself wanted, and mark to the webpage selected from whole webpages of internet Remember code insertion, data consumer (such as above-mentioned client or business personnel) webpage clicking can be collected by the marker code JavaScript code, thus by collect obtain data, determine webpage interested to data consumer, web page element and Web page element attribute.
After determining target webpage, data consumer can be labeled the webpage of selection, click its interested number According to, after the relevant position of data consumer's webpage clicking, can voluntarily record from the background data consumer click webpage, click The information such as the element of webpage.
Step S104 carries out data to each target webpage in multiple target webpages and crawls, obtains data and crawl result.
Wherein, the data in the embodiment of the present invention crawl process, can refer to the webpage and web page contents being marked into Row crawls, with the content being marked, it can the data of the mark of data consumer are collected, here it is possible to pass through capture Element code determines marked content, it can collects the capture element code triggered after data consumer clicks every time, passes through The code captures the content that webpage is marked.
Step S106 is crawled according to data as a result, obtaining multiple labeled data on each target webpage, wherein each It include the number that every element occurs in target webpage in labeled data.
Through the above steps, multiple labeled data on available each webpage, on obtaining each target webpage When labeled data, it can use data and crawl as a result, getting the capture element code.It can allow business in the embodiment of the present invention Personnel are labeled target webpage, and after mark, annotation results can be returned to data processing personnel, data processing personnel By capturing element code and capturing log recording, all data being marked is determined.By capturing element code, capture every The every element being marked and element property data on a target webpage obtain capturing result;Using capture as a result, determination is more A labeled data.For the capture element code, the element and element property data for capturing the webpage being marked can be.It is optional , the webpage in the present invention for specifically capturing without limitation then will not do the element for the webpage clicked in webpage It limits, for example, the web page element can be a certain item supporting paper in webpage, the shopping element etc. being also possible in webpage, Wherein, the attribute of an element data in the embodiment of the present invention can be not specifically limited, for example, the element for capturing webpage is one A microblogging, then its corresponding element property data may include: microblogging head portrait, microblogging title, personal gender etc..
Furthermore it is possible to get the position of the element for the webpage being clicked by capturing element code, determine that position is believed Breath, wherein the location information can be indicated with web page text or webpage URL (uniform resource locator), and will be captured To location information and relevant access information be sent to server, the category of object element and object element is determined by server Property data.For relevant access information, can include but is not limited to: webpage clicking time point, visits at the session duration for accessing webpage It asks the session id of webpage, access the user information (including user account and/or password) of webpage.
The total degree that can occur in different web pages to every element counts, and obtains corresponding statistical result.
Step S108 determines the attribute data of object element according to the number that element every in target webpage occurs.
By above embodiment, can first determine the multiple target webpages selected, and to each target webpage into Row data crawl, to crawled using data as a result, getting multiple labeled data on each target webpage, the labeled data Including the number that element every in target webpage occurs, the number occurred by element every in webpage determines object element Attribute data corresponding with object element.I.e. in this embodiment, data can be carried out to the target webpage selected to crawl, And the information of the element on webpage is got, pass through the content of every element in analysis webpage, so that it may obtain object element Attribute data is linked up without business, it can is obtained expected obtained data, and then is solved in the related technology due to communication process The deviation of appearance causes to crawl the big technical problem of web data deviation.
The attribute of object element is determined in the number occurred according to element every in target webpage for above-described embodiment When data, the total degree that can occur in multiple target webpages by counting every web page element in multiple target webpages, Obtain statistical result;According to statistical result, determine that the number for the web page element being marked is more than or equal to the target element of preset threshold Element;According to object element, multiple attributes corresponding to each object element are obtained, to determine the attribute data of object element.It is right In statistical result, the number for the objectives webpage being marked and the number of every web page element can include but is not limited to.
It can be in the element information and element property data for obtaining webpage, by going out during selection to element Existing number, to determine that object element, the object element in the present invention can be one and/or multiple elements.For presetting threshold Value can be according to the voluntarily determination of user's actual use situation in use, for example, preset threshold can be set being 3 or 5, after the total degree for determining element appearance is beyond the preset threshold, available object element, and to each target element The link information of element is acquired, to obtain attribute data corresponding with object element.
It is counted below by the frequency of every element to data consumer's webpage clicking, so that it is determined that object element With element property data, wherein in statistics, element frequency of occurrence can be determined according to session number or number of users.
Wherein, the total degree occurred in multiple target webpages in the every web page element counted in multiple target webpages, When obtaining statistical result, the total degree that can occur by statistical web page access session, wherein web page access session is to visit every time Ask corresponding session when webpage;The element for filtering out the webpage duplicated in web page access conversation procedure obtains first and filters out knot Fruit;It is filtered out according to first as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
Above-mentioned web page access session can be in order to avoid individual's operation generates excessive influence to statistical result, to unite The session number of meter access webpage, can be in an access process web page access session, enter the Web page to closing webpage Process, for example, data consumer, which enters Taobao, is set as a web page access session to closing during Taobao.It filters out The element of webpage clicking is duplicated in web page access conversation procedure, to avoid unessential web page element to whole statistics knot Fruit has an impact.
In addition, in the total degree that the every web page element counted in multiple target webpages occurs in multiple target webpages, It, can be to click the corresponding user of user of each web page element locator in statistical web page access process when obtaining statistical result Data;According to user data, the data that webpage clicking element locator duplicates the element of webpage clicking in the process are filtered out, are obtained Result is filtered out to second;It is filtered out according to second as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
The user data of user's webpage clicking element locator of access can be counted, a user is for net The multiple click of the element of page is calculated as once, thus can be to avoid the unessential element of click.
In the above embodiment of the present invention, it can be counted according to the element of the webpage counted, the net being analyzed and acquired by The corresponding element of the element positioning of the page page, and attribute data corresponding to element is saved.
Here is a kind of method of web page field parsing according to an embodiment of the present invention, wherein this method comprises:
11, the webpage to be crawled is downloaded, a part is randomly selected, to mark and (click data important in the page).For This part of webpage that random selection comes out can be embedded in and collect the JavaScr ipt code that user clicks.
12, the user of data clicks the data of their concerns on the webpage to be marked.
13, collecting the data that mark personnel (i.e. above-mentioned data consumer) marks, (each click of mark personnel can touch Correlative code (similar to the function of the examination element of browser) is sent out, codeacquisition clicks css selector (the i.e. css of element Element selector realizes the control to the element of Webpage, is a kind of method for positioning element position in dom tree), and this A css selecotor is together with other and click related information and (for example click the time, (webpage is conversationally by session id Location), cookie (instruction number of users) etc.) it is sent to server).
14, in server end, the frequency that each css selector occurs is counted, most important several css are selected selector。
15, the frequency is counted, sets and is clicked highest three elements of the frequency as important element.
Wherein, in statistics, in order to avoid personal abnormal operation generates excessive influence for result, webpage can be directed to Session sess ion is counted, i.e. statistics sess ion quantity.For each css selector, it is a how many can be counted Sess ion, user clicks the css selector in the sess ion.Multiple point in this way in a sess ion Hitting will only be counted once.It avoids user and repeatedly clicks some unessential element and whole result is had an impact.
In addition, access number of users cookie can also be counted when the above-mentioned statistics frequency, it can for each css selector The css selector is clicked to count how many user, in this way, a user will calculate the multiple click of element At primary, avoid user and repeatedly click some unessential element, whole result is had an impact.
16, the css selector come out according to previous step parses css in all pages downloaded to The corresponding element of selector (corresponding above-mentioned object element), meanwhile, it, can also be institute when saving the object element of webpage There is attribute of an element value to save and deposit into database, to complete the parsing work of webpage label data.
It for above-described embodiment, is illustrated by electric business commodity of webpage, technical staff can download all electric business commodity The page.Then extracting several hundred a pages out allows business personnel to mark and (click important element, correspond to target webpage), business people Member may title to commodity it is interested, so the title of commodity can be clicked largely.In this way, in the present invention, Ke Yigen The location information of element in the webpage and webpage clicked according to user obtains the css selector for indicating title, it is then possible to sharp All electric business commodity pages downloaded to are parsed with the obtained css selector for indicating title, the title of commodity is deposited Into in database.It arrives here, business personnel can check the title of commodity in the database.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, storage medium is used to store program, Wherein, equipment where controlling storage medium when program is run executes the determination side of the webpage attribute data of above-mentioned any one Method.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, processor is used to run program, In, program executes the determination method of the webpage attribute data of above-mentioned any one when running.
Fig. 2 is the schematic diagram of the determining device of webpage attribute data according to an embodiment of the present invention, as shown in Fig. 2, the dress Setting may include: the first determination unit 21, for determining multiple target webpages;Unit 23 is crawled, for multiple target webpages In each target webpage carry out data crawl, obtain data and crawl result;Acquiring unit 25, for crawling knot according to data Fruit obtains multiple labeled data on each target webpage, wherein includes every element in target webpage in each labeled data The number of appearance;Second determination unit 27, the number for being occurred according to element every in target webpage, determines object element Attribute data.
Using above-mentioned apparatus, the multiple target webpages selected can be determined by the first determination unit 21, and are led to It crosses to crawl unit 23 and carry out data to each target webpage and crawl, to crawled using data as a result, being obtained by acquiring unit 25 Multiple labeled data on each target webpage are got, which includes the number that every element occurs in webpage, most The number that can be occurred afterwards by the second determination unit 27 according to element every in target webpage, determines the attribute number of object element According to.I.e. in this embodiment, data can be carried out to the target webpage that selects crawl, and gets element on webpage Information obtains labeled data, and the attribute data of object element is determined by labeled data.In embodiments of the present invention, only It needs according to the corresponding labeled data of the element marked on webpage and element information, so that it may obtain the attribute number of object element According to without business communication, it can obtain expected obtained data, and then solve in the related technology due to communication process appearance Deviation causes to crawl the big technical problem of web data deviation.
Optionally, above-mentioned device further include: mark unit, for each target webpage in multiple target webpages It carries out data to crawl, obtains injecting each target webpage and capturing element code, wherein capture member after data crawl result Plain code is used for the every element for capturing the target webpage being marked and being marked on each target webpage.
Optionally, above-mentioned acquiring unit 25 includes: the first acquisition module, for after multiple target webpages are marked, It is crawled according to data as a result, obtaining and captures element code, obtained and capture element code;Capture module, for by capturing element Code captures the every element being marked and element property data on each target webpage, obtains capturing result;First determines Module is captured for utilizing as a result, determining multiple labeled data.
In addition, the second determination unit 27 includes: statistical module, for counting every web page element in multiple target webpages The total degree occurred in the multiple target webpage, obtains statistical result;Second determining module is used for according to statistical result, Determine that the number for the web page element being marked is more than or equal to the object element of preset threshold;Second obtains module, for according to mesh Element is marked, multiple attributes corresponding to each object element are obtained, to determine the attribute data of object element.
It may include: the first statistic submodule for above-mentioned statistical module, occur for statistical web page access session Total degree, wherein web page access session corresponding session when being every time access webpage;First filters out submodule, for filtering out net The element of the webpage duplicated in access to web page conversation procedure obtains first and filters out result;First determines submodule, is used for basis First filters out as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
In addition, for above-mentioned statistical module further include: the second statistic submodule is used for statistical web page access process midpoint Hit the corresponding user data of user of each web page element locator;Second filters out submodule, for filtering out according to user data The data that the element of webpage clicking is duplicated during webpage clicking element locator, obtain second and filter out result;Second really Stator modules, for being filtered out according to second as a result, the total degree of each web page element appearance is determined, to obtain statistical result.
Optionally, the device further include: receiving module, for receiving business demand before determining multiple target webpages Parameter;Module is obtained, for obtaining multiple target webpages, wherein during obtaining target webpage according to business demand parameter Marker code is embedded in every element in each target webpage, marker code is for recording user in the enterprising rower of target webpage Infuse the data of operation.
The determining device of above-mentioned webpage attribute data can also include processor and memory, above-mentioned first determination unit 21, unit 23, acquiring unit 25, second determination unit 27 etc. is crawled to store as program unit in memory, by handling Device executes above procedure unit stored in memory to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one Or more, target obtained according to the corresponding labeled data of the element marked on webpage and element information by adjusting kernel parameter Attribute of an element data.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/ Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited Store up chip.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can The program run on a processor, processor performs the steps of when executing program determines multiple target webpages;To multiple targets Each target webpage in webpage carries out data and crawls, and obtains data and crawls result;It is crawled according to data as a result, obtaining each mesh Mark multiple labeled data on webpage, wherein include the number that every element occurs in target webpage in each labeled data;Root According to the number that element every in target webpage occurs, the attribute data of object element is determined.
Optionally, above-mentioned processor can also crawl as a result, obtaining according to data when executing program and capture element generation Code, wherein capture every element that element code is used to capture the target webpage being marked and be marked on each target webpage; By capturing element code, the every element being marked and the element property data on each target webpage are captured, are captured As a result;Using capture as a result, determining multiple labeled data.
Optionally, above-mentioned processor can also count every web page element in multiple target webpages when executing program The total degree occurred in multiple target webpages, obtains statistical result;According to statistical result, the web page element being marked is determined Number is more than or equal to the object element of preset threshold;According to object element, multiple attributes corresponding to each object element are obtained, To determine the attribute data of object element.
Optionally, above-mentioned processor is when executing program, the total degree that can occur with statistical web page access session, In, web page access session corresponding session when being every time access webpage;Filter out the net duplicated in web page access conversation procedure The element of page, obtains first and filters out result;It is filtered out according to first as a result, the total degree that each web page element occurs is determined, to obtain To statistical result.
Optionally, above-mentioned processor, can be to click each webpage member when executing program in statistical web page access process The corresponding user data of the user of plain locator;According to user data, filters out webpage clicking element locator and occur weight in the process The data of the element of multiple webpage clicking, obtain second and filter out result;It is filtered out according to second as a result, determining that each web page element occurs Total degree, to obtain statistical result.
Optionally, above-mentioned processor can also receive business demand parameter when executing program;Joined according to business demand Number, obtains multiple target webpages, wherein is embedded in mark to every element in each target webpage during obtaining target webpage Remember code, marker code is for recording the data that user is labeled operation on target webpage.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (10)

1. a kind of determination method of webpage attribute data characterized by comprising
Determine multiple target webpages;
Data are carried out to each target webpage in the multiple target webpage to crawl, and are obtained data and are crawled result;
It is crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein in each labeled data The number occurred including element every in target webpage;
According to the number that element every in the target webpage occurs, the attribute data of object element is determined.
2. the method according to claim 1, wherein to each target webpage in the multiple target webpage It carries out data to crawl, obtain after data crawl result, which comprises
Element code is captured to each target webpage injection, wherein it is described capture element code and be used to capture be marked The every element being marked on target webpage and each target webpage.
3. according to the method described in claim 2, it is characterized in that, after the multiple target webpage is marked, according to described Data crawl as a result, the multiple labeled data obtained on each target webpage include:
It is crawled according to the data as a result, obtaining the capture element code;
By the capture element code, the every element being marked and the element property data on each target webpage are captured, It obtains capturing result;
It is captured using described as a result, determining the multiple labeled data.
4. the method according to claim 1, wherein time occurred according to element every in the target webpage Number, determines that the attribute data of object element includes:
The total degree that every web page element in multiple target webpages occurs in the multiple target webpage is counted, is counted As a result;
According to the statistical result, determine that the number for the web page element being marked is more than or equal to the object element of preset threshold;
According to the object element, multiple attributes corresponding to each described object element are obtained, with the determination object element Attribute data.
5. according to the method described in claim 4, it is characterized in that, counting every web page element in multiple target webpages in institute The total degree occurred in multiple target webpages is stated, obtaining statistical result includes:
The total degree that statistical web page access session occurs, wherein the web page access session is corresponding when being every time access webpage Session;
The element for filtering out the webpage duplicated in web page access conversation procedure obtains first and filters out result;
It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistical result.
6. according to the method described in claim 4, it is characterized in that, counting every web page element in multiple target webpages in institute The total degree occurred in multiple target webpages is stated, obtaining statistical result includes:
The corresponding user data of user of each web page element locator is clicked in statistical web page access process;
According to the user data, the element that webpage clicking is duplicated during clicking the web page element locator is filtered out Data obtain second and filter out result;
It is filtered out according to described second as a result, the total degree that each web page element occurs is determined, to obtain the statistical result.
7. the method according to claim 1, wherein the method is also wrapped before determining multiple target webpages It includes:
Receive business demand parameter;
According to the business demand parameter, the multiple target webpage is obtained, wherein to each during obtaining target webpage Every element in the target webpage is embedded in marker code, and the marker code is for recording user in the target webpage It is labeled the data of operation.
8. a kind of determining device of webpage attribute data characterized by comprising
First determination unit, for determining multiple target webpages;
Unit is crawled, is crawled for carrying out data to each target webpage in the multiple target webpage, is obtained data and crawl As a result;
Acquiring unit, for being crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein It include the number that every element occurs in target webpage in each labeled data;
Second determination unit, the number for being occurred according to element every in the target webpage, determines the attribute of object element Data.
9. a kind of storage medium, which is characterized in that the storage medium is for storing program, wherein in described program operation The determination side of webpage attribute data described in any one of equipment perform claim requirement 1 to 7 where controlling the storage medium Method.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run Benefit require any one of 1 to 7 described in webpage attribute data determination method.
CN201810219804.9A 2018-03-16 2018-03-16 Method and device for determining webpage attribute data Active CN110275998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810219804.9A CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810219804.9A CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Publications (2)

Publication Number Publication Date
CN110275998A true CN110275998A (en) 2019-09-24
CN110275998B CN110275998B (en) 2021-07-30

Family

ID=67957841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810219804.9A Active CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Country Status (1)

Country Link
CN (1) CN110275998B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836316A (en) * 2021-09-23 2021-12-24 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041800A1 (en) * 2005-10-14 2007-04-19 Panscient Inc Information extraction system
CN103294711A (en) * 2012-02-28 2013-09-11 阿里巴巴集团控股有限公司 Method and device for determining page elements in web page
CN103916293A (en) * 2014-04-15 2014-07-09 浪潮软件股份有限公司 Method for monitoring and analyzing website user behaviors
CN104021185A (en) * 2014-06-11 2014-09-03 北京奇虎科技有限公司 Method and device for identifying information attributes of data in web pages
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN107562620A (en) * 2017-08-24 2018-01-09 阿里巴巴集团控股有限公司 One kind buries an automatic setting method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041800A1 (en) * 2005-10-14 2007-04-19 Panscient Inc Information extraction system
CN103294711A (en) * 2012-02-28 2013-09-11 阿里巴巴集团控股有限公司 Method and device for determining page elements in web page
CN103916293A (en) * 2014-04-15 2014-07-09 浪潮软件股份有限公司 Method for monitoring and analyzing website user behaviors
CN104021185A (en) * 2014-06-11 2014-09-03 北京奇虎科技有限公司 Method and device for identifying information attributes of data in web pages
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN107562620A (en) * 2017-08-24 2018-01-09 阿里巴巴集团控股有限公司 One kind buries an automatic setting method and device

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836316A (en) * 2021-09-23 2021-12-24 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data

Also Published As

Publication number Publication date
CN110275998B (en) 2021-07-30

Similar Documents

Publication Publication Date Title
CN105959371B (en) Webpage share system
CN103218431B (en) A kind ofly can identify the system that info web gathers automatically
CN106844522B (en) A kind of network data crawling method and device
CN104077415B (en) Searching method and device
CN109242553A (en) A kind of user behavior data recommended method, server and computer-readable medium
CN108304410A (en) A kind of detection method, device and the data analysing method of the abnormal access page
CN106570013A (en) Method and device for processing page access data
CN105260414B (en) User behavior similarity calculation method and device
CN102970348B (en) Network application method for pushing, system and network application server
CN103186670A (en) Method and system for integrally acquiring webpage information
CN107783993A (en) The storage method and device of data
CN104899306B (en) Information processing method, information display method and device
CN104298782B (en) Internet user actively accesses the analysis method of action trail
CN104391953B (en) Detect the method and device of webpage renewal
CN110222253A (en) A kind of collecting method, equipment and computer readable storage medium
CN107995092A (en) Handle the method and device of the information of network social intercourse platform issue
JP6286559B2 (en) Method and device for adding sign icons in interactive applications
CN104731937B (en) The processing method and processing device of user behavior data
CN104967698B (en) A kind of method and apparatus crawling network data
CN104462242B (en) Webpage capacity of returns statistical method and device
CN109145194A (en) The acquisition method and device of user behavior data
CN103838728B (en) The processing method and browser of info web
CN110275998A (en) The determination method and device of webpage attribute data
CN106815248A (en) Web analytics method and device
CN106897297B (en) Method and device for determining access path between website columns

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant