CN110275998A - The determination method and device of webpage attribute data - Google Patents
The determination method and device of webpage attribute data Download PDFInfo
- Publication number
- CN110275998A CN110275998A CN201810219804.9A CN201810219804A CN110275998A CN 110275998 A CN110275998 A CN 110275998A CN 201810219804 A CN201810219804 A CN 201810219804A CN 110275998 A CN110275998 A CN 110275998A
- Authority
- CN
- China
- Prior art keywords
- data
- webpage
- result
- target webpage
- web page
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
- G06F16/986—Document structures and storage, e.g. HTML extensions
Abstract
The invention discloses a kind of determination method and devices of webpage attribute data.Wherein, this method comprises: determining multiple target webpages;Data are carried out to each target webpage in multiple target webpages to crawl, and are obtained data and are crawled result;It is crawled according to data as a result, obtaining multiple labeled data on each target webpage, wherein include the number that every element occurs in target webpage in each labeled data;According to the number that element every in target webpage occurs, the attribute data of object element is determined.The present invention solves the deviation occurred in the related technology due to communication process, causes to crawl the big technical problem of web data deviation.
Description
Technical field
The present invention relates to Internet technical fields, in particular to the determination method and dress of a kind of webpage attribute data
It sets.
Background technique
In the related technology, need to obtain certain fields and element property data in webpage in business personnel or client
When, it needs constantly to be linked up between business personnel and technical staff, i.e., business personnel needs to inform that technical staff wants
The web page field or attribute data arrived, technical staff crawl according to the understanding of oneself, but in this process, it needs
Technical staff has stronger understandability, can know the desired content that business personnel or client propose in time, can just climb in this way
The web page contents that business personnel or client want are got, and during the work time, can there are business personnel or client's statement
Unclear perhaps technical staff understand deviation cause the webpage attribute data or webpage that crawl element and client (or
Business personnel) the obtained content variation of expection it is very big, need to re-start and crawl.
For above-mentioned in the related technology due to the deviation that communication process occurs, cause to crawl the big skill of web page element deviation
Art problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of determination method and devices of webpage attribute data, at least to solve the relevant technologies
In due to communication process occur deviation, cause to crawl the big technical problem of web data deviation.
According to an aspect of an embodiment of the present invention, a kind of determination method of webpage attribute data is provided, comprising: determine
Multiple target webpages;Data are carried out to each target webpage in the multiple target webpage to crawl, and are obtained data and are crawled result;
It is crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein include in each labeled data
The number that every element occurs in target webpage;According to the number that element every in the target webpage occurs, target element is determined
The attribute data of element.
Further, data are carried out to each target webpage in the multiple target webpage to crawl, obtains data and crawls
As a result after, which comprises capture element code to each target webpage injection, wherein the capture element generation
Every element of the code for capturing the target webpage being marked and being marked on each target webpage.
Further, it after the multiple target webpage is marked, is crawled according to the data as a result, obtaining each target
Multiple labeled data on webpage include: to be crawled according to the data as a result, obtaining to capture element code;Pass through the capture member
Plain code captures the every element being marked and element property data on each target webpage, obtains capturing result;Using institute
Capture is stated as a result, determining the multiple labeled data.
Further, the number occurred according to element every in the target webpage, determines the attribute data of object element
Include: the total degree that every web page element in the multiple target webpages of statistics occurs in the multiple target webpage, is united
Count result;According to the statistical result, determine that the number for the web page element being marked is more than or equal to the object element of preset threshold;
According to the object element, multiple attributes corresponding to each described object element are obtained, with the category of the determination object element
Property data.
Further, count every web page element in multiple target webpages occur in the multiple target webpage it is total
Number, obtaining statistical result includes: the total degree that statistical web page access session occurs, wherein the web page access session is every
Corresponding session when secondary access webpage;The element for filtering out the webpage duplicated in web page access conversation procedure obtains the first filter
Division result;It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistics knot
Fruit.
Further, count every web page element in multiple target webpages occur in the multiple target webpage it is total
Number, obtaining statistical result includes: that the corresponding use of user of each web page element locator is clicked in statistical web page access process
User data;According to the user data, the member that webpage clicking is duplicated during clicking the web page element locator is filtered out
The data of element, obtain second and filter out result;It is filtered out according to described second as a result, determining total time that each web page element occurs
Number, to obtain the statistical result.
Further, before determining multiple target webpages, the method also includes: receive business demand parameter;According to
The business demand parameter, obtains the multiple target webpage, wherein to each target during obtaining target webpage
Every element in webpage is embedded in marker code, and the marker code is labeled on the target webpage for recording user
The data of operation.
According to another aspect of an embodiment of the present invention, a kind of determining device of webpage attribute data is additionally provided, comprising: the
One determination unit, for determining multiple target webpages;Unit is crawled, for each target network in the multiple target webpage
Page carries out data and crawls, and obtains data and crawls result;Acquiring unit, for being crawled according to the data as a result, obtaining each mesh
Mark multiple labeled data on webpage, wherein include the number that every element occurs in target webpage in each labeled data;The
Two determination units, the number for being occurred according to element every in the target webpage, determine the attribute data of object element.
Further, described device further include: injection unit, for each target in the multiple target webpage
Webpage carries out data and crawls, and obtains capturing element code to each target webpage injection after data crawl result,
In, the every element for capturing element code and being used to capture the target webpage being marked and be marked on each target webpage.
Further, the acquiring unit includes: the first acquisition module, for being marked in the multiple target webpage
Afterwards, it is crawled according to the data as a result, obtaining and captures element code, wherein the capture element code is marked for capturing
Target webpage and each target webpage on every element for being marked;Capture module is used for through the capture element code,
The every element being marked and the element property data on each target webpage are captured, obtain capturing result;First determining module,
For being captured using described as a result, determining the multiple labeled data.
Further, second determination unit includes: statistical module, for counting every net in multiple target webpages
The total degree that page element occurs in the multiple target webpage, obtains statistical result;Second determining module, for according to
Statistical result determines that the number for the web page element being marked is more than or equal to the object element of preset threshold;Second obtains module, uses
According to the object element, obtaining multiple attributes corresponding to each described object element, with the determination object element
Attribute data.
Further, the statistical module includes: the first statistic submodule, is occurred for statistical web page access session total
Number, wherein the web page access session corresponding session when being every time access webpage;First filters out submodule, for filtering out
The element of the webpage duplicated in web page access conversation procedure obtains first and filters out result;First determines submodule, is used for root
It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistical result.
Further, the statistical module further include: the second statistic submodule, for being clicked in statistical web page access process
The corresponding user data of the user of each web page element locator;Second filters out submodule, for according to the user data, filter
Except the data for the element for duplicating webpage clicking during the click web page element locator, obtains second and filter out result;
Second determines submodule, for being filtered out according to described second as a result, the total degree of each web page element appearance is determined, to obtain
To the statistical result.
Further, described device further include: receiving module, for receiving business before determining multiple target webpages
Demand parameter;Module is obtained, for obtaining the multiple target webpage according to the business demand parameter, wherein obtaining mesh
It marks webpage and marker code is embedded in every element in each target webpage in the process, the marker code is used for recording
Family is labeled the data of operation on the target webpage.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, the storage medium is for storing
Program, wherein equipment where controlling the storage medium in described program operation executes webpage described in above-mentioned any one
The determination method of attribute data.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, the processor is used to run program,
Wherein, the determination method of webpage attribute data described in above-mentioned any one is executed when described program is run.
In the present invention, it can be counted by determining the multiple target webpages selected, and to each target webpage
According to crawling, it then can use the data and crawl as a result, get multiple labeled data on each target webpage, the mark number
According to the information for the every element that can be corresponding with webpage, the number that every element occurs in webpage is also included, so as to root
According to the number that every element occurs, the data of object element and the corresponding attribute of object element are determined.I.e. in this embodiment,
Data can be carried out to the target webpage selected to crawl, and get the information of the element on webpage, it can obtain target
Attribute of an element data are linked up without business, it can are obtained expected obtained data, and then solved in the related technology due to ditch
The deviation occurred by journey, causes to crawl the big technical problem of web data deviation.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the determination method of webpage attribute data according to an embodiment of the present invention;
Fig. 2 is the schematic diagram of the determining device of webpage attribute data according to an embodiment of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of embodiment of the method for the determination of webpage attribute data is provided, needs to illustrate
It is that step shown in the flowchart of the accompanying drawings can execute in a computer system such as a set of computer executable instructions,
Also, although logical order is shown in flow charts, and it in some cases, can be to be different from sequence execution herein
Shown or described step.
Present invention could apply to the rings such as the attribute parsing of the parsing of various web page fields, web page element extraction and web page element
In border, specifically, can be in various internets, especially for the different web pages in internet, due to relative clients or
Person business personnel, at work, the web page field or attribute for needing to obtain the relevant technologies support, in the present invention, business personnel
Or client can carry out content mark to the webpage of selection, directly directly can be labeled content according to marked content from the background
Parsing, to obtain oneself desired web page contents.
Following embodiments are that a kind of preferred embodiment of the method, Fig. 1 are nets according to an embodiment of the present invention according to the present invention
The flow chart of the determination method of page attribute data, as shown in Figure 1, this method comprises the following steps:
Step S102 determines multiple target webpages.
Optionally, the webpage that the target webpage in the present invention can be chosen with terminal according to business demand, for webpage
Concrete type, can include but is not limited to: shopping webpage (for example, Taobao or Jingdone district net in every shopping webpage),
Tourism webpage (for example, travel with partners's webpage, remove webpage of where travelling), electric appliance webpage (for example, the webpages such as Gree Electric Appliances), technology
Webpage (for example, Baidupedia webpage) etc..It, can be by receiving business demand parameter before determining multiple target webpages;Root
According to business demand parameter, to obtain multiple target webpages, wherein in each target webpage during obtaining target webpage
Every element insertion marker code (can be used for guiding capture element code, be easier to remember so that capturing element code
Record the data of mark), marker code is for recording the data that user is labeled operation on target webpage.It can be according to industry
Business demand parameter, chooses the webpage oneself wanted, and mark to the webpage selected from whole webpages of internet
Remember code insertion, data consumer (such as above-mentioned client or business personnel) webpage clicking can be collected by the marker code
JavaScript code, thus by collect obtain data, determine webpage interested to data consumer, web page element and
Web page element attribute.
After determining target webpage, data consumer can be labeled the webpage of selection, click its interested number
According to, after the relevant position of data consumer's webpage clicking, can voluntarily record from the background data consumer click webpage, click
The information such as the element of webpage.
Step S104 carries out data to each target webpage in multiple target webpages and crawls, obtains data and crawl result.
Wherein, the data in the embodiment of the present invention crawl process, can refer to the webpage and web page contents being marked into
Row crawls, with the content being marked, it can the data of the mark of data consumer are collected, here it is possible to pass through capture
Element code determines marked content, it can collects the capture element code triggered after data consumer clicks every time, passes through
The code captures the content that webpage is marked.
Step S106 is crawled according to data as a result, obtaining multiple labeled data on each target webpage, wherein each
It include the number that every element occurs in target webpage in labeled data.
Through the above steps, multiple labeled data on available each webpage, on obtaining each target webpage
When labeled data, it can use data and crawl as a result, getting the capture element code.It can allow business in the embodiment of the present invention
Personnel are labeled target webpage, and after mark, annotation results can be returned to data processing personnel, data processing personnel
By capturing element code and capturing log recording, all data being marked is determined.By capturing element code, capture every
The every element being marked and element property data on a target webpage obtain capturing result;Using capture as a result, determination is more
A labeled data.For the capture element code, the element and element property data for capturing the webpage being marked can be.It is optional
, the webpage in the present invention for specifically capturing without limitation then will not do the element for the webpage clicked in webpage
It limits, for example, the web page element can be a certain item supporting paper in webpage, the shopping element etc. being also possible in webpage,
Wherein, the attribute of an element data in the embodiment of the present invention can be not specifically limited, for example, the element for capturing webpage is one
A microblogging, then its corresponding element property data may include: microblogging head portrait, microblogging title, personal gender etc..
Furthermore it is possible to get the position of the element for the webpage being clicked by capturing element code, determine that position is believed
Breath, wherein the location information can be indicated with web page text or webpage URL (uniform resource locator), and will be captured
To location information and relevant access information be sent to server, the category of object element and object element is determined by server
Property data.For relevant access information, can include but is not limited to: webpage clicking time point, visits at the session duration for accessing webpage
It asks the session id of webpage, access the user information (including user account and/or password) of webpage.
The total degree that can occur in different web pages to every element counts, and obtains corresponding statistical result.
Step S108 determines the attribute data of object element according to the number that element every in target webpage occurs.
By above embodiment, can first determine the multiple target webpages selected, and to each target webpage into
Row data crawl, to crawled using data as a result, getting multiple labeled data on each target webpage, the labeled data
Including the number that element every in target webpage occurs, the number occurred by element every in webpage determines object element
Attribute data corresponding with object element.I.e. in this embodiment, data can be carried out to the target webpage selected to crawl,
And the information of the element on webpage is got, pass through the content of every element in analysis webpage, so that it may obtain object element
Attribute data is linked up without business, it can is obtained expected obtained data, and then is solved in the related technology due to communication process
The deviation of appearance causes to crawl the big technical problem of web data deviation.
The attribute of object element is determined in the number occurred according to element every in target webpage for above-described embodiment
When data, the total degree that can occur in multiple target webpages by counting every web page element in multiple target webpages,
Obtain statistical result;According to statistical result, determine that the number for the web page element being marked is more than or equal to the target element of preset threshold
Element;According to object element, multiple attributes corresponding to each object element are obtained, to determine the attribute data of object element.It is right
In statistical result, the number for the objectives webpage being marked and the number of every web page element can include but is not limited to.
It can be in the element information and element property data for obtaining webpage, by going out during selection to element
Existing number, to determine that object element, the object element in the present invention can be one and/or multiple elements.For presetting threshold
Value can be according to the voluntarily determination of user's actual use situation in use, for example, preset threshold can be set being
3 or 5, after the total degree for determining element appearance is beyond the preset threshold, available object element, and to each target element
The link information of element is acquired, to obtain attribute data corresponding with object element.
It is counted below by the frequency of every element to data consumer's webpage clicking, so that it is determined that object element
With element property data, wherein in statistics, element frequency of occurrence can be determined according to session number or number of users.
Wherein, the total degree occurred in multiple target webpages in the every web page element counted in multiple target webpages,
When obtaining statistical result, the total degree that can occur by statistical web page access session, wherein web page access session is to visit every time
Ask corresponding session when webpage;The element for filtering out the webpage duplicated in web page access conversation procedure obtains first and filters out knot
Fruit;It is filtered out according to first as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
Above-mentioned web page access session can be in order to avoid individual's operation generates excessive influence to statistical result, to unite
The session number of meter access webpage, can be in an access process web page access session, enter the Web page to closing webpage
Process, for example, data consumer, which enters Taobao, is set as a web page access session to closing during Taobao.It filters out
The element of webpage clicking is duplicated in web page access conversation procedure, to avoid unessential web page element to whole statistics knot
Fruit has an impact.
In addition, in the total degree that the every web page element counted in multiple target webpages occurs in multiple target webpages,
It, can be to click the corresponding user of user of each web page element locator in statistical web page access process when obtaining statistical result
Data;According to user data, the data that webpage clicking element locator duplicates the element of webpage clicking in the process are filtered out, are obtained
Result is filtered out to second;It is filtered out according to second as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
The user data of user's webpage clicking element locator of access can be counted, a user is for net
The multiple click of the element of page is calculated as once, thus can be to avoid the unessential element of click.
In the above embodiment of the present invention, it can be counted according to the element of the webpage counted, the net being analyzed and acquired by
The corresponding element of the element positioning of the page page, and attribute data corresponding to element is saved.
Here is a kind of method of web page field parsing according to an embodiment of the present invention, wherein this method comprises:
11, the webpage to be crawled is downloaded, a part is randomly selected, to mark and (click data important in the page).For
This part of webpage that random selection comes out can be embedded in and collect the JavaScr ipt code that user clicks.
12, the user of data clicks the data of their concerns on the webpage to be marked.
13, collecting the data that mark personnel (i.e. above-mentioned data consumer) marks, (each click of mark personnel can touch
Correlative code (similar to the function of the examination element of browser) is sent out, codeacquisition clicks css selector (the i.e. css of element
Element selector realizes the control to the element of Webpage, is a kind of method for positioning element position in dom tree), and this
A css selecotor is together with other and click related information and (for example click the time, (webpage is conversationally by session id
Location), cookie (instruction number of users) etc.) it is sent to server).
14, in server end, the frequency that each css selector occurs is counted, most important several css are selected
selector。
15, the frequency is counted, sets and is clicked highest three elements of the frequency as important element.
Wherein, in statistics, in order to avoid personal abnormal operation generates excessive influence for result, webpage can be directed to
Session sess ion is counted, i.e. statistics sess ion quantity.For each css selector, it is a how many can be counted
Sess ion, user clicks the css selector in the sess ion.Multiple point in this way in a sess ion
Hitting will only be counted once.It avoids user and repeatedly clicks some unessential element and whole result is had an impact.
In addition, access number of users cookie can also be counted when the above-mentioned statistics frequency, it can for each css selector
The css selector is clicked to count how many user, in this way, a user will calculate the multiple click of element
At primary, avoid user and repeatedly click some unessential element, whole result is had an impact.
16, the css selector come out according to previous step parses css in all pages downloaded to
The corresponding element of selector (corresponding above-mentioned object element), meanwhile, it, can also be institute when saving the object element of webpage
There is attribute of an element value to save and deposit into database, to complete the parsing work of webpage label data.
It for above-described embodiment, is illustrated by electric business commodity of webpage, technical staff can download all electric business commodity
The page.Then extracting several hundred a pages out allows business personnel to mark and (click important element, correspond to target webpage), business people
Member may title to commodity it is interested, so the title of commodity can be clicked largely.In this way, in the present invention, Ke Yigen
The location information of element in the webpage and webpage clicked according to user obtains the css selector for indicating title, it is then possible to sharp
All electric business commodity pages downloaded to are parsed with the obtained css selector for indicating title, the title of commodity is deposited
Into in database.It arrives here, business personnel can check the title of commodity in the database.
According to another aspect of an embodiment of the present invention, a kind of storage medium is additionally provided, storage medium is used to store program,
Wherein, equipment where controlling storage medium when program is run executes the determination side of the webpage attribute data of above-mentioned any one
Method.
According to another aspect of an embodiment of the present invention, a kind of processor is additionally provided, processor is used to run program,
In, program executes the determination method of the webpage attribute data of above-mentioned any one when running.
Fig. 2 is the schematic diagram of the determining device of webpage attribute data according to an embodiment of the present invention, as shown in Fig. 2, the dress
Setting may include: the first determination unit 21, for determining multiple target webpages;Unit 23 is crawled, for multiple target webpages
In each target webpage carry out data crawl, obtain data and crawl result;Acquiring unit 25, for crawling knot according to data
Fruit obtains multiple labeled data on each target webpage, wherein includes every element in target webpage in each labeled data
The number of appearance;Second determination unit 27, the number for being occurred according to element every in target webpage, determines object element
Attribute data.
Using above-mentioned apparatus, the multiple target webpages selected can be determined by the first determination unit 21, and are led to
It crosses to crawl unit 23 and carry out data to each target webpage and crawl, to crawled using data as a result, being obtained by acquiring unit 25
Multiple labeled data on each target webpage are got, which includes the number that every element occurs in webpage, most
The number that can be occurred afterwards by the second determination unit 27 according to element every in target webpage, determines the attribute number of object element
According to.I.e. in this embodiment, data can be carried out to the target webpage that selects crawl, and gets element on webpage
Information obtains labeled data, and the attribute data of object element is determined by labeled data.In embodiments of the present invention, only
It needs according to the corresponding labeled data of the element marked on webpage and element information, so that it may obtain the attribute number of object element
According to without business communication, it can obtain expected obtained data, and then solve in the related technology due to communication process appearance
Deviation causes to crawl the big technical problem of web data deviation.
Optionally, above-mentioned device further include: mark unit, for each target webpage in multiple target webpages
It carries out data to crawl, obtains injecting each target webpage and capturing element code, wherein capture member after data crawl result
Plain code is used for the every element for capturing the target webpage being marked and being marked on each target webpage.
Optionally, above-mentioned acquiring unit 25 includes: the first acquisition module, for after multiple target webpages are marked,
It is crawled according to data as a result, obtaining and captures element code, obtained and capture element code;Capture module, for by capturing element
Code captures the every element being marked and element property data on each target webpage, obtains capturing result;First determines
Module is captured for utilizing as a result, determining multiple labeled data.
In addition, the second determination unit 27 includes: statistical module, for counting every web page element in multiple target webpages
The total degree occurred in the multiple target webpage, obtains statistical result;Second determining module is used for according to statistical result,
Determine that the number for the web page element being marked is more than or equal to the object element of preset threshold;Second obtains module, for according to mesh
Element is marked, multiple attributes corresponding to each object element are obtained, to determine the attribute data of object element.
It may include: the first statistic submodule for above-mentioned statistical module, occur for statistical web page access session
Total degree, wherein web page access session corresponding session when being every time access webpage;First filters out submodule, for filtering out net
The element of the webpage duplicated in access to web page conversation procedure obtains first and filters out result;First determines submodule, is used for basis
First filters out as a result, the total degree that each web page element occurs is determined, to obtain statistical result.
In addition, for above-mentioned statistical module further include: the second statistic submodule is used for statistical web page access process midpoint
Hit the corresponding user data of user of each web page element locator;Second filters out submodule, for filtering out according to user data
The data that the element of webpage clicking is duplicated during webpage clicking element locator, obtain second and filter out result;Second really
Stator modules, for being filtered out according to second as a result, the total degree of each web page element appearance is determined, to obtain statistical result.
Optionally, the device further include: receiving module, for receiving business demand before determining multiple target webpages
Parameter;Module is obtained, for obtaining multiple target webpages, wherein during obtaining target webpage according to business demand parameter
Marker code is embedded in every element in each target webpage, marker code is for recording user in the enterprising rower of target webpage
Infuse the data of operation.
The determining device of above-mentioned webpage attribute data can also include processor and memory, above-mentioned first determination unit
21, unit 23, acquiring unit 25, second determination unit 27 etc. is crawled to store as program unit in memory, by handling
Device executes above procedure unit stored in memory to realize corresponding function.
Include kernel in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can be set one
Or more, target obtained according to the corresponding labeled data of the element marked on webpage and element information by adjusting kernel parameter
Attribute of an element data.
Memory may include the non-volatile memory in computer-readable medium, random access memory (RAM) and/
Or the forms such as Nonvolatile memory, if read-only memory (ROM) or flash memory (flash RAM), memory include that at least one is deposited
Store up chip.
The embodiment of the invention provides a kind of equipment, equipment include processor, memory and storage on a memory and can
The program run on a processor, processor performs the steps of when executing program determines multiple target webpages;To multiple targets
Each target webpage in webpage carries out data and crawls, and obtains data and crawls result;It is crawled according to data as a result, obtaining each mesh
Mark multiple labeled data on webpage, wherein include the number that every element occurs in target webpage in each labeled data;Root
According to the number that element every in target webpage occurs, the attribute data of object element is determined.
Optionally, above-mentioned processor can also crawl as a result, obtaining according to data when executing program and capture element generation
Code, wherein capture every element that element code is used to capture the target webpage being marked and be marked on each target webpage;
By capturing element code, the every element being marked and the element property data on each target webpage are captured, are captured
As a result;Using capture as a result, determining multiple labeled data.
Optionally, above-mentioned processor can also count every web page element in multiple target webpages when executing program
The total degree occurred in multiple target webpages, obtains statistical result;According to statistical result, the web page element being marked is determined
Number is more than or equal to the object element of preset threshold;According to object element, multiple attributes corresponding to each object element are obtained,
To determine the attribute data of object element.
Optionally, above-mentioned processor is when executing program, the total degree that can occur with statistical web page access session,
In, web page access session corresponding session when being every time access webpage;Filter out the net duplicated in web page access conversation procedure
The element of page, obtains first and filters out result;It is filtered out according to first as a result, the total degree that each web page element occurs is determined, to obtain
To statistical result.
Optionally, above-mentioned processor, can be to click each webpage member when executing program in statistical web page access process
The corresponding user data of the user of plain locator;According to user data, filters out webpage clicking element locator and occur weight in the process
The data of the element of multiple webpage clicking, obtain second and filter out result;It is filtered out according to second as a result, determining that each web page element occurs
Total degree, to obtain statistical result.
Optionally, above-mentioned processor can also receive business demand parameter when executing program;Joined according to business demand
Number, obtains multiple target webpages, wherein is embedded in mark to every element in each target webpage during obtaining target webpage
Remember code, marker code is for recording the data that user is labeled operation on target webpage.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (10)
1. a kind of determination method of webpage attribute data characterized by comprising
Determine multiple target webpages;
Data are carried out to each target webpage in the multiple target webpage to crawl, and are obtained data and are crawled result;
It is crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein in each labeled data
The number occurred including element every in target webpage;
According to the number that element every in the target webpage occurs, the attribute data of object element is determined.
2. the method according to claim 1, wherein to each target webpage in the multiple target webpage
It carries out data to crawl, obtain after data crawl result, which comprises
Element code is captured to each target webpage injection, wherein it is described capture element code and be used to capture be marked
The every element being marked on target webpage and each target webpage.
3. according to the method described in claim 2, it is characterized in that, after the multiple target webpage is marked, according to described
Data crawl as a result, the multiple labeled data obtained on each target webpage include:
It is crawled according to the data as a result, obtaining the capture element code;
By the capture element code, the every element being marked and the element property data on each target webpage are captured,
It obtains capturing result;
It is captured using described as a result, determining the multiple labeled data.
4. the method according to claim 1, wherein time occurred according to element every in the target webpage
Number, determines that the attribute data of object element includes:
The total degree that every web page element in multiple target webpages occurs in the multiple target webpage is counted, is counted
As a result;
According to the statistical result, determine that the number for the web page element being marked is more than or equal to the object element of preset threshold;
According to the object element, multiple attributes corresponding to each described object element are obtained, with the determination object element
Attribute data.
5. according to the method described in claim 4, it is characterized in that, counting every web page element in multiple target webpages in institute
The total degree occurred in multiple target webpages is stated, obtaining statistical result includes:
The total degree that statistical web page access session occurs, wherein the web page access session is corresponding when being every time access webpage
Session;
The element for filtering out the webpage duplicated in web page access conversation procedure obtains first and filters out result;
It is filtered out according to described first as a result, the total degree that each described web page element occurs is determined, to obtain the statistical result.
6. according to the method described in claim 4, it is characterized in that, counting every web page element in multiple target webpages in institute
The total degree occurred in multiple target webpages is stated, obtaining statistical result includes:
The corresponding user data of user of each web page element locator is clicked in statistical web page access process;
According to the user data, the element that webpage clicking is duplicated during clicking the web page element locator is filtered out
Data obtain second and filter out result;
It is filtered out according to described second as a result, the total degree that each web page element occurs is determined, to obtain the statistical result.
7. the method according to claim 1, wherein the method is also wrapped before determining multiple target webpages
It includes:
Receive business demand parameter;
According to the business demand parameter, the multiple target webpage is obtained, wherein to each during obtaining target webpage
Every element in the target webpage is embedded in marker code, and the marker code is for recording user in the target webpage
It is labeled the data of operation.
8. a kind of determining device of webpage attribute data characterized by comprising
First determination unit, for determining multiple target webpages;
Unit is crawled, is crawled for carrying out data to each target webpage in the multiple target webpage, is obtained data and crawl
As a result;
Acquiring unit, for being crawled according to the data as a result, obtaining multiple labeled data on each target webpage, wherein
It include the number that every element occurs in target webpage in each labeled data;
Second determination unit, the number for being occurred according to element every in the target webpage, determines the attribute of object element
Data.
9. a kind of storage medium, which is characterized in that the storage medium is for storing program, wherein in described program operation
The determination side of webpage attribute data described in any one of equipment perform claim requirement 1 to 7 where controlling the storage medium
Method.
10. a kind of processor, which is characterized in that the processor is for running program, wherein right of execution when described program is run
Benefit require any one of 1 to 7 described in webpage attribute data determination method.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810219804.9A CN110275998B (en) | 2018-03-16 | 2018-03-16 | Method and device for determining webpage attribute data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810219804.9A CN110275998B (en) | 2018-03-16 | 2018-03-16 | Method and device for determining webpage attribute data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110275998A true CN110275998A (en) | 2019-09-24 |
CN110275998B CN110275998B (en) | 2021-07-30 |
Family
ID=67957841
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810219804.9A Active CN110275998B (en) | 2018-03-16 | 2018-03-16 | Method and device for determining webpage attribute data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110275998B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836316A (en) * | 2021-09-23 | 2021-12-24 | 北京百度网讯科技有限公司 | Processing method, training method, device, equipment and medium for ternary group data |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007041800A1 (en) * | 2005-10-14 | 2007-04-19 | Panscient Inc | Information extraction system |
CN103294711A (en) * | 2012-02-28 | 2013-09-11 | 阿里巴巴集团控股有限公司 | Method and device for determining page elements in web page |
CN103916293A (en) * | 2014-04-15 | 2014-07-09 | 浪潮软件股份有限公司 | Method for monitoring and analyzing website user behaviors |
CN104021185A (en) * | 2014-06-11 | 2014-09-03 | 北京奇虎科技有限公司 | Method and device for identifying information attributes of data in web pages |
CN105447139A (en) * | 2015-11-20 | 2016-03-30 | 广州华多网络科技有限公司 | Data acquisition statistical method, and system, terminal and service equipment thereof |
CN107562620A (en) * | 2017-08-24 | 2018-01-09 | 阿里巴巴集团控股有限公司 | One kind buries an automatic setting method and device |
-
2018
- 2018-03-16 CN CN201810219804.9A patent/CN110275998B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007041800A1 (en) * | 2005-10-14 | 2007-04-19 | Panscient Inc | Information extraction system |
CN103294711A (en) * | 2012-02-28 | 2013-09-11 | 阿里巴巴集团控股有限公司 | Method and device for determining page elements in web page |
CN103916293A (en) * | 2014-04-15 | 2014-07-09 | 浪潮软件股份有限公司 | Method for monitoring and analyzing website user behaviors |
CN104021185A (en) * | 2014-06-11 | 2014-09-03 | 北京奇虎科技有限公司 | Method and device for identifying information attributes of data in web pages |
CN105447139A (en) * | 2015-11-20 | 2016-03-30 | 广州华多网络科技有限公司 | Data acquisition statistical method, and system, terminal and service equipment thereof |
CN107562620A (en) * | 2017-08-24 | 2018-01-09 | 阿里巴巴集团控股有限公司 | One kind buries an automatic setting method and device |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113836316A (en) * | 2021-09-23 | 2021-12-24 | 北京百度网讯科技有限公司 | Processing method, training method, device, equipment and medium for ternary group data |
Also Published As
Publication number | Publication date |
---|---|
CN110275998B (en) | 2021-07-30 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN105959371B (en) | Webpage share system | |
CN103218431B (en) | A kind ofly can identify the system that info web gathers automatically | |
CN106844522B (en) | A kind of network data crawling method and device | |
CN104077415B (en) | Searching method and device | |
CN109242553A (en) | A kind of user behavior data recommended method, server and computer-readable medium | |
CN108304410A (en) | A kind of detection method, device and the data analysing method of the abnormal access page | |
CN106570013A (en) | Method and device for processing page access data | |
CN105260414B (en) | User behavior similarity calculation method and device | |
CN102970348B (en) | Network application method for pushing, system and network application server | |
CN103186670A (en) | Method and system for integrally acquiring webpage information | |
CN107783993A (en) | The storage method and device of data | |
CN104899306B (en) | Information processing method, information display method and device | |
CN104298782B (en) | Internet user actively accesses the analysis method of action trail | |
CN104391953B (en) | Detect the method and device of webpage renewal | |
CN110222253A (en) | A kind of collecting method, equipment and computer readable storage medium | |
CN107995092A (en) | Handle the method and device of the information of network social intercourse platform issue | |
JP6286559B2 (en) | Method and device for adding sign icons in interactive applications | |
CN104731937B (en) | The processing method and processing device of user behavior data | |
CN104967698B (en) | A kind of method and apparatus crawling network data | |
CN104462242B (en) | Webpage capacity of returns statistical method and device | |
CN109145194A (en) | The acquisition method and device of user behavior data | |
CN103838728B (en) | The processing method and browser of info web | |
CN110275998A (en) | The determination method and device of webpage attribute data | |
CN106815248A (en) | Web analytics method and device | |
CN106897297B (en) | Method and device for determining access path between website columns |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |