CN110275998B - Method and device for determining webpage attribute data - Google Patents

Method and device for determining webpage attribute data Download PDF

Info

Publication number
CN110275998B
CN110275998B CN201810219804.9A CN201810219804A CN110275998B CN 110275998 B CN110275998 B CN 110275998B CN 201810219804 A CN201810219804 A CN 201810219804A CN 110275998 B CN110275998 B CN 110275998B
Authority
CN
China
Prior art keywords
target
webpage
data
determining
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810219804.9A
Other languages
Chinese (zh)
Other versions
CN110275998A (en
Inventor
王蒙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201810219804.9A priority Critical patent/CN110275998B/en
Publication of CN110275998A publication Critical patent/CN110275998A/en
Application granted granted Critical
Publication of CN110275998B publication Critical patent/CN110275998B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • G06F16/986Document structures and storage, e.g. HTML extensions

Abstract

The invention discloses a method and a device for determining webpage attribute data. Wherein, the method comprises the following steps: determining a plurality of target webpages; performing data crawling on each target webpage in the plurality of target webpages to obtain a data crawling result; acquiring a plurality of label data on each target webpage according to the data crawling result, wherein each label data comprises the occurrence frequency of each element in the target webpage; and determining attribute data of the target elements according to the occurrence times of each element in the target webpage. The method and the device solve the technical problem of large deviation of the data of the crawled webpage due to deviation in the communication process in the related technology.

Description

Method and device for determining webpage attribute data
Technical Field
The invention relates to the technical field of internet, in particular to a method and a device for determining webpage attribute data.
Background
In the related art, when a service person or a client needs to obtain attribute data of some fields and elements in a web page, the service person and a technician need to communicate with each other continuously, that is, the service personnel need to inform the technical personnel of the webpage field or attribute data which the technical personnel want to obtain, the technical personnel crawl according to the self understanding, in the process, however, the technical personnel is required to have strong comprehension capability, the content of the requirement proposed by the service personnel or the client can be known in time, therefore, the webpage content wanted by the service personnel or the clients can be crawled, and during the working process, the unclear expression of the service personnel or the clients exists, or the skilled person understands the deviation, so that the attribute data of the crawled webpage or the elements of the webpage greatly deviate from the expected content of the client (or business person), and the crawl needs to be carried out again.
Aiming at the technical problem that the deviation of elements of the crawled webpage is large due to the deviation in the communication process in the related technology, an effective solution is not provided at present.
Disclosure of Invention
The embodiment of the invention provides a method and a device for determining webpage attribute data, which are used for at least solving the technical problem of large deviation of crawled webpage data caused by deviation in a communication process in the related technology.
According to an aspect of the embodiments of the present invention, a method for determining attribute data of a web page is provided, including: determining a plurality of target webpages; performing data crawling on each target webpage in the plurality of target webpages to obtain a data crawling result; acquiring a plurality of labeled data on each target webpage according to the data crawling result, wherein each labeled data comprises the occurrence frequency of each element in the target webpage; and determining attribute data of the target elements according to the occurrence times of each element in the target webpage.
Further, after data crawling is performed on each target webpage in the multiple target webpages and a data crawling result is obtained, the method includes: and injecting a capturing element code into each target webpage, wherein the capturing element code is used for capturing the labeled target webpage and the labeled elements on each target webpage.
Further, after the target webpages are labeled, obtaining the labeled data on each target webpage according to the data crawling result includes: acquiring a capturing element code according to the data crawling result; capturing various elements and element attribute data labeled on each target webpage through the capture element codes to obtain a capture result; determining the plurality of annotation data using the capture results.
Further, determining the attribute data of the target element according to the occurrence frequency of each element in the target webpage comprises: counting the total times of the webpage elements in the target webpages appearing in the target webpages to obtain a statistical result; according to the statistical result, determining a target element with the frequency of the labeled webpage element being more than or equal to a preset threshold value; and acquiring a plurality of attributes corresponding to each target element according to the target elements to determine attribute data of the target elements.
Further, counting the total number of times that each web page element in a plurality of target web pages appears in the plurality of target web pages, and obtaining a statistical result includes: counting the total times of occurrence of webpage access sessions, wherein the webpage access sessions are corresponding sessions when a webpage is accessed each time; filtering out elements of repeated webpages appearing in the webpage access session process to obtain a first filtering result; and determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain the statistical result.
Further, counting the total number of times that each web page element in a plurality of target web pages appears in the plurality of target web pages, and obtaining a statistical result includes: counting user data corresponding to a user clicking each webpage element locator in the webpage access process; according to the user data, filtering data of elements of repeatedly clicked web pages in the process of clicking the web page element locator to obtain a second filtering result; and determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain the statistical result.
Further, before determining the plurality of target web pages, the method further comprises: receiving a service demand parameter; and acquiring the target webpages according to the service demand parameters, wherein mark codes are embedded into each element in each target webpage in the process of acquiring the target webpages, and the mark codes are used for recording data of marking operation of a user on the target webpages.
According to another aspect of the embodiments of the present invention, there is also provided a device for determining attribute data of a web page, including: a first determining unit for determining a plurality of target web pages; the crawling unit is used for crawling data of each target webpage in the target webpages to obtain data crawling results; the acquisition unit is used for acquiring a plurality of label data on each target webpage according to the data crawling result, wherein each label data comprises the occurrence frequency of each element in the target webpage; and the second determining unit is used for determining the attribute data of the target element according to the occurrence frequency of each element in the target webpage.
Further, the apparatus further comprises: and the injection unit is used for injecting a capture element code into each target webpage after data crawling is carried out on each target webpage in the target webpages to obtain a data crawling result, wherein the capture element code is used for capturing the labeled target webpage and each labeled element on each target webpage.
Further, the acquisition unit includes: the first acquisition module is used for acquiring capture element codes according to the data crawling result after the target webpages are labeled, wherein the capture element codes are used for capturing the labeled target webpages and elements labeled on each target webpage; the capturing module is used for capturing each labeled element and element attribute data on each target webpage through the capturing element code to obtain a capturing result; a first determining module for determining the plurality of annotation data using the captured result.
Further, the second determination unit includes: the statistical module is used for counting the total times of the webpage elements in the target webpages to obtain statistical results; the second determining module is used for determining the target elements with the times of the labeled webpage elements being more than or equal to a preset threshold value according to the statistical result; and the second acquisition module is used for acquiring a plurality of attributes corresponding to each target element according to the target elements so as to determine attribute data of the target elements.
Further, the statistics module comprises: the first statistic submodule is used for counting the total times of occurrence of webpage access sessions, wherein the webpage access sessions are corresponding sessions when a webpage is accessed each time; the first filtering submodule is used for filtering repeated webpage elements in the webpage access session process to obtain a first filtering result; and the first determining submodule is used for determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain the statistical result.
Further, the statistic module further comprises: the second statistical submodule is used for counting user data corresponding to a user clicking each webpage element locator in the webpage access process; the second filtering submodule is used for filtering data of elements of the repeatedly clicked webpage in the process of clicking the webpage element locator according to the user data to obtain a second filtering result; and the second determining submodule is used for determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain the statistical result.
Further, the apparatus further comprises: the receiving module is used for receiving the service requirement parameters before determining the target webpages; and the acquisition module is used for acquiring the plurality of target webpages according to the service demand parameters, wherein mark codes are embedded into each element in each target webpage in the process of acquiring the target webpages, and the mark codes are used for recording data of marking operation of a user on the target webpages.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium is used for storing a program, and when the program runs, a device on which the storage medium is located is controlled to execute any one of the above methods for determining web page attribute data.
According to another aspect of the embodiments of the present invention, there is also provided a processor, where the processor is configured to execute a program, where the program executes the method for determining the web page attribute data described in any one of the above.
In the invention, a plurality of selected target webpages can be determined, data crawling is carried out on each target webpage, and then a plurality of labeled data on each target webpage can be obtained by using the data crawling result, wherein the labeled data can correspond to the information of each element of the webpage and also comprises the occurrence frequency of each element in the webpage, so that the data of the target element and the attribute corresponding to the target element can be determined according to the occurrence frequency of each element. In other words, in this embodiment, data crawling may be performed on the selected target webpage, and information of elements on the webpage is obtained, that is, attribute data of the target element may be obtained, and data that is expected to be obtained may be obtained without business communication, so as to solve a technical problem in the related art that a deviation of data of the crawled webpage is large due to a deviation occurring in a communication process.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a method of determining web page attribute data according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an apparatus for determining web page attribute data according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a method embodiment for determination of web page attribute data, it is noted that the steps illustrated in the flowchart of the figures may be performed in a computer system such as a set of computer-executable instructions and that, although a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method can be applied to the environments of field analysis of various web pages, extraction of web page elements, attribute analysis of the web page elements and the like, particularly can be applied to various internet, especially different web pages in the internet, and related clients or business personnel need to obtain web page field or attribute support of related technologies in work.
The following embodiment is a preferred method embodiment according to the present invention, and fig. 1 is a flowchart of a method for determining attribute data of a web page according to an embodiment of the present invention, as shown in fig. 1, the method includes the following steps:
step S102, a plurality of target web pages are determined.
Optionally, the target webpage in the present invention may be a webpage selected by the terminal according to a service requirement, and the specific type of the webpage may include but is not limited to: shopping web pages (e.g., shopping web pages in Taobao or Jingdong), travel web pages (e.g., travel to travel web page, travel to where web page), appliance web pages (e.g., web pages for Geli appliances), technology web pages (e.g., Baidu encyclopedia web pages), and so on. Before determining a plurality of target webpages, business requirement parameters can be received; and according to the business requirement parameters, acquiring a plurality of target web pages, wherein in the process of acquiring the target web pages, mark codes (which can be used for guiding the capture element codes so that the capture element codes can more easily record the labeled data) are embedded into each element in each target web page, and the mark codes are used for recording the data of the labeling operation performed on the target web pages by the user. The method includes the steps that a webpage which is required to be obtained by a user is selected from all webpages of the internet according to business demand parameters, mark codes are embedded into the selected webpage, JavaScript codes of the webpage clicked by the data user (such as a client or a business person) can be collected through the mark codes, and accordingly the webpage, webpage elements and webpage element attributes which are interested by the data user are determined through the collected data.
After the target webpage is determined, the data user can mark the selected webpage and click the interested data, and after the data user clicks the relevant position of the webpage, the background can automatically record the information of the webpage clicked by the data user, the elements of the clicked webpage and the like.
And step S104, performing data crawling on each target webpage in the plurality of target webpages to obtain data crawling results.
The data crawling process in the embodiment of the present invention may refer to crawling tagged web pages and web page contents to obtain tagged contents, that is, to collect tagged data of a data user, where the tagged contents may be determined by capturing an element code, that is, capturing an element code triggered after each click of the data user, and capturing the tagged contents of the web pages by using the code.
And step S106, acquiring a plurality of label data on each target webpage according to the data crawling result, wherein each label data comprises the occurrence frequency of each element in the target webpage.
Through the steps, a plurality of label data on each webpage can be obtained, and when the label data on each target webpage are obtained, the capture element code can be obtained by using the data crawling result. The embodiment of the invention can lead business personnel to label the target webpage, and can return the labeling result to the data processing personnel after labeling, and the data processing personnel can determine each item of labeled data by capturing element codes and log records. Capturing various elements and element attribute data labeled on each target webpage through capturing element codes to obtain a capturing result; using the capture results, a plurality of annotation data is determined. For the capture element code, it may be the element and element attribute data that captures the webpage being annotated. Optionally, in the present invention, the specifically captured web page is not limited, and then an element of the web page clicked in the web page is also not limited, for example, the element of the web page may be a certain description file in the web page, or a shopping element in the web page, and the like, where attribute data of the element in the embodiment of the present invention may not be specifically limited, for example, if the element captured in the web page is a microblog, then the corresponding element attribute data may include: microblog head portrait, microblog name, personal gender and the like.
In addition, the position of the element of the clicked web page can be acquired by capturing the element code, the position information can be determined, wherein the position information can be represented by web page text or web page URL (uniform resource locator), and the captured position information and the relevant access information are sent to the server, and the target element and the attribute data of the target element are determined by the server. For the relevant access information, may include, but is not limited to: a point in time when the web page is clicked, a duration of a session for accessing the web page, a session ID for accessing the web page, user information (including a user account and/or password) for accessing the web page.
The total times of occurrence of each element in different webpages can be counted to obtain corresponding statistical results.
And step S108, determining attribute data of the target elements according to the occurrence frequency of each element in the target webpage.
By the embodiment, the selected target webpages can be determined firstly, and data crawling is performed on each target webpage, so that a plurality of labeled data on each target webpage are obtained by using data crawling results, the labeled data comprise the occurrence frequency of each element in the target webpage, and the target elements and the attribute data corresponding to the target elements are determined according to the occurrence frequency of each element in the webpage. In other words, in this embodiment, data crawling may be performed on the selected target webpage, information of elements on the webpage may be obtained, attribute data of the target element may be obtained by analyzing contents of each element in the webpage, and data that is expected to be obtained may be obtained without business communication, so as to solve a technical problem in the related art that a deviation of crawled webpage data is large due to a deviation occurring in a communication process.
For the above embodiment, when determining the attribute data of the target element according to the occurrence frequency of each element in the target web page, a statistical result may be obtained by counting the total occurrence frequency of each web page element in the plurality of target web pages; according to the statistical result, determining the target elements with the times of the labeled webpage elements being more than or equal to a preset threshold value; and acquiring a plurality of attributes corresponding to each item mark element according to the target element so as to determine attribute data of the target element. For statistical results, the number of times each target web page is annotated and the number of times each web page element is annotated may be included, but not limited.
That is, when the element information and the element attribute data of the web page are obtained, the target element is determined by the number of times that the element appears in the selection process, and the target element in the invention may be one or more elements. The preset threshold may be determined according to the actual use condition of the user in the use process, for example, the preset threshold may be set to 3 or 5, after the total number of occurrences of the elements exceeds the preset threshold, the target elements may be obtained, and the link information of each target element is collected to obtain the attribute data corresponding to the target elements.
The target element and the element attribute data are determined by counting the frequency of clicking each element of the webpage by the data user, wherein the frequency of occurrence of the elements can be determined according to the number of sessions or the number of users during counting.
When a statistical result is obtained by counting the total times of occurrence of each webpage element in a plurality of target webpages, the total times of occurrence of a webpage access session can be counted, wherein the webpage access session is a session corresponding to each webpage access; filtering out elements of repeated webpages appearing in the webpage access session process to obtain a first filtering result; and determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain a statistical result.
The above-mentioned web page access session may be to avoid that the personal operation has an excessive influence on the statistical result, so as to count the number of sessions accessing the web page, and the web page access session may be a process from entering the web page to closing the web page in a single access process, for example, a data user sets a web page access session in a process from entering the treasure panning web to closing the treasure panning web. And elements of repeatedly clicking the webpage appear in the webpage access session process are filtered, so that the influence of unimportant webpage elements on the overall statistical result is avoided.
In addition, when the total times of the webpage elements in the target webpages appearing in the target webpages are counted to obtain a statistical result, user data corresponding to a user clicking each webpage element locator in the webpage access process can be counted; according to the user data, filtering data of elements of the repeatedly clicked webpage in the process of clicking the webpage element locator to obtain a second filtering result; and determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain a statistical result.
That is, the statistics of the user data of the accessed user clicking the web page element locator can be carried out, and the multiple clicks of one user on the elements of the web page are calculated as one time, so that the clicking of unimportant elements can be avoided.
In the embodiment of the present invention, statistics may be performed according to the counted elements of the web page, the elements corresponding to the element positioning of the obtained web page are analyzed, and the attribute data corresponding to the elements are stored.
The following is a method for analyzing a web page field according to an embodiment of the present invention, where the method includes:
11. downloading the web page to be crawled, and randomly selecting a part to label (clicking important data in the page). For the randomly selected part of the webpage, JavaScript code which collects clicks of the user is embedded.
12. The data user clicks on the data they are interested in on the web page to be annotated.
13. The method comprises the steps of collecting data marked by a marking person (namely, the data user) and sending the data to a server, wherein each click of the marking person triggers a related code (similar to the function of a browser examination element), the code captures a cs selector of a click element (namely, a cs element selector realizes control over elements of a webpage, and is a method for positioning element positions in a DOM (document object model) tree), and the cs selector is sent to the server together with other information related to the click (such as click time, session id (webpage session address), cookie (indicating the number of users) and the like).
14. And at the server side, counting the occurrence frequency of each css selector, and selecting the most important css selectors.
15. And counting the frequency, and setting the three elements with the highest clicked frequency as important elements.
During the statistics, in order to avoid that the abnormal operation of the individual has an excessive influence on the result, statistics may be performed on the web session sess ion, that is, the number of the sess ions may be counted. For each css selector, it can be counted how many sess ions are, in which the user clicks the css selector. So that multiple clicks in a session will be counted only once. The influence of the user clicking certain unimportant element for multiple times on the overall result is avoided.
In addition, when the frequency is counted, the number of the access users can be counted, and for each css selector, the number of times that each user clicks the css selector can be counted, so that multiple times of clicking of one user on an element can be counted, and the situation that the user clicks an unimportant element multiple times and influences on the whole result is avoided.
16. And analyzing elements (corresponding to the target elements) corresponding to the css selector in all the downloaded pages according to the css selector counted in the previous step, and meanwhile, when the target elements of the webpage are stored, storing attribute values of all the elements into a database so as to finish the analysis work of the webpage labeled data.
For the above embodiment, the web page is used as the e-commerce product for explanation, and the technician can download all the web pages of the e-commerce product. Then, several hundred pages are extracted to be marked by the business person (clicking on the important element, corresponding to the target web page), and the business person may be interested in the title of the commodity, so that the title of the commodity can be clicked by a large amount. Therefore, in the invention, the css selector representing the title can be obtained according to the webpage clicked by the user and the position information of the elements in the webpage, and then all downloaded e-commerce commodity pages can be analyzed by using the obtained css selector representing the title, and the title of the commodity is stored in the database. Up to this point, the business person may already view the title of the item in the database.
According to another aspect of the embodiments of the present invention, there is also provided a storage medium, where the storage medium is used to store a program, and when the program runs, a device in which the storage medium is located is controlled to execute any one of the above methods for determining web page attribute data.
According to another aspect of the embodiments of the present invention, there is also provided a processor, configured to execute a program, where the program executes a method for determining web page attribute data in any one of the above-mentioned items when running.
Fig. 2 is a schematic diagram of an apparatus for determining web page attribute data according to an embodiment of the present invention, and as shown in fig. 2, the apparatus may include: a first determining unit 21 for determining a plurality of target web pages; the crawling unit 23 is configured to perform data crawling on each target webpage in the multiple target webpages to obtain a data crawling result; the obtaining unit 25 is configured to obtain, according to the data crawling result, a plurality of label data on each target webpage, where each label data includes the number of times that each element in the target webpage appears; and a second determining unit 27, configured to determine attribute data of the target element according to the number of times that each element in the target web page appears.
By using the device, the selected target webpages can be determined by the first determining unit 21, and data crawling is performed on each target webpage by the crawling unit 23, so that a plurality of label data on each target webpage are obtained by the obtaining unit 25 according to data crawling results, wherein the label data comprise the occurrence frequency of each element in the webpage, and finally, the attribute data of the target element can be determined by the second determining unit 27 according to the occurrence frequency of each element in the target webpage. That is, in this embodiment, data crawling may be performed on the selected target webpage, information of elements on the webpage is obtained, tagging data is obtained, and attribute data of the target element is determined by the tagging data. In the embodiment of the invention, the attribute data of the target element can be obtained only according to the labeled data and the element information corresponding to the elements labeled on the webpage, and the expected data can be obtained without business communication, so that the technical problem of large deviation of the crawled webpage data caused by deviation in the communication process in the related technology is solved.
Optionally, the above apparatus further comprises: and the labeling unit is used for performing data crawling on each target webpage in the multiple target webpages to obtain a data crawling result, and then injecting capturing element codes into each target webpage, wherein the capturing element codes are used for capturing the labeled target webpages and the labeled elements on each target webpage.
Optionally, the obtaining unit 25 includes: the first acquisition module is used for acquiring capture element codes according to data crawling results after a plurality of target webpages are labeled, and acquiring the capture element codes; the capturing module is used for capturing various elements and element attribute data which are labeled on each target webpage through capturing element codes to obtain a capturing result; and the first determining module is used for determining a plurality of annotation data by utilizing the capture result.
In addition, the second determination unit 27 includes: the statistical module is used for counting the total times of the webpage elements in the target webpages to obtain statistical results; the second determining module is used for determining the target elements with the times of the labeled webpage elements being more than or equal to a preset threshold value according to the statistical result; and the second acquisition module is used for acquiring a plurality of attributes corresponding to each item mark element according to the target element so as to determine attribute data of the target element.
For the above statistical module, the method may include: the first statistic submodule is used for counting the total times of occurrence of the webpage access sessions, wherein the webpage access sessions are corresponding sessions when a webpage is accessed every time; the first filtering submodule is used for filtering repeated webpage elements in the webpage access session process to obtain a first filtering result; and the first determining submodule is used for determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain a statistical result.
In addition, the statistical module further comprises: the second statistical submodule is used for counting user data corresponding to a user clicking each webpage element locator in the webpage access process; the second filtering submodule is used for filtering data of elements of the repeatedly clicked webpage in the process of clicking the webpage element locator according to the user data to obtain a second filtering result; and the second determining submodule is used for determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain a statistical result.
Optionally, the apparatus further comprises: the receiving module is used for receiving the service requirement parameters before determining the target webpages; the acquisition module is used for acquiring a plurality of target webpages according to the service demand parameters, wherein mark codes are embedded into each element in each target webpage in the process of acquiring the target webpages, and the mark codes are used for recording data of marking operation of a user on the target webpages.
The device for determining the web page attribute data may further include a processor and a memory, where the first determining unit 21, the crawling unit 23, the obtaining unit 25, the second determining unit 27, and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to implement corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, and the attribute data of the target element is obtained according to the labeling data and the element information corresponding to the element labeled on the webpage by adjusting the kernel parameters.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The embodiment of the invention provides equipment, which comprises a processor, a memory and a program which is stored on the memory and can run on the processor, wherein the processor executes the program and realizes the following steps: determining a plurality of target webpages; performing data crawling on each target webpage in the plurality of target webpages to obtain a data crawling result; acquiring a plurality of label data on each target webpage according to the data crawling result, wherein each label data comprises the occurrence frequency of each element in the target webpage; and determining attribute data of the target elements according to the occurrence times of each element in the target webpage.
Optionally, when the processor executes the program, the capture element code may be further obtained according to a data crawling result, where the capture element code is used to capture the labeled target web page and each element labeled on each target web page; capturing various elements and element attribute data labeled on each target webpage through capturing element codes to obtain a capturing result; using the capture results, a plurality of annotation data is determined.
Optionally, when the processor executes the program, the total number of times that each web page element in the multiple target web pages appears in the multiple target web pages may be counted to obtain a statistical result; according to the statistical result, determining the target elements with the times of the labeled webpage elements being more than or equal to a preset threshold value; and acquiring a plurality of attributes corresponding to each item mark element according to the target element so as to determine attribute data of the target element.
Optionally, when the processor executes the program, the total number of times of occurrence of the web page access session may also be counted, where the web page access session is a session corresponding to each time of accessing the web page; filtering out elements of repeated webpages appearing in the webpage access session process to obtain a first filtering result; and determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain a statistical result.
Optionally, when the processor executes the program, user data corresponding to a user clicking each webpage element locator in the webpage access process may also be counted; according to the user data, filtering data of elements of the repeatedly clicked webpage in the process of clicking the webpage element locator to obtain a second filtering result; and determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain a statistical result.
Optionally, when executing the program, the processor may further receive a service requirement parameter; and acquiring a plurality of target web pages according to the service demand parameters, wherein mark codes are embedded into each element in each target web page in the process of acquiring the target web pages, and the mark codes are used for recording data of marking operation of a user on the target web pages.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (7)

1. A method for determining attribute data of a web page, comprising:
determining a plurality of target webpages;
performing data crawling on each target webpage in the plurality of target webpages to obtain a data crawling result;
and acquiring a plurality of labeled data on each target webpage according to the data crawling result, wherein each labeled data comprises the occurrence frequency of each webpage element in the target webpage, and the webpage elements at least comprise one of the following elements: a description file in each target webpage and a shopping element in the target webpage;
determining attribute data of the target elements according to the occurrence times of all the web page elements in the target web page;
determining attribute data of the target element according to the occurrence frequency of each webpage element in the target webpage comprises the following steps: counting the total times of the webpage elements in the target webpages appearing in the target webpages to obtain a statistical result; according to the statistical result, determining a target element with the frequency of the labeled webpage element being more than or equal to a preset threshold value; according to the target elements, acquiring a plurality of attributes corresponding to each target element to determine attribute data of the target elements;
after data crawling each target webpage in the plurality of target webpages to obtain a data crawling result, the method comprises the following steps: injecting a capturing element code into each target webpage, wherein the capturing element code is used for capturing the labeled target webpage and each webpage element labeled on each target webpage;
after the target webpages are labeled, acquiring a plurality of labeled data on each target webpage according to the data crawling result comprises the following steps: acquiring the capture element code according to the data crawling result; capturing various webpage elements and element attribute data labeled on each target webpage through the capture element codes to obtain a capture result; determining the plurality of annotation data using the capture results.
2. The method of claim 1, wherein counting a total number of occurrences of each web page element in a plurality of target web pages in the plurality of target web pages comprises:
counting the total times of occurrence of webpage access sessions, wherein the webpage access sessions are corresponding sessions when a webpage is accessed each time;
filtering webpage elements of repeated webpages appearing in the webpage access session process to obtain a first filtering result;
and determining the total occurrence frequency of each webpage element according to the first filtering result so as to obtain the statistical result.
3. The method of claim 1, wherein counting a total number of occurrences of each web page element in a plurality of target web pages in the plurality of target web pages comprises:
counting user data corresponding to a user clicking each webpage element locator in the webpage access process;
according to the user data, filtering data of webpage elements of repeatedly clicked webpages in the process of clicking the webpage element locator to obtain a second filtering result;
and determining the total occurrence frequency of each webpage element according to the second filtering result so as to obtain the statistical result.
4. The method of claim 1, wherein prior to determining the plurality of target web pages, the method further comprises:
receiving a service demand parameter;
and acquiring the target webpages according to the service demand parameters, wherein mark codes are embedded into each webpage element in each target webpage in the process of acquiring the target webpages, and the mark codes are used for recording data of marking operation of a user on the target webpages.
5. An apparatus for determining attribute data of a web page, comprising:
a first determining unit for determining a plurality of target web pages;
the crawling unit is used for crawling data of each target webpage in the target webpages to obtain data crawling results;
an obtaining unit, configured to obtain, according to the data crawling result, multiple pieces of annotation data on each target webpage, where each piece of annotation data includes a number of times that each webpage element in the target webpage appears, and the webpage elements at least include one of the following: a description file in each target webpage and a shopping element in the target webpage;
the second determining unit is used for determining attribute data of the target element according to the occurrence frequency of each webpage element in the target webpage;
the second determination unit includes: the statistical module is used for counting the total times of the webpage elements in the target webpages to obtain statistical results; the second determining module is used for determining the target elements with the times of the labeled webpage elements being more than or equal to a preset threshold value according to the statistical result; a second obtaining module, configured to obtain, according to the target element, multiple attributes corresponding to each target element, so as to determine attribute data of the target element;
the labeling unit is used for performing data crawling on each target webpage in the target webpages to obtain a data crawling result and then injecting capturing element codes into each target webpage, wherein the capturing element codes are used for capturing the labeled target webpages and all webpage elements labeled on each target webpage;
the acquisition unit includes: the first acquisition module is used for acquiring the capture element codes according to the data crawling result after the target webpages are labeled; the capturing module is used for capturing each webpage element and element attribute data labeled on each target webpage through the capturing element code to obtain a capturing result; a first determining module for determining the plurality of annotation data using the captured result.
6. A storage medium for storing a program, wherein the program is executed to control a device on which the storage medium is located to execute the method for determining web page attribute data according to any one of claims 1 to 4.
7. A processor, characterized in that the processor is configured to run a program, wherein the program is configured to execute the method for determining web page property data according to any one of claims 1 to 4 when running.
CN201810219804.9A 2018-03-16 2018-03-16 Method and device for determining webpage attribute data Active CN110275998B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810219804.9A CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810219804.9A CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Publications (2)

Publication Number Publication Date
CN110275998A CN110275998A (en) 2019-09-24
CN110275998B true CN110275998B (en) 2021-07-30

Family

ID=67957841

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810219804.9A Active CN110275998B (en) 2018-03-16 2018-03-16 Method and device for determining webpage attribute data

Country Status (1)

Country Link
CN (1) CN110275998B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113836316B (en) * 2021-09-23 2023-01-03 北京百度网讯科技有限公司 Processing method, training method, device, equipment and medium for ternary group data

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041800A1 (en) * 2005-10-14 2007-04-19 Panscient Inc Information extraction system
CN103294711A (en) * 2012-02-28 2013-09-11 阿里巴巴集团控股有限公司 Method and device for determining page elements in web page
CN103916293A (en) * 2014-04-15 2014-07-09 浪潮软件股份有限公司 Method for monitoring and analyzing website user behaviors
CN104021185A (en) * 2014-06-11 2014-09-03 北京奇虎科技有限公司 Method and device for identifying information attributes of data in web pages
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN107562620A (en) * 2017-08-24 2018-01-09 阿里巴巴集团控股有限公司 One kind buries an automatic setting method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007041800A1 (en) * 2005-10-14 2007-04-19 Panscient Inc Information extraction system
CN103294711A (en) * 2012-02-28 2013-09-11 阿里巴巴集团控股有限公司 Method and device for determining page elements in web page
CN103916293A (en) * 2014-04-15 2014-07-09 浪潮软件股份有限公司 Method for monitoring and analyzing website user behaviors
CN104021185A (en) * 2014-06-11 2014-09-03 北京奇虎科技有限公司 Method and device for identifying information attributes of data in web pages
CN105447139A (en) * 2015-11-20 2016-03-30 广州华多网络科技有限公司 Data acquisition statistical method, and system, terminal and service equipment thereof
CN107562620A (en) * 2017-08-24 2018-01-09 阿里巴巴集团控股有限公司 One kind buries an automatic setting method and device

Also Published As

Publication number Publication date
CN110275998A (en) 2019-09-24

Similar Documents

Publication Publication Date Title
CN107562620B (en) Automatic buried point setting method and device
CN108664375B (en) Method for detecting abnormal behavior of computer network system user
CN106294648B (en) Processing method and device for page access path
CN108304410B (en) Method and device for detecting abnormal access page and data analysis method
CN107800591B (en) Unified log data analysis method
CN107797894B (en) APP user behavior analysis method and device
CN106295382B (en) A kind of Information Risk preventing control method and device
CN107797908A (en) A kind of behavioral data acquisition method of website user
CN102752288A (en) Method and device for identifying network access action
CN106570013B (en) Method and device for processing page access data
CN110020339B (en) Webpage data acquisition method and device based on non-buried point
CN106202101B (en) Advertisement identification method and device
CN105721578B (en) A kind of user behavior data acquisition method and system
CN106033579A (en) Data processing method and apparatus thereof
US20190266206A1 (en) Data processing method, server, and computer storage medium
CN110263070B (en) Event reporting method and device
CN110881131B (en) Classification method of live review videos and related device thereof
CN106933916B (en) JSON character string processing method and device
CN107832333A (en) Method and system based on distributed treatment and DPI data structure user network data fingerprint
CN110275998B (en) Method and device for determining webpage attribute data
CN104281629A (en) Method and device for extracting picture from webpage and client equipment
Chitraa et al. An efficient path completion technique for web log mining
CN109558305B (en) Log data sorting method and device
CN103605742A (en) Method and device for recognizing network resource entity content page
CN106815248A (en) Web analytics method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant