CN110874434A - Webpage data acquisition method and device, storage medium and electronic equipment - Google Patents

Webpage data acquisition method and device, storage medium and electronic equipment Download PDF

Info

Publication number
CN110874434A
CN110874434A CN201811010439.7A CN201811010439A CN110874434A CN 110874434 A CN110874434 A CN 110874434A CN 201811010439 A CN201811010439 A CN 201811010439A CN 110874434 A CN110874434 A CN 110874434A
Authority
CN
China
Prior art keywords
website
websites
title
empty
acquiring
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811010439.7A
Other languages
Chinese (zh)
Inventor
张诗茹
李春光
谭泽汉
孙秀丹
仲丽君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Gree Electric Appliances Inc of Zhuhai
Original Assignee
Gree Electric Appliances Inc of Zhuhai
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Gree Electric Appliances Inc of Zhuhai filed Critical Gree Electric Appliances Inc of Zhuhai
Priority to CN201811010439.7A priority Critical patent/CN110874434A/en
Publication of CN110874434A publication Critical patent/CN110874434A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The application relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment, and belongs to the technical field of webpage data acquisition. The method comprises the steps of obtaining a first website set; cleaning the websites in the first website set to obtain a second website set; and acquiring the webpage data according to the second website set. Redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is favorably reduced.

Description

Webpage data acquisition method and device, storage medium and electronic equipment
Technical Field
The application belongs to the technical field of webpage data acquisition, and particularly relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment.
Background
With the rapid development of the internet and the advent of the big data era, more and more data are generated on the network, and the technology of collecting data from mass data becomes more and more important.
In the related art of data collection, for example, starting from a specified website through a data collection tool, data of a page is continuously traversed, collected and stored by a horizontal or vertical method. However, there is a problem in that a large amount of redundant data, such as incoherent data, redundant data, and the like, exists in the acquired data, resulting in a great deal of effort and time required by the user in cleansing the acquired data.
Disclosure of Invention
In order to overcome the problems in the related art at least to a certain extent, the application provides a method, a device, a storage medium and electronic equipment for acquiring webpage data, which are beneficial to solving the problem that a large amount of redundant data exists in the acquired data.
In order to achieve the purpose, the following technical scheme is adopted in the application:
in a first aspect,
the application provides a webpage data acquisition method, which comprises the following steps:
acquiring a first website set;
cleaning the websites in the first website set to obtain a second website set;
and acquiring webpage data according to the second website set.
Further, the air conditioner is provided with a fan,
the acquiring of the first website set includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the air conditioner is provided with a fan,
the clearing of the websites in the first website set includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the air conditioner is provided with a fan,
the performing domain name matching on the websites in the first website set according to the preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the air conditioner is provided with a fan,
the de-duplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the air conditioner is provided with a fan,
the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the air conditioner is provided with a fan,
the classification includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the air conditioner is provided with a fan,
the classifying the website based on the title which is not empty comprises the following steps:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquiring the web page data according to the second website set further includes:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the air conditioner is provided with a fan,
the acquiring of the webpage data according to the second website set comprises:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Further, the method further comprises:
and storing the second website set in a pre-constructed website pool.
In a second aspect of the present invention,
the application provides a webpage data acquisition device, includes:
the acquisition module is used for acquiring a first website set;
the clearing module is used for clearing the websites in the first website set to obtain a second website set;
and the acquisition module is used for acquiring the webpage data according to the second website set.
Further, the obtaining module is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the cleaning module is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the de-duplicating includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the classifying the website based on the title that is not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquisition module is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
Further, if the cleaning of the websites in the first website set includes classifying the websites in the first website set, the acquisition module is further specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the apparatus further comprises:
and the website pool module is used for storing the second website set in a website pool which is constructed in advance.
In a third aspect,
a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the method of any of the above.
In a fourth aspect of the present invention,
the application provides an electronic device, including:
a computer readable storage medium as described above; and
one or more processors to execute the program in the computer-readable storage medium.
This application adopts above technical scheme, possesses following beneficial effect at least:
according to the method and the device, the websites in the first website set are cleaned firstly, and then the webpage data are collected according to the second website set, so that redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a web page data acquisition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of cleaning websites in the first website set according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a web page data acquisition method according to an embodiment of the present application, and as shown in fig. 1, the web page data acquisition method includes the following steps:
and S101, acquiring a first website set.
In one embodiment, the obtaining the first set of web addresses includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
In practical application, the first website set can be obtained through a search engine tool, for example, search information is "lattice force", massive webpages containing "lattice force" can be searched through a search engine, and websites corresponding to the massive webpages form the first website set. In addition, a large number of redundant web pages exist in the massive web pages, such as web pages irrelevant to the content to be acquired, e.g., web pages with repeated content, and so on.
And S102, cleaning the websites in the first website set to obtain a second website set.
By the steps and the schemes of the embodiment, the websites in the first website set are cleaned, and the websites which are not needed by the user can be eliminated, so that the data collected by the user approaches to the direction which is needed by the user.
In one embodiment, the cleaning the websites in the first set of websites includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
The screening, de-duplication, and classification will be further described below.
In one embodiment, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
In a specific application, for a preset domain name, a user may set one or more domain names as needed. And matching the website relevant to the preset domain name in the first website set according to the domain name matching with the preset domain name, wherein the matched website is the website required by the user for acquiring the content, so that the acquisition website range for acquiring the content is determined by the domain name matching, and meanwhile, the irrelevant data can be eliminated by eliminating the irrelevant website.
In one embodiment, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
In the related art, the regular expression is a logic formula for operating a character string, a regular character string is formed by using some specific characters defined in advance, and other character strings are filtered and screened through the regular character string. For the scheme of the embodiment, the regular expression is set according to the domain name, the domain name matching is performed on the websites in the first website set, and the matched websites are the websites required by the user for acquiring the content.
In one embodiment, the deduplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
It can be understood that the web address is deduplicated based on the title that is not empty, and redundant web pages can be eliminated, so that the collection of redundant web page contents is avoided.
In one embodiment, the deduplication based on titles that are not empty includes:
and only carrying out deduplication on the website based on the title which is not empty.
It will be appreciated that the web address is only deduplicated based on the title that is not empty, and that the web address with the same title content is deduplicated.
In another embodiment, the deduplication based on titles that are not empty includes:
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
It can be understood that the title of the web page is obtained through the website, and meanwhile the publishing time and the author information of the web page are also obtained, so that the uniqueness of the web page can be effectively determined.
In one embodiment, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Through the scheme of the embodiment, the websites are classified according to the non-empty titles, a plurality of classified website sets can be obtained, such as science and technology website sets, financial website sets and the like, data collection is performed according to the classified website sets, the classified data can be directly output, convenience is provided for processing the collected data, and a large amount of time can be saved when further classification is performed on the basis of the classified data.
In one embodiment, the classifying the web addresses based on the titles that are not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
In practical application, the semantic similarity between the titles can be calculated by using a semantic similarity algorithm, and the websites are classified, so that the titles of the websites in each category of website collection have certain semantic similarity.
As can be seen from the above respective descriptions of the further embodiments of screening, deduplication and classification, each embodiment of screening, deduplication and classification can clean the websites in the first website set to a certain extent.
Fig. 2 is a schematic flow chart of cleaning the websites in the first website set according to an embodiment of the present application, and as shown in fig. 2, the cleaning the websites in the first website set includes the following steps:
step S201, performing domain name matching on the websites in the first website set according to a preset domain name, matching out the websites related to the preset domain name, and executing step S202 on the matched websites;
step S202, acquiring a title corresponding to the website, judging whether the acquired title is empty, removing the duplicate of the website based on the title which is not empty, and executing step S203 on the duplicate-removed website;
and step S203, classifying the websites according to the titles of the websites.
With regard to the above-described embodiment solutions,
in step S201, the performing domain name matching on the websites in the first website set according to the preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
In step S202, the removing duplicate addresses based on the non-empty titles includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
In step S203, the classifying the websites according to the titles of the websites includes:
and calculating semantic similarity between the titles, and classifying the websites according to the semantic similarity.
The embodiment scheme for cleaning the websites in the first website set by sequentially screening, removing the duplicate and classifying is provided. In the three steps of step S201, step S202 and step S203, the previous step is taken as a premise of the next step, the execution of the next step is triggered by the completion of the previous step, and the cleaning effect on the first website set is improved by the mutual engagement and cooperation of the three steps, which is helpful for realizing that the webpage content corresponding to each website in the second website set obtained after cleaning has a collection value.
In a specific application, the second website set obtained by cleaning may be stored in the website pool by constructing the website pool according to the cleaned second website set obtained by processing according to one or more of screening, deduplication, and classification.
And S103, acquiring webpage data according to the second website set.
In practical application, the web crawler may be used to collect the content of the web page corresponding to the web address in the second web address set.
It can be understood that each website in the second website set is obtained after being cleaned, so that each website in the second website set has a more targeted need for a user, and by collecting the content of the web page corresponding to each website in the second website set, redundant data in the collected data is effectively reduced, thereby helping to reduce the cleaning pressure on the collected data.
In one embodiment, the collecting the web page data according to the second website set includes:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Through the scheme of the embodiment, the direct output and classified storage of the webpage data can be realized. Compared with the prior art that data are collected firstly and then classified, the data are collected according to the website classification, the data are classified while the data are collected, the data are classified in advance to the data collection stage, the pressure of further classifying the data subsequently is relieved, and the time and labor cost are reduced.
In one embodiment, the collecting the web page data according to the second website set includes:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. In the above embodiment, by using the distributed acquisition method, a large number of common users can be simulated to normally visit a certain website to realize distributed data acquisition, so that the data acquisition time can be prevented from being wasted due to blocking and killing by a blocking program, and the data acquisition time can be effectively saved.
Fig. 3 is a schematic structural diagram of a web page data acquisition apparatus according to an embodiment of the present application, and as shown in fig. 3, the web page data acquisition apparatus 3 includes:
an obtaining module 31, configured to obtain a first website set;
a cleaning module 32, configured to clean websites in the first website set to obtain a second website set;
and the acquisition module 33 is configured to acquire the web page data according to the second website set.
Further, the obtaining module 31 is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the cleaning module 32 is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the de-duplicating includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the classifying the website based on the title that is not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquisition module 33 is specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the acquisition module 33 is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Fig. 4 is a schematic structural diagram of a web page data acquisition apparatus according to another embodiment of the present application, and as shown in fig. 4, the web page data acquisition apparatus 3 further includes:
and a website pool module 34, configured to store the second website set in a website pool that is constructed in advance.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
In an exemplary embodiment, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.
With regard to the computer-readable storage medium in the above-described embodiments, the specific manner in which the stored computer program performs the operations has been described in detail in relation to the embodiments of the method, and will not be described in detail herein.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 5, the electronic device 5 includes:
the computer-readable storage medium 51 as described above; and
one or more processors 52 for executing the programs in the computer-readable storage medium 51.
With regard to the electronic device in the above embodiments, the specific manner in which the processor thereof executes the program in the computer-readable storage medium has been described in detail in the embodiments related to the method, and will not be elaborated here.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims (26)

1. A webpage data acquisition method is characterized by comprising the following steps:
acquiring a first website set;
cleaning the websites in the first website set to obtain a second website set;
and acquiring webpage data according to the second website set.
2. The method of claim 1,
the acquiring of the first website set includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
3. The method of claim 1,
the clearing of the websites in the first website set includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
4. The method of claim 3, wherein the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
5. The method according to claim 4, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
6. The method of claim 3, wherein the de-duplicating comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
7. The method of claim 6, wherein the de-duplicating the web address based on the title that is not empty comprises:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
8. The method of claim 3, wherein the classifying comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
9. The method of claim 8, wherein classifying web addresses based on non-empty headings comprises:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
10. The method of claim 8,
the acquiring of the webpage data according to the second website set comprises:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
11. The method according to any one of claims 1-10, wherein collecting web page data according to the second set of web addresses comprises:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
12. The method according to any one of claims 1-9, further comprising:
and storing the second website set in a pre-constructed website pool.
13. A web page data acquisition device, comprising:
the acquisition module is used for acquiring a first website set;
the clearing module is used for clearing the websites in the first website set to obtain a second website set;
and the acquisition module is used for acquiring the webpage data according to the second website set.
14. The apparatus of claim 13,
the acquisition module is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
15. The apparatus of claim 13,
the cleaning module is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
16. The apparatus of claim 15, wherein the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
17. The apparatus of claim 16, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
18. The apparatus of claim 15, wherein the de-duplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
19. The apparatus of claim 18, wherein the de-duplicating the web address based on the title that is not empty comprises:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
20. The apparatus of claim 15, wherein the classifying comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
21. The apparatus of claim 20, wherein the classifying the web addresses based on the non-empty titles comprises:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
22. The apparatus of claim 20,
the acquisition module is specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
23. The apparatus of any one of claims 13-22,
the acquisition module is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
24. The apparatus of any one of claims 13-22, further comprising:
and the website pool module is used for storing the second website set in a website pool which is constructed in advance.
25. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1-12.
26. An electronic device, comprising:
the computer-readable storage medium of claim 25; and
one or more processors to execute the program in the computer-readable storage medium.
CN201811010439.7A 2018-08-31 2018-08-31 Webpage data acquisition method and device, storage medium and electronic equipment Pending CN110874434A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811010439.7A CN110874434A (en) 2018-08-31 2018-08-31 Webpage data acquisition method and device, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811010439.7A CN110874434A (en) 2018-08-31 2018-08-31 Webpage data acquisition method and device, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN110874434A true CN110874434A (en) 2020-03-10

Family

ID=69715769

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811010439.7A Pending CN110874434A (en) 2018-08-31 2018-08-31 Webpage data acquisition method and device, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110874434A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972179A (en) * 2024-01-05 2024-05-03 深圳中泓在线股份有限公司 Directional data acquisition normalization method, system and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038363A (en) * 2013-10-24 2014-09-10 南京汇吉递特网络科技有限公司 Method for acquiring and counting CCDN provider information
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system
CN108121743A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of generation of generic web pages masterplate and application method, system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104038363A (en) * 2013-10-24 2014-09-10 南京汇吉递特网络科技有限公司 Method for acquiring and counting CCDN provider information
CN104951512A (en) * 2015-05-27 2015-09-30 中国科学院信息工程研究所 Public sentiment data collection method and system based on Internet
CN107045507A (en) * 2016-02-05 2017-08-15 北京国双科技有限公司 Web page crawl method and device
CN108121743A (en) * 2016-11-30 2018-06-05 中移(苏州)软件技术有限公司 A kind of generation of generic web pages masterplate and application method, system
CN107704515A (en) * 2017-09-01 2018-02-16 安徽简道科技有限公司 Data grab method based on internet data grasping system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
郝慧: "一种基于科技查新的跨库检索去重算法", 《现代图书情报技术》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117972179A (en) * 2024-01-05 2024-05-03 深圳中泓在线股份有限公司 Directional data acquisition normalization method, system and storage medium
CN117972179B (en) * 2024-01-05 2024-10-22 深圳中泓在线股份有限公司 Directional data acquisition normalization method, system and storage medium

Similar Documents

Publication Publication Date Title
EP2823410B1 (en) Entity augmentation service from latent relational data
JP5864586B2 (en) Method and apparatus for ranking search results
TWI524193B (en) Computer-readable media and computer-implemented method for semantic table of contents for search results
CN102930059B (en) Method for designing focused crawler
JP5995409B2 (en) Graphical model for representing text documents for computer analysis
JP5616444B2 (en) Method and system for document indexing and data querying
CN108520007B (en) Web page information extracting method, storage medium and computer equipment
Sisodia et al. Fast prediction of web user browsing behaviours using most interesting patterns
CN111125298A (en) Method, equipment and storage medium for reconstructing NTFS file directory tree
JP2017532655A (en) Compress cascading style sheet files
CN110889023A (en) Distributed multifunctional search engine of elastic search
CN107704620B (en) Archive management method, device, equipment and storage medium
Bagade et al. The Kauwa-Kaate fake news detection system
JP2013041385A (en) Document retrieval method, document retrieval device, and document retrieval program
CN110874434A (en) Webpage data acquisition method and device, storage medium and electronic equipment
JP4750628B2 (en) Information ranking method and apparatus, program, and computer-readable recording medium
Rome et al. Towards a formal concept analysis approach to exploring communities on the world wide web
CN105224583B (en) Method and device for cleaning log files
CN105512232B (en) Data storage method and device
JP6727097B2 (en) Information processing apparatus, information processing method, and program
JP4189387B2 (en) Knowledge search system, knowledge search method and program
CN118152506A (en) Similar text filtering method and electronic device
CN109190003B (en) Method and apparatus for determining list page nodes
CN105512230B (en) Data storage method and device
CN109948015A (en) A kind of Meta Search Engine tabulating result abstracting method and system

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20200310

RJ01 Rejection of invention patent application after publication