CN110874434A

CN110874434A - Webpage data acquisition method and device, storage medium and electronic equipment

Info

Publication number: CN110874434A
Application number: CN201811010439.7A
Authority: CN
Inventors: 张诗茹; 李春光; 谭泽汉; 孙秀丹; 仲丽君
Original assignee: Gree Electric Appliances Inc of Zhuhai
Current assignee: Gree Electric Appliances Inc of Zhuhai
Priority date: 2018-08-31
Filing date: 2018-08-31
Publication date: 2020-03-10

Abstract

The application relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment, and belongs to the technical field of webpage data acquisition. The method comprises the steps of obtaining a first website set; cleaning the websites in the first website set to obtain a second website set; and acquiring the webpage data according to the second website set. Redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is favorably reduced.

Description

Webpage data acquisition method and device, storage medium and electronic equipment

Technical Field

The application belongs to the technical field of webpage data acquisition, and particularly relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment.

Background

With the rapid development of the internet and the advent of the big data era, more and more data are generated on the network, and the technology of collecting data from mass data becomes more and more important.

In the related art of data collection, for example, starting from a specified website through a data collection tool, data of a page is continuously traversed, collected and stored by a horizontal or vertical method. However, there is a problem in that a large amount of redundant data, such as incoherent data, redundant data, and the like, exists in the acquired data, resulting in a great deal of effort and time required by the user in cleansing the acquired data.

Disclosure of Invention

In order to overcome the problems in the related art at least to a certain extent, the application provides a method, a device, a storage medium and electronic equipment for acquiring webpage data, which are beneficial to solving the problem that a large amount of redundant data exists in the acquired data.

In order to achieve the purpose, the following technical scheme is adopted in the application:

in a first aspect,

the application provides a webpage data acquisition method, which comprises the following steps:

acquiring a first website set;

cleaning the websites in the first website set to obtain a second website set;

and acquiring webpage data according to the second website set.

Further, the air conditioner is provided with a fan,

the acquiring of the first website set includes:

receiving search information input by a user;

and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.

Further, the air conditioner is provided with a fan,

the clearing of the websites in the first website set includes:

processing the websites in the first website set by one or more of the following items:

screening, removing weight and classifying.

Further, the screening comprises:

and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.

Further, the air conditioner is provided with a fan,

the performing domain name matching on the websites in the first website set according to the preset domain name includes:

setting a regular expression according to the preset domain name;

and performing domain name matching on the websites in the first website set through the regular expression.

Further, the air conditioner is provided with a fan,

the de-duplication comprises:

acquiring the title of a webpage corresponding to the website in the first website set;

and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.

Further, the air conditioner is provided with a fan,

the deduplication of the website based on the title that is not empty includes:

deduplication of a web site based only on the title that is not empty; or,

and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.

Further, the air conditioner is provided with a fan,

the classification includes:

and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.

Further, the air conditioner is provided with a fan,

the classifying the website based on the title which is not empty comprises the following steps:

and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.

Further, the acquiring the web page data according to the second website set further includes:

and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.

Further, the air conditioner is provided with a fan,

the acquiring of the webpage data according to the second website set comprises:

and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Further, the method further comprises:

and storing the second website set in a pre-constructed website pool.

In a second aspect of the present invention,

the application provides a webpage data acquisition device, includes:

the acquisition module is used for acquiring a first website set;

the clearing module is used for clearing the websites in the first website set to obtain a second website set;

and the acquisition module is used for acquiring the webpage data according to the second website set.

Further, the obtaining module is specifically configured to:

receiving search information input by a user;

Further, the cleaning module is specifically configured to:

screening, removing weight and classifying.

Further, the screening comprises:

Further, the performing domain name matching on the websites in the first website set according to a preset domain name includes:

setting a regular expression according to the preset domain name;

Further, the de-duplicating includes:

Further, the deduplication of the website based on the title that is not empty includes:

deduplication of a web site based only on the title that is not empty; or,

Further, the classifying includes:

Further, the classifying the website based on the title that is not empty includes:

Further, the acquisition module is specifically configured to:

and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.

Further, if the cleaning of the websites in the first website set includes classifying the websites in the first website set, the acquisition module is further specifically configured to:

Further, the apparatus further comprises:

and the website pool module is used for storing the second website set in a website pool which is constructed in advance.

In a third aspect,

a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the method of any of the above.

In a fourth aspect of the present invention,

the application provides an electronic device, including:

a computer readable storage medium as described above; and

one or more processors to execute the program in the computer-readable storage medium.

This application adopts above technical scheme, possesses following beneficial effect at least:

according to the method and the device, the websites in the first website set are cleaned firstly, and then the webpage data are collected according to the second website set, so that redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is reduced.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a schematic flowchart of a web page data acquisition method according to an embodiment of the present application;

fig. 2 is a schematic flowchart illustrating a process of cleaning websites in the first website set according to an embodiment of the present application;

fig. 3 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;

fig. 4 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of a web page data acquisition method according to an embodiment of the present application, and as shown in fig. 1, the web page data acquisition method includes the following steps:

and S101, acquiring a first website set.

In one embodiment, the obtaining the first set of web addresses includes:

receiving search information input by a user;

In practical application, the first website set can be obtained through a search engine tool, for example, search information is "lattice force", massive webpages containing "lattice force" can be searched through a search engine, and websites corresponding to the massive webpages form the first website set. In addition, a large number of redundant web pages exist in the massive web pages, such as web pages irrelevant to the content to be acquired, e.g., web pages with repeated content, and so on.

And S102, cleaning the websites in the first website set to obtain a second website set.

By the steps and the schemes of the embodiment, the websites in the first website set are cleaned, and the websites which are not needed by the user can be eliminated, so that the data collected by the user approaches to the direction which is needed by the user.

In one embodiment, the cleaning the websites in the first set of websites includes:

screening, removing weight and classifying.

The screening, de-duplication, and classification will be further described below.

In one embodiment, the screening comprises:

In a specific application, for a preset domain name, a user may set one or more domain names as needed. And matching the website relevant to the preset domain name in the first website set according to the domain name matching with the preset domain name, wherein the matched website is the website required by the user for acquiring the content, so that the acquisition website range for acquiring the content is determined by the domain name matching, and meanwhile, the irrelevant data can be eliminated by eliminating the irrelevant website.

In one embodiment, the performing domain name matching on the websites in the first website set according to a preset domain name includes:

setting a regular expression according to the preset domain name;

In the related art, the regular expression is a logic formula for operating a character string, a regular character string is formed by using some specific characters defined in advance, and other character strings are filtered and screened through the regular character string. For the scheme of the embodiment, the regular expression is set according to the domain name, the domain name matching is performed on the websites in the first website set, and the matched websites are the websites required by the user for acquiring the content.

In one embodiment, the deduplication comprises:

It can be understood that the web address is deduplicated based on the title that is not empty, and redundant web pages can be eliminated, so that the collection of redundant web page contents is avoided.

In one embodiment, the deduplication based on titles that are not empty includes:

and only carrying out deduplication on the website based on the title which is not empty.

It will be appreciated that the web address is only deduplicated based on the title that is not empty, and that the web address with the same title content is deduplicated.

In another embodiment, the deduplication based on titles that are not empty includes:

It can be understood that the title of the web page is obtained through the website, and meanwhile the publishing time and the author information of the web page are also obtained, so that the uniqueness of the web page can be effectively determined.

In one embodiment, the classifying includes:

Through the scheme of the embodiment, the websites are classified according to the non-empty titles, a plurality of classified website sets can be obtained, such as science and technology website sets, financial website sets and the like, data collection is performed according to the classified website sets, the classified data can be directly output, convenience is provided for processing the collected data, and a large amount of time can be saved when further classification is performed on the basis of the classified data.

In one embodiment, the classifying the web addresses based on the titles that are not empty includes:

In practical application, the semantic similarity between the titles can be calculated by using a semantic similarity algorithm, and the websites are classified, so that the titles of the websites in each category of website collection have certain semantic similarity.

As can be seen from the above respective descriptions of the further embodiments of screening, deduplication and classification, each embodiment of screening, deduplication and classification can clean the websites in the first website set to a certain extent.

Fig. 2 is a schematic flow chart of cleaning the websites in the first website set according to an embodiment of the present application, and as shown in fig. 2, the cleaning the websites in the first website set includes the following steps:

step S201, performing domain name matching on the websites in the first website set according to a preset domain name, matching out the websites related to the preset domain name, and executing step S202 on the matched websites;

step S202, acquiring a title corresponding to the website, judging whether the acquired title is empty, removing the duplicate of the website based on the title which is not empty, and executing step S203 on the duplicate-removed website;

and step S203, classifying the websites according to the titles of the websites.

With regard to the above-described embodiment solutions,

in step S201, the performing domain name matching on the websites in the first website set according to the preset domain name includes:

setting a regular expression according to the preset domain name;

In step S202, the removing duplicate addresses based on the non-empty titles includes:

deduplication of a web site based only on the title that is not empty; or,

In step S203, the classifying the websites according to the titles of the websites includes:

and calculating semantic similarity between the titles, and classifying the websites according to the semantic similarity.

The embodiment scheme for cleaning the websites in the first website set by sequentially screening, removing the duplicate and classifying is provided. In the three steps of step S201, step S202 and step S203, the previous step is taken as a premise of the next step, the execution of the next step is triggered by the completion of the previous step, and the cleaning effect on the first website set is improved by the mutual engagement and cooperation of the three steps, which is helpful for realizing that the webpage content corresponding to each website in the second website set obtained after cleaning has a collection value.

In a specific application, the second website set obtained by cleaning may be stored in the website pool by constructing the website pool according to the cleaned second website set obtained by processing according to one or more of screening, deduplication, and classification.

And S103, acquiring webpage data according to the second website set.

In practical application, the web crawler may be used to collect the content of the web page corresponding to the web address in the second web address set.

It can be understood that each website in the second website set is obtained after being cleaned, so that each website in the second website set has a more targeted need for a user, and by collecting the content of the web page corresponding to each website in the second website set, redundant data in the collected data is effectively reduced, thereby helping to reduce the cleaning pressure on the collected data.

In one embodiment, the collecting the web page data according to the second website set includes:

Through the scheme of the embodiment, the direct output and classified storage of the webpage data can be realized. Compared with the prior art that data are collected firstly and then classified, the data are collected according to the website classification, the data are classified while the data are collected, the data are classified in advance to the data collection stage, the pressure of further classifying the data subsequently is relieved, and the time and labor cost are reduced.

and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. In the above embodiment, by using the distributed acquisition method, a large number of common users can be simulated to normally visit a certain website to realize distributed data acquisition, so that the data acquisition time can be prevented from being wasted due to blocking and killing by a blocking program, and the data acquisition time can be effectively saved.

Fig. 3 is a schematic structural diagram of a web page data acquisition apparatus according to an embodiment of the present application, and as shown in fig. 3, the web page data acquisition apparatus 3 includes:

an obtaining module 31, configured to obtain a first website set;

a cleaning module 32, configured to clean websites in the first website set to obtain a second website set;

and the acquisition module 33 is configured to acquire the web page data according to the second website set.

Further, the obtaining module 31 is specifically configured to:

receiving search information input by a user;

Further, the cleaning module 32 is specifically configured to:

screening, removing weight and classifying.

Further, the screening comprises:

setting a regular expression according to the preset domain name;

Further, the de-duplicating includes:

deduplication of a web site based only on the title that is not empty; or,

Further, the classifying includes:

Further, the acquisition module 33 is specifically configured to:

and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Fig. 4 is a schematic structural diagram of a web page data acquisition apparatus according to another embodiment of the present application, and as shown in fig. 4, the web page data acquisition apparatus 3 further includes:

and a website pool module 34, configured to store the second website set in a website pool that is constructed in advance.

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

In an exemplary embodiment, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.

With regard to the computer-readable storage medium in the above-described embodiments, the specific manner in which the stored computer program performs the operations has been described in detail in relation to the embodiments of the method, and will not be described in detail herein.

Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 5, the electronic device 5 includes:

the computer-readable storage medium 51 as described above; and

one or more processors 52 for executing the programs in the computer-readable storage medium 51.

With regard to the electronic device in the above embodiments, the specific manner in which the processor thereof executes the program in the computer-readable storage medium has been described in detail in the embodiments related to the method, and will not be elaborated here.

It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.

It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.

It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.

The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.

In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.

Claims

1. A webpage data acquisition method is characterized by comprising the following steps:

acquiring a first website set;

cleaning the websites in the first website set to obtain a second website set;

and acquiring webpage data according to the second website set.

2. The method of claim 1,

the acquiring of the first website set includes:

receiving search information input by a user;

3. The method of claim 1,

the clearing of the websites in the first website set includes:

screening, removing weight and classifying.

4. The method of claim 3, wherein the screening comprises:

5. The method according to claim 4, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:

setting a regular expression according to the preset domain name;

6. The method of claim 3, wherein the de-duplicating comprises:

7. The method of claim 6, wherein the de-duplicating the web address based on the title that is not empty comprises:

deduplication of a web site based only on the title that is not empty; or,

8. The method of claim 3, wherein the classifying comprises:

9. The method of claim 8, wherein classifying web addresses based on non-empty headings comprises:

10. The method of claim 8,

11. The method according to any one of claims 1-10, wherein collecting web page data according to the second set of web addresses comprises:

12. The method according to any one of claims 1-9, further comprising:

and storing the second website set in a pre-constructed website pool.

13. A web page data acquisition device, comprising:

the acquisition module is used for acquiring a first website set;

14. The apparatus of claim 13,

the acquisition module is specifically configured to:

receiving search information input by a user;

15. The apparatus of claim 13,

the cleaning module is specifically configured to:

screening, removing weight and classifying.

16. The apparatus of claim 15, wherein the screening comprises:

17. The apparatus of claim 16, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:

setting a regular expression according to the preset domain name;

18. The apparatus of claim 15, wherein the de-duplication comprises:

19. The apparatus of claim 18, wherein the de-duplicating the web address based on the title that is not empty comprises:

deduplication of a web site based only on the title that is not empty; or,

20. The apparatus of claim 15, wherein the classifying comprises:

21. The apparatus of claim 20, wherein the classifying the web addresses based on the non-empty titles comprises:

22. The apparatus of claim 20,

the acquisition module is specifically configured to:

23. The apparatus of any one of claims 13-22,

the acquisition module is specifically configured to:

24. The apparatus of any one of claims 13-22, further comprising:

25. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1-12.

26. An electronic device, comprising:

the computer-readable storage medium of claim 25; and