CN110874434A - Webpage data acquisition method and device, storage medium and electronic equipment - Google Patents
Webpage data acquisition method and device, storage medium and electronic equipment Download PDFInfo
- Publication number
- CN110874434A CN110874434A CN201811010439.7A CN201811010439A CN110874434A CN 110874434 A CN110874434 A CN 110874434A CN 201811010439 A CN201811010439 A CN 201811010439A CN 110874434 A CN110874434 A CN 110874434A
- Authority
- CN
- China
- Prior art keywords
- website
- websites
- title
- empty
- acquiring
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 40
- 238000004140 cleaning Methods 0.000 claims abstract description 20
- 238000012216 screening Methods 0.000 claims description 17
- 238000012545 processing Methods 0.000 claims description 9
- 238000004590 computer program Methods 0.000 claims description 4
- 238000010586 diagram Methods 0.000 description 6
- 238000013480 data collection Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 3
- 230000008569 process Effects 0.000 description 3
- 230000009286 beneficial effect Effects 0.000 description 2
- 230000000903 blocking effect Effects 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
Images
Landscapes
- Information Transfer Between Computers (AREA)
Abstract
The application relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment, and belongs to the technical field of webpage data acquisition. The method comprises the steps of obtaining a first website set; cleaning the websites in the first website set to obtain a second website set; and acquiring the webpage data according to the second website set. Redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is favorably reduced.
Description
Technical Field
The application belongs to the technical field of webpage data acquisition, and particularly relates to a webpage data acquisition method, a webpage data acquisition device, a storage medium and electronic equipment.
Background
With the rapid development of the internet and the advent of the big data era, more and more data are generated on the network, and the technology of collecting data from mass data becomes more and more important.
In the related art of data collection, for example, starting from a specified website through a data collection tool, data of a page is continuously traversed, collected and stored by a horizontal or vertical method. However, there is a problem in that a large amount of redundant data, such as incoherent data, redundant data, and the like, exists in the acquired data, resulting in a great deal of effort and time required by the user in cleansing the acquired data.
Disclosure of Invention
In order to overcome the problems in the related art at least to a certain extent, the application provides a method, a device, a storage medium and electronic equipment for acquiring webpage data, which are beneficial to solving the problem that a large amount of redundant data exists in the acquired data.
In order to achieve the purpose, the following technical scheme is adopted in the application:
in a first aspect,
the application provides a webpage data acquisition method, which comprises the following steps:
acquiring a first website set;
cleaning the websites in the first website set to obtain a second website set;
and acquiring webpage data according to the second website set.
Further, the air conditioner is provided with a fan,
the acquiring of the first website set includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the air conditioner is provided with a fan,
the clearing of the websites in the first website set includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the air conditioner is provided with a fan,
the performing domain name matching on the websites in the first website set according to the preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the air conditioner is provided with a fan,
the de-duplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the air conditioner is provided with a fan,
the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the air conditioner is provided with a fan,
the classification includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the air conditioner is provided with a fan,
the classifying the website based on the title which is not empty comprises the following steps:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquiring the web page data according to the second website set further includes:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the air conditioner is provided with a fan,
the acquiring of the webpage data according to the second website set comprises:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Further, the method further comprises:
and storing the second website set in a pre-constructed website pool.
In a second aspect of the present invention,
the application provides a webpage data acquisition device, includes:
the acquisition module is used for acquiring a first website set;
the clearing module is used for clearing the websites in the first website set to obtain a second website set;
and the acquisition module is used for acquiring the webpage data according to the second website set.
Further, the obtaining module is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the cleaning module is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the de-duplicating includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the classifying the website based on the title that is not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquisition module is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
Further, if the cleaning of the websites in the first website set includes classifying the websites in the first website set, the acquisition module is further specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the apparatus further comprises:
and the website pool module is used for storing the second website set in a website pool which is constructed in advance.
In a third aspect,
a computer-readable storage medium is provided, which stores a computer program that, when executed by a processor, implements the method of any of the above.
In a fourth aspect of the present invention,
the application provides an electronic device, including:
a computer readable storage medium as described above; and
one or more processors to execute the program in the computer-readable storage medium.
This application adopts above technical scheme, possesses following beneficial effect at least:
according to the method and the device, the websites in the first website set are cleaned firstly, and then the webpage data are collected according to the second website set, so that redundant data in the collected data are effectively reduced, and further the cleaning pressure of the collected data is reduced.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a schematic flowchart of a web page data acquisition method according to an embodiment of the present application;
fig. 2 is a schematic flowchart illustrating a process of cleaning websites in the first website set according to an embodiment of the present application;
fig. 3 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;
fig. 4 is a schematic structural diagram of a web page data acquisition device according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be described in detail below. It is to be understood that the embodiments described are only a few embodiments of the present application and not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the examples given herein without making any creative effort, shall fall within the protection scope of the present application.
Fig. 1 is a schematic flow chart of a web page data acquisition method according to an embodiment of the present application, and as shown in fig. 1, the web page data acquisition method includes the following steps:
and S101, acquiring a first website set.
In one embodiment, the obtaining the first set of web addresses includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
In practical application, the first website set can be obtained through a search engine tool, for example, search information is "lattice force", massive webpages containing "lattice force" can be searched through a search engine, and websites corresponding to the massive webpages form the first website set. In addition, a large number of redundant web pages exist in the massive web pages, such as web pages irrelevant to the content to be acquired, e.g., web pages with repeated content, and so on.
And S102, cleaning the websites in the first website set to obtain a second website set.
By the steps and the schemes of the embodiment, the websites in the first website set are cleaned, and the websites which are not needed by the user can be eliminated, so that the data collected by the user approaches to the direction which is needed by the user.
In one embodiment, the cleaning the websites in the first set of websites includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
The screening, de-duplication, and classification will be further described below.
In one embodiment, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
In a specific application, for a preset domain name, a user may set one or more domain names as needed. And matching the website relevant to the preset domain name in the first website set according to the domain name matching with the preset domain name, wherein the matched website is the website required by the user for acquiring the content, so that the acquisition website range for acquiring the content is determined by the domain name matching, and meanwhile, the irrelevant data can be eliminated by eliminating the irrelevant website.
In one embodiment, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
In the related art, the regular expression is a logic formula for operating a character string, a regular character string is formed by using some specific characters defined in advance, and other character strings are filtered and screened through the regular character string. For the scheme of the embodiment, the regular expression is set according to the domain name, the domain name matching is performed on the websites in the first website set, and the matched websites are the websites required by the user for acquiring the content.
In one embodiment, the deduplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
It can be understood that the web address is deduplicated based on the title that is not empty, and redundant web pages can be eliminated, so that the collection of redundant web page contents is avoided.
In one embodiment, the deduplication based on titles that are not empty includes:
and only carrying out deduplication on the website based on the title which is not empty.
It will be appreciated that the web address is only deduplicated based on the title that is not empty, and that the web address with the same title content is deduplicated.
In another embodiment, the deduplication based on titles that are not empty includes:
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
It can be understood that the title of the web page is obtained through the website, and meanwhile the publishing time and the author information of the web page are also obtained, so that the uniqueness of the web page can be effectively determined.
In one embodiment, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Through the scheme of the embodiment, the websites are classified according to the non-empty titles, a plurality of classified website sets can be obtained, such as science and technology website sets, financial website sets and the like, data collection is performed according to the classified website sets, the classified data can be directly output, convenience is provided for processing the collected data, and a large amount of time can be saved when further classification is performed on the basis of the classified data.
In one embodiment, the classifying the web addresses based on the titles that are not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
In practical application, the semantic similarity between the titles can be calculated by using a semantic similarity algorithm, and the websites are classified, so that the titles of the websites in each category of website collection have certain semantic similarity.
As can be seen from the above respective descriptions of the further embodiments of screening, deduplication and classification, each embodiment of screening, deduplication and classification can clean the websites in the first website set to a certain extent.
Fig. 2 is a schematic flow chart of cleaning the websites in the first website set according to an embodiment of the present application, and as shown in fig. 2, the cleaning the websites in the first website set includes the following steps:
step S201, performing domain name matching on the websites in the first website set according to a preset domain name, matching out the websites related to the preset domain name, and executing step S202 on the matched websites;
step S202, acquiring a title corresponding to the website, judging whether the acquired title is empty, removing the duplicate of the website based on the title which is not empty, and executing step S203 on the duplicate-removed website;
and step S203, classifying the websites according to the titles of the websites.
With regard to the above-described embodiment solutions,
in step S201, the performing domain name matching on the websites in the first website set according to the preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
In step S202, the removing duplicate addresses based on the non-empty titles includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
In step S203, the classifying the websites according to the titles of the websites includes:
and calculating semantic similarity between the titles, and classifying the websites according to the semantic similarity.
The embodiment scheme for cleaning the websites in the first website set by sequentially screening, removing the duplicate and classifying is provided. In the three steps of step S201, step S202 and step S203, the previous step is taken as a premise of the next step, the execution of the next step is triggered by the completion of the previous step, and the cleaning effect on the first website set is improved by the mutual engagement and cooperation of the three steps, which is helpful for realizing that the webpage content corresponding to each website in the second website set obtained after cleaning has a collection value.
In a specific application, the second website set obtained by cleaning may be stored in the website pool by constructing the website pool according to the cleaned second website set obtained by processing according to one or more of screening, deduplication, and classification.
And S103, acquiring webpage data according to the second website set.
In practical application, the web crawler may be used to collect the content of the web page corresponding to the web address in the second web address set.
It can be understood that each website in the second website set is obtained after being cleaned, so that each website in the second website set has a more targeted need for a user, and by collecting the content of the web page corresponding to each website in the second website set, redundant data in the collected data is effectively reduced, thereby helping to reduce the cleaning pressure on the collected data.
In one embodiment, the collecting the web page data according to the second website set includes:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Through the scheme of the embodiment, the direct output and classified storage of the webpage data can be realized. Compared with the prior art that data are collected firstly and then classified, the data are collected according to the website classification, the data are classified while the data are collected, the data are classified in advance to the data collection stage, the pressure of further classifying the data subsequently is relieved, and the time and labor cost are reduced.
In one embodiment, the collecting the web page data according to the second website set includes:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. In the above embodiment, by using the distributed acquisition method, a large number of common users can be simulated to normally visit a certain website to realize distributed data acquisition, so that the data acquisition time can be prevented from being wasted due to blocking and killing by a blocking program, and the data acquisition time can be effectively saved.
Fig. 3 is a schematic structural diagram of a web page data acquisition apparatus according to an embodiment of the present application, and as shown in fig. 3, the web page data acquisition apparatus 3 includes:
an obtaining module 31, configured to obtain a first website set;
a cleaning module 32, configured to clean websites in the first website set to obtain a second website set;
and the acquisition module 33 is configured to acquire the web page data according to the second website set.
Further, the obtaining module 31 is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
Further, the cleaning module 32 is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
Further, the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
Further, the performing domain name matching on the websites in the first website set according to a preset domain name includes:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
Further, the de-duplicating includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
Further, the deduplication of the website based on the title that is not empty includes:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
Further, the classifying includes:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
Further, the classifying the website based on the title that is not empty includes:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
Further, the acquisition module 33 is specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
Further, the acquisition module 33 is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode. Fig. 4 is a schematic structural diagram of a web page data acquisition apparatus according to another embodiment of the present application, and as shown in fig. 4, the web page data acquisition apparatus 3 further includes:
and a website pool module 34, configured to store the second website set in a website pool that is constructed in advance.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
In an exemplary embodiment, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of any of the above.
With regard to the computer-readable storage medium in the above-described embodiments, the specific manner in which the stored computer program performs the operations has been described in detail in relation to the embodiments of the method, and will not be described in detail herein.
Fig. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application, and as shown in fig. 5, the electronic device 5 includes:
the computer-readable storage medium 51 as described above; and
one or more processors 52 for executing the programs in the computer-readable storage medium 51.
With regard to the electronic device in the above embodiments, the specific manner in which the processor thereof executes the program in the computer-readable storage medium has been described in detail in the embodiments related to the method, and will not be elaborated here.
It is understood that the same or similar parts in the above embodiments may be mutually referred to, and the same or similar parts in other embodiments may be referred to for the content which is not described in detail in some embodiments.
It should be noted that, in the description of the present application, the terms "first", "second", etc. are used for descriptive purposes only and are not to be construed as indicating or implying relative importance. Further, in the description of the present application, the meaning of "a plurality" means at least two unless otherwise specified.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process, and the scope of the preferred embodiments of the present application includes other implementations in which functions may be executed out of order from that shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved, as would be understood by those reasonably skilled in the art of the present application.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. For example, if implemented in hardware, as in another embodiment, any one or combination of the following techniques, which are known in the art, may be used: a discrete logic circuit having a logic gate circuit for implementing a logic function on a data signal, an application specific integrated circuit having an appropriate combinational logic gate circuit, a Programmable Gate Array (PGA), a Field Programmable Gate Array (FPGA), or the like.
It will be understood by those skilled in the art that all or part of the steps carried by the method for implementing the above embodiments may be implemented by hardware related to instructions of a program, which may be stored in a computer readable storage medium, and when the program is executed, the program includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module, if implemented in the form of a software functional module and sold or used as a stand-alone product, may also be stored in a computer readable storage medium.
The storage medium mentioned above may be a read-only memory, a magnetic or optical disk, etc.
In the description herein, reference to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
Although embodiments of the present application have been shown and described above, it is understood that the above embodiments are exemplary and should not be construed as limiting the present application, and that variations, modifications, substitutions and alterations may be made to the above embodiments by those of ordinary skill in the art within the scope of the present application.
Claims (26)
1. A webpage data acquisition method is characterized by comprising the following steps:
acquiring a first website set;
cleaning the websites in the first website set to obtain a second website set;
and acquiring webpage data according to the second website set.
2. The method of claim 1,
the acquiring of the first website set includes:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
3. The method of claim 1,
the clearing of the websites in the first website set includes:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
4. The method of claim 3, wherein the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
5. The method according to claim 4, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
6. The method of claim 3, wherein the de-duplicating comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
7. The method of claim 6, wherein the de-duplicating the web address based on the title that is not empty comprises:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
8. The method of claim 3, wherein the classifying comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
9. The method of claim 8, wherein classifying web addresses based on non-empty headings comprises:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
10. The method of claim 8,
the acquiring of the webpage data according to the second website set comprises:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
11. The method according to any one of claims 1-10, wherein collecting web page data according to the second set of web addresses comprises:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
12. The method according to any one of claims 1-9, further comprising:
and storing the second website set in a pre-constructed website pool.
13. A web page data acquisition device, comprising:
the acquisition module is used for acquiring a first website set;
the clearing module is used for clearing the websites in the first website set to obtain a second website set;
and the acquisition module is used for acquiring the webpage data according to the second website set.
14. The apparatus of claim 13,
the acquisition module is specifically configured to:
receiving search information input by a user;
and searching according to the search information to obtain a webpage containing the search information, and acquiring the website corresponding to the webpage containing the search information to form the first website set.
15. The apparatus of claim 13,
the cleaning module is specifically configured to:
processing the websites in the first website set by one or more of the following items:
screening, removing weight and classifying.
16. The apparatus of claim 15, wherein the screening comprises:
and performing domain name matching on the websites in the first website set according to a preset domain name, and matching out the websites related to the preset domain name.
17. The apparatus of claim 16, wherein the performing domain name matching on the websites in the first website set according to a preset domain name comprises:
setting a regular expression according to the preset domain name;
and performing domain name matching on the websites in the first website set through the regular expression.
18. The apparatus of claim 15, wherein the de-duplication comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and removing the duplicate of the website based on the title which is not empty.
19. The apparatus of claim 18, wherein the de-duplicating the web address based on the title that is not empty comprises:
deduplication of a web site based only on the title that is not empty; or,
and acquiring the release time and the author information of the webpage corresponding to the website with the title not being empty, and removing the duplicate of the website according to the acquired title, release time and author information.
20. The apparatus of claim 15, wherein the classifying comprises:
acquiring the title of a webpage corresponding to the website in the first website set;
and judging whether the acquired title is empty or not, and classifying the website based on the title which is not empty.
21. The apparatus of claim 20, wherein the classifying the web addresses based on the non-empty titles comprises:
and calculating the semantic similarity between the titles which are not empty, and classifying the websites according to the semantic similarity.
22. The apparatus of claim 20,
the acquisition module is specifically configured to:
and acquiring the webpage data according to the classified websites, and storing the webpage data corresponding to the websites of the same class into one class.
23. The apparatus of any one of claims 13-22,
the acquisition module is specifically configured to:
and acquiring the webpage data corresponding to each website in the second website set by adopting a distributed acquisition mode.
24. The apparatus of any one of claims 13-22, further comprising:
and the website pool module is used for storing the second website set in a website pool which is constructed in advance.
25. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1-12.
26. An electronic device, comprising:
the computer-readable storage medium of claim 25; and
one or more processors to execute the program in the computer-readable storage medium.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811010439.7A CN110874434A (en) | 2018-08-31 | 2018-08-31 | Webpage data acquisition method and device, storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811010439.7A CN110874434A (en) | 2018-08-31 | 2018-08-31 | Webpage data acquisition method and device, storage medium and electronic equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110874434A true CN110874434A (en) | 2020-03-10 |
Family
ID=69715769
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811010439.7A Pending CN110874434A (en) | 2018-08-31 | 2018-08-31 | Webpage data acquisition method and device, storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110874434A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117972179A (en) * | 2024-01-05 | 2024-05-03 | 深圳中泓在线股份有限公司 | Directional data acquisition normalization method, system and storage medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104038363A (en) * | 2013-10-24 | 2014-09-10 | 南京汇吉递特网络科技有限公司 | Method for acquiring and counting CCDN provider information |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN107704515A (en) * | 2017-09-01 | 2018-02-16 | 安徽简道科技有限公司 | Data grab method based on internet data grasping system |
CN108121743A (en) * | 2016-11-30 | 2018-06-05 | 中移(苏州)软件技术有限公司 | A kind of generation of generic web pages masterplate and application method, system |
-
2018
- 2018-08-31 CN CN201811010439.7A patent/CN110874434A/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104038363A (en) * | 2013-10-24 | 2014-09-10 | 南京汇吉递特网络科技有限公司 | Method for acquiring and counting CCDN provider information |
CN104951512A (en) * | 2015-05-27 | 2015-09-30 | 中国科学院信息工程研究所 | Public sentiment data collection method and system based on Internet |
CN107045507A (en) * | 2016-02-05 | 2017-08-15 | 北京国双科技有限公司 | Web page crawl method and device |
CN108121743A (en) * | 2016-11-30 | 2018-06-05 | 中移(苏州)软件技术有限公司 | A kind of generation of generic web pages masterplate and application method, system |
CN107704515A (en) * | 2017-09-01 | 2018-02-16 | 安徽简道科技有限公司 | Data grab method based on internet data grasping system |
Non-Patent Citations (1)
Title |
---|
郝慧: "一种基于科技查新的跨库检索去重算法", 《现代图书情报技术》 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117972179A (en) * | 2024-01-05 | 2024-05-03 | 深圳中泓在线股份有限公司 | Directional data acquisition normalization method, system and storage medium |
CN117972179B (en) * | 2024-01-05 | 2024-10-22 | 深圳中泓在线股份有限公司 | Directional data acquisition normalization method, system and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
EP2823410B1 (en) | Entity augmentation service from latent relational data | |
JP5864586B2 (en) | Method and apparatus for ranking search results | |
TWI524193B (en) | Computer-readable media and computer-implemented method for semantic table of contents for search results | |
CN102930059B (en) | Method for designing focused crawler | |
JP5995409B2 (en) | Graphical model for representing text documents for computer analysis | |
JP5616444B2 (en) | Method and system for document indexing and data querying | |
CN108520007B (en) | Web page information extracting method, storage medium and computer equipment | |
Sisodia et al. | Fast prediction of web user browsing behaviours using most interesting patterns | |
CN111125298A (en) | Method, equipment and storage medium for reconstructing NTFS file directory tree | |
JP2017532655A (en) | Compress cascading style sheet files | |
CN110889023A (en) | Distributed multifunctional search engine of elastic search | |
CN107704620B (en) | Archive management method, device, equipment and storage medium | |
Bagade et al. | The Kauwa-Kaate fake news detection system | |
JP2013041385A (en) | Document retrieval method, document retrieval device, and document retrieval program | |
CN110874434A (en) | Webpage data acquisition method and device, storage medium and electronic equipment | |
JP4750628B2 (en) | Information ranking method and apparatus, program, and computer-readable recording medium | |
Rome et al. | Towards a formal concept analysis approach to exploring communities on the world wide web | |
CN105224583B (en) | Method and device for cleaning log files | |
CN105512232B (en) | Data storage method and device | |
JP6727097B2 (en) | Information processing apparatus, information processing method, and program | |
JP4189387B2 (en) | Knowledge search system, knowledge search method and program | |
CN118152506A (en) | Similar text filtering method and electronic device | |
CN109190003B (en) | Method and apparatus for determining list page nodes | |
CN105512230B (en) | Data storage method and device | |
CN109948015A (en) | A kind of Meta Search Engine tabulating result abstracting method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20200310 |
|
RJ01 | Rejection of invention patent application after publication |