CN110704713B - Thesis data crawling method and system based on multiple data sources - Google Patents

Thesis data crawling method and system based on multiple data sources Download PDF

Info

Publication number
CN110704713B
CN110704713B CN201910916820.8A CN201910916820A CN110704713B CN 110704713 B CN110704713 B CN 110704713B CN 201910916820 A CN201910916820 A CN 201910916820A CN 110704713 B CN110704713 B CN 110704713B
Authority
CN
China
Prior art keywords
data
task
source
thesis
paper
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910916820.8A
Other languages
Chinese (zh)
Other versions
CN110704713A (en
Inventor
崔佳
张仰森
李超
纪玉春
马欢
缪亚男
侯晋升
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Information Science and Technology University
National Computer Network and Information Security Management Center
Original Assignee
Beijing Information Science and Technology University
National Computer Network and Information Security Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Information Science and Technology University, National Computer Network and Information Security Management Center filed Critical Beijing Information Science and Technology University
Priority to CN201910916820.8A priority Critical patent/CN110704713B/en
Publication of CN110704713A publication Critical patent/CN110704713A/en
Application granted granted Critical
Publication of CN110704713B publication Critical patent/CN110704713B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a thesis data crawling method and system based on multiple data sources, which are used for capturing batch keyword thesis data. Before the crawling task is executed, the keywords or the basic thesis information are used for splicing the URL, and the URL is added to a queue to be crawled; when the method is executed, a program is divided into a plurality of sub-crawling threads, and tasks are taken out from a plurality of queues to be crawled which are balanced by a task scheduling algorithm respectively to perform source code grabbing; after execution, required fields are analyzed from the captured webpage source codes, and the results are stored in a database to construct a thesis data database. Compared with the prior art, the method and the device can provide a more efficient and comprehensive thesis crawling function, can quickly respond when serving retrieval requirements of users and fuse and display the query results of all data sources in front of the users, can enable the users not to discriminate and compare the retrieval results of all the data sources, greatly facilitate the use of the users, and save the time of the users.

Description

Thesis data crawling method and system based on multiple data sources
Technical Field
The invention relates to a method and a system for realizing a thesis crawler technology, in particular to a method and a system for realizing a thesis data crawler based on multiple data sources, and belongs to the technical field of information retrieval.
Background
The big data technology develops from the beginning of rising to the becoming mature nowadays, and plays a great role in various industries; the field of artificial intelligence which generates vitality again by means of the east wind of big data has made a major breakthrough in recent years, creates great value in scientific research and application, and people gradually realize that data is the most important resource at present.
Today, where the internet is highly developed, the acquisition of data has become very convenient. However, the huge amount of information and platforms on the internet can also cause troubles to people. Taking the paper data as an example, resources owned by a plurality of paper data websites on the network are not completely the same, when a user searches for an article, if only a single website is searched, all high-value papers cannot be searched, and if a plurality of websites are searched, the user will waste time and energy in screening and comparing a large number of search results.
The thesis data contains huge value, and by combining data mining and machine learning technologies, the information such as the current research situation, research hotspots, trend changes and the like of each subject field can be analyzed, and the research level and the strength of each school, each organization and each student can be evaluated. Therefore, it is necessary to provide a comprehensive and efficient crawler method for acquiring the paper data.
At present, data collection on a network is mainly realized by using a web crawler technology. Since the first internet crawler Wanders was realized by Matthew Gray in 1993 to date, the research on the web crawler system by the industry and the academia is not interrupted, and people are pursuing the web crawler system with higher data crawling efficiency and stronger stability. The Shaoxing uses a mode of increasing the thread number of a program, designs a concurrent network crawler system based on Java multithreading, and combines a bloom filter and a Redis cache technology to realize efficient acquisition of network data. The design of Wang Shufen and the like realizes a distributed theme crawler frame, solves the problem that a Hadoop distributed computing platform is not suitable for being deployed in a wide area network, realizes distributed reliable communication by using message middleware, and improves the efficiency of a crawler system by using a distributed system. In foreign related research, md.abu Kausar et al introduces a mobile agent technology in the reality of a crawler system, so that the analysis work of webpage source codes can be completed locally without remote work, and the overall efficiency of the crawler system is improved. ManishKumar et al designs a crawler algorithm, and transmits keywords to a search query interface of a website corresponding to a URL in a query-based manner, aiming at quickly acquiring links most concerned by a user.
In the aspect of acquisition and application of paper data, a paper recommendation system based on a collaborative filtering algorithm is designed very vigorously, a distributed crawler system is used for capturing the paper data of the paper data, and then a customized collaborative filtering algorithm is used for calculating and recommending the paper to a user. The Yangtze superman et al crawls the inclusion papers of five meetings of ACL, ACMMM, ICML, KDD and SIGIR, and summarizes the research hotspot and research trend change in the field of information retrieval after performing statistical analysis on the papers. The document and the like use a cognitive network as a data source, and perform research on a problem of repeated publication of a document after analyzing crawled paper data by using software such as SPSS.
However, relevant research at home and abroad is less in the aspect of the thesis crawler technology and application of multi-data source fusion.
Disclosure of Invention
In order to solve the problems in the prior art, the invention creatively provides a thesis data crawling method and system based on multiple data sources.
In order to achieve the above technical object, the present invention adopts the following technical means.
A thesis data crawling method based on multiple data sources comprises the following steps:
step1, acquiring keywords of a task to be grabbed, organizing the keywords into a task, and sending the task to a keyword queue to be grabbed of a webpage source code grabbing module;
specifically, the keywords of the task to be crawled may be input from a user's search, or may be obtained from a keyword profile.
And step2, replacing parameters required by the tasks to be captured into the retrieval result page URLs of all the data sources to complete the replacement of the appointed page URLs, and distributing the tasks to corresponding task queues to be downloaded according to different data sources.
A URL (uniform resource delimiter) is an address of a standard resource on the internet, and the URL includes important information such as a website to be accessed, an access method, and an access parameter. The search page request URL of each data source website has a specific format and comprises search parameters, wherein the search parameters comprise query keywords, result page numbers, a result sorting mode and the like. Splicing the tasks to be captured into retrieval result page URLs of a plurality of data source websites, and distributing the tasks into corresponding task queues to be downloaded according to different data sources;
during operation, the URL format of specific services of each data source website can be summarized by collecting and analyzing specific access requests of different data source websites, required parameters including search keywords, result sorting modes, result page numbers and the like are spliced into the URL format to obtain required requests, and a subsequent source code capturing module can directly obtain source codes according to the automatically generated URL requests.
Specifically, for a data source that does not use an asynchronous update policy in displaying a result of a paper search (e.g., common wanfang data, extra-star journals, and wipu chinese web, etc.), a web page corresponding to a URL directly accessed by a browser directly includes required paper data, and a URL of a website search page can be directly subjected to parameter replacement (including the search keyword, a result sorting manner, and a result page number) to obtain a required URL for subsequent access.
For a data source (such as a common Hopkinson) loaded to a foreground page through an asynchronous request in the result display of paper retrieval, specific paper data cannot be found in the original request. All requests made by the website when the webpage is opened are captured by means of a developer tool of the browser, and asynchronous request URLs of required data are obtained through request analysis.
It should be noted that, since the URL is usually encoded when the URL is transmitted over the network, the URL needs to be encoded first for the search keyword, the result sorting method, and the result page number when the URL is constructed. Specifically, as a scheme that can be implemented, when Java language is used for development, an encode method of encode in Java.
Step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code;
specifically, the visit interval of different data source websites is set to 1 second to 1.5 seconds, and the response timeout time is not particularly required, and generally set to 10 seconds to 20 seconds.
The webpage source code downloader is composed of a plurality of sub-threads, the number of the sub-threads is the same as that of the data sources, and the sub-threads correspond to the corresponding data sources respectively.
Although the functions used for crawling the source code of the web page are uniform, the division into the multi-thread parallel processing has the following two advantages compared with the single thread only:
1. the grabbing efficiency is improved. The website often limits the access frequency of a single IP in order to relieve the load pressure of the server, so an access interval must be set when accessing. If a single thread is used, the access interval must be set to the maximum access interval among the multiple data sources to ensure the stability of the program, which causes great waste in time overhead.
After the source code is divided into a plurality of sub-threads, parameters such as specific access intervals, response timeout time and the like can be set for different websites, the plurality of threads operate in parallel without mutual influence, and the source code capturing efficiency is greatly improved.
2. The robustness of the program is enhanced. Single threaded operation performs poorly in handling emergency situations. For example, a data source website suddenly has a problem, and all the data source websites consume the overtime time when capturing source codes, and other tasks in the queue can only wait, so that the overall operating efficiency of the program is greatly influenced. After the operation is divided into a plurality of sub-threads, the problem of one data source does not affect the grabbing speed of other data source tasks, and the influence on the whole is reduced to the minimum.
In particular, as an implementation scheme, the web page source code downloader of the program is preferably developed based on the JAVA language, which itself has good support for multithreading. In the development process, specifically, multithreading is realized through inheritance to the Thread parent class. And in the program operation preparation stage, a plurality of sub-threads are started concurrently, corresponding task queues to be captured are monitored respectively, when a task exists in the queue, the task is circularly taken out for processing, finally, the captured source codes are put into a uniform queue to be processed, and the source codes are taken out by a webpage source code collecting classifier for subsequent processing.
Furthermore, the invention adopts HttpClient as a webpage source code grasping tool. HTTP live is a child under Apache Jakarta Common and can be used to provide an efficient, up-to-date, feature-rich client programming toolkit that supports the HTTP protocol. The http client provides access methods of common requests such as GET and POST, and can limit the access duration of a single request by setting a response time parameter, thereby preventing a card death phenomenon caused by network fluctuation or server problems.
Step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into retrieval result page source codes and thesis detail page source codes according to the format characteristics of the webpage source codes;
the subsequent processing of the retrieval result page is shifted to Step5, and the subsequent processing of the source code of the paper detail page is shifted to Step 7;
step5, analyzing the paper data in the retrieval result page, organizing the paper data into tasks and sending the tasks to a paper detail page task scheduler;
the source code of the search result contains general information of all search results in a list form, such as the Hopkinson Web and the like, but the information only contains rough information of names, authors, sources and the like of papers, and further grabbing and analyzing detailed pages of the papers is needed. Referring to fig. 1, an exemplary diagram of a list information page of a web is shown.
In the step, rough thesis information (author, unit, journal name, publication time, thesis website id and the like) in a thesis retrieval result list is used for splicing a thesis detail page URL, three fields of the thesis name, the thesis detail page URL and a data source are used for organizing a thesis detail page task, and the task is sent to a thesis detail page task scheduler.
Step6, after receiving the task, the paper detail page task scheduler distributes the task to different data source queues to be downloaded in a balanced manner by using a distribution algorithm, and then Step3 is carried out;
in the prior art, when the retrieval result of the same keyword on a single data source is adopted, a large amount of repetition exists in the retrieval result when the retrieval result is oriented to multiple data sources, and if all tasks are not processed, all follow-up thesis details are captured, so that a large amount of repeated work is bound to be done. If simple task-name deduplication is performed, the amount of unassigned tasks for each data source may be very uneven. For example, for a certain task keyword, 1000 retrieval results exist in both the first data source and the second data source, 800 of the retrieval results are repeated, but the retrieval result of the first data source enters the program first, and then the retrieval result of the second data source enters the program, the 800 tasks are directly filtered out, at this time, 1000 tasks are in the queue to be processed corresponding to the first data source, the queue to be processed corresponding to the second data source only has 200 tasks, the processing thread facing the second data source can only wait after processing 200 tasks, although 800 tasks in the queue corresponding to the first data source can also be processed. This also results in waste of resources and the advantage of multiple data sources cannot be fully exploited.
In order to solve the above problems, the present step is implemented by the following method:
step61, data deduplication;
in the step, HashSet is used as a storage data structure, an entity class designed aiming at the basic information of the thesis is used as a storage element, and the entity class comprises a source member variable representing a data source.
Specifically, after a new task enters a program, whether the new task exists in a current data set is firstly inquired, and if the new task does not exist in the current data set, the new task is directly added; if the element exists, the element is taken out, and the new data source is added into the source field of the element and then the element is stored again.
After this deduplication is completed, all the thesis tasks appear once in the task set and carry all the data source information that can be processed on it.
Step62, distributing balanced task quantity for the queue to be downloaded of each data source;
distributing the tasks to be processed into a plurality of queues by a task scheduling algorithm, preferably, ensuring that the queues to be downloaded of each data source have relatively balanced task amount as much as possible, and the steps are as follows:
step621, taking out the task from the task set to be processed;
step622. read some set of available data sources for the task;
step623, checking the size of the current queue to be downloaded of each data source in the data source set one by one;
and step623, selecting a data source with the minimum size of the queue to be downloaded, and adding the task to the tail of the queue to be downloaded of the data source.
Step7, analyzing the thesis data from the source code of the thesis detail page by using a specific rule aiming at different data sources;
because the fields of the paper data which can be captured by different data source websites are not completely consistent, the paper data needs to be uniformly stored and displayed. In order to solve the problem, the method adopted by the invention is to extract common and important information from heterogeneous information of a plurality of data source websites as a uniform data field and design a data table structure.
The specific rule is as follows: unifying data fields of id, Chinese topic of the paper, English topic of the paper, Chinese abstract of the paper, author of the paper, Chinese keyword of the paper, author unit, Chinese name of periodical, fund project, publication date and warehousing time of the paper in a database.
Referring to fig. 2, an exemplary diagram of a paper detail page for a search in a database of all parties is shown.
Step8, storing the paper data results into a database.
Since a paper may correspond to a plurality of keywords, even if a non-repeating keyword list is used for capturing, a problem that a large number of papers are repeatedly captured is encountered. As more and more data is available in a database, the problem of data duplication becomes more and more serious. Therefore, the duplication removal operation before data warehousing becomes an essential step. Although the data deduplication problem can be achieved by establishing a unique index for the data table, the disadvantage of this is that a large amount of repeated data will still try to perform the warehousing operation in the later period, which will significantly reduce the operation speed of the system and bring a large burden to the database when facing a large amount of data (e.g. millions).
To solve the above problem, this step uses a bloom filter to accomplish the deduplication task.
A bloom filter is a filtering algorithm implemented using the idea of hashing, which is effectively a long binary vector and a series of random mapping functions. When a large-scale data set is processed, the bloom filter has great advantages, the required storage space and the write/query time are constant, the use efficiency cannot be reduced along with the increase of data, more extra space cannot be occupied, and the bloom filter is quite suitable for the service scene of the invention.
As a preferred solution, a thesis data crawling method based on multiple data sources, when a task of step1. is triggered by a retrieval behavior of a user, further includes the following steps:
and step9, feeding back the result to a foreground display interface.
Further preferably, the invention adopts a jsp + servlet technology to construct the webpage, and uses css to beautify the webpage, thereby realizing simple and beautiful thesis retrieval and information display webpage.
The invention also provides a thesis data crawling system based on multiple data sources, which comprises the following modules: the system comprises a task organization management module, a webpage source code capturing module, a paper data extracting module and a webpage retrieval display module. The architecture of the above system is shown in fig. 3.
The task organization management module is used for achieving step1. obtaining the keywords of the task to be grabbed, and organizing the keywords into a keyword queue to be grabbed, wherein the keyword queue is used for sending the task to be grabbed to the webpage source code grabbing module.
The webpage source code grabbing module is used for realizing step2. the parameters required by the task to be grabbed are replaced into the retrieval result page URL of each data source to complete the replacement of the specified page URL, and the task is distributed into the corresponding task queue to be downloaded according to different data sources; step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code; step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into retrieval result page source codes and thesis detail page source codes according to the format characteristics of the webpage source codes; and step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader and divides the source codes into thesis detail page source codes and retrieval result page source codes according to the format characteristics of the webpage source codes.
The paper data extraction module is used for realizing step5. analyzing the paper data in the retrieval result page, organizing the paper data into tasks and sending the tasks to the paper detail page task scheduler; step7, analyzing the thesis data from the source code of the thesis detail page aiming at different data sources; and step8, storing the result of the paper data into a database.
And the webpage retrieval display module is used for realizing Step9. the result is fed back to a foreground display interface, and the user retrieval entering Step1 is undertaken.
By adopting the technical scheme, the invention achieves the following technical effects.
1. The invention can eliminate the condition that the data of each website is not the same, is convenient for academic staff to more conveniently search the thesis data and provides an efficient batch data acquisition function; by applying the technology to construct the thesis data database, precious data resources can be provided for work such as data mining, machine learning and the like.
2. Experiments show that the invention can provide a more efficient and comprehensive paper crawling function, can quickly respond when serving the retrieval requirements of users and fuse and display the query results of all data sources in front of the users, so that the users do not need to discriminate and compare the retrieval results of all data sources, thereby greatly facilitating the use of the users and saving the time of the users.
Drawings
FIG. 1 is an exemplary diagram of a list information page of a known network;
FIG. 2 is an exemplary diagram of details of a retrieved paper for a database of all parties;
FIG. 3 is an architecture diagram of a thesis data crawling system based on multiple data sources according to the present invention;
FIG. 4 is an exemplary diagram of a paper search web page in accordance with the present invention;
FIG. 5 is an exemplary diagram of an information presentation Web page of the present invention;
FIG. 6 is a diagram of the task volumes of each queue before and after task allocation for the keyword "Zhouqihua" using the allocation algorithm of the present invention;
FIGS. 7 and 8 are graphs comparing the amount of data crawled using a single data source crawler method and the multi-data source-based thesis data crawling method of the present invention, respectively;
fig. 9 and fig. 10 are graphs comparing the operation time of a single data source crawler method and a thesis data crawling method based on multiple data sources according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the embodiments of the present invention and the accompanying drawings. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Examples
The thesis data crawling method and system based on multiple data sources are introduced below by taking four large Chinese literature service websites of the HowNet, the Wanfang data, the UpPop network and the extra-Star periodical as data sources. Specifically, the method comprises the following steps:
step1, acquiring keywords of a task to be grabbed, organizing the keywords into a task, and sending the task to a keyword queue to be grabbed of a webpage source code grabbing module;
specifically, the keywords of the task to be crawled may be input from a user's search, or may be obtained from a keyword profile.
And step2, replacing parameters required by the tasks to be captured into the retrieval result page URLs of all the data sources to complete the replacement of the appointed page URLs, and distributing the tasks to corresponding task queues to be downloaded according to different data sources.
The parameters comprise query keywords, the number of pages of the result and a sorting mode.
Specifically, for data sources which do not use an asynchronous update strategy in the display of the result of the thesis retrieval, including Wanfang data, extra-week journals and the UpChinese web, the webpage corresponding to the URL directly accessed by the browser directly contains the required thesis data, and the URL of the website retrieval page can be directly subjected to parameter replacement to obtain the required URL for subsequent access.
For data sources loaded to a foreground page through an asynchronous request in the result display of paper retrieval, including a web site, specific paper data cannot be found in an original request. All requests made by the website when the webpage is opened are captured by means of a developer tool of the browser, and asynchronous request URLs of required data are obtained through request analysis.
It should be noted that, URL encoding is usually performed when URL is transmitted on the network, so that the URL is encoded for the keyword when URL is constructed. Specifically, as a scheme that can be implemented, when a Java language is used for development, an encode method in Java.
Step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code;
specifically, the web page source code downloader is composed of a plurality of sub-threads, the number of the sub-threads is the same as that of the data sources, and the sub-threads correspond to a plurality of different data sources respectively.
In this embodiment, the web page source code downloader includes four known network source code downloading threads, a ten-thousand source code downloading thread, a weipu source code downloading thread, and a superstar source code downloading thread, which are respectively used for downloading known network, ten-thousand source code, weipu code, and superstar source code.
Preferably, the visit interval of different data source websites is set to 1 second to 1.5 seconds, and the response timeout time is not particularly required, and generally set to 10 seconds to 20 seconds.
In particular, as an implementation scheme, the web page source code downloader of the program is preferably developed based on the JAVA language, which itself has good support for multithreading. In the development process, specifically, multithreading is realized through inheritance to the Thread parent class. And in the program operation preparation stage, a plurality of sub-threads are started concurrently, corresponding task queues to be captured are monitored respectively, when a task exists in the queue, the task is circularly taken out for processing, finally, the captured source codes are put into a uniform queue to be processed, and the source codes are taken out by a webpage source code collecting classifier for subsequent processing.
Furthermore, the invention adopts HttpClient as a webpage source code grasping tool. .
Step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into retrieval result page source codes and thesis detail page source codes according to the format characteristics of the webpage source codes;
the subsequent processing of the retrieval result page is shifted to Step5, and the subsequent processing of the source code of the paper detail page is shifted to Step 7;
step5, analyzing the paper data in the retrieval result page, organizing the paper data into tasks, sending the tasks to a paper detail page task scheduler, and turning to Step 6;
in the step, rough thesis information (author, unit, journal name, publication time, thesis website id and the like) in a thesis retrieval result list is used for splicing a thesis detail page URL, three fields of the thesis name, the thesis detail page URL and a data source are used for organizing a thesis detail page task, and the task is sent to a thesis detail page task scheduler.
Step6, after receiving the tasks, the paper detail page task scheduler distributes the tasks to different data source queues to be downloaded in a balanced manner by using a distribution algorithm;
comprises the following steps
Step61, data deduplication;
in the step, HashSet is used as a storage data structure, an entity class designed aiming at the basic information of the thesis is used as a storage element, and the entity class comprises a source member variable representing a data source.
Specifically, after a new task enters a program, whether the new task exists in a current data set is firstly inquired, and if the new task does not exist in the current data set, the new task is directly added; if the element exists, the element is taken out, and the new data source is added into the source field of the element and then the element is stored again.
After this deduplication is completed, all the thesis tasks appear once in the task set and carry all the data source information that can be processed on it.
Step62, distributing balanced task quantity for the queue to be downloaded of each data source;
distributing the tasks to be processed into a plurality of queues by a task scheduling algorithm, preferably, ensuring that the queues to be downloaded of each data source have relatively balanced task amount as much as possible, and the steps are as follows:
step621, taking out the task from the task set to be processed;
step622, reading an available data source set of the task;
step623, checking the size of the current queue to be downloaded of each data source in the data source set one by one;
and step623, selecting a data source with the minimum size of the queue to be downloaded, and adding the task to the tail of the queue to be downloaded of the data source.
Step7, analyzing the thesis data from the source code of the thesis detail page by using a specific rule aiming at different data sources;
common and important information is extracted from heterogeneous information of a plurality of data source websites and used as a uniform data field, and a data table structure is designed.
The specific rule is as follows: unifying data fields of id, Chinese topic of the paper, English topic of the paper, Chinese abstract of the paper, author of the paper, Chinese keyword of the paper, author unit, Chinese name of periodical, fund project, publication date and warehousing time of the paper in a database. See table 1 for details.
TABLE 1
Figure BDA0002216348660000091
Figure BDA0002216348660000101
Step8, storing the paper data results into a database.
In the step, a bloom filter is adopted to complete the duplicate removal task, and then the result of the thesis data is stored in a database.
As a preferred scheme, a thesis data crawling method based on multiple data sources further includes the following steps when a task is triggered by a retrieval behavior of a user:
and step9, feeding back the result to a foreground display interface.
Further preferably, the invention adopts a jsp + servlet technology to construct the webpage, and uses css to beautify the webpage, so as to realize simple and beautiful thesis retrieval and information display webpage, as shown in fig. 4 and 5.
The thesis data crawling system based on multiple data sources comprises the following modules: the system comprises a task organization management module, a webpage source code capturing module, a paper data extracting module and a webpage retrieval display module. The task organization management module is used for achieving step1. obtaining the keywords of the task to be grabbed, and organizing the keywords into a keyword queue to be grabbed, wherein the keyword queue is used for sending the task to be grabbed to the webpage source code grabbing module.
The webpage source code grabbing module is used for realizing step2. the parameters required by the task to be grabbed are replaced into the retrieval result page URL of each data source to complete the replacement of the specified page URL, and the task is distributed into the corresponding task queue to be downloaded according to different data sources; step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code; step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into retrieval result page source codes and thesis detail page source codes according to the format characteristics of the webpage source codes; and step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader and divides the source codes into thesis detail page source codes and retrieval result page source codes according to the format characteristics of the webpage source codes.
The paper data extraction module is used for realizing step5. analyzing the paper data in the retrieval result page, organizing the paper data into tasks and sending the tasks to the paper detail page task scheduler; step7, analyzing the thesis data from the source code of the thesis detail page aiming at different data sources; and step8, storing the result of the paper data into a database.
And the webpage retrieval display module is used for realizing Step9. the result is fed back to a foreground display interface, and the user retrieval entering Step1 is undertaken.
Examples of the experiments
Experiments were performed as described in the above detailed description, with the test machine hardware implemented configured to: the seventh generation Kurui i7-7700HQ @2.80GHZ dual-core, 16GB memory, using Windows10 professional version operating system.
The implementation system is developed and completed by using Java language, the development platform is Eclipse Luna Service Release 2(4.4.2), the web server selects Apache Tomcat 8.5.31, and the database selects MySql 5.6.36.
Fig. 6 shows the task amount of each queue before and after the task allocation is performed before and after the allocation algorithm of the present invention is applied to the keyword "zhouqinhua", and it can be seen that the algorithm can effectively balance and reduce the tasks of each queue on the premise of removing the duplicate.
Under the condition of ensuring good network communication, when the http feature toolkit and the Jsoup technology are respectively used for webpage source code capturing and webpage analysis, a single data source crawler program and the multi-data-source-based thesis data crawling method are used for data crawling, and the coverage area and the crawling efficiency of data crawled by the single data source crawler program and the multi-data-source-based thesis data crawling method are compared. Two keywords of 'recurrent neural network' and 'ZhouZhiHua' are selected, and each keyword is used for carrying out five groups of experiments.
The volumes of the paper data crawled from different data sources are shown in fig. 7 and 8.
On the premise that all search results of the keywords are crawled completely, the running time of the system adopting the single-data-source crawler and the running time of the thesis data crawling method based on the multiple data sources are compared as shown in the attached figures 9 and 10.
Compared with a crawler system only using a single-source data source, the method for crawling the thesis data based on multiple data sources obtains the most results on the crawling results of a single keyword, and the comprehensiveness of the thesis crawling results can be better ensured; meanwhile, the thesis data crawling method based on the multiple data sources has more excellent performance on crawling efficiency and good performance in target tasks due to the fact that the thesis data crawling method based on the multiple data sources is operated in a multithread cooperation mode and a task balanced distribution scheduling strategy is adopted.
The technical solution provided by the present invention is not limited by the above embodiments, and all technical solutions formed by utilizing the structure and the mode of the present invention through conversion and substitution are within the protection scope of the present invention.

Claims (8)

1. A thesis data crawling method based on multiple data sources is characterized by comprising the following steps:
step1. acquiring keywords of a task to be grabbed, organizing the keywords into a task, and sending the task to a keyword queue to be grabbed of a webpage source code grabbing module;
step2, replacing parameters required by the tasks to be captured into the retrieval result page URLs of all the data sources to complete the replacement of the designated page URLs, and distributing the tasks to corresponding task queues to be downloaded according to different data sources;
step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code;
step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into thesis detail page source codes and retrieval result page source codes according to the format characteristics of the webpage source codes;
the subsequent processing of the source code of the retrieval result is shifted to Step5, and the subsequent processing of the source code of the detail page of the paper is shifted to Step 7;
step5, analyzing the paper data in the retrieval result page, organizing the paper data into tasks and sending the tasks to a paper detail page task scheduler;
step6, after receiving the task, the paper detail page task scheduler distributes the task to different data source queues to be downloaded in a balanced manner by using a distribution algorithm, and then Step3 is carried out;
step7, analyzing the thesis data from the source code of the thesis detail page aiming at different data sources;
step8, storing the result of the paper data into a database;
wherein, Step6. comprises the following steps,
step61, data deduplication;
using HashSet as a storage data structure, and using an entity class designed aiming at the basic information of a thesis as a storage element, wherein the entity class comprises a source member variable representing a data source;
after a new task enters a program, firstly inquiring whether the new task exists in a current data set or not, and if not, directly adding the new task; if the element exists, the element is taken out, and the new data source is added into the source field of the element and then the element is stored again.
Step62, allocating balanced task amount to the queue to be downloaded of each data source.
2. The method of claim 1, wherein the method comprises:
step62. the process comprises the following steps,
step621, taking out the task from the task set to be processed;
step622, reading an available data source set of the task;
step623, checking the size of the current queue to be downloaded of each data source in the set one by one;
and step623, selecting a data source with the minimum size of the queue to be downloaded, and adding the task to the tail of the queue to be downloaded of the data source.
3. The multi-data-source-based thesis data crawling method of claim 2, wherein: in step1, the keywords of the task to be grabbed are input from the user's search, or are obtained from a keyword configuration file.
4. The multi-data-source-based thesis data crawling method of claim 3, wherein: the webpage source code downloader is composed of a plurality of sub-threads, the number of the sub-threads is the same as that of the data sources, and the sub-threads and the data sources correspond to the corresponding data sources respectively.
5. The method of claim 4, wherein the method comprises: and Step5. the method is that a thesis detail page URL is spliced by using thesis information including an author, a unit, a journal name, publication time and a website id of a thesis in a retrieval result list, and a thesis detail page task is organized by using three fields of the thesis name, the thesis detail page URL and a data source and is sent to a thesis detail page task scheduler.
6. The multi-data-source-based thesis data crawling method of claim 5, wherein:
step7, analyzing the paper data from the source code of the detail page of the paper aiming at different data sources, wherein the rule is that the data fields of id, Chinese title of the paper, English title of the paper, Chinese abstract of the paper, author of the paper, Chinese keyword of the paper, author unit, Chinese name of periodical, fund project, publication date and warehousing time of the paper in the database are unified.
7. A thesis data crawling method based on multiple data sources as claimed in claim 3 or 4, wherein: step8. storing the paper data results into a database also includes a bloom filter deduplication process.
8. A thesis data crawling system based on multiple data sources is characterized in that: it includes four modules: a task organization management module, a webpage source code capturing module, a paper data extracting module and a webpage retrieval display module,
the task organization management module is used for achieving step1. acquiring keywords of a task to be grabbed, organizing the keywords into a task to be grabbed keyword queue, and sending the task to the webpage source code grabbing module;
the webpage source code grabbing module is used for realizing step2. the parameters required by the task to be grabbed are replaced into the retrieval result page URL of each data source to complete the replacement of the specified page URL, and the task is distributed into the corresponding task queue to be downloaded according to different data sources; step3, acquiring a task from the task queue to be downloaded by using a webpage source code downloader, and downloading a source code; step4, the webpage source code collecting classifier takes out the webpage source codes from the completion queue of the source code downloader, and divides the source codes into retrieval result page source codes and thesis detail page source codes according to the format characteristics of the webpage source codes; step6, after receiving the tasks, the paper detail page task scheduler evenly distributes the tasks to different data source queues to be downloaded by using a distribution algorithm, and then Step3 is carried out;
the paper data extraction module is used for realizing step5. analyzing the paper data in the retrieval result page, organizing the paper data into tasks and sending the tasks to the paper detail page task scheduler; step7, analyzing the thesis data from the source code of the thesis detail page aiming at different data sources; step8, storing the result of the paper data into a database;
the webpage retrieval display module is used for realizing Step9. the result is fed back to a foreground display interface, and the user retrieval entering Step1 is undertaken;
wherein, Step6. comprises the following steps,
step61, data deduplication;
using HashSet as a storage data structure, and using an entity class designed aiming at the basic information of a thesis as a storage element, wherein the entity class comprises a source member variable representing a data source;
after a new task enters a program, firstly inquiring whether the new task exists in a current data set or not, and if not, directly adding the new task; if the element exists, the element is taken out, and the new data source is added into the source field of the element and then the element is stored again.
Step62, allocating balanced task amount to the queue to be downloaded of each data source.
CN201910916820.8A 2019-09-26 2019-09-26 Thesis data crawling method and system based on multiple data sources Active CN110704713B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910916820.8A CN110704713B (en) 2019-09-26 2019-09-26 Thesis data crawling method and system based on multiple data sources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910916820.8A CN110704713B (en) 2019-09-26 2019-09-26 Thesis data crawling method and system based on multiple data sources

Publications (2)

Publication Number Publication Date
CN110704713A CN110704713A (en) 2020-01-17
CN110704713B true CN110704713B (en) 2022-02-08

Family

ID=69197526

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910916820.8A Active CN110704713B (en) 2019-09-26 2019-09-26 Thesis data crawling method and system based on multiple data sources

Country Status (1)

Country Link
CN (1) CN110704713B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device
CN113722572B (en) * 2021-10-11 2024-03-29 上海易路软件有限公司 Distributed deep crawling method, device and medium
CN113987146B (en) * 2021-10-22 2023-01-31 国网江苏省电力有限公司镇江供电分公司 Dedicated intelligent question-answering system of electric power intranet

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885777A (en) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 A kind of control method and system of the crawl web data based on collaborative reptile

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070038608A1 (en) * 2005-08-10 2007-02-15 Anjun Chen Computer search system for improved web page ranking and presentation
CN103049575B (en) * 2013-01-05 2015-08-19 华中科技大学 A kind of academic conference search system of topic adaptation
US9887933B2 (en) * 2014-10-31 2018-02-06 The Nielsen Company (Us), Llc Method and apparatus to throttle media access by web crawlers
CN109213908A (en) * 2018-08-01 2019-01-15 浙江工业大学 A kind of academic meeting paper supplying system based on data mining
CN109325161A (en) * 2018-09-11 2019-02-12 五八有限公司 Public sentiment data grasping means, device, equipment and storage medium
CN109543086B (en) * 2018-11-23 2022-11-22 北京信息科技大学 Network data acquisition and display method oriented to multiple data sources
CN109902182A (en) * 2019-01-30 2019-06-18 北京百度网讯科技有限公司 Knowledge data processing method, device, equipment and storage medium

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107885777A (en) * 2017-10-11 2018-04-06 北京智慧星光信息技术有限公司 A kind of control method and system of the crawl web data based on collaborative reptile

Also Published As

Publication number Publication date
CN110704713A (en) 2020-01-17

Similar Documents

Publication Publication Date Title
Khder Web scraping or web crawling: State of art, techniques, approaches and application.
CN110704713B (en) Thesis data crawling method and system based on multiple data sources
CN108875091B (en) Distributed web crawler system with unified management
CN103927370B (en) Network information batch acquisition method of combined text and picture information
US20110246482A1 (en) Augmented and cross-service tagging
CN103970788A (en) Webpage-crawling-based crawler technology
JP2021515950A (en) Systems and methods for cloud computing
US9582572B2 (en) Personalized search library based on continual concept correlation
CN104077402A (en) Data processing method and data processing system
CN106407442B (en) A kind of mass text data processing method and device
CN103399877A (en) Multi-Android-client service sharing method and system
JP2016194921A (en) Removal of old item in curated content
CN113656673A (en) Master-slave distributed content crawling robot for advertisement delivery
Aggarwal et al. Small files’ problem in Hadoop: A systematic literature review
Liang et al. Co-clustering WSDL documents to bootstrap service discovery
CN104765823A (en) Method and device for collecting website data
CN116016702A (en) Application observable data acquisition processing method, device and medium
CN109145233A (en) internet information acquisition system
Xie et al. Design and implementation of the topic-focused crawler based on Scrapy
CN112231093A (en) Data acquisition method and system based on code template and coroutine pool and electronic equipment
GB2572544A (en) System and method of crawling a wide area computer network for retrieving contextual information
CN113407803A (en) Method for acquiring internet data in one step
Ye et al. The research and implementation of a distributed crawler system based on Apache Flink
CN112860844A (en) Case clue processing system, method and device and computer equipment
Xu et al. The application of web crawler in city image research

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant