CN111859073A

CN111859073A - Python-based unstructured data real-time crawling system and using method thereof

Info

Publication number: CN111859073A
Application number: CN202010729806.XA
Authority: CN
Inventors: 官鲁卫; 陈霞
Original assignee: Guangxi Meicube Engineering Consulting Co Ltd
Current assignee: Guangxi Meicube Engineering Consulting Co Ltd
Priority date: 2020-07-27
Filing date: 2020-07-27
Publication date: 2020-10-30

Abstract

The invention discloses a Python-based unstructured data real-time crawling system, which comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database; and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage. In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.

Description

Python-based unstructured data real-time crawling system and using method thereof

Technical Field

The invention relates to the technical field of network big data, in particular to a Python-based unstructured data real-time crawling system and a using method thereof.

Background

With the rapid development of the internet, the internet has penetrated the aspects of people's life, and the acquisition of material demands from information of mental levels can be realized through the internet.

With the explosive development of information, hundreds of millions of websites are emerging continuously, and the number of webpages included in a search engine is also increased sharply.

Abundant information on the internet brings great convenience to people, and people can efficiently and quickly acquire various information through the internet. However, the information is greatly exploded, and meanwhile, the problem of information overload is brought to users, and how to quickly select the information needed by the users from massive information is an increasingly urgent problem.

Python is a cross-platform computer programming language. Is a high-level scripting language that combines interpretive, compiled, interactive, and object-oriented capabilities. Originally designed for writing automated scripts (shells), the more they are used for the development of independent, large projects with the constant updating of versions and the addition of new functionality in languages, and today, they have become increasingly widely used for the processing of system management tasks and for Web programming.

Disclosure of Invention

The invention aims to solve the defects in the prior art, and provides a Python-based unstructured data real-time crawling system and a using method thereof.

A real-time Python-based unstructured data crawling system comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;

the crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time;

the temporary storage database is connected with the crawler cluster and used for storing data crawled by each web crawler in real time;

the data migration module is respectively connected with the temporary storage database and the target database;

and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.

Preferably, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database according to the labeling result for storage.

Preferably, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.

Preferably, the system also comprises a data retrieval module which is respectively connected with the target database and the temporary storage database; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.

Preferably, after the data retrieval module acquires the retrieval instruction, the data retrieval module firstly retrieves the target database, and sorts the retrieval results acquired from the target database according to the correlation degree with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.

Preferably, the crawler cluster comprises at least partially separate source web crawlers.

Preferably, a distributed framework structure is employed.

A use method of a Python-based unstructured data real-time crawling system comprises the following steps:

s1, inputting keywords, and generating a retrieval instruction by the data retrieval module according to the keywords;

s2, acquiring the retrieval result of the data retrieval module, and screening expected information from the retrieval result for response;

and S3, the data retrieval module finishes the retrieval action according to the acquired response information.

Preferably, in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.

In the Python-based unstructured data real-time crawling system provided by the invention, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Therefore, the target database is convenient for subsequent data retrieval and calling.

In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.

In addition, the invention arranges the unstructured data in the temporary storage database in a blocking arrangement mode, thereby being beneficial to improving the data processing efficiency and reducing the interference among data.

The invention also provides a use method of the Python-based unstructured data real-time crawling system, which automatically judges whether the user obtains expected information or not through the response action of the user to the retrieval result, automatically finishes retrieval according to the judgment result, realizes intelligent termination of the working thread, is beneficial to avoiding meaningless thread occupation, ensures the working efficiency of the system, and realizes energy conservation and high efficiency.

Drawings

FIG. 1 is a schematic diagram of a Python-based unstructured data real-time crawling system according to the present invention;

FIG. 2 is a schematic diagram of another Python-based unstructured data real-time crawling system according to the present invention;

FIG. 3 is a flow chart of a method for using the Python-based unstructured data real-time crawling system according to the present invention.

Detailed Description

Referring to fig. 1, the system for crawling unstructured data in real time based on Python provided by the invention comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database.

The crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time. In this embodiment, the object of crawling of web crawler sets up to unstructured data, has so avoided the restriction to crawling the data, is favorable to guaranteeing the width that data crawled to guarantee to crawl abundant and comprehensive of data.

Specifically, in this embodiment, the crawler cluster at least includes a partial source-type web crawler, so as to conveniently adjust the web crawler according to the data requirement, thereby improving the applicability and flexibility of the system.

And the temporary storage database is connected with the crawler cluster and is used for storing the data crawled by the web crawlers in real time.

And the data migration module is respectively connected with the temporary storage database and the target database.

Therefore, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Thus, in the embodiment, the existence of the target database facilitates subsequent data retrieval and calling.

In this embodiment, the data migration module sorts the non-structural data in the temporary storage database, and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.

In addition, in the embodiment, the unstructured data in the temporary storage database is sorted in a blocking sorting mode, so that the efficiency of data processing is improved, and the interference among data is reduced.

In this embodiment, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database for storage according to the labeling result. In the embodiment, the data migration module is convenient to arrange the data through the setting of the cache region, the data migration module is prevented from transferring the temporary storage database during data arrangement, and the data arrangement efficiency is further improved.

In the embodiment, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage. In this way, in the embodiment, the data migration module establishes the sub-database corresponding to the tag in the target database according to the label information of the sorted data to store the sorted and labeled data, thereby ensuring the ordered storage of the sorted and labeled data. Meanwhile, when the sub-libraries exist in the target database, the data migration module matches the labels of the existing sub-libraries according to the labeling information of the sorted and labeled data so as to avoid the sub-libraries with repeated labels.

Referring to fig. 2, the Python-based unstructured data real-time crawling system in this embodiment further includes a data retrieval module, which is respectively connected to the target database and the temporary storage database. And the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.

Specifically, in this embodiment, after the data retrieval module obtains the retrieval instruction, it first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction, so as to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.

In the embodiment, the target database is searched preferentially, so that the data searching efficiency is improved; the retrieval of the temporary storage database realizes the supplement of retrieval information and is beneficial to ensuring the comprehensiveness and diversity of retrieval results.

Specifically, in this embodiment, the system employs a distributed framework structure. The data crawling system is characterized in that a distributed cluster system is formed through distributed arrangement of web crawlers, so that high-efficiency data crawling is achieved, and real-time performance of data presentation is improved.

Referring to fig. 3, the invention further provides a use method of the Python-based unstructured data real-time crawling system, and the use method includes the following steps.

And S1, inputting the keywords, and generating a retrieval instruction by the data retrieval module according to the keywords. Specifically, in this step, the data retrieval module performs semantic analysis and concept extraction on the keywords, and then generates a retrieval instruction according to a concept extraction result. Therefore, the method is beneficial to ensuring the supplement of the retrieval elements and ensuring the completeness and accuracy of the retrieval result.

And S2, acquiring the search result of the data search module, and screening expected information from the search result for response. Specifically, in this step, when the user clicks the link in the search result, the search result is regarded as a response to the search result.

And S3, the data retrieval module finishes the retrieval action according to the acquired response information. Specifically, when any one of the search results is clicked and the browsing time exceeds a predetermined time threshold, the data search module ends the search operation.

Specifically, in the step, whether the user obtains expected information or not is automatically judged through the response action of the user to the search result, and the search is automatically ended according to the judgment result, so that the intelligent termination of the working thread is realized, the occupation of meaningless threads is favorably avoided, the working efficiency of the system is ensured, and the energy conservation and the high efficiency are realized.

The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims

1. A real-time unstructured data crawling system based on Python is characterized by comprising: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;

2. The Python-based unstructured data real-time crawling system of claim 1, wherein a cache area is preset in the data migration module, the data migration module is used for extracting unstructured data from the temporary storage database according to time sequence information and storing the unstructured data in the cache area, and the data migration module is used for performing feature extraction and labeling on the unstructured data in the cache area and migrating the data in the cache area to a target database according to a labeling result for storage.

3. The Python-based unstructured data real-time crawling system of claim 2, wherein after the data migration module labels the data in the cache region, a sub-library corresponding to the tag is selected from the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.

4. The Python-based unstructured real-time crawling system of claim 1, further comprising a data retrieval module, which is connected to the target database and the temporary database respectively; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.

5. The Python-based unstructured data real-time crawling system of claim 4, wherein after the data retrieval module obtains the retrieval instruction, the data retrieval module first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.

6. The Python-based unstructured data real-time crawling system of claim 1, wherein at least some of the crawlers in the crawler cluster are separately sourced web crawlers.

7. The Python-based unstructured data real-time crawling system of claim 1, characterized in that a distributed framework structure is adopted.

8. Use of the Python-based unstructured data real-time crawling system according to any one of the claims 1 to 7, characterized by comprising the following steps:

9. The use method of the Python-based unstructured data real-time crawling system according to claim 8, characterized in that in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.