CN111859073A - Python-based unstructured data real-time crawling system and using method thereof - Google Patents

Python-based unstructured data real-time crawling system and using method thereof Download PDF

Info

Publication number
CN111859073A
CN111859073A CN202010729806.XA CN202010729806A CN111859073A CN 111859073 A CN111859073 A CN 111859073A CN 202010729806 A CN202010729806 A CN 202010729806A CN 111859073 A CN111859073 A CN 111859073A
Authority
CN
China
Prior art keywords
data
retrieval
target database
database
python
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202010729806.XA
Other languages
Chinese (zh)
Inventor
官鲁卫
陈霞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangxi Meicube Engineering Consulting Co Ltd
Original Assignee
Guangxi Meicube Engineering Consulting Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangxi Meicube Engineering Consulting Co Ltd filed Critical Guangxi Meicube Engineering Consulting Co Ltd
Priority to CN202010729806.XA priority Critical patent/CN111859073A/en
Publication of CN111859073A publication Critical patent/CN111859073A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention discloses a Python-based unstructured data real-time crawling system, which comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database; and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage. In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.

Description

Python-based unstructured data real-time crawling system and using method thereof
Technical Field
The invention relates to the technical field of network big data, in particular to a Python-based unstructured data real-time crawling system and a using method thereof.
Background
With the rapid development of the internet, the internet has penetrated the aspects of people's life, and the acquisition of material demands from information of mental levels can be realized through the internet.
With the explosive development of information, hundreds of millions of websites are emerging continuously, and the number of webpages included in a search engine is also increased sharply.
Abundant information on the internet brings great convenience to people, and people can efficiently and quickly acquire various information through the internet. However, the information is greatly exploded, and meanwhile, the problem of information overload is brought to users, and how to quickly select the information needed by the users from massive information is an increasingly urgent problem.
Python is a cross-platform computer programming language. Is a high-level scripting language that combines interpretive, compiled, interactive, and object-oriented capabilities. Originally designed for writing automated scripts (shells), the more they are used for the development of independent, large projects with the constant updating of versions and the addition of new functionality in languages, and today, they have become increasingly widely used for the processing of system management tasks and for Web programming.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a Python-based unstructured data real-time crawling system and a using method thereof.
A real-time Python-based unstructured data crawling system comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;
the crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time;
the temporary storage database is connected with the crawler cluster and used for storing data crawled by each web crawler in real time;
the data migration module is respectively connected with the temporary storage database and the target database;
and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
Preferably, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database according to the labeling result for storage.
Preferably, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.
Preferably, the system also comprises a data retrieval module which is respectively connected with the target database and the temporary storage database; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
Preferably, after the data retrieval module acquires the retrieval instruction, the data retrieval module firstly retrieves the target database, and sorts the retrieval results acquired from the target database according to the correlation degree with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
Preferably, the crawler cluster comprises at least partially separate source web crawlers.
Preferably, a distributed framework structure is employed.
A use method of a Python-based unstructured data real-time crawling system comprises the following steps:
s1, inputting keywords, and generating a retrieval instruction by the data retrieval module according to the keywords;
s2, acquiring the retrieval result of the data retrieval module, and screening expected information from the retrieval result for response;
and S3, the data retrieval module finishes the retrieval action according to the acquired response information.
Preferably, in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.
In the Python-based unstructured data real-time crawling system provided by the invention, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Therefore, the target database is convenient for subsequent data retrieval and calling.
In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.
In addition, the invention arranges the unstructured data in the temporary storage database in a blocking arrangement mode, thereby being beneficial to improving the data processing efficiency and reducing the interference among data.
The invention also provides a use method of the Python-based unstructured data real-time crawling system, which automatically judges whether the user obtains expected information or not through the response action of the user to the retrieval result, automatically finishes retrieval according to the judgment result, realizes intelligent termination of the working thread, is beneficial to avoiding meaningless thread occupation, ensures the working efficiency of the system, and realizes energy conservation and high efficiency.
Drawings
FIG. 1 is a schematic diagram of a Python-based unstructured data real-time crawling system according to the present invention;
FIG. 2 is a schematic diagram of another Python-based unstructured data real-time crawling system according to the present invention;
FIG. 3 is a flow chart of a method for using the Python-based unstructured data real-time crawling system according to the present invention.
Detailed Description
Referring to fig. 1, the system for crawling unstructured data in real time based on Python provided by the invention comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database.
The crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time. In this embodiment, the object of crawling of web crawler sets up to unstructured data, has so avoided the restriction to crawling the data, is favorable to guaranteeing the width that data crawled to guarantee to crawl abundant and comprehensive of data.
Specifically, in this embodiment, the crawler cluster at least includes a partial source-type web crawler, so as to conveniently adjust the web crawler according to the data requirement, thereby improving the applicability and flexibility of the system.
And the temporary storage database is connected with the crawler cluster and is used for storing the data crawled by the web crawlers in real time.
And the data migration module is respectively connected with the temporary storage database and the target database.
And the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
Therefore, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Thus, in the embodiment, the existence of the target database facilitates subsequent data retrieval and calling.
In this embodiment, the data migration module sorts the non-structural data in the temporary storage database, and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.
In addition, in the embodiment, the unstructured data in the temporary storage database is sorted in a blocking sorting mode, so that the efficiency of data processing is improved, and the interference among data is reduced.
In this embodiment, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database for storage according to the labeling result. In the embodiment, the data migration module is convenient to arrange the data through the setting of the cache region, the data migration module is prevented from transferring the temporary storage database during data arrangement, and the data arrangement efficiency is further improved.
In the embodiment, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage. In this way, in the embodiment, the data migration module establishes the sub-database corresponding to the tag in the target database according to the label information of the sorted data to store the sorted and labeled data, thereby ensuring the ordered storage of the sorted and labeled data. Meanwhile, when the sub-libraries exist in the target database, the data migration module matches the labels of the existing sub-libraries according to the labeling information of the sorted and labeled data so as to avoid the sub-libraries with repeated labels.
Referring to fig. 2, the Python-based unstructured data real-time crawling system in this embodiment further includes a data retrieval module, which is respectively connected to the target database and the temporary storage database. And the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
Specifically, in this embodiment, after the data retrieval module obtains the retrieval instruction, it first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction, so as to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
In the embodiment, the target database is searched preferentially, so that the data searching efficiency is improved; the retrieval of the temporary storage database realizes the supplement of retrieval information and is beneficial to ensuring the comprehensiveness and diversity of retrieval results.
Specifically, in this embodiment, the system employs a distributed framework structure. The data crawling system is characterized in that a distributed cluster system is formed through distributed arrangement of web crawlers, so that high-efficiency data crawling is achieved, and real-time performance of data presentation is improved.
Referring to fig. 3, the invention further provides a use method of the Python-based unstructured data real-time crawling system, and the use method includes the following steps.
And S1, inputting the keywords, and generating a retrieval instruction by the data retrieval module according to the keywords. Specifically, in this step, the data retrieval module performs semantic analysis and concept extraction on the keywords, and then generates a retrieval instruction according to a concept extraction result. Therefore, the method is beneficial to ensuring the supplement of the retrieval elements and ensuring the completeness and accuracy of the retrieval result.
And S2, acquiring the search result of the data search module, and screening expected information from the search result for response. Specifically, in this step, when the user clicks the link in the search result, the search result is regarded as a response to the search result.
And S3, the data retrieval module finishes the retrieval action according to the acquired response information. Specifically, when any one of the search results is clicked and the browsing time exceeds a predetermined time threshold, the data search module ends the search operation.
Specifically, in the step, whether the user obtains expected information or not is automatically judged through the response action of the user to the search result, and the search is automatically ended according to the judgment result, so that the intelligent termination of the working thread is realized, the occupation of meaningless threads is favorably avoided, the working efficiency of the system is ensured, and the energy conservation and the high efficiency are realized.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.

Claims (9)

1. A real-time unstructured data crawling system based on Python is characterized by comprising: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;
the crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time;
the temporary storage database is connected with the crawler cluster and used for storing data crawled by each web crawler in real time;
the data migration module is respectively connected with the temporary storage database and the target database;
and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
2. The Python-based unstructured data real-time crawling system of claim 1, wherein a cache area is preset in the data migration module, the data migration module is used for extracting unstructured data from the temporary storage database according to time sequence information and storing the unstructured data in the cache area, and the data migration module is used for performing feature extraction and labeling on the unstructured data in the cache area and migrating the data in the cache area to a target database according to a labeling result for storage.
3. The Python-based unstructured data real-time crawling system of claim 2, wherein after the data migration module labels the data in the cache region, a sub-library corresponding to the tag is selected from the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.
4. The Python-based unstructured real-time crawling system of claim 1, further comprising a data retrieval module, which is connected to the target database and the temporary database respectively; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
5. The Python-based unstructured data real-time crawling system of claim 4, wherein after the data retrieval module obtains the retrieval instruction, the data retrieval module first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
6. The Python-based unstructured data real-time crawling system of claim 1, wherein at least some of the crawlers in the crawler cluster are separately sourced web crawlers.
7. The Python-based unstructured data real-time crawling system of claim 1, characterized in that a distributed framework structure is adopted.
8. Use of the Python-based unstructured data real-time crawling system according to any one of the claims 1 to 7, characterized by comprising the following steps:
s1, inputting keywords, and generating a retrieval instruction by the data retrieval module according to the keywords;
s2, acquiring the retrieval result of the data retrieval module, and screening expected information from the retrieval result for response;
and S3, the data retrieval module finishes the retrieval action according to the acquired response information.
9. The use method of the Python-based unstructured data real-time crawling system according to claim 8, characterized in that in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.
CN202010729806.XA 2020-07-27 2020-07-27 Python-based unstructured data real-time crawling system and using method thereof Pending CN111859073A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010729806.XA CN111859073A (en) 2020-07-27 2020-07-27 Python-based unstructured data real-time crawling system and using method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010729806.XA CN111859073A (en) 2020-07-27 2020-07-27 Python-based unstructured data real-time crawling system and using method thereof

Publications (1)

Publication Number Publication Date
CN111859073A true CN111859073A (en) 2020-10-30

Family

ID=72947199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010729806.XA Pending CN111859073A (en) 2020-07-27 2020-07-27 Python-based unstructured data real-time crawling system and using method thereof

Country Status (1)

Country Link
CN (1) CN111859073A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1666198A (en) * 2002-07-08 2005-09-07 松下电器产业株式会社 Data search device
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN106649249A (en) * 2015-07-14 2017-05-10 比亚迪股份有限公司 Retrieval method and retrieval device
CN110889023A (en) * 2019-11-20 2020-03-17 河海大学常州校区 Distributed multifunctional search engine of elastic search
CN111209331A (en) * 2020-01-06 2020-05-29 北京旷视科技有限公司 Target object retrieval method and device and electronic equipment

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1666198A (en) * 2002-07-08 2005-09-07 松下电器产业株式会社 Data search device
CN104820670A (en) * 2015-03-13 2015-08-05 国家电网公司 Method for acquiring and storing big data of power information
CN106649249A (en) * 2015-07-14 2017-05-10 比亚迪股份有限公司 Retrieval method and retrieval device
CN105760550A (en) * 2016-03-23 2016-07-13 江苏物联网研究发展中心 Big data storage center-oriented internet data acquisition system and acquisition method
CN110889023A (en) * 2019-11-20 2020-03-17 河海大学常州校区 Distributed multifunctional search engine of elastic search
CN111209331A (en) * 2020-01-06 2020-05-29 北京旷视科技有限公司 Target object retrieval method and device and electronic equipment

Similar Documents

Publication Publication Date Title
CN101908071B (en) Method and device thereof for improving search efficiency of search engine
US8965894B2 (en) Automated web page classification
Bernardini et al. Full-subtopic retrieval with keyphrase-based search results clustering
CN111428047B (en) Knowledge graph construction method and device based on UCL semantic indexing
US20070136248A1 (en) Keyword driven search for questions in search targets
CN104715064A (en) Method and server for marking keywords on webpage
CN1512388A (en) Computer system and method for establishing concept knowledge according to machine readable dictionary
WO2007085187A1 (en) Method of data retrieval, method of generating index files and search engine
CN113076538B (en) Method for extracting embedded privacy policy of mobile application APK file
CN105389328B (en) A kind of extensive open source software searching order optimization method
CN105808615A (en) Document index generation method and device based on word segment weights
CN110263021B (en) Theme library generation method based on personalized label system
CN111368167A (en) Chinese literature data automatic acquisition method based on web crawler technology
Sharma et al. A novel architecture for deep web crawler
Chang A Survey of Modern Crawler Methods
CN111859073A (en) Python-based unstructured data real-time crawling system and using method thereof
CN104778233A (en) Searching method and device based on click rate
CN109948015B (en) Meta search list result extraction method and system
CN105574185A (en) Method and device for providing clustering type intelligent summaries
CN102567016A (en) Method and device for extracting use example of application programming interface
CN114443927A (en) Efficient network crawling method and device
CN111625570B (en) List data resource retrieval method and device
CN106776654B (en) Data searching method and device
CN112115269A (en) Webpage automatic classification method based on crawler
CN113407803A (en) Method for acquiring internet data in one step

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20201030

RJ01 Rejection of invention patent application after publication