CN111859073A - Python-based unstructured data real-time crawling system and using method thereof - Google Patents
Python-based unstructured data real-time crawling system and using method thereof Download PDFInfo
- Publication number
- CN111859073A CN111859073A CN202010729806.XA CN202010729806A CN111859073A CN 111859073 A CN111859073 A CN 111859073A CN 202010729806 A CN202010729806 A CN 202010729806A CN 111859073 A CN111859073 A CN 111859073A
- Authority
- CN
- China
- Prior art keywords
- data
- retrieval
- target database
- database
- python
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 230000009193 crawling Effects 0.000 title claims abstract description 37
- 238000000034 method Methods 0.000 title claims description 11
- 238000013508 migration Methods 0.000 claims abstract description 38
- 230000005012 migration Effects 0.000 claims abstract description 38
- 238000002372 labelling Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 abstract description 3
- 230000009286 beneficial effect Effects 0.000 description 4
- 238000011161 development Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 230000000903 blocking effect Effects 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004134 energy conservation Methods 0.000 description 2
- 239000013589 supplement Substances 0.000 description 2
- 230000007547 defect Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000002349 favourable effect Effects 0.000 description 1
- 230000002452 interceptive effect Effects 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 230000003340 mental effect Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention discloses a Python-based unstructured data real-time crawling system, which comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database; and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage. In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.
Description
Technical Field
The invention relates to the technical field of network big data, in particular to a Python-based unstructured data real-time crawling system and a using method thereof.
Background
With the rapid development of the internet, the internet has penetrated the aspects of people's life, and the acquisition of material demands from information of mental levels can be realized through the internet.
With the explosive development of information, hundreds of millions of websites are emerging continuously, and the number of webpages included in a search engine is also increased sharply.
Abundant information on the internet brings great convenience to people, and people can efficiently and quickly acquire various information through the internet. However, the information is greatly exploded, and meanwhile, the problem of information overload is brought to users, and how to quickly select the information needed by the users from massive information is an increasingly urgent problem.
Python is a cross-platform computer programming language. Is a high-level scripting language that combines interpretive, compiled, interactive, and object-oriented capabilities. Originally designed for writing automated scripts (shells), the more they are used for the development of independent, large projects with the constant updating of versions and the addition of new functionality in languages, and today, they have become increasingly widely used for the processing of system management tasks and for Web programming.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a Python-based unstructured data real-time crawling system and a using method thereof.
A real-time Python-based unstructured data crawling system comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;
the crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time;
the temporary storage database is connected with the crawler cluster and used for storing data crawled by each web crawler in real time;
the data migration module is respectively connected with the temporary storage database and the target database;
and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
Preferably, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database according to the labeling result for storage.
Preferably, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.
Preferably, the system also comprises a data retrieval module which is respectively connected with the target database and the temporary storage database; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
Preferably, after the data retrieval module acquires the retrieval instruction, the data retrieval module firstly retrieves the target database, and sorts the retrieval results acquired from the target database according to the correlation degree with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
Preferably, the crawler cluster comprises at least partially separate source web crawlers.
Preferably, a distributed framework structure is employed.
A use method of a Python-based unstructured data real-time crawling system comprises the following steps:
s1, inputting keywords, and generating a retrieval instruction by the data retrieval module according to the keywords;
s2, acquiring the retrieval result of the data retrieval module, and screening expected information from the retrieval result for response;
and S3, the data retrieval module finishes the retrieval action according to the acquired response information.
Preferably, in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.
In the Python-based unstructured data real-time crawling system provided by the invention, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Therefore, the target database is convenient for subsequent data retrieval and calling.
In the invention, the data migration module sorts the non-structural data in the temporary storage database and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.
In addition, the invention arranges the unstructured data in the temporary storage database in a blocking arrangement mode, thereby being beneficial to improving the data processing efficiency and reducing the interference among data.
The invention also provides a use method of the Python-based unstructured data real-time crawling system, which automatically judges whether the user obtains expected information or not through the response action of the user to the retrieval result, automatically finishes retrieval according to the judgment result, realizes intelligent termination of the working thread, is beneficial to avoiding meaningless thread occupation, ensures the working efficiency of the system, and realizes energy conservation and high efficiency.
Drawings
FIG. 1 is a schematic diagram of a Python-based unstructured data real-time crawling system according to the present invention;
FIG. 2 is a schematic diagram of another Python-based unstructured data real-time crawling system according to the present invention;
FIG. 3 is a flow chart of a method for using the Python-based unstructured data real-time crawling system according to the present invention.
Detailed Description
Referring to fig. 1, the system for crawling unstructured data in real time based on Python provided by the invention comprises: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database.
The crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time. In this embodiment, the object of crawling of web crawler sets up to unstructured data, has so avoided the restriction to crawling the data, is favorable to guaranteeing the width that data crawled to guarantee to crawl abundant and comprehensive of data.
Specifically, in this embodiment, the crawler cluster at least includes a partial source-type web crawler, so as to conveniently adjust the web crawler according to the data requirement, thereby improving the applicability and flexibility of the system.
And the temporary storage database is connected with the crawler cluster and is used for storing the data crawled by the web crawlers in real time.
And the data migration module is respectively connected with the temporary storage database and the target database.
And the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
Therefore, the temporary storage database and the target database are respectively used for storing unstructured data crawled by a web crawler and ordered data arranged by the data migration module. Thus, in the embodiment, the existence of the target database facilitates subsequent data retrieval and calling.
In this embodiment, the data migration module sorts the non-structural data in the temporary storage database, and migrates the sorted data to the target database for storage. Therefore, redundant storage of the same data by the temporary storage database and the target database is avoided. Meanwhile, through the storage transition of the temporary storage database, the data arrangement pressure of the data migration module is reduced, and the data logic integrity in the target database is reasonably ensured, so that the efficiency of information retrieval through the target database is further ensured.
In addition, in the embodiment, the unstructured data in the temporary storage database is sorted in a blocking sorting mode, so that the efficiency of data processing is improved, and the interference among data is reduced.
In this embodiment, a cache area is preset in the data migration module, the data migration module is configured to extract the non-structural data from the temporary storage database according to the timing information and store the non-structural data in the cache area, and the data migration module is configured to perform feature extraction and labeling on the non-structural data in the cache area and migrate the data in the cache area to the target database for storage according to the labeling result. In the embodiment, the data migration module is convenient to arrange the data through the setting of the cache region, the data migration module is prevented from transferring the temporary storage database during data arrangement, and the data arrangement efficiency is further improved.
In the embodiment, after the data migration module labels the data in the cache region, selecting a sub-library corresponding to the tag in the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage. In this way, in the embodiment, the data migration module establishes the sub-database corresponding to the tag in the target database according to the label information of the sorted data to store the sorted and labeled data, thereby ensuring the ordered storage of the sorted and labeled data. Meanwhile, when the sub-libraries exist in the target database, the data migration module matches the labels of the existing sub-libraries according to the labeling information of the sorted and labeled data so as to avoid the sub-libraries with repeated labels.
Referring to fig. 2, the Python-based unstructured data real-time crawling system in this embodiment further includes a data retrieval module, which is respectively connected to the target database and the temporary storage database. And the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
Specifically, in this embodiment, after the data retrieval module obtains the retrieval instruction, it first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction, so as to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
In the embodiment, the target database is searched preferentially, so that the data searching efficiency is improved; the retrieval of the temporary storage database realizes the supplement of retrieval information and is beneficial to ensuring the comprehensiveness and diversity of retrieval results.
Specifically, in this embodiment, the system employs a distributed framework structure. The data crawling system is characterized in that a distributed cluster system is formed through distributed arrangement of web crawlers, so that high-efficiency data crawling is achieved, and real-time performance of data presentation is improved.
Referring to fig. 3, the invention further provides a use method of the Python-based unstructured data real-time crawling system, and the use method includes the following steps.
And S1, inputting the keywords, and generating a retrieval instruction by the data retrieval module according to the keywords. Specifically, in this step, the data retrieval module performs semantic analysis and concept extraction on the keywords, and then generates a retrieval instruction according to a concept extraction result. Therefore, the method is beneficial to ensuring the supplement of the retrieval elements and ensuring the completeness and accuracy of the retrieval result.
And S2, acquiring the search result of the data search module, and screening expected information from the search result for response. Specifically, in this step, when the user clicks the link in the search result, the search result is regarded as a response to the search result.
And S3, the data retrieval module finishes the retrieval action according to the acquired response information. Specifically, when any one of the search results is clicked and the browsing time exceeds a predetermined time threshold, the data search module ends the search operation.
Specifically, in the step, whether the user obtains expected information or not is automatically judged through the response action of the user to the search result, and the search is automatically ended according to the judgment result, so that the intelligent termination of the working thread is realized, the occupation of meaningless threads is favorably avoided, the working efficiency of the system is ensured, and the energy conservation and the high efficiency are realized.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any person skilled in the art should be considered to be within the technical scope of the present invention, and the technical solutions and the inventive concepts thereof according to the present invention should be equivalent or changed within the scope of the present invention.
Claims (9)
1. A real-time unstructured data crawling system based on Python is characterized by comprising: the system comprises a crawler cluster, a temporary storage database, a data migration module and a target database;
the crawler cluster comprises a plurality of web crawlers which are set aiming at different crawling objects, and each web crawler is used for crawling unstructured data from the corresponding crawling object in real time;
the temporary storage database is connected with the crawler cluster and used for storing data crawled by each web crawler in real time;
the data migration module is respectively connected with the temporary storage database and the target database;
and the data migration module is used for carrying out block sorting on the data stored in the temporary storage database and migrating the sorted data to a target database for storage.
2. The Python-based unstructured data real-time crawling system of claim 1, wherein a cache area is preset in the data migration module, the data migration module is used for extracting unstructured data from the temporary storage database according to time sequence information and storing the unstructured data in the cache area, and the data migration module is used for performing feature extraction and labeling on the unstructured data in the cache area and migrating the data in the cache area to a target database according to a labeling result for storage.
3. The Python-based unstructured data real-time crawling system of claim 2, wherein after the data migration module labels the data in the cache region, a sub-library corresponding to the tag is selected from the target database according to the labeling information to store the labeled data; and when any labeled data does not have a sub-library matched with the label in the target database, the data migration module informs the target database to establish a sub-library corresponding to the label according to the data labeling result and sends the data to the sub-library for storage.
4. The Python-based unstructured real-time crawling system of claim 1, further comprising a data retrieval module, which is connected to the target database and the temporary database respectively; and the data retrieval module is used for acquiring a retrieval instruction and respectively retrieving the target database and the temporary storage database according to the retrieval instruction so as to acquire a retrieval result.
5. The Python-based unstructured data real-time crawling system of claim 4, wherein after the data retrieval module obtains the retrieval instruction, the data retrieval module first retrieves the target database, and sorts the retrieval results obtained from the target database according to the degree of correlation with the retrieval instruction to form an information queue; and then, the data retrieval module retrieves the temporary storage database according to the retrieval instruction and inserts the acquired retrieval result into the information queue according to the correlation degree of the retrieval instruction.
6. The Python-based unstructured data real-time crawling system of claim 1, wherein at least some of the crawlers in the crawler cluster are separately sourced web crawlers.
7. The Python-based unstructured data real-time crawling system of claim 1, characterized in that a distributed framework structure is adopted.
8. Use of the Python-based unstructured data real-time crawling system according to any one of the claims 1 to 7, characterized by comprising the following steps:
s1, inputting keywords, and generating a retrieval instruction by the data retrieval module according to the keywords;
s2, acquiring the retrieval result of the data retrieval module, and screening expected information from the retrieval result for response;
and S3, the data retrieval module finishes the retrieval action according to the acquired response information.
9. The use method of the Python-based unstructured data real-time crawling system according to claim 8, characterized in that in step S2, the mode of responding to the expected information is clicking a link; step S3 specifically includes: and when any one retrieval result is clicked and the browsing time exceeds a preset time threshold, the data retrieval module finishes the retrieval action.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010729806.XA CN111859073A (en) | 2020-07-27 | 2020-07-27 | Python-based unstructured data real-time crawling system and using method thereof |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010729806.XA CN111859073A (en) | 2020-07-27 | 2020-07-27 | Python-based unstructured data real-time crawling system and using method thereof |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111859073A true CN111859073A (en) | 2020-10-30 |
Family
ID=72947199
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010729806.XA Pending CN111859073A (en) | 2020-07-27 | 2020-07-27 | Python-based unstructured data real-time crawling system and using method thereof |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111859073A (en) |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1666198A (en) * | 2002-07-08 | 2005-09-07 | 松下电器产业株式会社 | Data search device |
CN104820670A (en) * | 2015-03-13 | 2015-08-05 | 国家电网公司 | Method for acquiring and storing big data of power information |
CN105760550A (en) * | 2016-03-23 | 2016-07-13 | 江苏物联网研究发展中心 | Big data storage center-oriented internet data acquisition system and acquisition method |
CN106649249A (en) * | 2015-07-14 | 2017-05-10 | 比亚迪股份有限公司 | Retrieval method and retrieval device |
CN110889023A (en) * | 2019-11-20 | 2020-03-17 | 河海大学常州校区 | Distributed multifunctional search engine of elastic search |
CN111209331A (en) * | 2020-01-06 | 2020-05-29 | 北京旷视科技有限公司 | Target object retrieval method and device and electronic equipment |
-
2020
- 2020-07-27 CN CN202010729806.XA patent/CN111859073A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1666198A (en) * | 2002-07-08 | 2005-09-07 | 松下电器产业株式会社 | Data search device |
CN104820670A (en) * | 2015-03-13 | 2015-08-05 | 国家电网公司 | Method for acquiring and storing big data of power information |
CN106649249A (en) * | 2015-07-14 | 2017-05-10 | 比亚迪股份有限公司 | Retrieval method and retrieval device |
CN105760550A (en) * | 2016-03-23 | 2016-07-13 | 江苏物联网研究发展中心 | Big data storage center-oriented internet data acquisition system and acquisition method |
CN110889023A (en) * | 2019-11-20 | 2020-03-17 | 河海大学常州校区 | Distributed multifunctional search engine of elastic search |
CN111209331A (en) * | 2020-01-06 | 2020-05-29 | 北京旷视科技有限公司 | Target object retrieval method and device and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101908071B (en) | Method and device thereof for improving search efficiency of search engine | |
US8965894B2 (en) | Automated web page classification | |
Bernardini et al. | Full-subtopic retrieval with keyphrase-based search results clustering | |
CN111428047B (en) | Knowledge graph construction method and device based on UCL semantic indexing | |
US20070136248A1 (en) | Keyword driven search for questions in search targets | |
CN104715064A (en) | Method and server for marking keywords on webpage | |
CN1512388A (en) | Computer system and method for establishing concept knowledge according to machine readable dictionary | |
WO2007085187A1 (en) | Method of data retrieval, method of generating index files and search engine | |
CN113076538B (en) | Method for extracting embedded privacy policy of mobile application APK file | |
CN105389328B (en) | A kind of extensive open source software searching order optimization method | |
CN105808615A (en) | Document index generation method and device based on word segment weights | |
CN110263021B (en) | Theme library generation method based on personalized label system | |
CN111368167A (en) | Chinese literature data automatic acquisition method based on web crawler technology | |
Sharma et al. | A novel architecture for deep web crawler | |
Chang | A Survey of Modern Crawler Methods | |
CN111859073A (en) | Python-based unstructured data real-time crawling system and using method thereof | |
CN104778233A (en) | Searching method and device based on click rate | |
CN109948015B (en) | Meta search list result extraction method and system | |
CN105574185A (en) | Method and device for providing clustering type intelligent summaries | |
CN102567016A (en) | Method and device for extracting use example of application programming interface | |
CN114443927A (en) | Efficient network crawling method and device | |
CN111625570B (en) | List data resource retrieval method and device | |
CN106776654B (en) | Data searching method and device | |
CN112115269A (en) | Webpage automatic classification method based on crawler | |
CN113407803A (en) | Method for acquiring internet data in one step |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20201030 |
|
RJ01 | Rejection of invention patent application after publication |