CN116975407A - Crawler method, crawler device, terminal equipment and computer readable storage medium - Google Patents

Crawler method, crawler device, terminal equipment and computer readable storage medium Download PDF

Info

Publication number
CN116975407A
CN116975407A CN202310866318.7A CN202310866318A CN116975407A CN 116975407 A CN116975407 A CN 116975407A CN 202310866318 A CN202310866318 A CN 202310866318A CN 116975407 A CN116975407 A CN 116975407A
Authority
CN
China
Prior art keywords
text
resource
snapshot
crawler
target webpage
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310866318.7A
Other languages
Chinese (zh)
Inventor
颜庚潇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xunlei Computer Shenzhen Co ltd
Original Assignee
Xunlei Computer Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xunlei Computer Shenzhen Co ltd filed Critical Xunlei Computer Shenzhen Co ltd
Priority to CN202310866318.7A priority Critical patent/CN116975407A/en
Publication of CN116975407A publication Critical patent/CN116975407A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

A crawler method, a crawler device, a crawler terminal device and a crawler computer readable storage medium, wherein the crawler method comprises the following steps: acquiring a first snapshot of a target webpage and a second snapshot in the history of the target webpage; resolving first resources included in the target webpage based on the first snapshot, and resolving second resources included in the target webpage history based on the second snapshot; comparing the first resource with the second resource; and if the first resource and the second resource are different, acquiring the first resource in the target webpage through a first crawler. The scheme provided by the application can solve the problems of high cost and low efficiency caused by the traditional crawler rule.

Description

Crawler method, crawler device, terminal equipment and computer readable storage medium
Technical Field
The application belongs to the technical field of Internet, and particularly relates to a crawler method, a crawler device, terminal equipment and a computer readable storage medium.
Background
The rapid growth of network size brings people with it and abundant resources, however how to acquire resources also brings people with it a great challenge. If the resources in the web page are obtained by only manually opening the web page, the requirements of the user cannot be met far enough.
In the prior art, as shown in fig. 1, a user may obtain all resources in a target web page through a crawler, and when a period of time has elapsed and the target web page is found to be updated, obtain all resources in the target web page again through the crawler. However, the cost is high and the efficiency is low because the user is required to check the target web page update and acquire all the resources in the updated target web page again through the crawler.
Disclosure of Invention
The application aims to provide a crawler method, a crawler device, a terminal device and a computer readable storage medium, and aims to solve the problems of high cost and low efficiency caused by the traditional crawler rule.
A first aspect of an embodiment of the present application proposes a crawler method, the method including:
acquiring a first snapshot of a target webpage and a second snapshot in the history of the target webpage;
resolving first resources included in the target webpage based on the first snapshot, and resolving second resources included in the target webpage history based on the second snapshot;
comparing the first resource with the second resource;
and if the first resource and the second resource are different, acquiring the first resource in the target webpage through a first crawler.
In some embodiments, the comparing the first resource with the second resource comprises at least one of:
comparing the editing distance between a first text included in the first resource and a second text included in the second resource;
comparing a first picture link included in the first resource with a second picture link included in the second resource;
comparing a first uniform resource locator (Uniform Resource Locator, URL) comprised by the first resource with a second uniform resource locator comprised by the second resource.
In some embodiments, the obtaining a first snapshot of the target web page includes:
and responding to an acquisition instruction, wherein the acquisition instruction is an instruction triggered by an external input or a timing task, and acquiring the first snapshot of the target webpage.
In some embodiments, the first resource comprises a third text, the method further comprising:
determining a text category corresponding to the third text;
and storing the third text based on the text category.
In some embodiments, the determining the text category corresponding to the third text includes:
inputting the third text into a text classification network model to obtain the text category output by the text classification network model;
The text classification network model is used for extracting text features included in the third text, and determining the text category corresponding to the third text from a plurality of text categories based on the text features.
In some embodiments, the text classification network model includes an embedded layer, a transformer layer, and a softmax layer;
the step of inputting the third text into a text classification network model to obtain the text category output by the text classification network model comprises the following steps:
inputting the third text to the embedded layer to obtain a feature matrix of the third text output by the embedded layer;
inputting the feature matrix to the transducer layer to obtain the text feature output by the transducer layer;
and outputting the text characteristics to the softmax layer to obtain the text category output by the softmax layer.
A second aspect of an embodiment of the present application proposes a crawler apparatus, including:
the first acquisition module is used for acquiring a first snapshot of a target webpage and a second snapshot in the history of the target webpage;
the analysis module is used for analyzing first resources included in the target webpage based on the first snapshot and analyzing second resources included in the target webpage history based on the second snapshot;
The comparison module is used for comparing the first resource with the second resource;
and the second acquisition module is used for acquiring the first resource in the target webpage through the first crawler if the first resource and the second resource are different.
In some embodiments, the comparison module is specifically configured to:
comparing the editing distance between a first text included in the first resource and a second text included in the second resource;
comparing a first picture link included in the first resource with a second picture link included in the second resource;
comparing a first uniform resource locator link included in the first resource with a second uniform resource locator link included in the second resource.
A third aspect of the embodiments of the present application proposes a terminal device comprising a memory, a processor and a computer program stored in said memory and executable on said processor, said processor implementing the steps of the method as described above when said computer program is executed.
A fourth aspect of the embodiments of the present application proposes a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method as described above.
Compared with the prior art, the embodiment of the application has the beneficial effects that:
in the embodiment of the application, the first snapshot of the target webpage and the second snapshot in the history of the target webpage can be obtained, the first resource included in the target webpage is analyzed based on the first snapshot, the second resource included in the history of the target webpage is analyzed based on the second snapshot, and the first resource and the second resource are compared, so that whether the target webpage is updated or not is automatically judged. If the first resource and the second resource are different, the first resource is updated in the target webpage, so that the first resource in the target webpage is acquired through the first crawler, and the automatic acquisition of the updated resource in the target webpage is realized. That is, the embodiment of the application can automatically judge whether the target webpage is updated or not, and automatically acquire the updated first resource in the target webpage under the condition that the target webpage is updated, without manually finding whether the target webpage is updated or not and acquiring all the resources in the target webpage, thereby improving the information hysteresis problem, improving the resource efficiency in acquiring the target webpage and reducing the cost.
Drawings
FIG. 1 is a flow diagram of a crawler method;
FIG. 2 is a schematic diagram of a crawler system according to an embodiment of the present application;
FIG. 3 is a schematic diagram of a crawler according to an embodiment of the present application;
FIG. 4 is a schematic flow chart of a crawler method according to an embodiment of the present application;
FIG. 5 is a flowchart of another crawler method according to an embodiment of the present application;
FIG. 6 is a diagram illustrating a method for comparing web page snapshots according to an embodiment of the present application;
FIG. 7 is a diagram illustrating another method for comparing web page snapshots according to an embodiment of the application;
FIG. 8 is a flowchart of another crawler method according to an embodiment of the present application;
fig. 9 is a schematic structural diagram of a text classification network model according to an embodiment of the present application;
fig. 10 is a schematic structural diagram of a text classification network model based on a transducer according to an embodiment of the present application;
FIG. 11 is a schematic diagram of a crawler device according to an embodiment of the present application;
fig. 12 is a schematic diagram of a terminal device according to an embodiment of the present application.
Detailed Description
In order to make the technical problems, technical schemes and beneficial effects to be solved more clear, the application is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
It will be understood that when an element is referred to as being "mounted" or "disposed" on another element, it can be directly on the other element or be indirectly on the other element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or be indirectly connected to the other element.
It is to be understood that the terms "length," "width," "upper," "lower," "front," "rear," "left," "right," "vertical," "horizontal," "top," "bottom," "inner," "outer," and the like are merely for convenience in describing and simplifying the description based on the orientation or positional relationship shown in the drawings, and do not indicate or imply that the devices or elements referred to must have a particular orientation, be constructed and operated in a particular orientation, and thus are not to be construed as limiting the application.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
Fig. 2 shows a schematic structural diagram of a crawler system according to an embodiment of the present application. As shown in FIG. 2, the crawler system includes one or more crawlers 210 (only 3 crawlers 210 are shown in FIG. 2), a distributed real-time message platform (NSQ) 220, a scheduler 230, a Redis cache 240, an Engine (Engine) 250, a MySQL database 260, and a search Engine (ES) 270.
Crawler 210, (also known as a web crawler, web spider, or web robot), is a program or script that automatically crawls web resources according to preset rules. In some implementations, crawler 210 may obtain a uniform resource locator on an initial web page starting with a uniform resource locator of one or several initial web pages. And continuously extracting a new uniform resource locator from the current page and putting the new uniform resource locator into a queue in the process of grabbing the webpage until a preset stop condition is met. In some embodiments, the crawler 210 may filter links unrelated to the target subject according to a web page analysis algorithm, reserve useful links and place the useful links in a uniform resource locator queue waiting to be crawled, select a web page uniform resource locator to be crawled next from the queue according to a preset search policy, and repeat the above process until a preset stop condition is reached.
NSQ220 is a distributed real-time message platform with a distributed and decentralized topology that has the advantages of no single point of failure and high availability. In embodiments of the present application, NSQ220 may be used for message reception, delivery, and pushing in a crawler system. In some implementations, subscription and message push flows can be published through NSQ 220. In some implementations, the NSQ220 may include multiple topics (topics) and multiple channels (channels). In some embodiments, NSQ220 may create a topic first, and then may send instructions to the topic requesting to create crawler 210, where each channel may receive information sent by the topic, and the topic may send instructions to the channels requesting to create the crawler. NSQ220 may therefore enable multiple crawlers 210 to acquire resources in a web page.
Scheduler 230 may be used to schedule crawler 210 such that crawler 210 obtains resources in the web page. In some implementations, scheduler 230 may be configured to send a crawler request to engine 250, which may include crawler 210 therein. In some implementations, crawler 210 may transition between various states as shown in FIG. 3 based on the schedule of scheduler 230.
Redis cache 240 may be used to store information related to crawlers. In some implementations, redis cache 240 may store uniform resource locators for web pages that have been crawled, which may improve the problem of web pages being repeatedly crawled. For example, scheduler 230 may query whether Redis cache 240 includes a uniform resource locator in crawler 210 when scheduling crawler 210. If Redis cache 240 includes the uniform resource locator, it may be determined that the web page corresponding to the uniform resource locator has been crawled, and thus, the crawler 210 is not sent to engine 250. If Redis cache 240 does not include the uniform resource locator, it may be determined that the web page corresponding to the uniform resource locator has not been crawled, and thus, crawler 210 may be sent to engine 250 to crawl the resources in the web page. In some implementations, redis cache 240 may store state information for crawlers that may be used to schedule operations, such as reboot and suspension of crawlers 210. In some implementations, redis cache 240 may store crawler task information, such as at least one of a number of crawl successes and a number of crawl failures.
Engine 250 may include a downloader 251 and a parser 252.
The downloader 251 may be used to obtain some or all of the resources from the web page.
The parser 252 may be used to parse and structure resources obtained from a web page.
MySQL database 260 may store resources obtained from web pages.
ES270 is a distributed, highly expanded, highly real-time search and analysis engine that may be used for data search and analysis. The implementation principles of ES270 may include: and storing the data submitted by the user into a database, determining sentence word segmentation corresponding to the data through a word segmentation controller, storing weights and word segmentation results, and returning search results to the user according to the weights and the word segmentation results when the user searches the data. In an embodiment of the application, ES270 may be used to store, search, and analyze resources in a web page. In some implementations, parsed and structured resources may be uploaded to the ES270 to facilitate searching and analysis of the resources by the ES 270.
Fig. 3 shows a state diagram of a crawler according to an embodiment of the present application. As shown in FIG. 3, during the scheduling of the crawler, the crawler may include at least some of a wait state, a go state, a fail state, a complete state, a pause state, and an unknown state.
The waiting state may be a state in which the crawler waits for the next operation. In some implementations, when a crawler has just created, paused, in an unknown state, or in a completed state, it may transition to the wait state.
The progress status may be a status during execution of the crawler. In some implementations, when the crawler in the wait state is continued to be executed, a transition to the go-to state may be made.
The failure state may be a state in which the crawler fails to execute. In some implementations, when an in-progress crawler fails, it may transition to the failure state. Among other reasons, the cause of the crawler failure may be due to reasons internal to the crawler system, such as engine or scheduler failure, etc.; alternatively, the cause of the crawler failure may be due to a cause external to the crawler system, such as an external network failure, or the like. It will be appreciated that embodiments of the present application are not limited to particular reasons for causing a crawler to fail.
The completion status may be the status of the crawler when it has completed acquiring resources from the web page. In some implementations, the completion state may be transitioned to when the running crawler successfully retrieves resources from the web page.
The suspended state may be a state in which execution of the crawler is suspended. In some approaches, when a crawler in operation is paused, a transition to a paused state may be made. In some implementations, the cause of the crawler suspension may be that no corresponding processor or other device resources are acquired.
The unknown state may be a state of the crawler when it cannot be determined that the crawler is outside of the waiting state, the progress state, the failure state, the completion state, and the suspension state.
Fig. 4 shows a flowchart of a crawler method according to an embodiment of the present application. This method can be used in the crawler system of fig. 2 described above. The method comprises the following steps:
s401, acquiring a first snapshot of the target webpage and a second snapshot in the history of the target webpage.
The snapshot of the web page may refer to storing the resources included in the web page at a certain moment in a specific format such as a picture or a text file, so as to facilitate subsequent viewing or updating. In some embodiments, the search engine may obtain and store a snapshot of at least one web page, and based on an externally (e.g., user) entered keyword, obtain a snapshot of at least a portion of the web page matching the keyword from the snapshot of the at least one web page, and present the obtained snapshot of at least a portion of the web page to the user.
The target web page may be a web page that requires crawler retrieval.
The first snapshot of the target webpage may be a snapshot of the target webpage obtained at the current moment.
In some embodiments, an externally input uniform resource locator for the target web page may be received, and a first snapshot of the target web page may be obtained based on the uniform resource locator corresponding to the target web page. In practical applications, the target web page may be determined by other manners, or the url corresponding to the target web page may be obtained by other manners.
In some implementations, the first snapshot of the target web page may be taken by the second crawler.
In some implementations, the first snapshot of the target web page may be taken through scheduler 230 and engine 250 in fig. 2.
The second snapshot in the history of the target web page, that is, the snapshot of the target web page before the current time. In some embodiments, the second snapshot may be the snapshot closest to the current time in the history, that is, the second snapshot may be the latest snapshot in the history, which may improve the problem that it is difficult to accurately determine whether the target web page is updated and repeatedly acquire the updated resources because the second snapshot is relatively long.
In some implementations, a second snapshot of the target web page may be taken from a storage system, such as an object storage service (Object Storage Service, OSS). Of course, it can be understood that, in practical application, the second snapshot may be obtained in other manners, and the manner of obtaining the second snapshot is not limited in the embodiment of the present application.
S402, analyzing first resources included in the target webpage based on the first snapshot, and analyzing second resources included in the target webpage history based on the second snapshot.
In some implementations, the first resource can include at least one of a first text, a first picture link, and a first uniform resource locator.
In some implementations, the first text can include all or part of the text in the target web page at the time corresponding to the first snapshot.
In some embodiments, the first picture link may include a link of all or part of the pictures in the target web page at a time corresponding to the first snapshot. In some implementations, the first picture link can indicate an address of a picture in the target web page in the network. For example, the first picture link may be "http:// w.123.com/456.png", indicating a picture named 456 at w.123.com in png.
In some implementations, the first uniform resource locator can include all or part of the uniform resource locator in the target web page at the time corresponding to the first snapshot. In some implementations, the first uniform resource locator can indicate an address of one or more information in the network. In some implementations, the first uniform resource locator can include a type of the information, a host domain name in which the information is stored, and a name of the information.
In some implementations, the second resource can include at least one of a second text, a second picture link, and a second uniform resource locator. In some implementations, the second text can include all or part of the text in the target web page at the time corresponding to the second snapshot. In some embodiments, the second picture link may include a link of all or part of the pictures in the target web page at a time corresponding to the second snapshot. In some embodiments, the second uniform resource locator may include all or part of the uniform resource locator in the target web page at the time corresponding to the second snapshot.
In some implementations, the first text and the second text can be text in hypertext markup language (Hyper Text Markup Language, HTML) format.
In some embodiments, the first snapshot and the second snapshot may be parsed by the parser 252 of fig. 2 described above.
S403, comparing the first resource with the second resource.
The first snapshot is a snapshot taken at the current time and the second snapshot is a snapshot in the history of the target web page, and if the first snapshot and the second snapshot are different, the target web page may have been updated during the time at which the first snapshot was taken and the time at which the second snapshot was taken. And comparing the first resource included in the first snapshot with the second resource included in the second snapshot, wherein if the first resource is different from the second resource, the first resource can be an updated resource in the target webpage.
S404, if the first resource and the second resource are different, the first resource in the target webpage is acquired through the first crawler.
If the first resource and the second resource are different, the first resource can be updated resource in the target webpage, so that the first resource in the target webpage can be acquired through the first crawler, the updated first resource can be determined and acquired under the condition that the target webpage is updated, all the resources included in the target webpage are not required to be acquired, and the efficiency of acquiring the webpage resources is improved.
In some implementations, the first text in the target web page may be obtained by a first crawler.
In some embodiments, the picture indicated by the first picture link in the target webpage may be obtained by the first crawler.
In some implementations, a first uniform resource locator in a target web page may be obtained by a first crawler.
It is understood that the number of first crawlers may be one or more, and embodiments of the present application are not limited to the number of first crawlers.
In some implementations, the first resource in the target web page may be obtained by the scheduler 230 and engine 250 in fig. 2 through the first crawler.
In the embodiment of the application, the first snapshot of the target webpage and the second snapshot in the history of the target webpage can be obtained, the first resource included in the target webpage is analyzed based on the first snapshot, the second resource included in the history of the target webpage is analyzed based on the second snapshot, and the first resource and the second resource are compared, so that whether the target webpage is updated or not is automatically judged. If the first resource and the second resource are different, the first resource is updated in the target webpage, so that the first resource in the target webpage is acquired through the first crawler, and the automatic acquisition of the updated resource in the target webpage is realized. That is, the embodiment of the application can automatically judge whether the target webpage is updated or not, and automatically acquire the updated first resource in the target webpage under the condition that the target webpage is updated, without manually finding whether the target webpage is updated or not and acquiring all the resources in the target webpage, thereby improving the information hysteresis problem, improving the resource efficiency in acquiring the target webpage and reducing the cost.
Fig. 5 shows a flowchart of a crawler method according to an embodiment of the present application. The method may be a detailed description of the method shown in fig. 4. The method comprises the following steps:
s501, acquiring all resources of the target webpage through a third crawler.
In some implementations, all resources in the target web page may be acquired by the third crawler by the scheduler 230 and engine 250 in fig. 2.
In some embodiments, an externally entered subscription request may be obtained, which may include a uniform resource locator of the target web page.
In some embodiments, the second resource is included in all the resources of the target web page acquired through S501, and the second snapshot of the target web page may be a snapshot of the target web page acquired when S501 is performed.
S502, responding to the acquisition instruction, and acquiring a first snapshot of the target webpage.
The acquiring instruction may be used to trigger acquiring a first snapshot of the target webpage. By acquiring the first snapshot of the target webpage triggered by the instruction, the opportunity of acquiring the first snapshot can be flexibly controlled.
In some embodiments, the acquiring instruction may be an externally input instruction, that is, may trigger a step of subsequently detecting whether the target web page is updated or not and acquiring updated resources through the crawler, which reduces the cost of manually executing the detection of whether the target web page is updated or not, and improves the accuracy and efficiency of detecting whether the target web page is updated or not. In some embodiments, the externally input instructions may include instructions to receive a guide through a keyboard, mouse, touch screen, microphone, and other external devices. And it will be appreciated that in practical applications, the peripheral devices may include more devices.
For example, a virtual button corresponding to the acquisition instruction may be displayed through the display screen. The user may click or touch the virtual button in the event that it is determined that the target web page is likely to be updated. When the operation of clicking or touching the virtual button by the user is detected based on the display screen, it is determined that the acquisition instruction is received.
In some embodiments, the acquiring instruction may be an instruction triggered by a timing task, that is, the step of detecting whether the target web page is updated or not and acquiring the updated resource by the crawler may be triggered at regular time by the timing task, so as to further improve the automation and intelligence degree of acquiring the web page resource by the crawler. In some embodiments, the timed task may trigger the get instruction for a preset duration per interval. In some implementations, the timed task may trigger the fetch instructions at each particular time. And it will be appreciated that in practical applications, the timing task may trigger the acquisition instruction in other ways.
S503, judging whether the target webpage is updated or not by comparing the first snapshot with the second snapshot. If yes, S504 is executed, otherwise S502 is executed back.
The second snapshot is a snapshot in the history of the target webpage. In some embodiments, if the second snapshot is a snapshot of the target web page obtained when S501 is performed, and no snapshot of the target web page is obtained again after S501 is performed and before S502 is performed, the second snapshot is the latest snapshot in the history.
In some embodiments, as shown in fig. 6, a first resource currently included in the target web page may be parsed by the first snapshot, and a second resource included in the history of the target web page may be parsed by the second snapshot, comparing the first resource with the second resource. If the first resource is the same as the second resource, determining that the second resource is still in the target webpage, i.e. the target webpage is not updated. If the first resource is different from the second resource, it is determined that the first resource has been updated in the target web page.
In some implementations, as shown in fig. 7, the edit distance between the first text included in the first resource and the second text included in the second resource can be compared. If the editing distance is greater than or equal to the preset editing distance, determining that the first text is different from the second text. If the editing distance is smaller than the preset editing distance, the first text and the second text are determined to be the same. It can be understood that the preset editing distance may be set by the user in advance, and the embodiment of the present application does not limit the manner of setting the preset editing distance and the numerical value of the preset editing distance.
In some embodiments, as shown in fig. 7, a first picture link included in a first resource may be compared with a second picture link included in a second resource, so as to determine whether the first picture link and the second picture link are the same.
In some embodiments, as shown in fig. 7, a first uniform resource locator included in a first resource may be compared to a second uniform resource locator included in a second resource to determine whether the first uniform resource locator and the second uniform resource locator are the same.
S504, acquiring updated first resources in the target webpage through the first crawler.
Note that the manner of performing 504 may be the same as or similar to the manner of performing S404.
It should be noted that the first crawler, the third crawler, and the second crawler may be the same crawler or different crawlers.
In the embodiment of the application, all the resources of the target webpage can be acquired through the third crawler, then the first snapshot of the target webpage is triggered and acquired through the timing task or external input, whether the updated resources exist in the target webpage or not is judged based on the first snapshot and the second snapshot, and under the condition that the updated first resources exist in the target webpage, the updated first resources are acquired through the first crawler, so that the cost for acquiring the resources of the target webpage is reduced, and the problem of waste of the crawler resources is solved.
Fig. 8 shows a flowchart of a crawler method according to an embodiment of the present application. In some implementations, the third text can include all or part of the text in the target web page. In some embodiments, the method may be performed after the method shown in fig. 4 or fig. 5 acquires the first resource, and the third text may include all or part of the first text, or the third text may be identical to the first text. The method comprises the following steps:
S801, determining a text category corresponding to the third text.
The text category can be used for explaining the characteristics of the text, and is convenient for storing, analyzing, managing, searching and the like. In some implementations, multiple text categories may be determined in accordance with a preset dimension or dimensions. For example, text categories may be determined to include news, literature, records, etc., based on the specific content of the text; alternatively, the text category is determined to include chinese and english based on the language in which the text is in. It should be noted that, in the embodiment of the present application, the dividing manner of the text category and the specific content included in the text category are not limited.
In some embodiments, the third text may be input to the text classification network model resulting in a text category output by the text classification network model. The text classification network model may be configured to extract text features included in the third text, and determine a text category corresponding to the third text from a plurality of text categories based on the text features. That is, the third text can be classified automatically and intelligently through the text classification network model, so that the cost for classifying the third text is further reduced.
In some embodiments, a text set may be obtained in advance, where the text set includes a plurality of texts, each text is labeled with a corresponding text category, each text is input into an initial text classification network model, a text category output by the initial text classification network model is obtained, and model parameters included in the initial text classification network model are updated based on a difference between the text category output by the initial text classification network model and the text category labeled by the text, so as to train to obtain the text classification network model.
In some implementations, the text classification network model can be a transformer-based neural network. In some implementations, as shown in fig. 9, the text classification network model may include an embedding (embedding) layer, a transformer layer, and a softmax layer. The third text may be input to the embedding layer, resulting in a feature matrix for the embedding layer to output the third text. And inputting the feature matrix into a transducer layer to obtain text features output by the transducer layer. And outputting the text characteristics to the softmax layer to obtain the text category output by the softmax layer.
In some embodiments, as shown in fig. 10, the transducer layer may include a multi-head attention layer, a normalization (add & & layer normalization) layer, and a feedforward (forward) layer. The normalization layer can be used for reducing the overfitting of the text classification network model and improving the accuracy of text classification.
In some embodiments, based on the attention mechanism, encoder (Encoder) partial data is first input into a multi-headed attention layer, which is made up of a plurality of self-attitudes. The input value of the input self-attribute will form three vectors through three different layers. The vector output by the multi-head attention layer and the vector input initially pass through a normalization layer, the normalization layer plays a role in adding the results of the two layers of the neural networks, the normalization plays a role in layer normalization, and the effect of overfitting of the text classification network model can be reduced. The output of the normalization layer is then input to a feed-forward layer, which may add nonlinear varying portions to the output of the multi-headed attention layer. The output of the feed forward layer and the output of the multi-headed attention layer are then passed through the normalization layer again. And finally outputting text characteristic information which can be used in the transformer.
In some implementations, the softmax layer may obtain a probability value for a text category of the third text. Wherein the softmax function may beWherein x is i Representing the output of the output unit; i and j represent text category indexes; n represents the total number of text categories.
In some embodiments, the loss function of the text classification network model may use a cross entropy loss function, where the loss function is as follows:
Wherein N represents the number of text categories; y represents a real independent-Hot (One-Hot) tag, which represents an N-dimensional vector, the nth real class value is 1, and the rest are 0;representing the probability distribution of the N-dimensional prediction.
In some embodiments, the third text may be pre-processed prior to being input into the embedding layer, including padding the third text to a preset length through a padding (padding) layer when the text length of the third text is less than the preset length, and then outputting the padded third text to the embedding layer.
S802, storing a third text based on the text category.
The third text may be classified and stored by the text category of the third text. In some implementations, the third text can be stored to a storage location corresponding to the text category.
In some implementations, the third text may be stored in ES270 and/or MySQL database 260 as in fig. 2.
According to the embodiment of the application, the text category corresponding to the third text can be determined through the neural network based on the transformer, and the third text is stored based on the text category, namely, the automatic classification and storage of the third text crawled from the target webpage are realized, the classification efficiency and accuracy are improved, and the labor cost is reduced.
It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present application.
FIG. 11 illustrates a schematic diagram of a crawler apparatus in accordance with an embodiment of the present application. The crawler apparatus 1100 includes:
the first obtaining module 1101 is configured to obtain a first snapshot of the target web page and a second snapshot in the history of the target web page.
The parsing module 1102 is configured to parse a first resource included in the target web page based on the first snapshot, and parse a second resource included in the target web page history based on the second snapshot.
The comparing module 1103 is configured to compare the first resource with the second resource.
The second obtaining module 1104 is configured to obtain, by the first crawler, the first resource in the target web page if the first resource and the second resource are different.
In some embodiments, the contrast module 1103 is further configured to:
comparing the editing distance between a first text included in the first resource and a second text included in the second resource;
comparing a first picture link included in the first resource with a second picture link included in the second resource;
The first uniform resource locator URL included in the first resource is compared with the second uniform resource locator included in the second resource.
In some embodiments, the first acquisition module 1101 is further configured to:
and responding to the acquisition instruction, acquiring a first snapshot of the target webpage, wherein the acquisition instruction is an external input or timing task triggered instruction.
In some implementations, the first resource includes a third text, and the crawler apparatus 1100 is further configured to:
determining a text category corresponding to the third text;
the third text is stored based on the text category.
In some implementations, crawler apparatus 1100 is also to:
inputting the third text into the text classification network model to obtain the text category output by the text classification network model;
the text classification network model is used for extracting text features included in the third text, and determining text categories corresponding to the third text from a plurality of text categories based on the text features.
In some implementations, the text classification network model includes an embedded layer, a transducer layer, and a softmax layer; inputting the third text into the text classification network model to obtain a text category output by the text classification network model, the crawler apparatus 1100 further being configured to:
Inputting the third text to the embedded layer to obtain a feature matrix of the third text output by the embedded layer;
inputting the feature matrix into a transducer layer to obtain text features output by the transducer layer;
and outputting the text characteristics to the softmax layer to obtain the text category output by the softmax layer.
Fig. 12 shows a schematic diagram of a terminal device according to an embodiment of the present application. As shown in fig. 12, the terminal device 12 of this embodiment includes: a processor 1200, a memory 1201, and a computer program 1202 stored in the memory 1201 and executable on the processor 1200, such as a program for retrieving a target web page. The processor 1200, when executing the computer program 1202, implements the steps of the crawler method embodiments described above, such as steps 401 to 404 shown in fig. 4. Alternatively, the processor 1200 may implement the functions of the modules in the embodiments of the crawler apparatus described above when executing the computer program 1202, for example, the functions of the first obtaining module 1101 to the second obtaining module 1104 shown in fig. 11.
By way of example, the computer program 1202 may be partitioned into one or more modules that are stored in the memory 1201 and executed by the processor 1200 to perform the present application. The one or more modules may be a series of computer program instruction segments capable of performing particular functions for describing the execution of the computer program 1202 in the terminal device 12. For example, the computer program 1202 may be partitioned into a synchronization module, a summary module, an acquisition module, a return module (a module in a virtual device).
The terminal device 12 may be a desktop computer, a notebook computer, a palm computer, a cloud server, or the like. The terminal device may include, but is not limited to, a processor 1200, a memory 1201. It will be appreciated by those skilled in the art that fig. 12 is merely an example of terminal device 12 and is not intended to limit terminal device 12, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the terminal device may further include an input-output device, a network access device, a bus, etc.
The processor 1200 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
The memory 1201 may be an internal storage unit of the terminal device 12, such as a hard disk or a memory of the terminal device 12. The memory 1201 may be an external storage device of the terminal device 12, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 12. Further, the memory 1201 may also include both an internal storage unit and an external storage device of the terminal device 12. The memory 1201 is used to store the computer program and other programs and data required by the terminal device. The memory 1201 may also be used to temporarily store data that has been output or is to be output.
It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, i.e. the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-described functions. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.
In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.
Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
In the embodiments provided in the present application, it should be understood that the disclosed terminal device and method may be implemented in other manners. For example, the above-described embodiments of the terminal device are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated modules, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the present application may implement all or part of the flow of the method of the above embodiment, or may be implemented by instructing related hardware by a computer program, where the computer program may be stored in a computer readable storage medium, and the computer program may implement the steps of each of the method embodiments described above when executed by a processor. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth. It should be noted that the computer readable medium contains content that can be appropriately scaled according to the requirements of jurisdictions in which such content is subject to legislation and patent practice, such as in certain jurisdictions in which such content is subject to legislation and patent practice, the computer readable medium does not include electrical carrier signals and telecommunication signals.
The above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application.

Claims (10)

1. A crawler method, comprising:
acquiring a first snapshot of a target webpage and a second snapshot in the history of the target webpage;
resolving first resources included in the target webpage based on the first snapshot, and resolving second resources included in the target webpage history based on the second snapshot;
comparing the first resource with the second resource;
and if the first resource and the second resource are different, acquiring the first resource in the target webpage through a first crawler.
2. The method of claim 1, wherein the comparing the first resource with the second resource comprises at least one of:
Comparing the editing distance between a first text included in the first resource and a second text included in the second resource;
comparing a first picture link included in the first resource with a second picture link included in the second resource;
comparing a first uniform resource locator included in the first resource with a second uniform resource locator included in the second resource.
3. The method of claim 1, wherein the obtaining a first snapshot of the target web page comprises:
and responding to an acquisition instruction, wherein the acquisition instruction is an instruction triggered by an external input or a timing task, and acquiring the first snapshot of the target webpage.
4. A method as recited in any of claims 1-3, wherein the first resource comprises a third text, the method further comprising:
determining a text category corresponding to the third text;
and storing the third text based on the text category.
5. The method of claim 4, wherein the determining the text category to which the third text corresponds comprises:
inputting the third text into a text classification network model to obtain the text category output by the text classification network model;
The text classification network model is used for extracting text features included in the third text, and determining the text category corresponding to the third text from a plurality of text categories based on the text features.
6. The method of claim 5, wherein the text classification network model comprises an embedding layer, a transducer layer, and a softmax layer;
the step of inputting the third text into a text classification network model to obtain the text category output by the text classification network model comprises the following steps:
inputting the third text to the embedded layer to obtain a feature matrix of the third text output by the embedded layer;
inputting the feature matrix to the transducer layer to obtain the text feature output by the transducer layer;
and outputting the text characteristics to the softmax layer to obtain the text category output by the softmax layer.
7. A crawler apparatus, comprising:
the first acquisition module is used for acquiring a first snapshot of a target webpage and a second snapshot in the history of the target webpage;
the analysis module is used for analyzing first resources included in the target webpage based on the first snapshot and analyzing second resources included in the target webpage history based on the second snapshot;
The comparison module is used for comparing the first resource with the second resource;
and the second acquisition module is used for acquiring the first resource in the target webpage through the first crawler if the first resource and the second resource are different.
8. The apparatus of claim 7, wherein the comparison module is specifically configured to:
comparing the editing distance between a first text included in the first resource and a second text included in the second resource;
comparing a first picture link included in the first resource with a second picture link included in the second resource;
comparing a first uniform resource locator URL link included in the first resource with a second uniform resource locator link included in the second resource.
9. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 6 when the computer program is executed.
10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 6.
CN202310866318.7A 2023-07-13 2023-07-13 Crawler method, crawler device, terminal equipment and computer readable storage medium Pending CN116975407A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310866318.7A CN116975407A (en) 2023-07-13 2023-07-13 Crawler method, crawler device, terminal equipment and computer readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310866318.7A CN116975407A (en) 2023-07-13 2023-07-13 Crawler method, crawler device, terminal equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN116975407A true CN116975407A (en) 2023-10-31

Family

ID=88474276

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310866318.7A Pending CN116975407A (en) 2023-07-13 2023-07-13 Crawler method, crawler device, terminal equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN116975407A (en)

Similar Documents

Publication Publication Date Title
US11755387B1 (en) Updating code of an app feature based on a value of a query feature
CN109902220B (en) Webpage information acquisition method, device and computer readable storage medium
US8756593B2 (en) Map generator for representing interrelationships between app features forged by dynamic pointers
US8589876B1 (en) Detection of central-registry events influencing dynamic pointers and app feature dependencies
US8082264B2 (en) Automated scheme for identifying user intent in real-time
US8751518B1 (en) Fixed phrase detection for search
CN102073726B (en) Structured data import method and device for search engine system
CN112101335B (en) APP violation monitoring method based on OCR and transfer learning
CN109840298B (en) Multi-information-source acquisition method and system for large-scale network data
US20150089415A1 (en) Method of processing big data, apparatus performing the same and storage media storing the same
CN110851681A (en) Crawler processing method and device, server and computer readable storage medium
CN104899324A (en) Sample training system based on IDC (internet data center) harmful information monitoring system
EP3961426A2 (en) Method and apparatus for recommending document, electronic device and medium
CN115033876A (en) Log processing method, log processing device, computer device and storage medium
CN114117242A (en) Data query method and device, computer equipment and storage medium
CN104778232B (en) Searching result optimizing method and device based on long query
CN112269906B (en) Automatic extraction method and device of webpage text
CN111488386B (en) Data query method and device
CN116975407A (en) Crawler method, crawler device, terminal equipment and computer readable storage medium
US10698931B1 (en) Input prediction for document text search
WO2014049310A2 (en) Method and apparatuses for interactive searching of electronic documents
CN111581950A (en) Method for determining synonym and method for establishing synonym knowledge base
CN111858918A (en) News classification method and device, network element and storage medium
US9934319B2 (en) Method of and system for determining creation time of a web resource
Xie et al. Design and Implementation of Web Information Extraction System Based on Crawler

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination