CN111506787A - Webpage updating method and device, electronic equipment and computer-readable storage medium - Google Patents

Webpage updating method and device, electronic equipment and computer-readable storage medium Download PDF

Info

Publication number
CN111506787A
CN111506787A CN202010153288.1A CN202010153288A CN111506787A CN 111506787 A CN111506787 A CN 111506787A CN 202010153288 A CN202010153288 A CN 202010153288A CN 111506787 A CN111506787 A CN 111506787A
Authority
CN
China
Prior art keywords
webpage
cloud
web page
updating
characteristic identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010153288.1A
Other languages
Chinese (zh)
Other versions
CN111506787B (en
Inventor
刘俊启
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202010153288.1A priority Critical patent/CN111506787B/en
Publication of CN111506787A publication Critical patent/CN111506787A/en
Application granted granted Critical
Publication of CN111506787B publication Critical patent/CN111506787B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Abstract

The application discloses a method and a device for updating a webpage, electronic equipment and a computer-readable storage medium, and relates to the technical field of search engines. The implementation scheme adopted when the webpage is updated in the server side is as follows: after a webpage is captured, generating a cloud characteristic identification of the webpage, and associating the cloud characteristic identification with the webpage; after feedback information sent by a client is received, a cloud characteristic identifier corresponding to the feedback information is obtained; and when the cloud characteristic identification of the webpage is determined not to be matched with the local characteristic identification in the feedback information, replacing the original webpage with the re-captured webpage, and updating the cloud characteristic identification associated with the webpage. According to the method and the device, the timeliness of webpage updating can be improved, and the computing resources of the server side are effectively saved.

Description

Webpage updating method and device, electronic equipment and computer-readable storage medium
Technical Field
The present application relates to the field of internet technologies, and in particular, to a method and an apparatus for updating a web page, an electronic device, and a computer-readable storage medium in the field of search engine technologies.
Background
With the rapid popularization of intelligent terminals, the mobile internet has become a main way for users to obtain information. Accordingly, mobile search is also becoming the primary way for users to use search engines instead of PC search. With the development of the internet, more and more search users are available in the internet, and the timeliness requirement of the users on information acquisition is higher and higher. However, the current search engine usually updates the web page only through the corresponding server, and as the size of the web page increases, the computational resources required by the server for updating the web page also increases, and if the existing computational resources of the server are limited, the timeliness of updating the web page is greatly reduced.
Disclosure of Invention
The technical scheme adopted by the application for solving the technical problem is to provide a method for updating a webpage, which comprises the following steps: after a server side captures a webpage, generating a cloud characteristic identification of the webpage, and associating the cloud characteristic identification with the webpage; after feedback information sent by a client is received, a cloud characteristic identifier corresponding to the feedback information is obtained; and when the cloud characteristic identification of the webpage is determined not to be matched with the local characteristic identification in the feedback information, replacing the original webpage with the re-captured webpage, and updating the cloud characteristic identification associated with the webpage. According to the method and the device, the timeliness of webpage updating can be improved, and the computing resources of the server side are effectively saved.
According to a preferred embodiment of the present application, the generating the cloud feature identifier of the web page includes: determining a feature extraction rule corresponding to the webpage; and extracting the characteristics of the webpage according to the characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics. The accuracy of webpage feature extraction can be improved.
According to a preferred embodiment of the present application, the method further comprises: receiving a rule query request sent by a client; and determining a feature extraction rule corresponding to the rule query request, and sending the determined feature extraction rule to a client. The method can ensure that the client and the server use the same feature extraction rule to extract the features of the same webpage, thereby improving the accuracy of webpage updating.
According to a preferred embodiment of the present application, when it is determined that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, the method includes: calculating the matching degree between the cloud characteristic identification and the local characteristic identification; and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
According to a preferred embodiment of the present application, before replacing an original web page with a newly crawled web page and updating a cloud feature identifier associated with the web page, the method further includes: generating an updating feature identifier according to the re-captured webpage; determining whether the updated feature identifier is the same as the original cloud end feature identifier; if not, continuing to execute the operation of replacing the original webpage with the re-captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client and associating the webpage with the attribute information. The step can realize refined indexing of the webpage, avoid errors in webpage updating and improve the accuracy of webpage updating.
According to a preferred embodiment of the present application, the updating the cloud feature identifier associated with the web page includes: generating an updating feature identifier according to the re-captured webpage; and updating the cloud characteristic identification associated with the webpage into the updated characteristic identification. The updating accuracy of the cloud characteristic identification can be improved.
The technical scheme that this application adopted for solving technical problem is to provide a device of webpage update, the device is located the server side, includes: the processing unit is used for generating a cloud characteristic identifier of the webpage after the webpage is captured, and associating the cloud characteristic identifier with the webpage; the acquisition unit is used for acquiring a cloud characteristic identifier corresponding to feedback information after receiving the feedback information sent by the client; and the updating unit is used for replacing the original webpage with the newly-captured webpage and updating the cloud characteristic identification associated with the webpage when the cloud characteristic identification of the webpage is determined not to be matched with the local characteristic identification in the feedback information.
According to a preferred embodiment of the present application, when generating the cloud feature identifier of the web page, the processing unit specifically executes: determining a feature extraction rule corresponding to the webpage; and extracting the characteristics of the webpage according to the characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics.
According to a preferred embodiment of the present application, the processing unit further performs: receiving a rule query request sent by a client; and determining a feature extraction rule corresponding to the rule query request, and sending the determined feature extraction rule to a client.
According to a preferred embodiment of the present application, when determining that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, the update unit specifically executes: calculating the matching degree between the cloud characteristic identification and the local characteristic identification; and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
According to a preferred embodiment of the present application, before the updating unit replaces the original web page with the newly crawled web page and updates the cloud feature identifier associated with the web page, the updating unit further performs: generating an updating feature identifier according to the re-captured webpage; determining whether the updated feature identifier is the same as the original cloud end feature identifier; if not, continuing to execute the operation of replacing the original webpage with the re-captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client and associating the webpage with the attribute information.
According to a preferred embodiment of the present application, when the update unit updates the cloud feature identifier associated with the web page, the following steps are specifically performed: generating an updating feature identifier according to the re-captured webpage; and updating the cloud characteristic identification associated with the webpage into the updated characteristic identification.
One embodiment in the above application has the following advantages or benefits: according to the method and the device, the timeliness of webpage updating can be improved, and the computing resources of the server side are effectively saved. Because the technical means that the server is driven by the client to update the webpage through the interaction between the server and the client is adopted, the technical problem of low timeliness caused by limited computing resources when the server updates the webpage in the prior art is solved, and therefore the technical effects of improving the timeliness of webpage updating and saving the computing resources of the server are achieved.
Other effects of the above-described alternative will be described below with reference to specific embodiments.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a flowchart of a method for updating a web page executed in a server according to a first embodiment of the present application;
fig. 2 is a block diagram of an apparatus for updating a web page located in a server according to a second embodiment of the present application;
FIG. 3a is an architecture diagram of a prior art search engine for web page crawling;
FIG. 3b is a frame diagram of a search engine for updating a web page according to a third embodiment of the present application;
FIG. 4 is a block diagram of an electronic device for implementing a method for web page update according to an embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Fig. 1 is a flowchart of a method for updating a web page according to a first embodiment of the present application, as shown in fig. 1, where the method is executed in a server side, and includes:
in S101, after a web page is captured, a cloud feature identifier of the web page is generated, and the cloud feature identifier is associated with the web page.
In this step, the server side first captures the web page, generates the cloud feature identifier of the captured web page, and then associates the generated cloud feature identifier with the captured web page. The server side in the application is the server side of the search engine, namely the server side of the search engine analyzes the webpage after capturing the webpage to generate the feature identifier.
It can be understood that the server side in this step may use a web crawler to capture web pages in the network, so as to store the captured web pages for presentation to the search user.
Specifically, when the cloud feature identifier of the webpage is generated in this step, the following modes can be used: determining a feature extraction rule corresponding to the captured webpage; and extracting the characteristics of the webpage according to the determined characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics. The method for generating the feature identifier by using the features of the web page is not limited, and for example, after the preset processing such as abstraction, assembly, encryption and decryption, compression and the like is performed on the features of the web page, the processing result is used as the feature identifier of the web page.
The feature extraction rules determined in this step may be unified rules, that is, different web pages correspond to the same feature extraction rules. For example, the feature extraction rule in this step may be that, for different web pages, the title, the text, and the sub-links of the web page are all extracted as features of the web page.
As the types of web pages are more and more abundant, when the same feature extraction rule is used to extract features of different types of web pages, the accuracy of web page feature extraction is reduced. Therefore, when determining the feature extraction rule corresponding to the webpage, the following method may be adopted: acquiring attribute information of a webpage, wherein the acquired attribute information can comprise a webpage name, a webpage type and the like; and determining a feature extraction rule corresponding to the acquired attribute information. That is to say, different feature extraction rules are preset for different webpages in the step, so that the corresponding feature extraction rules are used for extracting features for different webpages, and the accuracy of webpage feature extraction is improved.
For example, for information-based web pages, the corresponding feature extraction rule may be to extract all the text in the body as web page features, or to extract several text segments in the body as web page features; for the web pages of the portal class, the corresponding feature extraction rule can be used for extracting the sub-links as the web page features; for the shopping web pages, the corresponding extraction rule can be to extract pictures as web page features.
In order to improve the accuracy of updating a web page and ensure that the server and the client use the same feature extraction rule for extracting features of the same web page, the method may further include the following steps: receiving a rule query request sent by a client, wherein the rule query request can contain attribute information of a webpage; and determining a feature extraction rule corresponding to the received rule query request, and sending the determined feature extraction rule to the client for the client to extract features and generate a local feature identifier of the webpage. That is to say, the step can also issue the feature extraction rule according to the query request sent by the client, so as to improve the accuracy of the client for generating the local feature identifier.
In addition, the client can also store the feature extraction rule corresponding to the webpage in advance locally, so that the client can obtain the feature extraction rule corresponding to the webpage locally without interacting with the server.
It can be understood that the feature extraction rule issued by the server to the client or the feature extraction rule locally pre-stored by the client may include a constraint for extracting features of a web page opened by the client, in addition to the type of the features to be extracted from the web page. The client performs the operation of feature extraction on the web pages under the condition that the constraint of feature extraction is met, and the constraint of feature extraction may include that some web pages do not need to be subjected to feature extraction, some web pages need to be subjected to feature extraction in real time, and the number of times of feature extraction performed on some web pages every day is equal to or greater than the number of times of feature extraction performed on some web pages.
Specifically, in this step, when the cloud feature identifier is associated with the web page, the cloud feature identifier may be associated with a Uniform Resource locator (Uniform Resource L adapter, UR L) of the web page, and UR L of the web page and the associated cloud feature identifier may be stored.
In S102, after receiving the feedback information sent by the client, a cloud feature identifier corresponding to the feedback information is obtained.
In this step, after receiving the feedback information sent by the client, the cloud feature identifier corresponding to the received feedback information is obtained, that is, the cloud feature identifier associated with the webpage currently opened by the client is obtained.
In addition, the feedback information received in this step may further include attribute information of the client, such as geographic information where the client is located, network information used by the client, and the like.
Since different webpages have respective UR L, the unique webpage can be determined according to UR L in the feedback information, and the webpage is associated with the corresponding cloud characteristic identification, so that the cloud characteristic identification associated with the webpage can be obtained through the determined webpage.
That is to say, after a user opens a web page in a search result through a client, the client performs feature extraction according to a feature extraction rule corresponding to the web page to generate a local feature identifier of the web page, then generates feedback information by using the generated local feature identifier and the UR L of the web page, and finally sends the generated feedback information to a server, so that the server updates the web page by comparing the feature identifiers after acquiring the cloud feature identifier.
In S103, when it is determined that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, the original web page is replaced with the newly captured web page, and the cloud feature identifier associated with the web page is updated.
In this step, after the cloud feature identifier of the web page is obtained in step S102, the cloud feature identifier of the web page is compared with the local feature identifier generated by the client in the feedback information to determine whether the two are matched, if the two are matched, it indicates that the web page is not changed, the web page does not need to be re-captured, and if the two are not matched, it indicates that the web page is changed, the corresponding web page is re-captured to replace the original web page, and the cloud feature identifier associated with the web page is updated.
That is to say, the server side in this step realizes that the client side drives to re-fetch the web page, and the client side sends feedback information to the server side after opening the web page, so that the timeliness of the server side in updating the web page is improved.
Specifically, when determining that the cloud feature identifier of the webpage is not matched with the local feature identifier in the feedback information, the following method may be adopted: calculating the matching degree between the cloud characteristic identification and the local characteristic identification; and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
It can be understood that, because the local feature identifier generated by the client corresponds to the changed web page, when the cloud feature identifier associated with the web page is updated in this step, the cloud feature identifier associated with the web page can be directly updated to the local feature identifier sent by the client, that is, the local feature identifier is used to replace the original cloud feature identifier.
In order to further improve the update accuracy of the cloud feature identifier, the following method can be further adopted when the cloud feature identifier associated with the webpage is updated in the step: generating an updating feature identifier according to the re-captured webpage; and updating the cloud characteristic identification associated with the webpage into an updated characteristic identification, namely replacing the original cloud characteristic identification with the updated characteristic identification. Therefore, the problem of inaccurate updating of the cloud characteristic identification caused by rapid change of the webpage can be avoided, and the updating accuracy of the webpage is improved.
In some current application scenarios, in order to improve the diversity of web page display, different display modes may exist for different users in the same web page. For example, for the same web page, the presentation at the location a is different from the presentation at the location B, i.e., the web page presentation has region differentiation.
Therefore, in order to avoid the problem of wrong webpage update caused by display differentiation of the same webpage and realize more detailed indexing of the webpage, the method can further include the following steps before replacing the original webpage with the newly captured webpage and updating the cloud feature identifier associated with the webpage: generating an updating feature identifier according to the re-captured webpage; determining whether the updated feature identifier is the same as the original cloud end feature identifier; if not, continuing to execute the operation of replacing the original webpage with the new captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client, and associating the webpage with the acquired attribute information, thereby outputting the directional search result to the users with the same attribute information.
For example, if the display form a of a certain web page is associated with the location a, and the display form B of the web page is associated with the location B, if the user at the location a opens the web page, the web page with the display form a is displayed to the user at the location a, and if the user at the location B opens the web page, the web page with the display form B is displayed to the user at the location B.
According to the webpage updating method and device, the server side is driven by the client side to update the webpage through interaction between the server side and the client side, on one hand, timeliness of webpage updating can be improved, on the other hand, a basic framework of the server side can be reserved, and computing resources of the server side are effectively saved.
Fig. 2 is a block diagram of a device for updating a web page according to a second embodiment of the present application, as shown in fig. 2, the device is located in a server, and includes: a processing unit 201, an acquisition unit 202, and an update unit 203.
The processing unit 201 is configured to, after a webpage is captured, generate a cloud feature identifier of the webpage, and associate the cloud feature identifier with the webpage.
The processing unit 201 first captures a web page, generates a cloud feature identifier of the captured web page, and then associates the generated cloud feature identifier with the captured web page. The server side in the application is the server side of the search engine, namely the server side of the search engine analyzes the webpage after capturing the webpage to generate the feature identifier.
It is understood that the processing unit 201 may use a web crawler to crawl web pages in the network, and thus save the crawled web pages for presentation to the searching user.
Specifically, when the processing unit 201 generates the cloud feature identifier of the web page, the following method may be used: determining a feature extraction rule corresponding to the captured webpage; and extracting the characteristics of the webpage according to the determined characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics. The method for generating the feature identifier by using the features of the web page is not limited, and for example, after the preset processing such as abstraction, assembly, encryption and decryption, compression and the like is performed on the features of the web page, the processing result is used as the feature identifier of the web page.
The feature extraction rule determined by the processing unit 201 may be a unified rule, that is, different web pages correspond to the same feature extraction rule.
As the types of web pages are more and more abundant, when the same feature extraction rule is used to extract features of different types of web pages, the accuracy of web page feature extraction is reduced. Therefore, when determining the feature extraction rule corresponding to the web page, the processing unit 201 may further adopt the following manner: acquiring attribute information of a webpage, wherein the acquired attribute information can comprise a webpage name, a webpage type and the like; and determining a feature extraction rule corresponding to the acquired attribute information. That is to say, the processing unit 201 sets different feature extraction rules in advance for different webpages, so that the corresponding feature extraction rules are used for extracting features for different webpages, thereby improving the accuracy of webpage feature extraction.
In order to improve the accuracy of updating a web page and ensure that the server and the client perform feature extraction on the same web page using the same feature extraction rule, the processing unit 201 may further include the following: receiving a rule query request sent by a client, wherein the rule query request can contain attribute information of a webpage; and determining a feature extraction rule corresponding to the received rule query request, and sending the determined feature extraction rule to the client for the client to extract features and generate a local feature identifier of the webpage. That is to say, the processing unit 201 can also issue the feature extraction rule according to the query request sent by the client, so as to improve the accuracy of generating the local feature identifier by the client.
In addition, the client can also store the feature extraction rule corresponding to the webpage in advance locally, so that the client can obtain the feature extraction rule corresponding to the webpage locally without interacting with the server.
It can be understood that the feature extraction rule issued by the server to the client or the feature extraction rule locally pre-stored by the client may include a constraint for extracting features of a web page opened by the client, in addition to the type of the features to be extracted from the web page. The client performs the operation of feature extraction on the web pages under the condition that the constraint of feature extraction is met, and the constraint of feature extraction may include that some web pages do not need to be subjected to feature extraction, some web pages need to be subjected to feature extraction in real time, and the number of times of feature extraction performed on some web pages every day is equal to or greater than the number of times of feature extraction performed on some web pages.
Specifically, when associating the cloud feature identifier with the web page, the processing unit 201 may associate the cloud feature identifier with a Uniform Resource locator (Uniform Resource L adapter, UR L) of the web page, and store UR L of the web page and the associated cloud feature identifier.
The obtaining unit 202 is configured to obtain, after receiving feedback information sent by a client, a cloud feature identifier corresponding to the feedback information.
After receiving the feedback information sent by the client, the obtaining unit 202 obtains the cloud feature identifier corresponding to the received feedback information, that is, obtains the cloud feature identifier associated with the webpage currently opened by the client.
In addition, the feedback information received by the obtaining unit 202 may further include attribute information of the client, such as geographic information where the client is located, network information used by the client, and the like.
Since different web pages have respective UR L, the obtaining unit 202 can determine a unique web page according to UR L in the feedback information, and the web page is associated with the corresponding cloud feature identifier, so that the obtaining unit 202 can obtain the cloud feature identifier associated with the web page through the determined web page.
That is to say, after a user opens a web page in a search result through a client, the client performs feature extraction according to a feature extraction rule corresponding to the web page to generate a local feature identifier of the web page, then generates feedback information by using the generated local feature identifier and the UR L of the web page, and finally sends the generated feedback information to a server, so that the server updates the web page by comparing the feature identifiers after acquiring the cloud feature identifier.
And the updating unit 203 is configured to, when it is determined that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, replace the original web page with the newly captured web page, and update the cloud feature identifier associated with the web page.
After the obtaining unit 202 obtains the cloud feature identifier of the web page, the updating unit 203 compares the cloud feature identifier of the web page with the local feature identifier generated by the client in the feedback information to determine whether the two are matched, if the two are matched, it indicates that the web page is not changed, the web page does not need to be re-captured, if the two are not matched, it indicates that the web page is changed, the corresponding web page is re-captured to replace the original web page, and the cloud feature identifier associated with the web page is updated.
That is to say, the updating unit 203 realizes that the client drives to re-fetch the web page, and the client sends the feedback information to the server after opening the web page, so that the timeliness of the server for updating the web page is improved.
Specifically, when determining that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, the updating unit 203 may adopt the following manner: calculating the matching degree between the cloud characteristic identification and the local characteristic identification; and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
It can be understood that, since the local feature identifier generated by the client corresponds to the changed web page, when the cloud feature identifier associated with the web page is updated, the updating unit 203 may directly update the cloud feature identifier associated with the web page to the local feature identifier sent by the client, that is, replace the original cloud feature identifier with the local feature identifier.
In order to further improve the update accuracy of the cloud feature identifier, when the update unit 203 updates the cloud feature identifier associated with the web page, the following method may be further adopted: generating an updating feature identifier according to the re-captured webpage; and updating the cloud characteristic identification associated with the webpage into an updated characteristic identification, namely replacing the original cloud characteristic identification with the updated characteristic identification. Therefore, the updating unit 203 can avoid the problem of inaccurate updating of the cloud feature identifier caused by rapid change of the webpage, and improve the accuracy of webpage updating.
In some current application scenarios, in order to improve the diversity of web page display, different display modes may exist for different users in the same web page.
Therefore, in order to avoid the problem of wrong webpage update caused by display differentiation of the same webpage and achieve finer indexing of the webpage, the updating unit 203 may further include the following content before replacing the original webpage with the newly crawled webpage and updating the cloud feature identifier associated with the webpage: generating an updating feature identifier according to the re-captured webpage; determining whether the updated feature identifier is the same as the original cloud end feature identifier; if not, continuing to execute the operation of replacing the original webpage with the new captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client, and associating the webpage with the acquired attribute information, thereby outputting the directional search result to the users with the same attribute information.
Fig. 3a is an architecture diagram of a search engine performing web page crawling in the prior art, where a seed UR L is placed in a UR L queue to be crawled, a UR L to be crawled is taken out from the queue, DNS resolution is performed, a web page corresponding to the UR L is downloaded and stored in a downloaded web page library, a UR L is extracted from a queue of the crawled UR L, and the UR L is placed in a UR L queue to be crawled, so that the next cycle is performed.
Fig. 3b is a frame diagram of a search engine for updating a web page according to a third embodiment of the present application, where a process of capturing a web page is the same as that described in fig. 3a, but after a web page is obtained by downloading, a server analyzes a web page opened by a client by using a processing unit, an obtaining unit, and an updating unit arranged in the server, so as to drive the search engine to update the web page after determining that content of the web page changes. Therefore, when the server side updates the webpage, the basic framework of the search engine is reserved, the search engine does not need to be modified too much, and therefore the development difficulty is reduced.
According to an embodiment of the present application, an electronic device and a computer-readable storage medium are also provided.
Fig. 4 is a block diagram of an electronic device according to an embodiment of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 4, the electronic apparatus includes: one or more processors 401, memory 402, and interfaces for connecting the various components, including high-speed interfaces and low-speed interfaces. The various components are interconnected using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions for execution within the electronic device, including instructions stored in or on the memory to display graphical information of a GUI on an external input/output apparatus (such as a display device coupled to the interface). In other embodiments, multiple processors and/or multiple buses may be used, along with multiple memories and multiple memories, as desired. Also, multiple electronic devices may be connected, with each device providing portions of the necessary operations (e.g., as a server array, a group of blade servers, or a multi-processor system). In fig. 4, one processor 401 is taken as an example.
Memory 402 is a non-transitory computer readable storage medium as provided herein. The memory stores instructions executable by at least one processor to cause the at least one processor to perform the method for updating a web page provided by the present application. The non-transitory computer readable storage medium of the present application stores computer instructions for causing a computer to perform the method of web page updating provided herein.
The memory 402, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs, non-transitory computer executable programs, and modules, such as program instructions/modules corresponding to the method for updating a web page in the embodiment of the present application (for example, the processing unit 201, the obtaining unit 202, and the updating unit 203 shown in fig. 2). The processor 401 executes various functional applications of the server and data processing, i.e., a method for updating a web page in the above method embodiment, by running non-transitory software programs, instructions, and modules stored in the memory 402.
The memory 402 may include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the electronic device, and the like. Further, the memory 402 may include high speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory 402 may optionally include memory located remotely from the processor 401, and these remote memories may be connected to the electronic device of the method of web page updating via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The electronic device of the method for updating a web page may further include: an input device 403 and an output device 404. The processor 401, the memory 402, the input device 403 and the output device 404 may be connected by a bus or other means, and fig. 4 illustrates an example of a connection by a bus.
The input device 403 may receive input numeric or character information and generate key signal inputs related to user settings and function control of an electronic device of the method of web page update, such as a touch screen, keypad, mouse, track pad, touch pad, pointing stick, one or more mouse buttons, track ball, joystick, etc. the output device 404 may include a display device, auxiliary lighting (e.g., L ED), and tactile feedback device (e.g., vibrating motor), etc.
Various implementations of the systems and techniques described here can be realized in digital electronic circuitry, integrated circuitry, application specific ASICs (application specific integrated circuits), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
As used herein, the terms "machine-readable medium" and "computer-readable medium" refer to any computer program product, apparatus, and/or device (e.g., magnetic discs, optical disks, memory, programmable logic devices (P L D)) used to provide machine instructions and/or data to a programmable processor, including a machine-readable medium that receives machine instructions as a machine-readable signal.
The systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or L CD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer for providing interaction with the user.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., AN application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with AN implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.
According to the technical scheme of the embodiment of the application, the server side is driven by the client side to update the webpage through interaction between the server side and the client side, on one hand, timeliness of webpage updating can be improved, on the other hand, a basic framework of the server side can be reserved, and computing resources of the server side are effectively saved.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present application may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present application can be achieved, and the present invention is not limited herein.
The above-described embodiments should not be construed as limiting the scope of the present application. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (14)

1. A method for updating a web page, comprising:
after a server side captures a webpage, generating a cloud characteristic identification of the webpage, and associating the cloud characteristic identification with the webpage;
after feedback information sent by a client is received, a cloud characteristic identifier corresponding to the feedback information is obtained;
and when the cloud characteristic identification of the webpage is determined not to be matched with the local characteristic identification in the feedback information, replacing the original webpage with the re-captured webpage, and updating the cloud characteristic identification associated with the webpage.
2. The method of claim 1, wherein the generating the cloud-based feature identifier of the web page comprises:
determining a feature extraction rule corresponding to the webpage;
and extracting the characteristics of the webpage according to the characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics.
3. The method of claim 1, further comprising:
receiving a rule query request sent by a client;
and determining a feature extraction rule corresponding to the rule query request, and sending the determined feature extraction rule to a client.
4. The method of claim 1, wherein determining that the cloud token of the web page does not match the local token in the feedback information comprises:
calculating the matching degree between the cloud characteristic identification and the local characteristic identification;
and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
5. The method of claim 1, wherein before replacing the original web page with the re-crawled web page and updating the cloud feature identifier associated with the web page, the method further comprises:
generating an updating feature identifier according to the re-captured webpage;
determining whether the updated feature identifier is the same as the original cloud end feature identifier;
if not, continuing to execute the operation of replacing the original webpage with the re-captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client and associating the webpage with the attribute information.
6. The method of claim 1, wherein updating the cloud-based feature identifier associated with the web page comprises:
generating an updating feature identifier according to the re-captured webpage;
and updating the cloud characteristic identification associated with the webpage into the updated characteristic identification.
7. An apparatus for updating a web page, wherein the apparatus is located at a server side, and comprises:
the processing unit is used for generating a cloud characteristic identifier of the webpage after the webpage is captured, and associating the cloud characteristic identifier with the webpage;
the acquisition unit is used for acquiring a cloud characteristic identifier corresponding to feedback information after receiving the feedback information sent by the client;
and the updating unit is used for replacing the original webpage with the newly-captured webpage and updating the cloud characteristic identification associated with the webpage when the cloud characteristic identification of the webpage is determined not to be matched with the local characteristic identification in the feedback information.
8. The apparatus according to claim 7, wherein the processing unit, when generating the cloud feature identifier of the web page, specifically performs:
determining a feature extraction rule corresponding to the webpage;
and extracting the characteristics of the webpage according to the characteristic extraction rule, and generating a cloud characteristic identification of the webpage by using the extracted characteristics.
9. The apparatus of claim 7, wherein the processing unit further performs:
receiving a rule query request sent by a client;
and determining a feature extraction rule corresponding to the rule query request, and sending the determined feature extraction rule to a client.
10. The apparatus according to claim 7, wherein the updating unit, when determining that the cloud feature identifier of the web page does not match the local feature identifier in the feedback information, specifically performs:
calculating the matching degree between the cloud characteristic identification and the local characteristic identification;
and determining whether the calculated matching degree exceeds a preset threshold value, if so, determining that the two are matched, otherwise, determining that the two are not matched.
11. The apparatus of claim 7, wherein the updating unit, before replacing the original web page with the re-crawled web page and updating the cloud feature identifier associated with the web page, further performs:
generating an updating feature identifier according to the re-captured webpage;
determining whether the updated feature identifier is the same as the original cloud end feature identifier;
if not, continuing to execute the operation of replacing the original webpage with the re-captured webpage and updating the cloud characteristic identification associated with the webpage, if so, acquiring the attribute information of the client and associating the webpage with the attribute information.
12. The apparatus according to claim 7, wherein the updating unit, when updating the cloud feature identifier associated with the web page, specifically performs:
generating an updating feature identifier according to the re-captured webpage;
and updating the cloud characteristic identification associated with the webpage into the updated characteristic identification.
13. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-6.
14. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-6.
CN202010153288.1A 2020-03-06 2020-03-06 Method, device, electronic equipment and computer readable storage medium for web page update Active CN111506787B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010153288.1A CN111506787B (en) 2020-03-06 2020-03-06 Method, device, electronic equipment and computer readable storage medium for web page update

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010153288.1A CN111506787B (en) 2020-03-06 2020-03-06 Method, device, electronic equipment and computer readable storage medium for web page update

Publications (2)

Publication Number Publication Date
CN111506787A true CN111506787A (en) 2020-08-07
CN111506787B CN111506787B (en) 2023-04-25

Family

ID=71863947

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010153288.1A Active CN111506787B (en) 2020-03-06 2020-03-06 Method, device, electronic equipment and computer readable storage medium for web page update

Country Status (1)

Country Link
CN (1) CN111506787B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528005A (en) * 2021-11-29 2022-05-24 深圳市千源互联网科技服务有限公司 Grab tag updating method, device, equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071766A1 (en) * 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
CN102109989A (en) * 2009-12-29 2011-06-29 阿里巴巴集团控股有限公司 Method, device and system for controlling browser cache
US20170085679A1 (en) * 2015-09-22 2017-03-23 Facebook, Inc. Error correction using state information of data
US20190138297A1 (en) * 2016-01-21 2019-05-09 Alibaba Group Holding Limited Method, apparatus, and system for hot-deploying application
US10310699B1 (en) * 2014-12-08 2019-06-04 Amazon Technologies, Inc. Dynamic modification of browser and content presentation
CN110083616A (en) * 2019-04-19 2019-08-02 深圳前海微众银行股份有限公司 Page data processing method, device, equipment and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050071766A1 (en) * 2003-09-25 2005-03-31 Brill Eric D. Systems and methods for client-based web crawling
CN102109989A (en) * 2009-12-29 2011-06-29 阿里巴巴集团控股有限公司 Method, device and system for controlling browser cache
US10310699B1 (en) * 2014-12-08 2019-06-04 Amazon Technologies, Inc. Dynamic modification of browser and content presentation
US20170085679A1 (en) * 2015-09-22 2017-03-23 Facebook, Inc. Error correction using state information of data
US20190138297A1 (en) * 2016-01-21 2019-05-09 Alibaba Group Holding Limited Method, apparatus, and system for hot-deploying application
CN110083616A (en) * 2019-04-19 2019-08-02 深圳前海微众银行股份有限公司 Page data processing method, device, equipment and computer readable storage medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
严悍等: "Web应用开发中的动态页面建模技术", 《计算机工程与应用》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114528005A (en) * 2021-11-29 2022-05-24 深圳市千源互联网科技服务有限公司 Grab tag updating method, device, equipment and storage medium
CN114528005B (en) * 2021-11-29 2023-06-23 深圳市千源互联网科技服务有限公司 Grabbing label updating method, grabbing label updating device, grabbing label updating equipment and storage medium

Also Published As

Publication number Publication date
CN111506787B (en) 2023-04-25

Similar Documents

Publication Publication Date Title
US11250066B2 (en) Method for processing information, electronic device and storage medium
CN111767069A (en) Applet processing method, server, device and storage medium
US20210049354A1 (en) Human object recognition method, device, electronic apparatus and storage medium
CN111694857B (en) Method, device, electronic equipment and computer readable medium for storing resource data
CN111913884A (en) Distributed test method, device, equipment, system and readable storage medium
CN113159807A (en) Landing page processing method, device, equipment and medium
CN111461343A (en) Model parameter updating method and related equipment thereof
CN111767089A (en) Method, device and equipment for loading file and storage medium
CN110545324B (en) Data processing method, device, system, network equipment and storage medium
CN112269706A (en) Interface parameter checking method and device, electronic equipment and computer readable medium
CN111767442B (en) Data updating method, device, search server, terminal and storage medium
CN111506787A (en) Webpage updating method and device, electronic equipment and computer-readable storage medium
CN110517079B (en) Data processing method and device, electronic equipment and storage medium
CN112565356A (en) Data storage method and device and electronic equipment
CN111832070A (en) Data mask method and device, electronic equipment and storage medium
CN111026916A (en) Text description conversion method and device, electronic equipment and storage medium
CN111400431A (en) Event argument extraction method and device and electronic equipment
US20210248486A1 (en) Method, apparatus, device and storage medium for customizing personalized rules for entities
CN111506786B (en) Method, device, electronic equipment and computer readable storage medium for web page update
CN114661274A (en) Method and device for generating intelligent contract
CN111292223B (en) Graph calculation processing method and device, electronic equipment and storage medium
CN112148279A (en) Log information processing method and device, electronic equipment and storage medium
CN113010811A (en) Webpage acquisition method and device, electronic equipment and computer readable storage medium
CN112800319A (en) Information searching method, device, equipment and medium
CN113220982A (en) Advertisement searching method, device, electronic equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant