CN117520629A - SEO scheme based on server-side webpage rendering technology and related device - Google Patents

SEO scheme based on server-side webpage rendering technology and related device Download PDF

Info

Publication number
CN117520629A
CN117520629A CN202311546572.5A CN202311546572A CN117520629A CN 117520629 A CN117520629 A CN 117520629A CN 202311546572 A CN202311546572 A CN 202311546572A CN 117520629 A CN117520629 A CN 117520629A
Authority
CN
China
Prior art keywords
crawler
webpage
html content
browser
request
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311546572.5A
Other languages
Chinese (zh)
Inventor
张衍炳
戴裕文
张楠
赵志瑞
许丹昊
杨建明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SHENZHEN SECURITIES INFORMATION CO Ltd
Original Assignee
SHENZHEN SECURITIES INFORMATION CO Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SHENZHEN SECURITIES INFORMATION CO Ltd filed Critical SHENZHEN SECURITIES INFORMATION CO Ltd
Priority to CN202311546572.5A priority Critical patent/CN117520629A/en
Publication of CN117520629A publication Critical patent/CN117520629A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9538Presentation of query results
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9574Browsing optimisation, e.g. caching or content distillation of access to content, e.g. by caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/51Discovery or management thereof, e.g. service location protocol [SLP] or web services
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/16Implementation or adaptation of Internet protocol [IP], of transmission control protocol [TCP] or of user datagram protocol [UDP]
    • H04L69/161Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields
    • H04L69/162Implementation details of TCP/IP or UDP/IP stack architecture; Specification of modified or new header fields involving adaptations of sockets based mechanisms

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the application discloses an SEO scheme and a related device based on a server-side webpage rendering technology, wherein the search method comprises the following steps: obtaining a crawler request sent by a search engine crawler; acquiring a pre-established engineering project, and establishing connection with a pre-installed browser, wherein the engineering project is pre-established based on a DevToolsProtocol; based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request; and returning the acquired webpage HTML content to the web crawler. It can be understood that the server side obtains the webpage HTML content by adopting the steps, so that the search engine crawler is ensured to completely obtain the webpage HTML content, and the response speed of the crawler is high.

Description

SEO scheme based on server-side webpage rendering technology and related device
Technical Field
The embodiment of the application relates to the technical field of SEO, in particular to an SEO scheme based on a server-side webpage rendering technology and a related device.
Background
In internet applications, it is very important to perform moderate SEO (Search Engine Optimization ), and the search engine is used for recording web pages, associated keywords and promoting the search ranking of the web site, so that the user can find the web site more easily, and the exposure rate of the product is increased.
In the related art, in the actual application process, a search engine regularly accesses, extracts and stores web page contents corresponding to addresses of all server ends through a web crawler, and then ranks and displays the crawled web pages according to factors such as relevance of search keywords, so as to provide accurate, various and useful search results for users.
However, in the current technology, there are problems that the web crawler cannot parse JavaScript, so that the search engine crawler cannot completely obtain the HTML content of the web page, and the web site response crawler requests slowly, which is not beneficial to the ranking display of the web page in the search engine, so it is very important to perform moderate SEO (Search Engine Optimization ) to solve the above problems.
Disclosure of Invention
The present invention is directed to solving the technical problems in the background art described above. Therefore, the invention provides the SEO scheme and the related device based on the server-side webpage rendering technology, and the server side can quickly and completely acquire webpage HTML content corresponding to the crawler request.
In a first aspect, an embodiment of the present application provides an SEO solution based on server-side webpage rendering technology, including:
obtaining a crawler request sent by a search engine crawler;
acquiring a pre-created engineering project, and establishing connection with a pre-installed browser, wherein the engineering project is pre-created based on DevTools Protocol;
based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request;
and returning the acquired webpage HTML content to the web crawler.
According to some embodiments of the invention, after the obtaining the crawler request sent by the search engine crawler, the SEO scheme further includes:
acquiring a preset cache service file; the cache service file is used for storing webpage HTML content;
judging whether the cache service file contains webpage HTML content corresponding to the crawler request;
if so, judging whether the webpage HTML content corresponding to the crawler request is out of date;
and if not, reading the webpage HTML content corresponding to the crawler request in the cache service file, and returning to the web crawler.
According to some embodiments of the invention, after the obtaining the preset cache service file, the SEO scheme further includes:
and if the webpage HTML content corresponding to the crawler request does not exist in the cache service file, or if the webpage HTML content corresponding to the crawler request in the cache service file is out of date, executing the crawler request sent by the search engine crawler, calling a browser through the engineering project to access and render a webpage, and obtaining the webpage HTML content corresponding to the crawler request.
According to some embodiments of the present invention, based on the crawler request sent by the search engine crawler, the engineering project calls a browser to access and render a webpage, and after obtaining the webpage HTML content corresponding to the crawler request, the SEO scheme further includes:
and inserting the acquired webpage HTML content into the cache service file, and setting the cache time.
According to some embodiments of the invention, the obtaining a crawler request sent by a search engine crawler includes:
acquiring a pre-established website project, wherein the website project is created with a self-defined filter;
receiving an http request through the website item;
judging whether the http request is a crawler request sent by a search engine crawler;
if not, allowing the http request to normally access the website;
if yes, intercepting a crawler request sent by the search engine crawler through the filter, and sending the crawler request to the engineering project.
According to some embodiments of the invention, the engineering project creates a browser Page object pool based on DevTools Protocol;
based on the crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request, wherein the method comprises the following steps:
judging whether an idle Page instance exists in the browser Page object pool;
if yes, calling an idle Page instance in the browser Page object pool, marking that the browser is in use, calling a browser to access and render a webpage through the Page instance based on a crawler request sent by the search engine crawler, and obtaining webpage HTML content corresponding to the crawler request.
According to some embodiments of the invention, after the determining whether there is a free Page instance in the browser Page object pool, the searching method further includes:
if not, waiting for an idle Page instance to appear in the browser Page object pool;
and judging whether the waiting time of the crawler request exceeds the preset time, and if so, returning a signal for responding to the error to the web crawler.
According to some embodiments of the invention, including website projects and engineering projects, wherein:
the website project is used for receiving a crawler request sent by a search engine crawler and feeding back the crawler request to the engineering project;
the engineering project is pre-created based on DevTools Protocol, and is used for establishing connection with a pre-installed browser, accessing and rendering a webpage by calling the browser through the engineering project based on a crawler request sent by the search engine crawler, acquiring webpage HTML content corresponding to the crawler request, and returning to the web crawler.
In a second aspect, an embodiment of the present application provides a server apparatus, including a website project and an engineering project, where:
the website project is used for receiving a crawler request sent by a search engine crawler and feeding back the crawler request to the engineering project;
the engineering project is pre-created based on DevTools Protocol, and is used for establishing connection with a pre-installed browser, accessing and rendering a webpage by calling the browser through the engineering project based on a crawler request sent by the search engine crawler, acquiring webpage HTML content corresponding to the crawler request, and returning to the web crawler.
In a third aspect, embodiments of the present application provide an electronic device, including a memory and a processor, wherein:
a memory for storing programs and/or instructions executable by the processor;
and a processor configured to execute the program and/or instructions to implement the search method described above.
In a fourth aspect, embodiments of the present application provide a computer readable storage medium, where a program and/or instructions are stored, where the program and/or instructions implement a search method as described above when executed by a processor.
From the above technical solutions, the embodiments of the present application have the following advantages: and after receiving a crawler request of a search engine crawler, remotely calling a browser connected with the WebSocket by the engineering project. According to the target URL corresponding to the crawler request, the engineering project remotely controls the browser to load and render the corresponding webpage, so that the real browser behavior is simulated, and the complete webpage HTML content is obtained. It can be understood that problems of incomplete search engine crawler retrieval, slow corresponding speed and the like are often caused by problems of dynamic loading, webpage loading delay and the like in the prior art, and the server side acquires webpage HTML content by adopting the steps, so that the search engine crawler is ensured to acquire the webpage HTML content completely, and the response crawler request speed is high.
Drawings
The invention is further described with reference to the accompanying drawings and examples, in which:
FIG. 1 is a schematic flow chart of a searching method in an embodiment of the present application;
FIG. 2 is a schematic diagram of the overall structure of a server in an embodiment of the present application;
FIG. 3 is a schematic workflow diagram of an engineering project in an embodiment of the present application;
FIG. 4 is a schematic workflow diagram of a website project in an embodiment of the present application;
FIG. 5 is a schematic workflow diagram of caching service files according to an embodiment of the present application;
fig. 6 is a schematic workflow diagram of a search device in an embodiment of the present application.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
The searching method provided by the embodiment of the application can be executed by the terminal equipment, including but not limited to: smart phones, tablet computers, notebook computers, and the like.
Or may be a chip or chip server side execution, the chip may be embedded in the terminal device.
Alternatively, it may be a server execution, including but not limited to: the cloud server comprises an independent physical server, a server cluster or a distributed server side formed by a plurality of physical servers, and cloud servers for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, a content delivery network (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligent platforms and the like.
Or may be performed by other devices, which are not limited herein.
Referring to fig. 1, a flow diagram of an SEO scheme based on a server-side webpage rendering technology is provided in an embodiment of the present application. Referring to fig. 1 and 2, the search method includes S101 to S104. Wherein:
s101, obtaining a crawler request sent by a search engine crawler.
The crawler is a program for automatically extracting web pages, which is an important component of a search engine for downloading web pages from the world wide web. Specifically, the search engine crawler starts from the server side of one or a plurality of initial web pages, and obtains the server side on the initial web pages. And continuously extracting a new server from the current page and putting the new server into a queue in the process of capturing the webpage until the stopping condition of the server is met.
In specific implementation, after a search engine crawler sends out a crawler request, a server receives the crawler request sent out by the search engine crawler. In a specific embodiment, the server side receives a crawler request sent by the crawler terminal through a hypertext transfer protocol (Hyper Text TransferProtocol, HTTP) or a hypertext transfer protocol (Hyper Text Transfer Protocol Secure, HTTPS), for example.
S102, acquiring a pre-created engineering project and establishing connection with a pre-installed browser, wherein the engineering project is pre-created based on DevTools Protocol.
Wherein, the browser refers to a browser supporting DevTools Protocol, runs the browser in a headless mode and designates a remote debugging port.
And, on the server side, engineering projects are created in advance based on DevTools Protocol, and can be freely implemented by random programming of Java, C++, python and the like. Also, devTools Protocol is a JSON-based protocol for communicating with a Chrome or other browser, whereby an established engineering project can establish a connection with a pre-installed browser to manipulate the browser's behavior.
In a specific implementation, according to a remote debugging port of a browser, an engineering project uses a WebSocket library to establish WebSocket connection with a browser DevTools, so that the engineering project can operate the behavior of the browser by sending DevTools Protocol commands in JSON format so as to call the browser to access and render web pages. For example, the target.createtarget command, one of the DevTools Protocol commands, is used to manipulate the behavior of the browser, and more specifically, the DevTools Protocol command is used to create a "target", i.e., a page, in the browser.
S103, based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and obtaining webpage HTML content corresponding to the crawler request.
Wherein DevTools Protocol is a protocol, and engineering projects can operate a browser to open, close, browse web pages and the like based on the protocol, so as to simulate the effect of clicking by a real person.
In a specific implementation, each time an engineering project receives a crawler request, a target URL corresponding to the crawler request is determined. And according to the target URL corresponding to the crawler request, invoking a browser for establishing connection by the engineering project, controlling the browser to load the webpage corresponding to the target URL, and rendering. Therefore, the web page HTML content corresponding to the crawler request is generated by simulating the real browser behavior. Rather, the engineering project is actually to command the remote control browser through DevTools Protocol to actually access and render various webpages, and generate webpage HTML content corresponding to the crawler request.
S104, returning the acquired webpage HTML content to the search engine crawler.
In the specific implementation, after the webpage HTML content is generated, the webpage HTML content is returned to the engineering project firstly, and the engineering project returns the acquired webpage HTML content to the search engine crawler. The search engine crawler stores the acquired webpage HTML content in a corresponding database and indexes the webpage HTML content. And in the display stage, ranking and displaying the crawled webpage HTML content according to the correlation and other factors of the search keywords, and providing accurate, various and useful search results for users.
In summary, in steps S101 to S104, after receiving the crawler request of the search engine crawler, the engineering project remotely calls the browser connected by WebSocket. According to the target URL corresponding to the crawler request, the engineering project remotely controls the browser to load and render the corresponding webpage, so that the real browser behavior is simulated, and the complete webpage HTML content is obtained. And then indexing the acquired webpage HTML content, so that the search engine crawler can acquire the complete webpage HTML content, and the indexable and search ranking of the webpage HTML content are improved. It can be understood that problems of dynamic loading, webpage loading delay and the like in the prior art often cause incomplete search of a search engine crawler, and the server side acquires webpage HTML content by adopting the steps, so that the search engine crawler is ensured to acquire webpage HTML content completely, and the response speed of the crawler is high.
In one possible embodiment, referring to fig. 2 and 3, to implement an engineering project remote invocation browser, it may be implemented in a number of ways. For example, the engineering project creates a browser Page object pool based on DevTools Protocol, and the browser Page object pool calls a browser to call and render a webpage by using a Page instance.
Specifically, step S103, based on the crawler request sent by the search engine crawler, calls a browser through the engineering project to access and render a webpage, and obtains webpage HTML content corresponding to the crawler request, including:
s301, judging whether an idle Page instance exists in the browser Page object pool.
The idle Page instance refers to an unused Page instance.
And, rather than creating new objects each time, the browser Page object pool can provide reusable objects when needed. The browser Page object pool can improve the performance and efficiency of the program, because it avoids frequent object creation and destruction, thereby reducing server-side overhead. And, in the present application, the number of objects in the browser Page object pool can be customized, for example, the browser Page object pool creates 3 objects, which is equivalent to three Page instances that can be used to call the web Page of the browser.
S302, if yes, calling an idle Page instance in the browser Page object pool, marking that the Page instance is in use, calling a browser to access and render a webpage through the Page instance based on a crawler request sent by the search engine crawler, and obtaining webpage HTML content corresponding to the crawler request.
In the specific implementation, each time an engineering project receives a crawler request, whether an idle Page instance exists in a browser Page object pool is judged. If the idle Page instance is determined, the engineering project calls an idle Page instance, the Page instance is marked as being used, and a browser is called to access and render a webpage according to a target URL corresponding to a crawler request, webpage HTML content corresponding to the crawler request is obtained, and meanwhile the state of the Page instance is marked as an idle state. More specifically, after calling an idle Page instance, the engineering project sends a page.navigation command to control the Page instance to remotely manipulate the browser to access the target URL in the browser. And then, a document/document element/document HTML command is sent, the Page instance is controlled to obtain the HTML content of the target URL rendered by the browser, and the state of the corresponding Page instance is modified into an idle state.
It can be understood that by adopting the steps S301 to S302, the engineering project can repeatedly use the free Page instance in the browser Page object pool to remotely call the browser to access and render the webpage, so as to avoid the expense of frequently creating and destroying the Page instance, thereby improving the efficiency and performance of the server side.
In a further embodiment, after determining in step S301 whether there is a free Page instance in the browser Page object pool, the search method further includes:
and S303, if not, waiting for an idle Page instance to appear in the browser Page object pool.
S304, judging whether the waiting time of the crawler request exceeds the preset time, and if so, returning a signal for responding to the error to the search engine crawler.
In a specific implementation, for a specific explanation scheme, 3 objects are created in the browser Page object pool, which is equivalent to three blank Page waiting operations of the browser. If three crawler requests are continuously input to the engineering project, www.baidu.com, www.taobao.com, www.aiqiyi.com are accessed respectively, and therefore, all three objects of the browser Page object pool are invoked. If all the three objects are in the working state and the engineering project receives the fourth crawler request www.qq.com, the engineering project controls the crawler request to enter a queuing state, and waits for an idle Page instance to appear in the browser Page object pool. If a Page instance switched to an idle state appears in the browser Page object pool, the work item calls the Page instance and executes a fourth crawler request.
The engineering project monitors waiting time of the fourth crawler request during waiting, and if the waiting time of the fourth crawler request exceeds preset time, the engineering project ends the crawler request. At the same time, the engineering project returns a signal in response to the error to the search engine crawler, which stops acquiring the web page HTML content requested by the crawler.
It can be understood that, by the arrangement from step S303 to step S304, when the number of objects in the browser Page object pool is insufficient, the crawler request may enter a queuing mode, so as to ensure that the crawler request is processed when a free Page instance occurs. Meanwhile, if the waiting time of the crawler request is too long, the crawler request may affect the working performance of the whole server, and in the application, if the waiting time of the crawler request is too long, the crawler request with longer waiting time is ended, so that the working performance of the server is guaranteed.
In some embodiments, referring to fig. 2 and 4, step S101 obtains a crawler request issued by a search engine crawler, including:
s110, acquiring a pre-established website project, wherein the website project is created with a self-defined filter;
s120, receiving an http request through the website item;
s130, judging whether the http request is a crawler request sent by a search engine crawler;
s140, if not, allowing the http request to normally access the website;
and S150, if yes, intercepting a crawler request sent by the search engine crawler through the filter, and sending the crawler request to the engineering project.
When obtaining a crawler request sent by a search engine crawler, the website item may also receive requests sent by other web crawlers or other requests, where all requests are collectively referred to as http requests.
In the specific implementation, after receiving an external http request, the website item judges whether the external http request is a request sent by a web crawler, and if not, the external http request is other requests; if yes, judging whether the web crawler is from a crawler of the search engine, namely determining whether the http request is a crawler request sent by the crawler of the search engine, wherein:
if the external http request is not a crawler request sent by a search engine crawler, the filter does not intercept the http request, and the http request is allowed to normally access the website; and, the http request is limited to engineering projects, i.e., the steps of accessing and rendering the web page by invoking the browser through the engineering project are not performed, and the external search engine crawler accesses the web site normally.
If the external http request is a crawler request sent by a search engine crawler, the filter intercepts the crawler request when the crawler request passes through the website. Meanwhile, the website project newly establishes a crawler request by taking a target URL corresponding to the crawler request as a parameter, and sends the crawler request to the engineering project. After receiving the newly built crawler request, the engineering project executes step S103, namely executes the crawler request sent by the search engine crawler, and calls a browser to access and render the webpage through the engineering project, thereby obtaining webpage HTML content corresponding to the crawler request. In step S104, the acquired web page HTML content is returned to the engineering project, the engineering project returns the acquired web page HTML content to the website project, and the website project returns the web page HTML content to the search engine crawler.
It can be understood that, by setting up steps S110 to S150, the filter only intercepts the crawler request sent by the search engine crawler to send to the engineering project, and at the same time, limits the engineering project to receive the requests sent by other web crawlers or other requests, thereby ensuring that the user can retrieve more accurate related content. In addition, if requests sent by other web crawlers or other requests flow into the project website, the requests occupy Page instances, which may cause the crawler requests sent by the search engine crawlers not to be processed in time, or excessively long waiting time, resulting in termination of the crawler requests. Therefore, the crawler requests sent by the search engine crawlers are screened out by the filter and sent to the engineering project, so that the problems can be effectively solved.
In some embodiments, referring to fig. 2 and 5, in order to further solve the problems of long webpage rendering time, slow response crawler request speed, etc., in some embodiments, after the obtaining the crawler request sent by the search engine crawler in step S101, more precisely, between step S130 and step S140, the search method further includes:
s111, acquiring a preset cache service file; the cache service file is used for storing webpage HTML content.
The web page HTML content in the cache service file may be acquired in various manners, for example, the web page HTML content acquired in step S103 is stored in the cache service file; or, during the process of passing through the website, the crawler requests that the acquired webpage HTML content is stored in the cache service file; or otherwise store the retrieved web page HTML content in a cache service file.
And, web page HTML content is stored in the cache service file, typically by inserting the web page HTML content into the cache service file via a cache key. The cache key cannot be repeated, and the URL corresponding to the crawler request can be used as the cache key. Thus, the crawler requests and the requested URLs are in one-to-one correspondence, and corresponding webpage HTML content can be acquired according to the URLs.
S112, judging whether the webpage HTML content corresponding to the crawler request exists in the cache service file.
And S113, if so, judging whether the webpage HTML content corresponding to the crawler request is out of date.
And S114, if not expired, reading webpage HTML content corresponding to the crawler request in the cache service file, and returning to a search engine crawler.
The webpage HTML content is stored in a cache service file, and the cache service file sets cache time for the cached webpage HTML content. If the caching time of the webpage HTML content in the caching service file exceeds the set caching time, the webpage HTML content becomes invalid data and cannot be used normally. If the caching time of the webpage HTML content in the caching service file does not exceed the set caching time, the webpage HTML content is effective data.
In combination with step S111 to step S114, after receiving the crawler request, the website item determines the URL corresponding to the crawler request according to the received crawler request. And judging whether the webpage HTML content corresponding to the crawler request exists in the cache service file according to the URL of the crawler request. If so, the website project again judges whether the webpage HTML content corresponding to the crawler request is out of date. If not, the website item can directly acquire the webpage HTML content corresponding to the crawler request from the cache service file and return to the search engine crawler.
Therefore, by setting the cache service file, the server side is additionally provided with the cache function, and the search engine crawler can directly acquire the webpage HTML content corresponding to the target URL from the cache service file, so that the problems of long webpage rendering time, low response crawler request speed and the like are solved.
Further, after the obtaining of the preset cache service file in step S111, more precisely, during step S112 and step S113, the searching method further includes:
and if the webpage HTML content corresponding to the crawler request does not exist in the cache service file, or if the webpage HTML content corresponding to the crawler request in the cache service file is expired, executing the step of acquiring the pre-created engineering project.
It can be understood that, on the basis of the engineering project, by matching the cache service file with the engineering project, when the website project receives a crawler request, it is firstly determined whether the cache service file can provide the corresponding web page HTML content and is expired, and when the cache service file does not have the corresponding web page HTML content or the web page HTML content cache is expired, the working phase of the engineering project is entered. Therefore, the cache service file further improves the data acquisition speed on the basis of engineering projects. And part of work of engineering projects is shared, so that the overall work efficiency of the server side is ensured.
In some embodiments, step S103 is based on a crawler request sent by the search engine crawler, and the engineering project calls a browser to access and render a web page, and after obtaining web page HTML content corresponding to the crawler request, the search method further includes:
and inserting the acquired webpage HTML content into the cache service file, and setting the cache time.
In a specific implementation, in step S103, after the web HTML content corresponding to the crawler request is obtained, the web HTML content of the target URL is returned to the engineering project. The engineering project returns the web page HTML content of the target URL to the web site project's crawler request, and thus to the search engine crawler via the web site project. Meanwhile, the engineering project is directly connected to the cache service file or connected with the website project, the URL of the URL request is used as a cache key, and the acquired webpage HTML content is stored in the cache service file. After the webpage HTML content is cached, setting the caching time.
It will be appreciated that if the search engine crawler issues the same crawler request as before, and the corresponding web page HTML content is still within the cache time, it is stored in the cache service file. Therefore, the crawler request of the search engine crawler can directly acquire the corresponding webpage HTML content from the cache service file and return to the search engine crawler. Therefore, by caching the web page HTML content acquired in step S103 in the cache service file with the target URL as the cache key, the crawler request can directly acquire the web page HTML content of the target URL from the cache, so as to improve the efficiency and speed of responding to the crawler request.
The application also discloses a server side, referring to fig. 6, including a memory 10 and a processor 20, wherein: memory 10 is used to store programs and/or instructions that may be executed by processor 20; the processor 20 is configured to execute programs and/or instructions to implement the real-time monitoring method described above. The processor 20, when executing the computer program, performs the following steps:
obtaining a crawler request sent by a search engine crawler;
acquiring a pre-created engineering project, and establishing connection with a pre-installed browser, wherein the engineering project is pre-created based on DevTools Protocol;
based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request;
and returning the acquired webpage HTML content to the search engine crawler.
The application also discloses a computer readable storage medium, wherein the computer readable storage medium stores a program and/or instructions, and the program and/or instructions implement the real-time monitoring method when executed by a processor. The computer program when executed by a processor performs the steps of:
obtaining a crawler request sent by a search engine crawler;
acquiring a pre-created engineering project, and establishing connection with a pre-installed browser, wherein the engineering project is pre-created based on DevTools Protocol;
based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request;
and returning the acquired webpage HTML content to the search engine crawler.
It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working processes of the server side, the device and the unit described above may refer to corresponding processes in the foregoing method embodiments, which are not described herein again.
In the embodiments provided in the present application, it should be understood that the disclosed server side, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units is merely a logic function division, and there may be other manners of division in actual implementation, for example, multiple units or components may be combined or integrated into another server side, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other forms.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (RM, rndom ccess memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Claims (10)

1. An SEO scheme based on server-side webpage rendering technology is characterized by comprising the following steps:
obtaining a crawler request sent by a search engine crawler;
acquiring a pre-created engineering project, and establishing connection with a pre-installed browser, wherein the engineering project is pre-created based on DevTools Protocol;
based on a crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request;
and returning the acquired webpage HTML content to the web crawler.
2. The SEO solution based on server-side web page rendering technology of claim 1, wherein after the obtaining of the crawler request sent by the search engine crawler, the SEO solution further comprises:
acquiring a preset cache service file; the cache service file is used for storing webpage HTML content;
judging whether the cache service file contains webpage HTML content corresponding to the crawler request;
if so, judging whether the webpage HTML content corresponding to the crawler request is out of date;
and if not, reading the webpage HTML content corresponding to the crawler request in the cache service file, and returning to the web crawler.
3. The SEO solution based on server-side webpage rendering technology according to claim 2, wherein after the obtaining the preset cache service file, the SEO solution further includes:
and if the webpage HTML content corresponding to the crawler request does not exist in the cache service file, or if the webpage HTML content corresponding to the crawler request in the cache service file is out of date, executing the crawler request sent by the search engine crawler, calling a browser through the engineering project to access and render a webpage, and obtaining the webpage HTML content corresponding to the crawler request.
4. The SEO solution based on server-side webpage rendering technology according to any one of claims 2 or 3, wherein the SEO solution further includes, after the server-side webpage rendering technology is based on a crawler request sent by the search engine crawler, calling a browser through the engineering project to access and render a webpage, and obtaining webpage HTML content corresponding to the crawler request:
and inserting the acquired webpage HTML content into the cache service file, and setting the cache time.
5. A server-side webpage rendering technology based SEO solution according to any one of claims 2 or 3, wherein the obtaining the crawler request sent by the search engine crawler includes:
acquiring a pre-established website project, wherein the website project is created with a self-defined filter;
receiving an http request through the website item;
judging whether the http request is a crawler request sent by a search engine crawler;
if not, allowing the http request to normally access the website;
if yes, intercepting a crawler request sent by the search engine crawler through the filter, and sending the crawler request to the engineering project.
6. The SEO scheme based on server-side web Page rendering technology of claim 1, wherein the engineering project creates a browser Page object pool based on DevTools Protocol;
based on the crawler request sent by the search engine crawler, calling a browser to access and render a webpage through the engineering project, and acquiring webpage HTML content corresponding to the crawler request, wherein the method comprises the following steps:
judging whether an idle Page instance exists in the browser Page object pool;
if yes, calling an idle Page instance in the browser Page object pool, marking that the browser is in use, calling a browser to access and render a webpage through the Page instance based on a crawler request sent by the search engine crawler, and obtaining webpage HTML content corresponding to the crawler request.
7. The SEO solution based on server-side webpage rendering technology of claim 6, wherein after determining whether there is an idle Page instance in the browser Page object pool, the search method further comprises:
if not, waiting for an idle Page instance to appear in the browser Page object pool;
and judging whether the waiting time of the crawler request exceeds the preset time, and if so, returning a signal for responding to the error to the web crawler.
8. A server device, comprising a website project and an engineering project, wherein:
the website project is used for receiving a crawler request sent by a search engine crawler and feeding back the crawler request to the engineering project;
the engineering project is pre-created based on DevTools Protocol, and is used for establishing connection with a pre-installed browser, accessing and rendering a webpage by calling the browser through the engineering project based on a crawler request sent by the search engine crawler, acquiring webpage HTML content corresponding to the crawler request, and returning to the web crawler.
9. An electronic device comprising a memory and a processor, wherein:
a memory for storing programs and/or instructions executable by the processor;
a processor configured to execute the program and/or instructions to implement the search method of any one of claims 1 to 8.
10. A computer readable storage medium, characterized in that the computer readable storage medium stores a program and/or instructions which, when executed by a processor, implement the search method according to any one of claims 1 to 8.
CN202311546572.5A 2023-11-20 2023-11-20 SEO scheme based on server-side webpage rendering technology and related device Pending CN117520629A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311546572.5A CN117520629A (en) 2023-11-20 2023-11-20 SEO scheme based on server-side webpage rendering technology and related device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311546572.5A CN117520629A (en) 2023-11-20 2023-11-20 SEO scheme based on server-side webpage rendering technology and related device

Publications (1)

Publication Number Publication Date
CN117520629A true CN117520629A (en) 2024-02-06

Family

ID=89762265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311546572.5A Pending CN117520629A (en) 2023-11-20 2023-11-20 SEO scheme based on server-side webpage rendering technology and related device

Country Status (1)

Country Link
CN (1) CN117520629A (en)

Similar Documents

Publication Publication Date Title
US10015226B2 (en) Methods for making AJAX web applications bookmarkable and crawlable and devices thereof
US8793347B2 (en) System and method for providing virtual web access
JP6356273B2 (en) Batch optimized rendering and fetch architecture
CA2723274C (en) Multi-process browser architecture
US8424004B2 (en) High performance script behavior detection through browser shimming
US8239490B2 (en) Exposing resource capabilities to web applications
US9426200B2 (en) Updating dynamic content in cached resources
US20120259833A1 (en) Configurable web crawler
US20050108418A1 (en) Method and system for updating/reloading the content of pages browsed over a network
CN110209966B (en) Webpage refreshing method, webpage system and electronic equipment
US9154522B2 (en) Network security identification method, security detection server, and client and system therefor
US20150046425A1 (en) Methods and systems for searching software applications
CN106776983B (en) Search engine optimization device and method
WO2010025059A2 (en) Discovering alternative user experiences for websites
US8464157B2 (en) Smart browsing providers
CN113641924B (en) Webpage interactive time point detection method and device, electronic equipment and storage medium
RU2691834C1 (en) Method of managing web site data
JP6568985B2 (en) Batch optimized rendering and fetch architecture
KR20060080180A (en) Method of caching data assets
EP2760183A1 (en) System for detecting hyperlink faults
CN117520629A (en) SEO scheme based on server-side webpage rendering technology and related device
US11716405B1 (en) System and method for identifying cache miss in backend application
JP2009080587A (en) Data transfer server
CN102902787B (en) A kind of method of browser and acquisition dns resolution data thereof
TW535071B (en) Method computer system and computer program product for updating a web page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination