WO2021042508A1 - 网页生成方法、装置、计算机设备和存储介质 - Google Patents

网页生成方法、装置、计算机设备和存储介质 Download PDF

Info

Publication number
WO2021042508A1
WO2021042508A1 PCT/CN2019/116545 CN2019116545W WO2021042508A1 WO 2021042508 A1 WO2021042508 A1 WO 2021042508A1 CN 2019116545 W CN2019116545 W CN 2019116545W WO 2021042508 A1 WO2021042508 A1 WO 2021042508A1
Authority
WO
WIPO (PCT)
Prior art keywords
webpage
access request
web page
simulated
access
Prior art date
Application number
PCT/CN2019/116545
Other languages
English (en)
French (fr)
Inventor
梅锦振华
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021042508A1 publication Critical patent/WO2021042508A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • This application relates to a webpage generation method, device, computer equipment and storage medium.
  • a method, apparatus, computer device, and storage medium for generating a webpage are provided.
  • a method for generating web pages including:
  • the web page identifier is obtained according to the web page access request
  • the source code of the simulated web page is obtained according to the output result, and the source code of the simulated web page is returned to the terminal, and the terminal is used to generate the simulated web page according to the source code of the simulated web page.
  • a webpage generating device includes:
  • the request detection module is used to receive the web page access request sent by the terminal and detect the web page access request
  • the identification acquisition module is used to acquire the webpage identification according to the webpage access request when it is detected that the webpage access request is a crawler access request;
  • the screenshot acquisition module is used to obtain corresponding simulated webpage screenshots according to the webpage identifier, and input the simulated webpage screenshots into the trained webpage generation model to obtain the output result;
  • the web page generation module is used to obtain the source code of the simulated web page according to the output result, and return the source code of the simulated web page to the terminal, and the terminal is used to generate the simulated web page according to the source code of the simulated web page.
  • a computer device including a memory and one or more processors, the memory stores computer readable instructions, and when the computer readable instructions are executed by the processor, the one or more processors execute The following steps:
  • the web page identifier is obtained according to the web page access request
  • the source code of the simulated web page is obtained according to the output result, and the source code of the simulated web page is returned to the terminal, and the terminal is used to generate the simulated web page according to the source code of the simulated web page.
  • One or more non-volatile storage media storing computer-readable instructions.
  • the computer-readable instructions When executed by one or more processors, the one or more processors perform the following steps:
  • the web page identifier is obtained according to the web page access request
  • the source code of the simulated web page is obtained according to the output result, and the source code of the simulated web page is returned to the terminal, and the terminal is used to generate the simulated web page according to the source code of the simulated web page.
  • Fig. 1 is an application scenario diagram of a method for generating a web page according to one or more embodiments.
  • Fig. 2 is a schematic flowchart of a method for generating a webpage according to one or more embodiments.
  • Fig. 3 is a schematic diagram of a flow of detecting a webpage access request according to one or more embodiments.
  • Fig. 4 is a schematic diagram of a process of generating a normal webpage according to one or more embodiments.
  • Fig. 5 is a schematic diagram of a process of alerting a crawler according to one or more embodiments.
  • Fig. 6 is a schematic flow chart of training a webpage generation model according to one or more embodiments.
  • Fig. 7 is a block diagram of an apparatus for generating a web page according to one or more embodiments.
  • Figure 8 is a block diagram of a computer device according to one or more embodiments.
  • the method for generating a webpage can be applied to the application environment as shown in FIG. 1.
  • the terminal 102 communicates with the server 104 through the network.
  • the server 104 receives the webpage access request sent by the terminal 102, and detects the webpage access request; when it detects that the webpage access request is a crawler access request, obtains the webpage identification according to the webpage access request; obtains the corresponding simulated webpage screenshot according to the webpage identification, and screenshots the simulated webpage Input to the trained webpage generation model to obtain the output result; the server 104 obtains the simulated webpage source code according to the output result, and returns the simulated webpage source code to the terminal 102, and the terminal 102 is used to generate the simulated webpage according to the simulated webpage source code.
  • the terminal 102 may be, but is not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server 104 may be implemented by an independent server or a server cluster composed of multiple servers.
  • a method for generating a webpage is provided. Taking the method applied to the server in FIG. 1 as an example for description, the method includes the following steps:
  • S202 Receive a web page access request sent by the terminal, and detect whether the web page access request is a crawler access request according to the blacklist database.
  • the blacklist database is a database of preset crawler access identifiers, which is used to detect whether the access identifier in the web page access request is a crawler access identifier.
  • the server receives the web page access request sent by the terminal, and detects the web page access request according to a preset blacklist database. In one embodiment, it can also detect whether the web page access request is a crawler according to a preset crawler detection rule. Web page access request sent.
  • the preset crawler detection rule may be crawler detection according to the IP address (Internet Protocol Address) of the webpage access request.
  • the user agent User Agent, a special string header of the webpage access request
  • the webpage identifier is used to uniquely identify the page to be accessed by the webpage access request.
  • the webpage identifier can be the IP address information of the webpage, or the domain name of the webpage, and so on.
  • the webpage identifier is obtained according to the webpage access request. That is, when the webpage access request is a crawler access request, the webpage access request is parsed first to obtain the webpage identifier carried in the webpage access request.
  • S206 Obtain a corresponding simulated webpage screenshot according to the webpage identifier, and input the simulated webpage screenshot into the trained webpage generation model to obtain an output result.
  • a simulated webpage screenshot refers to a screenshot of a fake webpage saved in the server, and the fake webpage screenshot refers to a screenshot of a webpage that is different from the actual webpage to be returned.
  • the trained web page generation model refers to the neural network algorithm generated based on the existing web page screenshots and corresponding source code.
  • the neural network algorithm can be LSTM (Long Short-Term Memory), which is a kind of cyclic neural network. Network) and CNN (Convolutional Neural Network, Convolutional Neural Network)
  • the corresponding relationship between the webpage identifier and the screenshot of the simulated webpage is preset in the server.
  • the webpage identifier is used to obtain the corresponding simulated webpage screenshot according to the corresponding relationship
  • the server inputs the simulated webpage screenshot into the trained webpage generation model to obtain the output result of the webpage generation model.
  • the output result may be a webpage code vector, and the corresponding relationship between the webpage code vector and the specific code is set during the training of the webpage generation model.
  • S208 Obtain the source code of the simulated webpage according to the output result, and return the source code of the simulated webpage to the terminal, and the terminal is used to generate the simulated webpage according to the source code of the simulated webpage.
  • the source code of the simulated webpage refers to the source code of the front end of the non-real webpage.
  • the source code can be in the form of HTML (Hypertext Markup Language), or in the form of XML (Extensible Markup Language), and so on.
  • the server obtains the simulated webpage source code according to the output result, that is, obtains the simulated webpage source code corresponding to the output result according to the corresponding relationship between the webpage code vector and the specific code, returns the simulated webpage source code to the terminal, and the terminal receives the simulation returned by the server
  • the server obtains the simulated webpage source code according to the output result, that is, obtains the simulated webpage source code corresponding to the output result according to the corresponding relationship between the webpage code vector and the specific code
  • returns the simulated webpage source code to the terminal and the terminal receives the simulation returned by the server
  • a simulated web page is generated according to the simulated web page source code, and the produced simulated web page is displayed in the terminal.
  • the web page generation model is used to generate the simulated web page source code according to the screenshot of the simulated web page, and the simulated web page source code is returned to the corresponding terminal.
  • the source code of the simulated webpage generates the simulated webpage so that the data crawled by the crawler is the fake data of the simulated webpage, which prevents the crawler from avoiding the restriction of the rules to obtain the real webpage data and improves the security of the webpage data.
  • step S202 which is to receive a web page access request sent by a terminal, and detect whether the web page access request is a crawler access request according to the blacklist database, includes the steps:
  • S302 Parse the webpage access request to obtain the access identifier, and search for the access identifier in the blacklist database.
  • the blacklist database refers to a database that has been set up in advance according to the crawler's access identifier, that is, the access identifier in the historical crawler's webpage access request is stored in the blacklist database.
  • the server when the server receives the web page access request, it parses the web page access request to obtain the access identifier carried in the web page access request, and then searches the blacklist database for the access identifier.
  • the web page access request is a crawler access request.
  • a crawler access request refers to an access request sent by a crawler to a web page.
  • a crawler refers to a program or script that automatically crawls information on the World Wide Web in accordance with certain rules.
  • the server finds the access identifier in the blacklist database, it indicates that the webpage access request is a webpage access request sent by a crawler, that is, the webpage access request is a crawler access request.
  • the server does not find the access identifier in the blacklist database, it can further detect the webpage access request.
  • step S302 that is, after the webpage access request is parsed to obtain the access identifier, and after the access identifier is searched in the blacklist database, the method further includes the following steps:
  • the web page access request is a crawler access request.
  • the historical access log records the information of historically visited web pages, which can be obtained from the access.log of nginx (a high-performance HTTP and reverse proxy web server).
  • Behavior characteristics refer to the characteristics of the webpage being visited, for example, the characteristic of the number of concurrent connections refers to the number of visits to the webpage identified by the visit within a fixed period of time. For example, whether to access hidden information refers to information that is not visible on the page but is accessed.
  • the preset rule refers to the rule of abnormal access behavior set in advance.
  • the access identifier when the access identifier is not found in the blacklist database, it means that the webpage access request requires further detection.
  • the historical access log corresponding to the access identifier is obtained, and the behavior feature is extracted from the historical access log.
  • the web page access request is a crawler access request. For example, if the number of concurrent connections in the behavior feature is 32 in one minute, which is consistent with the preset rule that the number of concurrent connections in one minute exceeds 30, the access request is a crawler access request.
  • the server when it is detected that the access request corresponding to the access identifier not in the blacklist database is a crawler access request, the server sends the access identifier to the management terminal, and the access identifier is received and displayed according to the management terminal. At this time, if the administrator checks that the access identifier belongs to a crawler access request, the access identifier can be added to the blacklist database. That is, the management terminal receives the access identification adding instruction, and writes the access identification into the blacklist database according to the access identification adding instruction.
  • the webpage access request is a normal access request.
  • the real source code of the web page can be obtained and returned to the terminal for web page display.
  • the historical access log is further used to determine whether the web page access request is a crawler access request, which improves the accuracy of detecting the crawler access request.
  • step S202 after receiving the web page access request sent by the terminal, and detecting whether the web page access request is a crawler access request according to the blacklist database, the method further includes the following steps:
  • a normal access request refers to a request for accessing a webpage without a crawler, such as a request for a user to access a webpage normally.
  • the server does not find the access identifier in the webpage access request in the blacklist database, indicating that the webpage access request is not a crawler access request, then the webpage access request is a normal access request, and at this time, the normal access request is parsed , Get the corresponding webpage ID.
  • S404 Find the corresponding webpage source code according to the webpage identifier, and return the webpage source code to the terminal, where the terminal is used to generate a webpage according to the webpage source code.
  • the webpage source code refers to the source code of the real webpage to be returned to the terminal.
  • the server searches for the corresponding webpage source code according to the webpage identifier, returns the webpage source code to the terminal, and the terminal receives the webpage source code sent by the server, parses the webpage source code to generate the corresponding webpage and displays it.
  • the webpage identification is obtained according to the webpage access request
  • the corresponding webpage source code is searched for according to the webpage identification
  • the webpage source code is returned to the terminal, and the terminal is used to generate
  • the webpage can perform normal display of the webpage when it is detected that the webpage access request is a normal access request.
  • step S208 that is, after obtaining the source code of the simulated webpage according to the output result
  • the source code of the simulated webpage is returned to the terminal, and the terminal is used to generate the simulated webpage according to the source code of the simulated webpage.
  • S502 Receive the webpage behavior data sent by the terminal and generate a crawler identifier, and store the webpage behavior data in association with the crawler identifier.
  • Web page behavior data refers to data information in simulated web pages crawled by crawlers.
  • the crawler ID is used to uniquely identify the crawler, which can be the name of the crawler, the ID of the crawler, and so on.
  • the server receives the web page behavior data sent by the terminal and generates a crawler identifier, and stores the web page behavior data and the crawler identifier in a database in association with each other, so as to facilitate subsequent viewing and management.
  • S504 Obtain an address of the management terminal, and send the webpage behavior data to the crawler identification to the management terminal according to the address of the management terminal.
  • the management terminal address refers to the address at which the management terminal receives crawler alert information, and the address may be the IP address of the management terminal.
  • the management terminal address is obtained, and the webpage behavior data and the crawler identification are associated and sent to the management terminal according to the management terminal address.
  • the management terminal receives the webpage behavior data and the crawler identification and performs an alarm display.
  • the management mailbox can be obtained, and the web page behavior data can be associated with the crawler identifier and sent to the management mailbox for a crawler alarm prompt.
  • the management mobile phone number can be obtained, and the association of webpage behavior data and crawler identification can be sent in the form of a short message to the mobile phone corresponding to the management mobile phone number to prompt the crawler.
  • the webpage behavior data sent by the terminal by receiving the webpage behavior data sent by the terminal and generating the crawler identification, the webpage behavior data is associated and saved with the crawler identification, the management terminal address is obtained, and the webpage behavior data is associated with the crawler identification and sent to the management terminal according to the management terminal address,
  • the crawler can be managed and alarmed, which is convenient for management personnel to deal with the crawler.
  • the steps of generating the trained webpage generation model include:
  • S602 Obtain a screenshot of the page and the corresponding simulation source code, and obtain the corresponding code feature vector according to the simulation source code.
  • the simulated source code refers to the page source code corresponding to the screenshot of the page
  • the code feature vector refers to the code feature vector obtained after vectorization based on the simulated source code corresponding to the page screenshot.
  • the simulation source code is coded using the one-hot encoding method.
  • the N-bit status register is used to encode N states. Each state has its own independent register bit, and at any time, there is only one bit of it. effective.
  • One-hot encoding is used to obtain the code feature vector corresponding to each code word in the simulated source code.
  • the server obtains a screenshot of the page and the corresponding simulation source code, and performs one-hot encoding on each code word in the simulation source code to obtain a code feature vector corresponding to each code word.
  • a piece of code of the simulated source code is "start hello word end", and the code feature vector corresponding to "start” is obtained by encoding each code word as (0,0,0,1), "hello”
  • the corresponding code feature vector is (0,0,1,0)
  • the code feature vector corresponding to "word” is (0,1,0,0)
  • the code feature vector corresponding to "end” is (1,0,0) ,0).
  • S604 Use the page screenshot and the starting code feature vector in the code feature vector as the input of the neural network model, and use the code feature vector next to the starting code feature vector in the code feature vector as the label of the neural network model for training.
  • the trained web page generation model is obtained.
  • the starting code feature vector refers to the code feature vector corresponding to the code word before the code word to be predicted in the simulated source code.
  • the page screenshot and the starting code feature vector in the code feature vector are used as the input of the neural network model, and the code feature vector next to the starting code feature vector in the code feature vector is used as the label of the neural network model for training.
  • the screenshot of the page and the code feature vector corresponding to the code word before the code word to be predicted are used as the input of the neural network model, and the code feature vector corresponding to the code word to be predicted is used as the label of the neural network model for training. Repeat this step until the simulation source All code words of the code are trained as labels of the neural network model.
  • the training reaches the preset number of iterations or reaches the preset threshold, the training is completed and the trained web page generation model is obtained.
  • the code feature vector corresponding to the page screenshot and "start” is (0,0,0,1) as the input of the recurrent neural network model
  • the code feature vector corresponding to "hello” is (0 ,0,1,0) as the label of the recurrent neural network model for training
  • the code feature vector corresponding to word” is (0,1,0,0) as the label of the cyclic neural network model to continue training, and all the code feature vectors are used as the label of the cyclic neural network model, and reach the preset value
  • the cost function threshold is reached, the training is completed, and the trained web page generation model is obtained.
  • the pre-trained web page generation model is used to deploy the web page generation model to the server.
  • the trained web page generation model can be quickly used to generate simulated web pages, which improves the efficiency of producing simulated web pages. .
  • supply chain finance when applied to a supply chain finance platform, refers to a bank that focuses on core enterprises, manages the capital flow and logistics of upstream and downstream SMEs, and transforms the uncontrollable risks of a single enterprise into a supply chain enterprise
  • the overall controllable risk through the three-dimensional acquisition of various types of information, the risk is controlled to the lowest level of financial services.
  • the user information, data, and amount involved in the supply chain financial platform are all sensitive. If crawled by a web crawler, it will cause serious information leakage and form a major information security problem.
  • the supply chain finance platform receives the web page access request sent by the terminal, it detects the web page access request.
  • the web page access request When it detects that the web page access request is a crawler access request, it obtains the web page identification according to the web access request; obtains the corresponding simulated web page according to the web page identification Screenshot, input the simulated webpage screenshot into the trained webpage generation model, and get the output result; according to the output result, the source code of the simulated webpage is obtained, and the source code of the simulated webpage is returned to the terminal, and the terminal receives the source code of the simulated webpage to generate the simulated webpage to make crawlers
  • the crawled data is simulated data to prevent the real data of the supply chain financial platform from being crawled by crawlers, and to ensure the security of the real data of the supply chain financial platform.
  • a webpage generation apparatus 700 which includes: a request detection module 702, an identification acquisition module 704, a screenshot acquisition module 706, and a webpage generation module 708, wherein:
  • the request detection module 702 is configured to receive a web page access request sent by the terminal, and detect whether the web page access request is a crawler access request according to the blacklist database;
  • the identification obtaining module 704 is configured to obtain a webpage identification according to the webpage access request when it is detected that the webpage access request is a crawler access request;
  • the screenshot obtaining module 706 is used to obtain a corresponding simulated webpage screenshot according to the webpage identifier, and input the simulated webpage screenshot into the trained webpage generation model to obtain an output result;
  • the webpage generation module 708 is configured to obtain the source code of the simulated webpage according to the output result, and return the source code of the simulated webpage to the terminal, and the terminal is used for generating the simulated webpage according to the source code of the simulated webpage.
  • the request detection module 702 is also used to parse the web page access request to obtain the access identifier, and search for the access identifier in the blacklist database; when the access identifier is found, the web page access request is a crawler access request.
  • the request detection module 702 is further configured to obtain the historical access log of the access identifier when the access identifier is not found, and extract the behavior feature from the historical access log.
  • the behavior feature is consistent with the preset rule, the web page is accessed
  • the request is a crawler access request.
  • the request detection module 702 is further configured to: when the webpage access request is detected as a normal access request according to the blacklist database, obtain the webpage identification according to the webpage access request; search for the corresponding webpage source code according to the webpage identification, and transfer the webpage The source code is returned to the terminal, and the terminal is used to generate a webpage based on the source code of the webpage.
  • the webpage generating apparatus 700 further includes:
  • the data storage module is used to receive the webpage behavior data sent by the terminal and generate the crawler identification, and store the webpage behavior data in association with the crawler identification;
  • the data sending module is used to obtain the address of the management terminal, and according to the address of the management terminal, associate the webpage behavior data with the crawler identifier and send the management terminal.
  • the webpage generating apparatus 700 further includes:
  • the vector obtaining module is used to obtain a screenshot of the page and the corresponding simulation source code, and obtain the corresponding code feature vector according to the simulation source code;
  • the model training module is used to take screenshots of the page and the starting code feature vector in the code feature vector as the input of the neural network model, and use the code feature vector in the code feature vector that is next to the starting code feature vector as the label of the neural network model. Training, when the training completion conditions are met, the trained web page generation model is obtained.
  • each module in the above webpage generating device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 8.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus.
  • the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a non-volatile storage medium and an internal memory.
  • the non-volatile storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer-readable instructions in the non-volatile storage medium.
  • the database of the computer equipment is used to store access identification data and web page behavior data.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer readable instruction is executed by the processor to realize a web page generation method.
  • FIG. 8 is only a block diagram of a part of the structure related to the solution of the present application, and does not constitute a limitation on the computer device to which the solution of the present application is applied.
  • the specific computer device may Including more or fewer parts than shown in the figure, or combining some parts, or having a different arrangement of parts.
  • a computer device includes a memory and one or more processors.
  • the memory stores computer-readable instructions.
  • the steps of the webpage generation method provided in any embodiment of the present application are implemented.
  • One or more non-volatile storage media storing computer-readable instructions.
  • the one or more processors implement the webpage provided in any embodiment of the present application Steps to generate method.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Abstract

一种网页生成方法,包括:接收终端发送的网页访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求;当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。

Description

网页生成方法、装置、计算机设备和存储介质
相关申请的交叉引用
本申请要求于2019年09月06日提交中国专利局,申请号为2019108437546,申请名称为“网页生成方法、装置、计算机设备和存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及一种网页生成方法、装置、计算机设备和存储介质。
背景技术
随着互联网技术的发展,从互联网获取数据通常是使用网络爬虫来爬取互联网中的数据,但是,互联网中经常会有恶意爬虫不遵循通用的reboots协议,未经允许爬取数据,不仅会造成用户隐私数据泄露,也会使服务器响应压力增加。目前,通常使用预设规则来限制网络爬虫,但是,使用规则限制容易被爬取者识别,从而避开限制规则爬取数据,仍然会使得数据泄露,造成数据安全问题。
发明内容
根据本申请公开的各种实施例,提供一种网页生成方法、装置、计算机设备和存储介质。
一种网页生成方法,包括:
接收终端发送的网页访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求;
当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;
根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
一种网页生成装置,包括:
请求检测模块,用于接收终端发送的网页访问请求,检测网页访问请求;
标识获取模块,用于当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;
截图获取模块,用于根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
网页生成模块,用于根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
一种计算机设备,包括存储器和一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述处理器执行时,使得所述一个或多个处理器执行以下步骤:
接收终端发送的网页访问请求,检测网页访问请求;
当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;
根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器执行以下步骤:
接收终端发送的网页访问请求,检测网页访问请求;
当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;
根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
本申请的一个或多个实施例的细节在下面的附图和描述中提出。本申请的其它特征和优点将从说明书、附图以及权利要求书变得明显。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1为根据一个或多个实施例中网页生成方法的应用场景图。
图2为根据一个或多个实施例中网页生成方法的流程示意图。
图3为根据一个或多个实施例中检测网页访问请求流程示意图。
图4为根据一个或多个实施例中生成正常网页的流程示意图。
图5为根据一个或多个实施例中进行爬虫报警提示的流程示意图。
图6为根据一个或多个实施例中训练网页生成模型的流程示意图。
图7为根据一个或多个实施例中网页生成装置的框图。
图8为根据一个或多个实施例中计算机设备的框图。
具体实施方式
为了使本申请的技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。
本申请提供的网页生成方法,可以应用于如图1所示的应用环境中。终端102通过网络与服务器104进行通信。服务器104接收终端102发送的网 页访问请求,检测网页访问请求;当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;服务器104根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端102,终端102用于根据模拟网页源代码生成模拟网页。终端102可以但不限于是各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备,服务器104可以用独立的服务器或者是多个服务器组成的服务器集群来实现。
在其中一个实施例中,如图2所示,提供了一种网页生成方法,以该方法应用于图1中的服务器为例进行说明,包括以下步骤:
S202,接收终端发送的网页访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求。
黑名单数据库是预先设置好的爬虫访问标识的数据库,用于检测网页访问请求中的访问标识是否为爬虫访问标识。
具体地,服务器接收终端发送的网页访问请求,并根据预先设置好的黑名单数据库检测网页访问请求,在一个实施例中,也可以根据预先设置好的爬虫检测规则检测该网页访问请求是否是爬虫发送的网页访问请求。该预先设置好的爬虫检测规则可以是根据网页访问请求的IP地址(Internet Protocol Address,互联网协议地址)进行爬虫检测。在一个实施例中,还可以根据网页访问请求的用户代理(User Agent,一个特殊字符串头,使得服务器能够识别客户使用的操作系统及版本、CPU类型、浏览器及版本、浏览器渲染引擎、浏览器语言、浏览器插件等)来检测是否为爬虫访问请求。
S204,当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识。
网页标识用于唯一标识网页访问请求要访问的页面,网页标识可以是该网页的IP地址信息,也可以是网页域名等等。
具体地,当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识。即当网页访问请求为爬虫访问请求,先解析该网页访问请求 获取到网页访问请求携带的网页标识。
S206,根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果。
模拟网页截图是指保存在服务器中的虚假网页截图,该虚假网页截图是指与真正要返回的网页不同网页的截图。已训练的网页生成模型是指根据已有的网页截图和对应的源代码使用神经网络算法生成的,该神经网络算法可以是LSTM(Long Short-Term Memory,长短期记忆网络,是一种循环神经网络)与CNN(Convolutional Neural Network,卷积神经网络)
具体地,服务器中预先设置好了网页标识和模拟网页截图的对应关系。当检测到网页访问请求是爬虫访问请求时,根据对应关系使用网页标识获取对应的模拟网页截图,服务器将模拟网页截图输入到已训练的网页生成模型中,得到网页生成模型的输出结果。该输出结果可以是网页代码向量,在网页生成模型训练时就设置好网页代码向量和具体代码的对应关系。
S208,根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
模拟网页源代码是指非真实网页前端的源代码。该源代码可以是HTML(超文本标记语言)形式的,也可以是XML(可扩展标记语言)形式的等等。
具体地,服务器根据输出结果得到模拟网页源代码,即根据网页代码向量和具体代码的对应关系得到输出结果对应的模拟网页源代码,将将模拟网页源代码返回终端,终端接收到服务器返回的模拟网页源代码时,根据模拟网页源代码生成模拟网页,将生产的模拟网页在终端中进行展示。
在上述网页生成方法中,通过检测网页访问请求,当检测网页访问请求为爬虫访问请求时,根据模拟网页截图使用网页生成模型生成模拟网页源代码,将模拟网页源代码返回对应的终端,终端根据模拟网页源代码生成模拟网页从而使爬虫爬取到的数据是模拟网页的虚假数据,避免爬虫避开规则限制去获取真实的网页数据,提高了网页数据的安全性。
在其中一个实施例中,如图3所示,步骤S202,即接收终端发送的网页 访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求,包括步骤:
S302,解析网页访问请求,得到访问标识,在黑名单数据库中查找访问标识。
黑名单数据库是指预先根据爬虫的访问标识设置好的数据库,即将历史爬虫访问网页请求中的访问标识存储到黑名单数据库中。
具体地,服务器接收到网页访问请求时,解析网页访问请求,得到网页访问请求携带的访问标识,然后在黑名单数据库中查找访问标识。
S304,当查找到访问标识时,网页访问请求为爬虫访问请求。
爬虫访问请求是指爬虫向网页发送的访问请求,爬虫是指是一种按照一定的规则,自动地抓取万维网信息的程序或者脚本。
具体地,具体地,当服务器在黑名单数据库中查找到访问标识时,说明该网页访问请求是爬虫发送的网页访问请求,即网页访问请求为爬虫访问请求。当服务器未在黑名单数据库中查找到访问标识时,可以进一步对该网页访问请求进行检测。
在上述实施例中,通过设置黑名单数据库来对网页访问请求进行爬虫检测,能够提高爬虫访问请求的检测效率。
在其中一个实施例中,在步骤S302之后,即在解析网页访问请求,得到访问标识,在黑名单数据库中查找访问标识之后,还,还包括步骤:
当未查找到访问标识时,获取访问标识的历史访问日志,从历史访问日志中提取行为特征,当行为特征与预设规则一致时,网页访问请求为爬虫访问请求。
历史访问日志记录了历史访问网页的信息,可以从nginx(一个高性能的HTTP和反向代理web服务器)的access.log中获取到历史访问日志。行为特征是指网页被访问的特征,比如并发连接数特征是指在固定时间段内该访问标识的访问网页的次数。比如是否访问隐藏信息是指在页面不可见却被访问的信息。预设规则是指预先设置好的异常访问行为的规则。
具体地,当未在黑名单数据库中查找到访问标识时,说明该网页访问请求需要进一步的检测。此时,获取访问标识对应的历史访问日志,从历史访问日志中提取行为特征,当行为特征与预设规则一致时,网页访问请求为爬虫访问请求。比如,行为特征中并发连接数特征为1分钟内为32个,与预先设置的1分钟内并发连接数超过30个的规则一致,则该访问请求为爬虫访问请求。
在其中一个实施例中,当检测到未在黑名单数据库中的访问标识对应的访问请求为爬虫访问请求时,服务器将访问标识发送到管理终端,根据管理终端接收到访问标识并进行显示。此时,若管理者检查该访问标识属于爬虫访问请求时,可以将访问标识加入到黑名单数据库中。即管理终端接收到访问标识添加指令,根据访问标识添加指令将该访问标识写入黑名单数据库中。
在其中一个实施例中,当行为特征与预设规则不一致时,网页访问请求为正常访问请求。此时,可以获取到真正的网页源代码返回到终端进行网页展示。
在上述实施例中,进一步通过历史访问日志来判断网页访问请求是否为爬虫访问请求,提高了检测爬虫访问请求的准确性。
在其中一个实施例中,如图4所示,在步骤S202之后即在接收终端发送的网页访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求之后,还包括步骤:
S402,当根据黑名单数据库检测网页访问请求为正常访问请求时,根据网页访问请求获取网页标识。
正常访问请求是指未通过爬虫访问网页的请求,比如用户正常访问网页的请求。
具体地,服务器在黑名单数据库中未查找到网页访问请求中的访问标识时,说明该网页访问请求不是爬虫访问请求时,则该网页访问请求为正常访问请求,此时,解析该正常访问请求,得到对应的网页标识。
S404,根据网页标识查找对应的网页源代码,将网页源代码返回终端, 终端用于根据网页源代码生成网页。
网页源代码是指要返回终端的真实网页的源代码。
具体地,服务器根据网页标识查找对应的网页源代码,将网页源代码返回终端,终端接收到服务器发送的网页源代码,解析网页源代码生成对应的网页并进行展示。
在上述实施例中,当检测网页访问请求为正常访问请求时,根据网页访问请求获取网页标识,根据网页标识查找对应的网页源代码,将网页源代码返回终端,终端用于根据网页源代码生成网页,能够在检测网页访问请求为正常访问请求时,进行网页的正常展示。
在一个实施例中,如图5所示,在步骤S208之后,即在根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页之后,还包括步骤:
S502,接收终端发送的网页行为数据并生成爬虫标识,将网页行为数据与爬虫标识关联保存。
网页行为数据是指爬虫爬取的模拟网页中数据信息。爬虫标识用于唯一标识爬虫,可以是爬虫的名称、爬虫的ID等等。
具体地,服务器接收终端发送的网页行为数据并生成爬虫标识,将网页行为数据与爬虫标识关联保存到数据库中,便于后续进行查看和管理。
S504,获取管理终端地址,根据管理终端地址将网页行为数据与爬虫标识关联发送管理终端。
管理终端地址是指管理终端接收爬虫警报信息的地址,该地址可以是管理终端的IP地址。
具体地,获取管理终端地址,根据管理终端地址将网页行为数据与爬虫标识关联发送管理终端,管理终端接收到网页行为数据与爬虫标识并进行报警展示。在一个实例中,可以获取到管理邮箱,将网页行为数据与爬虫标识关联发送管理邮箱中进行爬虫报警提示。在一个实施例中,可以获取到管理手机号码,将将网页行为数据与爬虫标识关联以短信的形式发送到管理手机 号码对应的手机中进行爬虫报警提示。
在上述实施例中,通过接收终端发送的网页行为数据并生成爬虫标识,将网页行为数据与爬虫标识关联保存,获取管理终端地址,根据管理终端地址将网页行为数据与爬虫标识关联发送管理终端,能够对爬虫进行管理和报警提示,便于管理人员对爬虫进行处理。
在其中一个实施例中,如图6所示,已训练的网页生成模型的生成步骤包括:
S602,获取页面截图和对应的模拟源代码,根据模拟源代码得到对应的代码特征向量。
模拟源代码是指页面截图对应的页面源代码,代码特征向量是指根据页面截图对应的模拟源代码进行向量化后得到的代码特征向量。比如,将模拟源代码使用独热编码的方法进行编码,具体是使用N位状态寄存器来对N个状态进行编码,每个状态都由他独立的寄存器位,并且在任意时候,其中只有一位有效。采用独热编码得到模拟源代码中每一个代码词对应的代码特征向量。
具体地,服务器获取页面截图和对应的模拟源代码,将模拟源代码中的每个代码词进行独热编码得到每个代码词对应的代码特征向量。在一个具体的实施例中,模拟源代码一段代码为“start hello word end”,对每个代码词编码得到“start”对应的代码特征向量为(0,0,0,1),“hello”对应的代码特征向量为(0,0,1,0),“word”对应的代码特征向量为(0,1,0,0),“end”对应的代码特征向量为(1,0,0,0)。
S604,将页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将代码特征向量中与起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到已训练的网页生成模型。
起始代码特征向量是指模拟源代码中待预测代码词之前的代码词对应的代码特征向量。
具体地,将页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将代码特征向量中与起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,即将页面截图和待预测代码词之前的代码词对应的代码特征向量作为神经网络模型的输入,将待预测代码词对应的代码特征向量作为神经网络模型的标签进行训练,重复该步骤,直到将模拟源代码所有代码词都作为神经网络模型的标签进行训练,此时,当训练达到预先设置好的迭代次数或者达到预设置好的阈值时,训练完成,得到已训练的网页生成模型。在一个具体的实施例中,将页面截图和“start”对应的代码特征向量为(0,0,0,1)作为循环神经网络模型的输入,将“hello”对应的代码特征向量为(0,0,1,0)作为循环神经网络模型的标签进行训练,然后将页面截图和“hello”对应的代码特征向量为(0,0,1,0)作为循环神经网络模型的输入,将“word”对应的代码特征向量为(0,1,0,0)作为循环神经网络模型的标签继续进行训练,将所有的代码特征向量都作为循环神经网络模型的标签时,且达到预设置好的代价函数阈值时,训练完成,得到已训练的网页生成模型。
在上述实施例中,通过预先训练的网页生成模型,然后将网页生成模型部署到服务器中,在检测到爬虫访问时,可以快速使用已训练的网页生成模型生成模拟网页,提高生产模拟网页的效率。
在一个具体地实施例中,应用在供应链金融平台中,供应链金融是指银行围绕核心企业,管理上下游中小企业的资金流和物流,并把单个企业的不可控风险转变为供应链企业整体的可控风险,通过立体获取各类信息,将风险控制在最低的金融服务。在供应链金融平台中涉及的用户信息、资料和金额等都比较敏感,如果被网络爬虫爬取,会造成严重的信息泄露,形成重大的信息安全问题。此时,供应链金融平台接收到终端发送的网页访问请求时,检测网页访问请求,当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;根据输出结果得到模 拟网页源代码,将模拟网页源代码返回终端,终端接收到模拟网页源代码生成模拟网页,使爬虫爬取的数据为模拟的数据,避免供应链金融平台的真实数据被爬虫爬取,保证供应链金融平台的真实数据的安全性。
应该理解的是,虽然图2-6的流程图中的各个步骤按照箭头的指示依次显示,但是这些步骤并不是必然按照箭头指示的顺序依次执行。除非本文中有明确的说明,这些步骤的执行并没有严格的顺序限制,这些步骤可以以其它的顺序执行。而且,图2-6中的至少一部分步骤可以包括多个子步骤或者多个阶段,这些子步骤或者阶段并不必然是在同一时刻执行完成,而是可以在不同的时刻执行,这些子步骤或者阶段的执行顺序也不必然是依次进行,而是可以与其它步骤或者其它步骤的子步骤或者阶段的至少一部分轮流或者交替地执行。
在一个实施例中,如图7所示,提供了一种网页生成装置700,包括:请求检测模块702、标识获取模块704、截图获取模块706和网页生成模块708,其中:
请求检测模块702,用于接收终端发送的网页访问请求,并根据黑名单数据库检测网页访问请求是否为爬虫访问请求;
标识获取模块704,用于当检测到网页访问请求为爬虫访问请求时,根据网页访问请求获取网页标识;
截图获取模块706,用于根据网页标识获取对应的模拟网页截图,将模拟网页截图输入到已训练的网页生成模型中,得到输出结果;
网页生成模块708,用于根据输出结果得到模拟网页源代码,将模拟网页源代码返回终端,终端用于根据模拟网页源代码生成模拟网页。
在一个实施例中,请求检测模块702还用于解析网页访问请求,得到访问标识,在黑名单数据库中查找访问标识;当查找到访问标识时,网页访问请求为爬虫访问请求。
在一个实施例中,请求检测模块702还用于当未查找到访问标识时,获取访问标识的历史访问日志,从历史访问日志中提取行为特征,当行为特征 与预设规则一致时,网页访问请求为爬虫访问请求。
在其中一个实施例中,请求检测模块702还用于:当根据黑名单数据库检测网页访问请求为正常访问请求时,根据网页访问请求获取网页标识;根据网页标识查找对应的网页源代码,将网页源代码返回终端,终端用于根据网页源代码生成网页。
在其中一个实施例中,网页生成装置700,还包括:
数据保存模块,用于接收终端发送的网页行为数据并生成爬虫标识,将网页行为数据与爬虫标识关联保存;
数据发送模块,用于获取管理终端地址,根据管理终端地址将网页行为数据与爬虫标识关联发送管理终端。
在其中一个实施例中,网页生成装置700,还包括:
向量得到模块,用于获取页面截图和对应的模拟源代码,根据模拟源代码得到对应的代码特征向量;
模型训练模块,用于将页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将代码特征向量中与起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到已训练的网页生成模型。
关于网页生成装置的具体限定可以参见上文中对于网页生成方法的限定,在此不再赘述。上述网页生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图8所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括非易失性存储介质、内存储器。该非易失性存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为非 易失性存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储访问标识数据和网页行为数据等。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种网页生成方法。
本领域技术人员可以理解,图8中示出的结构,仅仅是与本申请方案相关的部分结构的框图,并不构成对本申请方案所应用于其上的计算机设备的限定,具体的计算机设备可以包括比图中所示更多或更少的部件,或者组合某些部件,或者具有不同的部件布置。
一种计算机设备,包括存储器和一个或多个处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时实现本申请任意一个实施例中提供的网页生成方法的步骤。
一个或多个存储有计算机可读指令的非易失性存储介质,计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现本申请任意一个实施例中提供的网页生成方法的步骤。
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。
以上实施例的各技术特征可以进行任意的组合,为使描述简洁,未对上述实施例中的各个技术特征所有可能的组合都进行描述,然而,只要这些技术特征的组合不存在矛盾,都应当认为是本说明书记载的范围。
以上所述实施例仅表达了本申请的几种实施方式,其描述较为具体和详细,但并不能因此而理解为对发明专利范围的限制。应当指出的是,对于本领域的普通技术人员来说,在不脱离本申请构思的前提下,还可以做出若干变形和改进,这些都属于本申请的保护范围。因此,本申请专利的保护范围应以所附权利要求为准。

Claims (20)

  1. 一种网页生成方法,包括:
    接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求;
    当检测到所述网页访问请求为爬虫访问请求时,根据所述网页访问请求获取网页标识;
    根据所述网页标识获取对应的模拟网页截图,将所述模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
    根据所述输出结果得到模拟网页源代码,将所述模拟网页源代码返回所述终端,所述终端用于根据所述模拟网页源代码生成模拟网页。
  2. 根据权利要求1所述的方法,其特征在于,所述接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求,包括:
    解析所述网页访问请求,得到访问标识,在所述黑名单数据库中查找所述访问标识;及
    当查找到所述访问标识地址时,所述网页访问请求为爬虫访问请求。
  3. 根据权利要求2所述的方法,其特征在于,在所述解析所述网页访问请求,得到访问标识,在所述黑名单数据库中查找所述访问标识之后,还包括:
    当未查找到所述访问标识时,获取所述访问标识的历史访问日志,从所述历史访问日志中提取行为特征,当所述行为特征与预设规则一致时,所述网页访问请求为爬虫访问请求。
  4. 根据权利要求3所述的方法,其特征在于,在所述当未查找到所述访问标识时,获取所述访问标识的历史访问日志,从所述历史访问日志中提取行为特征,当所述行为特征与预设规则一致时,所述网页访问请求为爬虫访问请求之后,还包括:
    将所述访问标识保存到所述黑名单数据库中,并将所述访问标识发送至 管理终端,以使所述管理终端显示所述访问标识。
  5. 根据权利要求1所述的方法,其特征在于,在所述接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求,还包括:
    当根据所述黑名单数据库检测所述网页访问请求为正常访问请求时,根据所述网页访问请求获取网页标识;及
    根据所述网页标识查找对应的网页源代码,将所述网页源代码返回所述终端,所述终端用于根据所述网页源代码生成网页。
  6. 根据权利要求1所述的方法,其特征在于,在所述根据所述输出结果得到模拟网页源代码,将所述模拟网页源代码返回所述终端,所述终端用于根据所述模拟网页源代码生成模拟网页之后,还包括:
    接收终端发送的网页行为数据并生成爬虫标识,将所述网页行为数据与所述爬虫标识关联保存;及
    获取管理终端地址,根据所述管理终端地址将所述网页行为数据与所述爬虫标识关联发送管理终端。
  7. 根据权利要求1所述的方法,其特征在于,所述已训练的网页生成模型的生成步骤包括:
    获取页面截图和对应的模拟源代码,根据所述模拟源代码得到对应的代码特征向量;及
    将所述页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将所述代码特征向量中与所述起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到所述已训练的网页生成模型。
  8. 一种网页生成装置,包括:
    请求检测模块,用于接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求;
    标识获取模块,用于当检测到所述网页访问请求为爬虫访问请求时,根 据所述网页访问请求获取网页标识;
    截图获取模块,用于根据所述网页标识获取对应的模拟网页截图,将所述模拟网页截图输入到已训练的网页生成模型中,得到输出结果;
    网页生成模块,用于根据所述输出结果得到模拟网页源代码,将所述模拟网页源代码返回所述终端,所述终端用于根据所述模拟网页源代码生成模拟网页。
  9. 根据权利要求8所述的装置,其特征在于,请求检测模块还用于解析所述网页访问请求,得到访问标识,在所述黑名单数据库中查找所述访问标识;当查找到所述访问标识时,所述网页访问请求为爬虫访问请求。
  10. 根据权利要求8所述的装置,其特征在于,请求检测模块还用于当未查找到所述访问标识时,获取所述访问标识的历史访问日志,从所述历史访问日志中提取行为特征,当所述行为特征与预设规则一致时,所述网页访问请求为爬虫访问请求。
  11. 根据权利要求8所述的装置,其特征在于,请求检测模块还用于当根据所述黑名单数据库检测所述网页访问请求为正常访问请求时,根据所述网页访问请求获取网页标识;根据所述网页标识查找对应的网页源代码,将所述网页源代码返回所述终端,所述终端用于根据所述网页源代码生成网页。
  12. 根据权利要求8所述的装置,其特征在于,还包括:
    向量得到模块,用于获取页面截图和对应的模拟源代码,根据所述模拟源代码得到对应的代码特征向量;及
    模型训练模块,用于将所述页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将所述代码特征向量中与所述起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到所述已训练的网页生成模型。
  13. 一种计算机设备,包括存储器及一个或多个处理器,所述存储器中储存有计算机可读指令,所述计算机可读指令被所述一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求;
    当检测到所述网页访问请求为爬虫访问请求时,根据所述网页访问请求获取网页标识;
    根据所述网页标识获取对应的模拟网页截图,将所述模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
    根据所述输出结果得到模拟网页源代码,将所述模拟网页源代码返回所述终端,所述终端用于根据所述模拟网页源代码生成模拟网页。
  14. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    解析所述网页访问请求,得到访问标识,在所述黑名单数据库中查找所述访问标识;及
    当查找到所述访问标识地址时,所述网页访问请求为爬虫访问请求。
  15. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    当未查找到所述访问标识时,获取所述访问标识的历史访问日志,从所述历史访问日志中提取行为特征,当所述行为特征与预设规则一致时,所述网页访问请求为爬虫访问请求。
  16. 根据权利要求13所述的计算机设备,其特征在于,所述处理器执行所述计算机可读指令时还执行以下步骤:
    获取页面截图和对应的模拟源代码,根据所述模拟源代码得到对应的代码特征向量;及
    将所述页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将所述代码特征向量中与所述起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到所述已训练的网页生成模型。
  17. 一个或多个存储有计算机可读指令的非易失性计算机可读存储介 质,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行以下步骤:
    接收终端发送的网页访问请求,并根据黑名单数据库检测所述网页访问请求是否为爬虫访问请求;
    当检测到所述网页访问请求为爬虫访问请求时,根据所述网页访问请求获取网页标识;
    根据所述网页标识获取对应的模拟网页截图,将所述模拟网页截图输入到已训练的网页生成模型中,得到输出结果;及
    根据所述输出结果得到模拟网页源代码,将所述模拟网页源代码返回所述终端,所述终端用于根据所述模拟网页源代码生成模拟网页。
  18. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    解析所述网页访问请求,得到访问标识,在所述黑名单数据库中查找所述访问标识;及
    当查找到所述访问标识地址时,所述网页访问请求为爬虫访问请求。
  19. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    当未查找到所述访问标识时,获取所述访问标识的历史访问日志,从所述历史访问日志中提取行为特征,当所述行为特征与预设规则一致时,所述网页访问请求为爬虫访问请求。
  20. 根据权利要求17所述的存储介质,其特征在于,所述计算机可读指令被所述处理器执行时还执行以下步骤:
    获取页面截图和对应的模拟源代码,根据所述模拟源代码得到对应的代码特征向量;及
    将所述页面截图和代码特征向量中的起始代码特征向量作为神经网络模型的输入,将所述代码特征向量中与所述起始代码特征向量紧邻的代码特征向量作为神经网络模型的标签进行训练,当达到训练完成条件时,得到所述 已训练的网页生成模型。
PCT/CN2019/116545 2019-09-06 2019-11-08 网页生成方法、装置、计算机设备和存储介质 WO2021042508A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910843754.6A CN110750750A (zh) 2019-09-06 2019-09-06 网页生成方法、装置、计算机设备和存储介质
CN201910843754.6 2019-09-06

Publications (1)

Publication Number Publication Date
WO2021042508A1 true WO2021042508A1 (zh) 2021-03-11

Family

ID=69276190

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/116545 WO2021042508A1 (zh) 2019-09-06 2019-11-08 网页生成方法、装置、计算机设备和存储介质

Country Status (2)

Country Link
CN (1) CN110750750A (zh)
WO (1) WO2021042508A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112749364B (zh) * 2020-02-28 2023-09-15 腾讯科技(深圳)有限公司 基于人工智能的网页生成方法、装置、设备及存储介质
CN111488546B (zh) * 2020-04-13 2023-09-26 北京小米移动软件有限公司 一种页面生成方法、装置及存储介质
CN113746790B (zh) * 2020-07-22 2023-09-05 北京沃东天骏信息技术有限公司 一种异常流量管理方法、电子设备及存储介质
CN113504906B (zh) * 2021-05-31 2022-06-24 贝壳找房(北京)科技有限公司 代码生成方法、装置、电子设备及可读存储介质
CN113535175A (zh) * 2021-07-23 2021-10-22 工银科技有限公司 应用程序前端代码的生成方法、装置、电子设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8126868B1 (en) * 2008-10-22 2012-02-28 Amazon Technologies, Inc. Search rankings with dynamically customized content
CN106789858A (zh) * 2015-11-25 2017-05-31 广州市动景计算机科技有限公司 一种访问控制方法和装置以及服务器
CN109885749A (zh) * 2019-02-28 2019-06-14 安徽腾策网络科技有限公司 一种网页信息数据防抓取系统
CN109948020A (zh) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 数据获取方法、装置、系统及可读存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8126868B1 (en) * 2008-10-22 2012-02-28 Amazon Technologies, Inc. Search rankings with dynamically customized content
CN106789858A (zh) * 2015-11-25 2017-05-31 广州市动景计算机科技有限公司 一种访问控制方法和装置以及服务器
CN109948020A (zh) * 2019-01-14 2019-06-28 北京三快在线科技有限公司 数据获取方法、装置、系统及可读存储介质
CN109885749A (zh) * 2019-02-28 2019-06-14 安徽腾策网络科技有限公司 一种网页信息数据防抓取系统

Also Published As

Publication number Publication date
CN110750750A (zh) 2020-02-04

Similar Documents

Publication Publication Date Title
WO2021042508A1 (zh) 网页生成方法、装置、计算机设备和存储介质
WO2019127881A1 (zh) 网页数据处理方法、装置、计算机设备及计算机存储介质
CN107003976A (zh) 基于可准许活动规则确定可准许活动
US20190222587A1 (en) System and method for detection of attacks in a computer network using deception elements
CN111143654B (zh) 辅助识别爬虫的、爬虫识别方法、装置及电子设备
CN111008348A (zh) 反爬虫方法、终端、服务器及计算机可读存储介质
CN112685739A (zh) 恶意代码检测方法、数据交互方法及相关设备
WO2019148712A1 (zh) 钓鱼网站检测方法、装置、计算机设备和存储介质
WO2019144548A1 (zh) 安全测试方法、装置、计算机设备和存储介质
CN107547524A (zh) 一种网页检测方法、装置和设备
CN108809943B (zh) 网站监控方法及其装置
CN107147645A (zh) 网络安全数据的获取方法及装置
CN112506481A (zh) 业务数据交互方法、装置、计算机设备和存储介质
US11477158B2 (en) Method and apparatus for advertisement anti-blocking
Alidoosti et al. Evaluating the web‐application resiliency to business‐layer DoS attacks
CN114626061A (zh) 网页木马检测的方法、装置、电子设备及介质
US20230080601A1 (en) Webpage integrity monitoring
CN117708450A (zh) 网页显示方法、装置、设备、介质和产品
KR102501227B1 (ko) 인터넷 주소 검색량에 기반한 금융 사기 탐지 시스템 및 방법
US20220222300A1 (en) Systems and methods for temporal and visual feature driven search utilizing machine learning
US20230185899A1 (en) Code injection detection using syntactic deviation
CN117370695A (zh) 页面显示方法、装置、设备、介质和产品
CN117370176A (zh) 应用安全测试方法、装置、计算机设备和存储介质
CN117331804A (zh) 一种前端页面监控方法、装置、计算机设备及存储介质
CN117040804A (zh) 网站的网络攻击检测方法、装置、设备、介质和程序产品

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19944375

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19944375

Country of ref document: EP

Kind code of ref document: A1