CN114020987A - Sample data acquisition method, device, equipment and storage medium based on webpage - Google Patents

Sample data acquisition method, device, equipment and storage medium based on webpage Download PDF

Info

Publication number
CN114020987A
CN114020987A CN202210007622.1A CN202210007622A CN114020987A CN 114020987 A CN114020987 A CN 114020987A CN 202210007622 A CN202210007622 A CN 202210007622A CN 114020987 A CN114020987 A CN 114020987A
Authority
CN
China
Prior art keywords
content
webpage
source code
sample data
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210007622.1A
Other languages
Chinese (zh)
Inventor
童兆丰
樊兴华
薛锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing ThreatBook Technology Co Ltd
Original Assignee
Beijing ThreatBook Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing ThreatBook Technology Co Ltd filed Critical Beijing ThreatBook Technology Co Ltd
Priority to CN202210007622.1A priority Critical patent/CN114020987A/en
Publication of CN114020987A publication Critical patent/CN114020987A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application provides a sample data acquisition method, a sample data acquisition device, a sample data acquisition equipment and a storage medium based on a webpage, wherein the sample data acquisition method based on the webpage comprises the following steps: accessing the target webpage based on the URL of the target webpage, and acquiring a webpage source code of the target webpage when the target webpage is successfully accessed; identifying content of the web page source code based on a first decoding format; judging whether the content of the webpage source code is messy code content or not; when the content of the webpage source code is messy code content, identifying the content of the webpage source code based on the second decoding format until the content of the webpage source code is correct; and obtaining sample data based on the content of the webpage source code. According to the method and the device, the utilization rate of server resources can be improved and the execution speed of generating the sample data can be improved in the process of generating the sample data by acquiring the webpage content.

Description

Sample data acquisition method, device, equipment and storage medium based on webpage
Technical Field
The present application relates to the field of computer technologies, and in particular, to a method, an apparatus, a device, and a storage medium for acquiring sample data based on a web page.
Background
Currently, sample data for website classification needs to be acquired in order to classify websites, and in the prior art, a web site is captured by five components, namely a scheduler, a downloader, a crawler, an entity pipeline and a script engine, and structured data is extracted from a page, so as to finally obtain the sample data, wherein the specific implementation process in the prior art is as follows: configuring a starting address for a site, capturing from the starting address during Scapy operation, acquiring a target URL in a page according to XPath or regular configuration, accessing again, circulating in the way, processing the accessed URL address result through a downloader, extracting content, and storing the processed content data in a persistent database.
However, the operation of the script crawler frame is based on the starting address, network crawling is performed in the page of the site, the crawled URLs meeting the requirements are accessed again, the extracted contents are processed, and finally storage is performed, so that the requirement scene that the existing URL addresses need to be rapidly and concurrently extracted contents is met.
Disclosure of Invention
An object of the embodiments of the present application is to provide a method, an apparatus, a device, and a storage medium for acquiring sample data based on a web page, so as to at least improve a resource utilization rate of a server and improve an execution speed of generating the sample data in a process of generating the sample data by acquiring content of the web page.
To this end, a first aspect of the present application discloses a sample data obtaining method based on a web page, the method including:
accessing the target webpage based on the URL of the target webpage, and acquiring a webpage source code of the target webpage when the target webpage is successfully accessed;
identifying content of the web page source code based on a first decoding format;
judging whether the content of the webpage source code is messy code content or not;
when the content of the webpage source code is messy code content, identifying the content of the webpage source code based on the second decoding format until the content of the webpage source code is correct;
and obtaining sample data based on the content of the webpage source code.
In the first aspect of the present application, as an optional implementation manner, before the identifying the content of the web page source code based on the first decoding format, the method further includes:
when the target webpage fails to be accessed based on the URL of the target webpage, replacing the IP address for accessing the target webpage;
and accessing the target webpage based on the changed IP address and the URL of the target webpage.
In the first aspect of the present application, as an optional implementation manner, the obtaining sample data based on the content of the web page source code includes:
removing a first HMTL element label in the content of the webpage source code, and obtaining a first page processing result;
extracting the text content of a second HMTL element label based on the first page processing result;
and taking the text content of the second HMTL element label as the sample data.
In the first aspect of the present application, as an optional implementation manner, the first HMTL element tag includes at least a JS code segment tag and a CSS style tag.
In the first aspect of the present application, as an optional implementation manner, before the extracting, based on the first page processing result, the text content of the second HMTL element tag, the method further includes:
converting the webpage line break in the first page processing result into a common text line break;
and combining continuous blank symbols and continuous line feed symbols in the first page processing result.
In the first aspect of the present application, as an optional implementation manner, after obtaining sample data based on the content of the web page source code, the method further includes:
and storing the sample data into a preset database based on the URL of the target webpage.
In the first aspect of the present application, as an optional implementation manner, before the storing, based on the URL of the target webpage, the sample data in a preset database, the method further includes:
calculating MD5 Hash of the sample data;
and judging whether historical data identical to the MD5 Hash exists in the preset database or not based on the MD5 Hash of the sample data, and if so, deleting the sample data.
A second aspect of the present application discloses a sample data obtaining apparatus based on a web page, the apparatus comprising:
the webpage source code acquisition module is used for accessing the target webpage based on the URL of the target webpage and acquiring the webpage source code of the target webpage when the target webpage is successfully accessed;
the first identification module is used for identifying the content of the webpage source code based on a first decoding format;
the judging module is used for judging whether the content of the webpage source code is messy code content or not;
the second identification module is used for identifying the content of the webpage source code based on the second decoding format when the content of the webpage source code is messy code content until the content of the webpage source code is correct;
and the sample generation module is used for obtaining sample data based on the content of the webpage source code.
A third aspect of the present application discloses a computer apparatus, the apparatus comprising:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute the sample data acquisition method based on the webpage in the first aspect of the application.
A fourth aspect of the present application discloses a storage medium, where the storage medium stores a computer instruction, and the computer instruction is used to execute the sample data acquisition method based on a web page in the first aspect of the present application when being invoked.
Compared with the prior art, the method has the following beneficial technical effects:
compared with the prior art that the URL of the target webpage is accessed directly and the target webpage is accessed concurrently, the method for obtaining the webpage source code content through the webpage has the advantages that the webpage content is obtained through the script crawler frame, the starting address is needed, network crawling is conducted in the webpage of the website, and the crawling URL meeting the requirements is accessed again.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments of the present application will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and that those skilled in the art can also obtain other related drawings based on the drawings without inventive efforts.
Fig. 1 is a schematic flowchart of a sample data acquisition method based on a web page disclosed in an embodiment of the present application;
fig. 2 is a schematic structural diagram of a sample data acquisition device based on a web page disclosed in an embodiment of the present application;
fig. 3 is a schematic structural diagram of a computer device disclosed in an embodiment of the present application.
Detailed Description
The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.
Example one
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating a sample data obtaining method based on a web page according to an embodiment of the present application. As shown in fig. 1, the method of the embodiment of the present application includes the following steps:
101. accessing a target webpage based on the URL of the target webpage, and acquiring a webpage source code of the target webpage when the target webpage is successfully accessed;
102. identifying content of a webpage source code based on a first decoding format;
103. judging whether the content of the webpage source code is messy code content or not;
104. when the content of the webpage source code is messy code content, identifying the content of the webpage source code based on a second decoding format until the content of the webpage source code is correct;
105. and obtaining sample data based on the content of the webpage source code.
In the process of acquiring the source code content of the webpage, on one hand, the target webpage can be directly and concurrently accessed through the URL access of the target webpage, compared with the process of acquiring the webpage content by utilizing a Scapy crawler frame in the prior art, the starting address is required to be used, the network crawling is carried out in the page of the website, and the crawled URL meeting the requirement is accessed again.
On the other hand, the method and the device can verify the correctness of the decoded content in real time by performing messy code identification on the webpage content, and timely modify the decoding mode to obtain the correct webpage content, so that invalid data in the final data sample is avoided.
In this embodiment of the present application, for the target webpage based on the URL access target webpage in step 101, optionally, the target webpage based on the URL access target webpage may be executed by the client, and the client obtains the batch URLs for batch concurrent access.
In this embodiment of the present application, for step 102, the first decoding format may implement "GB 2312", that is, step 102 specifically includes: the content of the web page source code is identified using the GB2312 decoding format.
Further, in step 103, based on the binary content corresponding to the content of the web page source code, it can be determined whether the content of the web page source code is scrambled.
In the embodiment of the present application, as an example, for step 104, when the content of the web page source code obtained by decoding based on the GB2312 decoding format is a scrambled content, the content of the web page source code is identified using the ISO-8859-1 decoding format, and if the content of the web page source code obtained by decoding is also a scrambled content, the content of the web page source code is identified using the UTF-8 decoding format, so as to obtain a correct content of the web page source code.
In the embodiment of the present application, optionally, if the content of the web page source code is still garbled content after three times of decoding, the step after step 105 and step 105 is terminated, and the URL corresponding to the content of the web page source code is labeled, so that excessive repeated decoding processes can be avoided and the labeled URL can be conveniently screened out subsequently.
In this application embodiment, as an optional implementation manner, the method in this application embodiment further includes:
when the target webpage is successfully accessed, screenshot is carried out on the target webpage and a first screenshot of the target webpage is generated;
the first cut is added to the sample data.
In some scenarios, the category of the web page cannot be identified only by depending on the text content of the web page, so that the type of the web page needs to be comprehensively judged by combining the picture and the text in the web page.
In the embodiment of the present application, as an optional implementation manner, in step 102: before identifying the content of the webpage source code based on the first decoding format, the embodiment of the application further comprises the following steps:
judging whether the webpage source code of the target webpage can be acquired or not;
when the webpage source code of the target webpage cannot be acquired, screenshot is carried out on the target webpage and a second screenshot of the target webpage is generated;
and identifying the text content of the second screenshot and taking the text content of the second screenshot as the content of the webpage source code.
According to the optional implementation method, when the text content cannot be extracted based on the webpage source code of the target webpage, the text content can be extracted by using the screenshot of the target webpage.
In the embodiment of the present application, as an optional implementation manner, in step 102: before identifying the content of the webpage source code based on the first decoding format, the method of the embodiment of the application further comprises the following steps:
when the target webpage fails to be accessed based on the URL of the target webpage, replacing the IP address for accessing the target webpage;
and accessing the target webpage based on the changed IP address and the URL of the target webpage.
The alternative embodiment can improve the access success rate of the target webpage through the changed IP address, for example, in some scenarios, the client accesses the server through the proxy mechanism, and when one proxy IP address cannot access the target webpage, the client can replace the other IP address to access the target webpage.
In the embodiment of the present application, as an optional implementation manner, step 105: obtaining sample data based on the content of the webpage source code, comprising the following substeps:
removing a first HMTL element label in the content of the webpage source code, and obtaining a first page processing result;
extracting the text content of the second HMTL element label based on the first page processing result;
and taking the text content of the second HMTL element label as sample data.
In this optional embodiment, by removing the first HMTL element tag, the text content of the second HMTL element tag can be ensured to be extracted, wherein the first HMTL element tag is a tag that does not need to extract the text content.
Further, in the first aspect of the present application, the first HMTL element tag includes a JS code snippet tag and a CSS style tag, and the first HMTL element tag may further include other tags that do not require extraction of text content.
In the optional embodiment, by removing the first HMTL element tag from the content of the web page source code, the text content extracted from the first HMTL element tag during the process of extracting the text content of the second HMTL element tag can be reduced.
In the embodiment of the present application, as an optional implementation manner, in the step: before extracting the text content of the second HMTL element tag based on the first page processing result, the method of the embodiment of the present application further includes the following steps:
converting the webpage line break in the first page processing result into a common text line break;
and combining the continuous blank symbols and the continuous line feed symbols in the first page processing result.
In this alternative, the page wrap indicator is a br tag or a p tag.
In the optional embodiment, the webpage line break in the first page processing result is converted into a common text line break, and the continuous blank symbol and the continuous line break symbol in the first page processing result are combined, so that the text content of the second HMTL element tag can be conveniently extracted.
In this embodiment, as an optional implementation manner, after obtaining sample data based on the content of the web page source code in step 105, the method in this embodiment further includes the following steps:
and storing the sample data into a preset database based on the URL of the target webpage.
In this optional embodiment, the sample data is stored in the preset database in a specific manner based on the URL of the target webpage, and the sample data is stored in the preset database based on the correspondence between the URL of the target webpage and the sample data, wherein in a subsequent query process, the corresponding sample data can be queried based on the URL of the target webpage.
In the embodiment of the present application, as an optional implementation manner, in step 105: before storing sample data into a preset database based on the URL of the target webpage, the method of the embodiment of the application further comprises the following steps:
calculating MD5 Hash of sample data;
and judging whether historical data identical to the MD5 Hash exists in the preset database or not based on the MD5 Hash of the sample data, and if so, deleting the sample data.
In the optional embodiment, since the MD5 Hash of the sample data is unique, the MD5 Hash of the sample data is compared with the MD5 Hash of each historical data in the preset database, so that whether the same data already exists in the preset database can be determined, and if yes, the sample data is deleted.
Example two
Referring to fig. 2, fig. 2 is a schematic structural diagram of a sample data acquisition device based on a web page according to an embodiment of the present application. As shown in fig. 2, the apparatus of the embodiment of the present application includes the following functional modules:
a web page source code obtaining module 201, configured to access a target web page based on a URL of the target web page, and obtain a web page source code of the target web page when the target web page is successfully accessed;
a first identification module 202, configured to identify content of a web page source code based on a first decoding format;
the judging module is used for judging whether the content of the webpage source code is messy code content or not;
the second identification module 203 identifies the content of the webpage source code based on the second decoding format when the content of the webpage source code is the messy code content until the content of the webpage source code is correct;
and the sample generation module 204 is configured to obtain sample data based on the content of the web page source code.
Compared with the prior art that the webpage content is obtained by utilizing the script crawler frame, the device in the embodiment of the application needs to perform network crawling in the page of the website according to the starting address, and accesses the crawled URL meeting the requirements again, the device in the embodiment of the application does not need to perform network crawling in the page of the website according to the starting address, and further can avoid the defects of server hardware resource waste and slow execution speed caused by network crawling in the page of the website according to the starting address, so that the device has the advantages of improving the utilization rate of server resources and improving the execution speed of generating sample data.
EXAMPLE III
Referring to fig. 3, fig. 3 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure. As shown in fig. 3, the apparatus of the embodiment of the present application includes:
a memory 301 storing executable program code;
a processor 302 coupled to the memory 301;
the processor 302 calls the executable program code stored in the memory 301 to execute the sample data obtaining method based on the web page according to the first embodiment of the present application.
Compared with the prior art that the webpage content is obtained by utilizing the script crawler frame, the device in the embodiment of the application needs to perform network crawling in the page of the website according to the starting address, and accesses the crawled URL meeting the requirements again, the device in the embodiment of the application does not need to perform network crawling in the page of the website according to the starting address, and further can avoid the defects of server hardware resource waste and slow execution speed caused by network crawling in the page of the website according to the starting address, so that the device has the advantages of improving the utilization rate of server resources and improving the execution speed of generating sample data.
Example four
The embodiment of the application discloses a storage medium, wherein a computer instruction is stored in the storage medium, and when the computer instruction is called, the storage medium is used for executing the sample data acquisition method based on the webpage.
Compared with the prior art that the web page content is acquired by utilizing a script crawler frame, the storage medium of the embodiment of the application needs to perform network crawling in the page of the website according to the starting address, and accesses the searched URL meeting the requirements again, the storage medium of the embodiment of the application does not need to perform network crawling in the page of the website according to the starting address, and further can avoid the defects of server hardware resource waste and slow execution speed caused by network crawling in the page of the website according to the starting address, so that the storage medium has the advantages of improving the utilization rate of server resources and improving the execution speed of generating sample data.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one logical division, and there may be other divisions when actually implemented, and for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection of devices or units through some communication interfaces, and may be in an electrical, mechanical or other form.
In addition, units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
Furthermore, the functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
It should be noted that the functions, if implemented in the form of software functional modules and sold or used as independent products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.
In this document, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions.
The above description is only an example of the present application and is not intended to limit the scope of the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A sample data acquisition method based on a webpage is characterized by comprising the following steps:
accessing the target webpage based on the URL of the target webpage, and acquiring a webpage source code of the target webpage when the target webpage is successfully accessed;
identifying content of the web page source code based on a first decoding format;
judging whether the content of the webpage source code is messy code content or not;
when the content of the webpage source code is messy code content, identifying the content of the webpage source code based on a second decoding format until the content of the webpage source code is correct;
and obtaining sample data based on the content of the webpage source code.
2. The method of claim 1, wherein prior to said identifying the content of the web page source code based on the first decoding format, the method further comprises:
when the target webpage fails to be accessed based on the URL of the target webpage, replacing the IP address for accessing the target webpage;
and accessing the target webpage based on the changed IP address and the URL of the target webpage.
3. The method of claim 1, wherein obtaining sample data based on the content of the web page source code comprises:
removing a first HMTL element label in the content of the webpage source code, and obtaining a first page processing result;
extracting the text content of a second HMTL element label based on the first page processing result;
and taking the text content of the second HMTL element label as the sample data.
4. The method of claim 3, wherein the first HMTL element label comprises at least a JS code fragment label, a CSS style label.
5. The method of claim 3, wherein prior to the extracting text content for a second HMTL element tag based on the first page processing result, the method further comprises:
converting the webpage line break in the first page processing result into a common text line break;
and combining continuous blank symbols and continuous line feed symbols in the first page processing result.
6. The method of claim 1, wherein after obtaining sample data based on the content of the web page source code, the method further comprises:
and storing the sample data into a preset database based on the URL of the target webpage.
7. The method of claim 6, wherein before storing the sample data in a predetermined database based on the URL of the target web page, the method further comprises:
calculating MD5 Hash of the sample data;
and judging whether historical data identical to the MD5 Hash exists in the preset database or not based on the MD5 Hash of the sample data, and if so, deleting the sample data.
8. A sample data acquisition device based on a web page, the device comprising:
the webpage source code acquisition module is used for accessing the target webpage based on the URL of the target webpage and acquiring the webpage source code of the target webpage when the target webpage is successfully accessed;
the first identification module is used for identifying the content of the webpage source code based on a first decoding format;
the judging module is used for judging whether the content of the webpage source code is messy code content or not;
the second identification module is used for identifying the content of the webpage source code based on a second decoding format when the content of the webpage source code is messy code content until the content of the webpage source code is correct;
and the sample generation module is used for obtaining sample data based on the content of the webpage source code.
9. A computer device, the device comprising:
a memory storing executable program code;
a processor coupled with the memory;
the processor calls the executable program code stored in the memory to execute the webpage-based sample data acquisition method according to any one of claims 1 to 7.
10. A storage medium storing computer instructions for executing the method for acquiring sample data based on a web page according to any one of claims 1 to 7 when the computer instructions are called.
CN202210007622.1A 2022-01-06 2022-01-06 Sample data acquisition method, device, equipment and storage medium based on webpage Pending CN114020987A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210007622.1A CN114020987A (en) 2022-01-06 2022-01-06 Sample data acquisition method, device, equipment and storage medium based on webpage

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210007622.1A CN114020987A (en) 2022-01-06 2022-01-06 Sample data acquisition method, device, equipment and storage medium based on webpage

Publications (1)

Publication Number Publication Date
CN114020987A true CN114020987A (en) 2022-02-08

Family

ID=80069818

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210007622.1A Pending CN114020987A (en) 2022-01-06 2022-01-06 Sample data acquisition method, device, equipment and storage medium based on webpage

Country Status (1)

Country Link
CN (1) CN114020987A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140281535A1 (en) * 2013-03-15 2014-09-18 Munibonsoftware.com, LLC Apparatus and Method for Preventing Information from Being Extracted from a Webpage
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN109948095A (en) * 2017-11-27 2019-06-28 腾讯科技(深圳)有限公司 Show method, apparatus, terminal and the storage medium of web page contents
CN110620657A (en) * 2019-08-23 2019-12-27 上海科技发展有限公司 Webpage word processing method, system and device
CN111352587A (en) * 2020-02-24 2020-06-30 苏州浪潮智能科技有限公司 Data packing method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140281535A1 (en) * 2013-03-15 2014-09-18 Munibonsoftware.com, LLC Apparatus and Method for Preventing Information from Being Extracted from a Webpage
CN109948095A (en) * 2017-11-27 2019-06-28 腾讯科技(深圳)有限公司 Show method, apparatus, terminal and the storage medium of web page contents
CN108664559A (en) * 2018-03-30 2018-10-16 中山大学 A kind of automatic crawling method of website and webpage source code
CN110620657A (en) * 2019-08-23 2019-12-27 上海科技发展有限公司 Webpage word processing method, system and device
CN111352587A (en) * 2020-02-24 2020-06-30 苏州浪潮智能科技有限公司 Data packing method and device

Similar Documents

Publication Publication Date Title
CN110750741B (en) Webpage link skipping processing method, computer device and storage medium
CN108366058B (en) Method, device, equipment and storage medium for preventing traffic hijacking of advertisement operator
US20150033331A1 (en) System and method for webpage analysis
US20170126723A1 (en) Method and device for identifying url legitimacy
US9563611B2 (en) Merging web page style addresses
CN107153716B (en) Webpage content extraction method and device
CN109862021B (en) Method and device for acquiring threat information
CN109871251B (en) Response data processing method and device, storage medium and terminal equipment
CN103383687A (en) Page processing method and device
CN108494728B (en) Method, device, equipment and medium for creating blacklist library for preventing traffic hijacking
CN103593406A (en) Static resource identifier processing method and device
EP3896940A1 (en) Resource description file processing, and page resource obtaining method and device
CN112637361A (en) Page proxy method, device, electronic equipment and storage medium
CN115437877A (en) Online analysis method and system for multi-source log, electronic equipment and storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN113656737A (en) Webpage content display method and device, electronic equipment and storage medium
CN111240790B (en) Multi-language adaptation method, device, client and storage medium for application
CN109145220B (en) Data processing method and device and electronic equipment
CN114020987A (en) Sample data acquisition method, device, equipment and storage medium based on webpage
CN114048400A (en) Method, device, system and medium for acquiring abnormal application program
CN111585897B (en) Request route management method, system, computer system and readable storage medium
CN109657178B (en) Page form processing method and device, computer equipment and storage medium
CN111783006A (en) Page generation method and device, electronic equipment and computer readable medium
CN106570044B (en) Method and device for analyzing webpage codes
CN114548079B (en) Text display method and device and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20220208