CN115766184A

CN115766184A - Webpage data processing method and device, electronic equipment and storage medium

Info

Publication number: CN115766184A
Application number: CN202211405960.7A
Authority: CN
Inventors: 杨柳; 任洪伟
Original assignee: Antiy Technology Group Co Ltd
Current assignee: Antiy Technology Group Co Ltd
Priority date: 2022-11-10
Filing date: 2022-11-10
Publication date: 2023-03-07

Abstract

The application provides a webpage data processing method, a webpage data processing device, electronic equipment and a storage medium, wherein the method comprises the following steps: acquiring hyperlink information and text information in a current webpage; the method comprises the steps that the website of a current page is stored in a target website list, and the page corresponding to each website in the target website list is used for obtaining malicious indexes; extracting malicious indexes of the text information; and if the malicious indexes are extracted from the text information, adding the website corresponding to the hyperlink information to a target website list. According to the method and the device, the target website list is automatically expanded according to the content in the current webpage, and the problem that malicious indexes are obtained only in the webpage within a limited range, so that the obtained malicious indexes are fewer in quantity and incomplete is solved.

Description

Webpage data processing method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of secure data processing, and in particular, to a method and an apparatus for processing web page data, an electronic device, and a storage medium.

Background

Threat attack intelligence is mainly used for identifying and detecting a collapse index (namely malicious IOC) of an attack activity, and in order to acquire a target IOC, many users acquire the target IOC from some data sources or web pages collected by the users. However, because the intelligence data sources are wide, if only some data sources or web pages collected by the user are used for obtaining the intelligence data, the intelligence data can only be obtained in a limited range, and the obtained malicious IOCs are easily few and incomplete.

Disclosure of Invention

In view of the above, the present application provides a method, an apparatus, an electronic device, and a storage medium for processing web page data, which at least partially solve the problems in the prior art.

In one aspect of the present application, a method for processing web page data is provided, including:

acquiring hyperlink information and text information in a current webpage; the method comprises the steps that the website of a current page is stored in a target website list, and the page corresponding to each website in the target website list is used for obtaining malicious indexes;

extracting malicious indexes of the text information;

and if the malicious indexes are extracted from the text information, adding the website corresponding to the hyperlink information to a target website list.

In an exemplary embodiment of the present application, the target website list includes a visited list and a non-visited list;

the adding the website corresponding to the hyperlink information to the target website list comprises:

determining whether the website corresponding to the hyperlink information exists in the accessed list;

and if the hyperlink information does not exist, adding the website corresponding to the hyperlink information to the non-access list.

In an exemplary embodiment of the present application, the method further comprises:

if the malicious index is not obtained from the webpage of the stored website within a set time period, deleting the stored website from the target website list;

and the stored website is any website in the target website list.

In an exemplary embodiment of the present application, the extracting malicious indicators from the text information includes:

acquiring candidate indexes from the text information;

inputting the candidate indexes into a first judgment model and a second judgment model respectively to obtain a first judgment result and a second judgment result; the first judgment model and the second judgment model are both used for judging whether the candidate indexes are malicious indexes, and the model types of the first judgment model and the second judgment model are different;

and determining whether the candidate index is marked as a malicious index according to the first judgment result and/or the second judgment result.

In an exemplary embodiment of the present application, the first determination model is a heuristic classifier, and the second determination model is a machine learning classifier;

the determining whether to mark the candidate indicator as a malicious indicator according to the first determination result and/or the second determination result includes:

determining whether the first judgment result is the same as the second judgment result;

if the candidate indexes are the same, determining whether to mark the candidate indexes as malicious indexes according to a first judgment result and/or a second judgment result;

if not, acquiring a first probability and a second probability; the first probability is the accuracy of the first judgment result output by the first judgment model, and the second probability is the accuracy of the second judgment result output by the second judgment model;

when the difference value between the first probability and the second probability is larger than a preset threshold value, if the first probability is larger than the second probability, determining whether the candidate index is marked as a malicious index according to a first judgment result; otherwise, determining whether the candidate index is marked as a malicious index according to a second judgment result;

and when the difference value of the first probability and the second probability is smaller than or equal to a preset threshold value, determining whether the candidate index is marked as a malicious index according to a second judgment result.

In an exemplary embodiment of the present application, the obtaining a candidate index from the text information includes:

sentence division processing is carried out on the text information to obtain a plurality of character segments;

extracting the characteristics of each character segment to obtain characteristic information corresponding to each character segment; the characteristic information includes at least one of: text features, data source features, source distribution features, sentence features, content features, and external features;

and determining whether each character segment contains a candidate index or not according to the characteristic information corresponding to each character segment.

In an exemplary embodiment of the present application, before the performing the feature extraction on each of the character segments, after the performing the sentence segmentation processing on the text information, the method further includes:

and determining that the set character string exists in each character segment, and if the set character string exists in each character segment, performing character conversion processing on the set character string in the current character segment.

In another aspect of the present application, there is provided a web page data processing apparatus including:

the acquisition module is used for acquiring hyperlink information and text information in the current webpage; the method comprises the steps that the website of a current page is stored in a target website list, and the page corresponding to each website in the target website list is used for obtaining malicious indexes;

the extraction module is used for extracting the malicious indexes of the text information;

and the adding module is used for adding the website corresponding to the hyperlink information to a target website list if the malicious indexes are extracted from the text information.

In another aspect of the present application, there is provided an electronic device comprising a processor and a memory;

the processor is configured to perform the steps of any of the above methods by calling a program or instructions stored in the memory.

In another aspect of the application, a non-transitory computer readable storage medium is provided, storing a program or instructions that causes a computer to perform the steps of any of the methods described above.

According to the webpage data processing method, when any webpage (namely the current webpage) in the target website list is accessed, hyperlink information and text information in the current webpage can be acquired. If the malicious indexes are extracted from the text information, it can be determined that the text information in the current webpage explains or otherwise explains the malicious indexes, and at this time, the website corresponding to the hyperlink information in the current webpage is probably the webpage quoted for explaining or otherwise explaining the malicious indexes in the current webpage. Therefore, the website corresponding to the hyperlink information is added to the target website list, so that the website corresponding to the hyperlink information can be accessed and extracted by malicious indexes when a malicious index is acquired through a page corresponding to the website in the target website list subsequently, the target website list is automatically expanded according to the content in the current webpage, and the problems that the malicious indexes are acquired only for the webpage in a limited range, the number of the acquired malicious indexes is small and the malicious indexes are incomplete are solved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a flowchart of a method for processing web page data according to an embodiment of the present disclosure;

fig. 2 is a block diagram of a web page data processing apparatus according to an embodiment of the present disclosure.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings.

It should be noted that, in the case of no conflict, the features in the following embodiments and examples may be combined with each other; moreover, based on the embodiments in the present disclosure, all other embodiments obtained by a person of ordinary skill in the art without making creative efforts shall fall within the protection scope of the present disclosure.

It is noted that various aspects of the embodiments are described below within the scope of the appended claims. It should be apparent that the aspects described herein may be embodied in a wide variety of forms and that any specific structure and/or function described herein is merely illustrative. Based on the disclosure, one skilled in the art should appreciate that one aspect described herein may be implemented independently of any other aspects and that two or more of these aspects may be combined in various ways. For example, an apparatus may be implemented and/or a method practiced using any number of the aspects set forth herein. Additionally, such an apparatus may be implemented and/or such a method may be practiced using other structure and/or functionality in addition to one or more of the aspects set forth herein.

The noun explains:

malicious Indicators, i.e. IOCs (acronyms of compliance), refer to Indicators used by an attacker to infect or breach a target device, i.e. collapse Indicators, and generally include malicious Indicators used by the attacker, such as hash values, domain names, URLs, IPs, and mailboxes appearing in intelligence data.

Referring to fig. 1, in an aspect of the present application, a method for processing web page data is provided, which includes the following steps:

s100, hyperlink information and text information in the current webpage are obtained. The website of the current page is stored in a target website list, and the page corresponding to each website in the target website list is used for acquiring malicious indexes. Specifically, when the web crawler accesses the current webpage, the web crawler stores an original HTML file with HTML, javaScript and CSS codes, processes the HTML file and retrieves links from the HTML file, and then determines the retrieved links as the hyperlink information. And the text information is all the text information displayed in the current page.

And S200, extracting the malicious indexes of the text information.

And S300, if the malicious indexes are extracted from the text information, adding the website corresponding to the hyperlink information to a target website list.

The method provided by the application is applied to any electronic equipment, such as a computer, a notebook or a server. The electronic device executing the method is configured to access each website in the target website list every set period and try to acquire the malicious index from the website list. The malicious index extraction can be to perform feature analysis on the text information to realize extraction of the malicious index from the text information.

In the method for processing web page data provided by this embodiment, when any web page (i.e., the current web page) in the target website list is accessed, hyperlink information and text information in the current web page are acquired. If the malicious indexes are extracted from the text information, it can be determined that the text information in the current webpage explains or otherwise explains the malicious indexes, and at this time, the website corresponding to the hyperlink information in the current webpage is probably the webpage quoted for explaining or otherwise explaining the malicious indexes in the current webpage. Therefore, when the malicious index is acquired through the page corresponding to the website in the target website list, the website corresponding to the hyperlink information can be accessed and extracted by the malicious index, so that the target website list is automatically expanded according to the content in the current webpage, and the problems that the number of acquired malicious indexes is small and the malicious indexes are incomplete due to the fact that malicious indexes are acquired only through the webpage in a limited range are solved.

In an exemplary embodiment of the present application, the list of target web addresses includes a visited list and a non-visited list. Specifically, the electronic device implementing the method transfers all websites in the accessed list to the non-accessed list every set period, starts to access each website in the non-accessed list, and extracts the malicious index. The specific extraction method of the malicious indicators will be described in detail later.

and determining whether the website corresponding to the hyperlink information exists in the accessed list.

And if so, not performing subsequent operation on the hyperlink information.

And if the hyperlink information does not exist, adding the website corresponding to the hyperlink information to the list which is not accessed.

In this embodiment, after the malicious indicator is obtained from the text information of the current web page, it is determined whether the website corresponding to the hyperlink information exists in the accessed list, and if so, it is determined that the website has been included, so that the website does not need to be added. If the website does not exist, the website is not accessed before, and is not recorded, so that the website is directly added into an unaccessed list to complete recording, and meanwhile, the website can be accessed and extracted by malicious indexes in the current set period.

and if the malicious indexes are not obtained from the web pages of the stored websites within the set time period, deleting the stored websites from the target website list. The set time period is the time for extracting the malicious indexes from the corresponding stored website.

The stored website is any website in the target website list, and the length of the set time period may be 5 days to 60 days, in this embodiment, the length of the set time period is 30 days.

After the method is used, new websites can be continuously recorded and recorded in the target website list. However, in this embodiment, in order to avoid missing the inclusion of the websites that may provide the malicious indicators, as long as the malicious indicators are extracted from the current page, the websites corresponding to the hyperlink information of the current web page are included. However, in some cases, the web pages corresponding to these websites may only issue information about malicious indicators sporadically and will not issue any more in the following time. At this time, if such websites are not cleaned, the electronic device accesses and processes these webpages each time the malicious index is collected, which undoubtedly increases a large amount of processing. Therefore, in this embodiment, if the malicious indicator is not obtained from the web page of the stored website within the set time period, it indicates that the current stored website has not provided information of the malicious indicator for a long time, and thus the stored website is automatically deleted from the target website list, so as to reduce the processing amount.

In an exemplary embodiment of the present application, the extracting malicious indicator C from the text information includes:

and acquiring candidate indexes from the text information.

And inputting the candidate indexes into a first judgment model and a second judgment model respectively to obtain a first judgment result and a second judgment result. The first judgment model and the second judgment model are both used for judging whether the candidate index is a malicious index, and the model types of the first judgment model and the second judgment model are different.

The candidate index can be understood as a suspected malicious index, but is not finally determined to be the malicious index. The detailed description of the specific method for obtaining the candidate index from the text information will be described later, and will not be repeated herein.

In this embodiment, after the candidate indexes are obtained, each candidate index (which may also include associated information of the current candidate index, such as context information) is sequentially input into two pre-trained determination models. And determining whether the candidate index is marked as a malicious index according to the judgment results of the two judgment models. The two judgment models are used for judgment, and relevant verification can be performed according to the judgment results output by the two judgment models, so that the accuracy of judging the malicious indexes is improved.

Specifically, the first judgment model is a heuristic classifier, and the second judgment model is a machine learning classifier.

determining whether the first judgment result is the same as the second judgment result; the first judgment result and the second judgment result can be both malicious indexes or non-malicious indexes.

And if so, determining whether the candidate indexes are marked as malicious indexes according to the first judgment result and/or the second judgment result.

And if not, acquiring the first probability and the second probability. The first probability is the accuracy of the first judgment result output by the first judgment model, and the second probability is the accuracy of the second judgment result output by the second judgment model.

When the difference value between the first probability and the second probability is larger than a preset threshold value, if the first probability is larger than the second probability, whether the candidate index is marked as a malicious index or not is determined according to a first judgment result; otherwise, determining whether the candidate index is marked as a malicious index according to a second judgment result. The preset threshold may be 10% to 50%, and in this embodiment, the preset threshold is 20%.

In this embodiment, if the first determination result is the same as the second determination result, it may be determined whether to mark the candidate indicator as a malicious indicator by directly using any determination result as a basis. However, when the two determination results are different, the first probability and the second probability respectively correspond to the two determination results. If the difference value between the first probability and the second probability is larger (namely larger than a preset threshold), the judgment result with the lower probability is wrong, and the judgment result with the higher probability is directly used as a basis for determining whether the judgment result is a malicious index.

A large number of test experiments show that when the difference between the first probability and the second probability is small (i.e., greater than the preset threshold), the accuracy of the second judgment model is higher, so in this embodiment, if the difference between the first probability and the second probability is small, the second judgment result is directly used as a basis for determining whether the first judgment result is a malicious indicator.

In an exemplary embodiment of the present application, the obtaining a candidate indicator from the text information includes:

and carrying out sentence division processing on the text information to obtain a plurality of character segments (namely a plurality of sentences). The sentence division processing can be performed by adopting a pre-trained NLP model.

And extracting the characteristics of each character segment to obtain the characteristic information corresponding to each character segment. The characteristic information includes at least one of: text features, data source features, source distribution features, sentence features, content features, and external features.

Specifically, the candidate index may be a URL, a domain name, an IP address, an email address, a file hash, and the like, and a text around the "candidate index", that is, a "context" of the index, is extracted from the candidate index, and is used as the associated information of the candidate index. Meanwhile, other information related to the "candidate index" may also be collected as association information, such as a DNS record related to the candidate index, so that whether the candidate index itself is a malicious index may be determined with assistance by the association information.

Since browsers and other Web and email clients often make URLs, IP addresses, email addresses, or clickable domain names, such textual information is often deliberately set to non-clickable to prevent readers from inadvertently accessing hazardous resources, a process called "innocent".

For example, when an example of a URL is "hXXPs [:// maleicious. URL [ ] com/install.exe", it is a harmless treated URL information. However, since many character features of the information after the detoxification are modified, it is not easy to identify the candidate index. Therefore, in this embodiment, if a setting character string (e.g., [ ]) exists in the character segment, the setting character string is subjected to character conversion processing to restore the text information, thereby increasing the recognition rate of the candidate index.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions, etc.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, an electronic device capable of implementing the above method is also provided.

As will be appreciated by one skilled in the art, aspects of the present application may be embodied as a system, method or program product. Accordingly, various aspects of the present application may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.) or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device according to this embodiment of the present application. The electronic device is only an example, and should not bring any limitation to the function and the scope of use of the embodiments of the present application.

The electronic device is in the form of a general purpose computing device. Components of the electronic device may include, but are not limited to: the at least one processor, the at least one memory, and a bus connecting the various system components (including the memory and the processor).

Wherein the storage stores program code executable by the processor to cause the processor to perform steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above.

The memory may include readable media in the form of volatile memory, such as Random Access Memory (RAM) and/or cache memory, and may further include Read Only Memory (ROM).

The storage may also include a program/utility having a set (at least one) of program modules including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a network environment.

The bus may be any of several types of bus structures including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures.

The electronic device may also communicate with one or more external devices (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device to communicate with one or more other computing devices. Such communication may be through an input/output (I/O) interface. Also, the electronic device may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via a network adapter. The network adapter communicates with other modules of the electronic device over the bus. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium having stored thereon a program product capable of implementing the above-described method of the present specification. In some possible embodiments, various aspects of the present application may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the present application described in the "exemplary methods" section above of this specification, when the program product is run on the terminal device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to external computing devices (e.g., through the internet using an internet service provider).

Furthermore, the above-described figures are only schematic illustrations of the processes involved in the methods according to exemplary embodiments of the present application and are not intended to be limiting. It will be readily appreciated that the processes illustrated in the above figures are not intended to indicate or limit the temporal order of the processes. In addition, it is also readily understood that these processes may be performed synchronously or asynchronously, e.g., in multiple modules.

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functions of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present application should be covered within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for processing web page data, comprising:

extracting malicious indexes of the text information;

2. The web page data processing method according to claim 1, wherein the target web address list includes an accessed list and an unaccessed list;

3. The method for processing web page data according to claim 1, further comprising:

and the stored website is any website in the target website list.

4. The method for processing webpage data according to claim 1, wherein the extracting malicious indicators from the text information includes:

acquiring candidate indexes from the text information;

5. The method for processing data on a web page according to claim 4, wherein the first judgment model is a heuristic classifier and the second judgment model is a machine learning classifier;

if so, determining whether the candidate index is marked as a malicious index according to the first judgment result and/or the second judgment result;

6. The method for processing webpage data according to claim 1, wherein the obtaining candidate indicators from the text information comprises:

7. The method for processing data of a web page according to claim 6, wherein before said extracting features from each of said character segments, after said performing sentence dividing processing on said text information, said method further comprises:

8. A web page data processing apparatus characterized by comprising:

the extraction module is used for extracting malicious indexes of the text information;

9. An electronic device comprising a processor and a memory;

the processor is adapted to perform the steps of the method of any one of claims 1 to 7 by calling a program or instructions stored in the memory.

10. A non-transitory computer readable storage medium storing a program or instructions for causing a computer to perform the steps of the method of any one of claims 1 to 7.