CN114579834B

CN114579834B - Webpage login entity identification method and device, electronic equipment and storage medium

Info

Publication number: CN114579834B
Application number: CN202210242582.9A
Authority: CN
Inventors: 李乾坤; 何召阳; 刘乃海; 靳宇馨; 王欣宇; 袁伟
Original assignee: Beijing Moyun Technology Co ltd
Current assignee: Beijing Moyun Technology Co ltd
Priority date: 2022-03-11
Filing date: 2022-03-11
Publication date: 2023-07-21
Anticipated expiration: 2042-03-11
Also published as: CN114579834A

Abstract

The application discloses a webpage login entity identification method, a webpage login entity identification device, electronic equipment and a storage medium. Firstly, acquiring candidate login webpage data by a regular matching method; extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels; constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels; the constructed graph data are input into the trained webpage login entity identification model to obtain a webpage login entity identification type list, and it can be seen that when each node is characterized, the method and the device not only refer to the node information of the node, but also consider the neighbor node information, and more fully utilize the webpage structure information to determine the login entity category to which different webpage labels belong, and do not need to check a large number of rules one by one, and have the characteristics of high detection speed, high precision, low cost and the like.

Description

Webpage login entity identification method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of data identification, and in particular, to a method and apparatus for identifying a web page login entity, an electronic device, and a storage medium.

Background

The traditional webpage data login entity identification method mainly comprises a rule-based method and a traditional machine learning-based method, wherein aiming at the characteristics of different login entities, the method extracts a plurality of key position information which is dependent in login entity identification from a specific type webpage label and a webpage label containing specific keywords relatively independently, designs various rules and characteristics according to different login entities, finally judges whether key characteristics of a certain login entity exist in the information, and finally returns an identification result.

Therefore, with the continuous updating of the present login mode and the continuous increasing of the new login types, the defects of the traditional website identification are increasingly prominent. On the one hand, for the rule-based method, the login entity identification rule is difficult to maintain, needs to be updated continuously along with the change of the webpage login mode, wastes extremely the manpower resource cost, and can have the problems of omission, wrong writing of the rule and the like, so that the identification effect is gradually declined. On the other hand, the traditional modeling method lacks knowledge of overall information of a login inlet, web page structure information cannot be fully utilized to correlate identification of a plurality of login entities, and as network security consciousness is improved, effective features of the login entities are reduced, identification difficulty is increased, and therefore the identification effect of the login entities based on the traditional method is poorer.

Disclosure of Invention

Based on the above, the embodiment of the application provides a webpage login entity identification method, a webpage login entity identification device, electronic equipment and a storage medium, which improve the identification effect of login entity identification compared with the prior art.

In a first aspect, a method for identifying a web page login entity is provided, where the method includes:

acquiring candidate login webpage data by a regular matching method, wherein the candidate login webpage data are webpage data possibly containing login functions;

extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and candidate webpage label attributes;

constructing graph data based on keywords of candidate webpage labels and distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through entity boundaries of the candidate webpage labels;

and inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.

Optionally, before inputting the constructed graph data into the trained web page login entity identification model, the method further comprises:

and performing data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into a graph neural network model for model training, performing function tuning and parameter tuning, and obtaining a webpage login entity identification model after training is completed.

Optionally, the constructing graph data based on the keywords of the candidate web page tags and the distance between the candidate web page tags includes:

selecting keywords of candidate webpage labels by using a TF-IDF method, and quantifying node characteristics of the candidate webpage labels by using keyword word frequency;

calculating the distance between candidate webpage labels based on the webpage Dom tree, and calculating to obtain the edge weight between webpage label nodes;

and constructing the webpage label graph data based on the candidate webpage label node characteristics and the edge weights between the webpage label nodes.

Optionally, calculating the distance between candidate web labels based on the web Dom tree to obtain the edge weight between web label nodes includes calculating through a first formula, where the first formula specifically includes:

Similarity＝(1-distance)/max(path length 1，path length 2)

wherein Similarity represents edge weight between web page label nodes, distance represents distance between candidate web page labels, and path length1 and path length2 are depths of two candidate web page labels in a Dom tree respectively.

Optionally, extracting the candidate webpage label from the candidate login webpage data through a regular matching method includes:

extracting the webpage labels of the preset type and the webpage labels containing the login keywords through a regular matching method.

Optionally, the preset type of web page tag at least includes: input tags and button tags.

In a second aspect, there is provided a web page login entity identification apparatus, the apparatus comprising:

the acquisition module is used for acquiring candidate login webpage data through a regular matching method, wherein the candidate login webpage data are webpage data possibly comprising login functions;

the extraction module is used for extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels;

the construction module is used for constructing graph data based on the keywords of the candidate webpage labels and the distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through the entity boundaries of the candidate webpage labels;

and the output module is used for inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.

Optionally, the apparatus further comprises.

The training module is used for carrying out data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into the graph neural network model for model training, carrying out function tuning and parameter tuning, and obtaining a webpage login entity identification model until training is completed.

In a third aspect, an electronic device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for identifying a web page logging entity according to any one of the first aspects when executing the computer program.

In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method for identifying a web page logging entity according to any one of the first aspects.

In the technical scheme provided by the embodiment of the application, candidate login webpage data are obtained through a regular matching method; extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels; constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels; and inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list. The beneficial effects that technical scheme that this application embodiment provided include at least:

1. a large number of rules are not required to be matched, so that the detection efficiency is high;

2. the model can be reused after training, and the maintenance labor cost is low;

3. the detection flexibility is high, and the false alarm is low;

4. the professional level requirement is low;

5. the model has strong portability.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.

FIG. 1 is a flowchart illustrating steps of a method for identifying a web page login entity according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating a method for identifying a web page login entity according to an embodiment of the present application;

FIG. 3 is a block diagram of a device for identifying a web page login entity according to an embodiment of the present application;

fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The method provides a webpage login entity identification technical scheme based on deep GCN neural network learning, noise data are cleaned by using a regular expression, candidate webpage labels are extracted, then graph data are constructed based on keywords of the webpage labels and distances between the webpage labels, a graph neural network model is used for identification, and a login entity list corresponding to a target webpage is output. Referring to fig. 1, a flowchart of a method for identifying a web page login entity according to an embodiment of the present application is shown, where the method may include the following steps:

and step 101, acquiring candidate login webpage data through a regular matching method.

The candidate login webpage data are webpage data possibly including login functions.

In the embodiment of the application, candidate login webpage data are obtained by utilizing a crawler and regular expression technology, namely, webpage data possibly containing login functions are rapidly screened by utilizing a regular matching method.

Step 102, extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels.

In the embodiment of the application, a regular matching method is utilized to extract a specific type of webpage label and a webpage label containing a login keyword, for example, an input label, a button label and the like. And secondly, determining the boundary of the candidate entity according to the priority among the webpage labels and the webpage label attribute, wherein if the parent label of the img label is an a label, the boundary of the current entity is determined by the a label.

And step 103, constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels.

The distance between the candidate webpage labels is determined through the entity boundaries of the candidate webpage labels.

In the embodiment of the application, firstly, a TF-IDF method is used for selecting keywords of candidate webpage labels, and keyword frequency is used for quantifying node characteristics of the candidate webpage labels, specifically, TF-IDF (term frequency-inverse document frequency) is a common weighting technology for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency). The TF-IDF is used to evaluate the importance of a word to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus.

Next, a distance between candidate web page tags is calculated based on the web page domtree, and DOM is an abbreviation for document objectification model (Document Object Model). The DOM Tree is used for analyzing the HTML page through the DOM, and the generated HTML Tree structure and the corresponding access method are used for calculating the edge weight between the webpage label nodes, and the edge weight is calculated through a first formula, wherein the first formula specifically comprises the following steps:

Similarity＝(1-distance)/max(path length 1，path length 2) (1)

And 104, inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.

In this embodiment of the present application, before step 104, the method further includes converting the different types of login web page data into web page label graph data by using the second and third data preprocessing methods, inputting the web page label graph data into a GCN (graph convolutional neural network) model, training the model, and performing function tuning and parameter tuning until the web page login entity identification model with the best effect is trained. The model can input the processed webpage label graph data and output a login entity list.

And (3) obtaining and processing the candidate webpage data through steps 101 to 103, inputting the processed webpage label graph data into a webpage login entity identification model, and outputting a webpage login entity identification type list.

As shown in fig. 2, another alternative embodiment utilizing the present application is provided, comprising:

1. obtaining candidate login webpage data by utilizing a crawler and regular expression technology;

2. extracting candidate webpage labels by using a regular expression technology, and determining entity boundaries according to the priority of the webpage labels;

3. selecting feature words by using a TF-IDF method, and quantifying features by using a word frequency method;

4. calculating the distance between the webpage labels based on the Dom tree to construct a webpage label graph;

5. constructing a GCN graph neural network learning model, and carrying out entity identification on the webpage label;

6. and converting the model output into a corresponding entity type, and outputting the login entity classification.

In summary, it can be seen that the method is completely separated from the traditional login webpage entity identification method based on rule matching, and is different from the traditional modeling login entry identification method, when each node is characterized, not only the node information of the node is referred, but also the neighbor node information is considered, and the webpage structure information is more fully utilized to determine the login entity category to which different webpage labels belong, so that a large number of rules are not required to be checked one by one, and the method has the characteristics of high detection speed, high precision, low cost and the like

Referring to fig. 3, a block diagram of a web page login entity identification apparatus 200 according to an embodiment of the present application is shown. As shown in fig. 3, the apparatus 200 may include: an acquisition module 201, a decimation module 202, a construction module 203 and an output module 204.

An obtaining module 201, configured to obtain candidate login web page data by using a regular matching method, where the candidate login web page data is web page data that may include a login function;

the extraction module 202 is configured to extract candidate web page tags from the candidate login web page data by a regular matching method, and determine entity boundaries of the candidate web page tags according to priorities among the candidate web page tags and the candidate web page tag attributes;

a construction module 203, configured to construct graph data based on keywords of candidate web page tags and distances between the candidate web page tags, where the distances between the candidate web page tags are determined by entity boundaries of the candidate web page tags;

and the output module 204 is used for inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.

In an alternative embodiment of the present application, the apparatus further comprises a training module 205:

the training module 205 is configured to perform data preprocessing on different types of login web page data, convert the login web page data into web page tag graph data, input the web page tag graph data into a graph neural network model for model training, perform function tuning and parameter tuning, and obtain a web page login entity identification model until training is completed.

For specific limitation of the web page login entity identification device, reference may be made to the limitation of the web page login entity identification method hereinabove, and the description thereof will not be repeated here. The modules in the webpage login entity identification device can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, an electronic device is provided, which may be a computer, and the internal structure of which may be as shown in fig. 4. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for web page login entity identification data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for identifying a web page login entity.

It will be appreciated by those skilled in the art that the structure shown in fig. 4 is merely a block diagram of some of the structures associated with the present application and does not constitute a limitation of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In one embodiment of the present application, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the above-described web page login entity identification method.

The computer readable storage medium provided in this embodiment has similar principles and technical effects to those of the above method embodiment, and will not be described herein.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SyMchlimk) DRAM (SLDRAM), memory bus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.

The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims

1. A method for identifying a web page login entity, the method comprising:

2. The method of claim 1, wherein prior to inputting the constructed graph data into the trained web log-in entity recognition model, the method further comprises:

3. The method of claim 1, wherein constructing graph data based on the keywords of the candidate web page tags and the distance between the candidate web page tags comprises:

4. The method of claim 3, wherein calculating the distance between candidate web labels based on the web Dom tree, the edge weight between web label nodes, comprises calculating by a first formula, the first formula specifically comprising:

Smilarity＝(1-distance)/max(path length 1，path length 2)

5. The method of claim 1, wherein extracting candidate web page tags from the candidate login web page data by a canonical matching method comprises:

6. The method of claim 5, wherein the preset type of web page tag comprises at least: input tags and button tags.

7. A web page entry entity identification device, the device comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. An electronic device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the web page login entity identification method according to any one of claims 1 to 6.

10. A computer readable storage medium, having stored thereon a computer program which when executed by a processor implements the web page entry entity identification method of any of claims 1 to 6.