CN114579834B - Webpage login entity identification method and device, electronic equipment and storage medium - Google Patents

Webpage login entity identification method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN114579834B
CN114579834B CN202210242582.9A CN202210242582A CN114579834B CN 114579834 B CN114579834 B CN 114579834B CN 202210242582 A CN202210242582 A CN 202210242582A CN 114579834 B CN114579834 B CN 114579834B
Authority
CN
China
Prior art keywords
webpage
candidate
login
labels
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210242582.9A
Other languages
Chinese (zh)
Other versions
CN114579834A (en
Inventor
李乾坤
何召阳
刘乃海
靳宇馨
王欣宇
袁伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Moyun Technology Co ltd
Original Assignee
Beijing Moyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Moyun Technology Co ltd filed Critical Beijing Moyun Technology Co ltd
Priority to CN202210242582.9A priority Critical patent/CN114579834B/en
Publication of CN114579834A publication Critical patent/CN114579834A/en
Application granted granted Critical
Publication of CN114579834B publication Critical patent/CN114579834B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • G06F16/9577Optimising the visualization of content, e.g. distillation of HTML documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The application discloses a webpage login entity identification method, a webpage login entity identification device, electronic equipment and a storage medium. Firstly, acquiring candidate login webpage data by a regular matching method; extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels; constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels; the constructed graph data are input into the trained webpage login entity identification model to obtain a webpage login entity identification type list, and it can be seen that when each node is characterized, the method and the device not only refer to the node information of the node, but also consider the neighbor node information, and more fully utilize the webpage structure information to determine the login entity category to which different webpage labels belong, and do not need to check a large number of rules one by one, and have the characteristics of high detection speed, high precision, low cost and the like.

Description

Webpage login entity identification method and device, electronic equipment and storage medium
Technical Field
The present invention relates to the field of data identification, and in particular, to a method and apparatus for identifying a web page login entity, an electronic device, and a storage medium.
Background
The traditional webpage data login entity identification method mainly comprises a rule-based method and a traditional machine learning-based method, wherein aiming at the characteristics of different login entities, the method extracts a plurality of key position information which is dependent in login entity identification from a specific type webpage label and a webpage label containing specific keywords relatively independently, designs various rules and characteristics according to different login entities, finally judges whether key characteristics of a certain login entity exist in the information, and finally returns an identification result.
Therefore, with the continuous updating of the present login mode and the continuous increasing of the new login types, the defects of the traditional website identification are increasingly prominent. On the one hand, for the rule-based method, the login entity identification rule is difficult to maintain, needs to be updated continuously along with the change of the webpage login mode, wastes extremely the manpower resource cost, and can have the problems of omission, wrong writing of the rule and the like, so that the identification effect is gradually declined. On the other hand, the traditional modeling method lacks knowledge of overall information of a login inlet, web page structure information cannot be fully utilized to correlate identification of a plurality of login entities, and as network security consciousness is improved, effective features of the login entities are reduced, identification difficulty is increased, and therefore the identification effect of the login entities based on the traditional method is poorer.
Disclosure of Invention
Based on the above, the embodiment of the application provides a webpage login entity identification method, a webpage login entity identification device, electronic equipment and a storage medium, which improve the identification effect of login entity identification compared with the prior art.
In a first aspect, a method for identifying a web page login entity is provided, where the method includes:
acquiring candidate login webpage data by a regular matching method, wherein the candidate login webpage data are webpage data possibly containing login functions;
extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and candidate webpage label attributes;
constructing graph data based on keywords of candidate webpage labels and distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through entity boundaries of the candidate webpage labels;
and inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
Optionally, before inputting the constructed graph data into the trained web page login entity identification model, the method further comprises:
and performing data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into a graph neural network model for model training, performing function tuning and parameter tuning, and obtaining a webpage login entity identification model after training is completed.
Optionally, the constructing graph data based on the keywords of the candidate web page tags and the distance between the candidate web page tags includes:
selecting keywords of candidate webpage labels by using a TF-IDF method, and quantifying node characteristics of the candidate webpage labels by using keyword word frequency;
calculating the distance between candidate webpage labels based on the webpage Dom tree, and calculating to obtain the edge weight between webpage label nodes;
and constructing the webpage label graph data based on the candidate webpage label node characteristics and the edge weights between the webpage label nodes.
Optionally, calculating the distance between candidate web labels based on the web Dom tree to obtain the edge weight between web label nodes includes calculating through a first formula, where the first formula specifically includes:
Similarity=(1-distance)/max(path length 1,path length 2)
wherein Similarity represents edge weight between web page label nodes, distance represents distance between candidate web page labels, and path length1 and path length2 are depths of two candidate web page labels in a Dom tree respectively.
Optionally, extracting the candidate webpage label from the candidate login webpage data through a regular matching method includes:
extracting the webpage labels of the preset type and the webpage labels containing the login keywords through a regular matching method.
Optionally, the preset type of web page tag at least includes: input tags and button tags.
In a second aspect, there is provided a web page login entity identification apparatus, the apparatus comprising:
the acquisition module is used for acquiring candidate login webpage data through a regular matching method, wherein the candidate login webpage data are webpage data possibly comprising login functions;
the extraction module is used for extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels;
the construction module is used for constructing graph data based on the keywords of the candidate webpage labels and the distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through the entity boundaries of the candidate webpage labels;
and the output module is used for inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
Optionally, the apparatus further comprises.
The training module is used for carrying out data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into the graph neural network model for model training, carrying out function tuning and parameter tuning, and obtaining a webpage login entity identification model until training is completed.
In a third aspect, an electronic device is provided, including a memory and a processor, where the memory stores a computer program, and the processor implements the method for identifying a web page logging entity according to any one of the first aspects when executing the computer program.
In a fourth aspect, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the method for identifying a web page logging entity according to any one of the first aspects.
In the technical scheme provided by the embodiment of the application, candidate login webpage data are obtained through a regular matching method; extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels; constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels; and inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list. The beneficial effects that technical scheme that this application embodiment provided include at least:
1. a large number of rules are not required to be matched, so that the detection efficiency is high;
2. the model can be reused after training, and the maintenance labor cost is low;
3. the detection flexibility is high, and the false alarm is low;
4. the professional level requirement is low;
5. the model has strong portability.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below. It will be apparent to those of ordinary skill in the art that the drawings in the following description are exemplary only and that other implementations can be obtained from the extensions of the drawings provided without inventive effort.
FIG. 1 is a flowchart illustrating steps of a method for identifying a web page login entity according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating a method for identifying a web page login entity according to an embodiment of the present application;
FIG. 3 is a block diagram of a device for identifying a web page login entity according to an embodiment of the present application;
fig. 4 is a schematic diagram of an electronic device according to an embodiment of the present application.
Detailed Description
Other advantages and advantages of the present invention will become apparent to those skilled in the art from the following detailed description, which, by way of illustration, is to be read in connection with certain specific embodiments, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
The method provides a webpage login entity identification technical scheme based on deep GCN neural network learning, noise data are cleaned by using a regular expression, candidate webpage labels are extracted, then graph data are constructed based on keywords of the webpage labels and distances between the webpage labels, a graph neural network model is used for identification, and a login entity list corresponding to a target webpage is output. Referring to fig. 1, a flowchart of a method for identifying a web page login entity according to an embodiment of the present application is shown, where the method may include the following steps:
and step 101, acquiring candidate login webpage data through a regular matching method.
The candidate login webpage data are webpage data possibly including login functions.
In the embodiment of the application, candidate login webpage data are obtained by utilizing a crawler and regular expression technology, namely, webpage data possibly containing login functions are rapidly screened by utilizing a regular matching method.
Step 102, extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels.
In the embodiment of the application, a regular matching method is utilized to extract a specific type of webpage label and a webpage label containing a login keyword, for example, an input label, a button label and the like. And secondly, determining the boundary of the candidate entity according to the priority among the webpage labels and the webpage label attribute, wherein if the parent label of the img label is an a label, the boundary of the current entity is determined by the a label.
And step 103, constructing graph data based on the keywords of the candidate webpage labels and the distance between the candidate webpage labels.
The distance between the candidate webpage labels is determined through the entity boundaries of the candidate webpage labels.
In the embodiment of the application, firstly, a TF-IDF method is used for selecting keywords of candidate webpage labels, and keyword frequency is used for quantifying node characteristics of the candidate webpage labels, specifically, TF-IDF (term frequency-inverse document frequency) is a common weighting technology for information retrieval and data mining. TF is the Term Frequency (Term Frequency) and IDF is the inverse text Frequency index (Inverse Document Frequency). The TF-IDF is used to evaluate the importance of a word to one of the documents in a document set or corpus. The importance of a word increases proportionally with the number of times it appears in the file, but at the same time decreases inversely with the frequency with which it appears in the corpus.
Next, a distance between candidate web page tags is calculated based on the web page domtree, and DOM is an abbreviation for document objectification model (Document Object Model). The DOM Tree is used for analyzing the HTML page through the DOM, and the generated HTML Tree structure and the corresponding access method are used for calculating the edge weight between the webpage label nodes, and the edge weight is calculated through a first formula, wherein the first formula specifically comprises the following steps:
Similarity=(1-distance)/max(path length 1,path length 2) (1)
wherein Similarity represents edge weight between web page label nodes, distance represents distance between candidate web page labels, and path length1 and path length2 are depths of two candidate web page labels in a Dom tree respectively.
And 104, inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
In this embodiment of the present application, before step 104, the method further includes converting the different types of login web page data into web page label graph data by using the second and third data preprocessing methods, inputting the web page label graph data into a GCN (graph convolutional neural network) model, training the model, and performing function tuning and parameter tuning until the web page login entity identification model with the best effect is trained. The model can input the processed webpage label graph data and output a login entity list.
And (3) obtaining and processing the candidate webpage data through steps 101 to 103, inputting the processed webpage label graph data into a webpage login entity identification model, and outputting a webpage login entity identification type list.
As shown in fig. 2, another alternative embodiment utilizing the present application is provided, comprising:
1. obtaining candidate login webpage data by utilizing a crawler and regular expression technology;
2. extracting candidate webpage labels by using a regular expression technology, and determining entity boundaries according to the priority of the webpage labels;
3. selecting feature words by using a TF-IDF method, and quantifying features by using a word frequency method;
4. calculating the distance between the webpage labels based on the Dom tree to construct a webpage label graph;
5. constructing a GCN graph neural network learning model, and carrying out entity identification on the webpage label;
6. and converting the model output into a corresponding entity type, and outputting the login entity classification.
In summary, it can be seen that the method is completely separated from the traditional login webpage entity identification method based on rule matching, and is different from the traditional modeling login entry identification method, when each node is characterized, not only the node information of the node is referred, but also the neighbor node information is considered, and the webpage structure information is more fully utilized to determine the login entity category to which different webpage labels belong, so that a large number of rules are not required to be checked one by one, and the method has the characteristics of high detection speed, high precision, low cost and the like
Referring to fig. 3, a block diagram of a web page login entity identification apparatus 200 according to an embodiment of the present application is shown. As shown in fig. 3, the apparatus 200 may include: an acquisition module 201, a decimation module 202, a construction module 203 and an output module 204.
An obtaining module 201, configured to obtain candidate login web page data by using a regular matching method, where the candidate login web page data is web page data that may include a login function;
the extraction module 202 is configured to extract candidate web page tags from the candidate login web page data by a regular matching method, and determine entity boundaries of the candidate web page tags according to priorities among the candidate web page tags and the candidate web page tag attributes;
a construction module 203, configured to construct graph data based on keywords of candidate web page tags and distances between the candidate web page tags, where the distances between the candidate web page tags are determined by entity boundaries of the candidate web page tags;
and the output module 204 is used for inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
In an alternative embodiment of the present application, the apparatus further comprises a training module 205:
the training module 205 is configured to perform data preprocessing on different types of login web page data, convert the login web page data into web page tag graph data, input the web page tag graph data into a graph neural network model for model training, perform function tuning and parameter tuning, and obtain a web page login entity identification model until training is completed.
For specific limitation of the web page login entity identification device, reference may be made to the limitation of the web page login entity identification method hereinabove, and the description thereof will not be repeated here. The modules in the webpage login entity identification device can be all or partially realized by software, hardware and a combination thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, an electronic device is provided, which may be a computer, and the internal structure of which may be as shown in fig. 4. The electronic device includes a processor, a memory, and a network interface connected by a system bus. Wherein the processor of the device is configured to provide computing and control capabilities. The memory of the device includes a non-volatile storage medium, an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for web page login entity identification data. The network interface of the computer device is used for communicating with an external terminal through a network connection. The computer program, when executed by a processor, implements a method for identifying a web page login entity.
It will be appreciated by those skilled in the art that the structure shown in fig. 4 is merely a block diagram of some of the structures associated with the present application and does not constitute a limitation of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.
In one embodiment of the present application, a computer readable storage medium is provided, on which a computer program is stored, which when executed by a processor implements the steps of the above-described web page login entity identification method.
The computer readable storage medium provided in this embodiment has similar principles and technical effects to those of the above method embodiment, and will not be described herein.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, storage, database, or other medium used in the various embodiments provided herein may include non-volatile and/or volatile memory. The nonvolatile memory can include Read Only Memory (ROM), programmable ROM (PROM), electrically Programmable ROM (EPROM), electrically Erasable Programmable ROM (EEPROM), or flash memory. Volatile memory can include Random Access Memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in M forms such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double Data Rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link (SyMchlimk) DRAM (SLDRAM), memory bus (RaMbus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), among others.
The technical features of the above-described embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above-described embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the claims. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application is to be determined by the claims appended hereto.

Claims (10)

1. A method for identifying a web page login entity, the method comprising:
acquiring candidate login webpage data by a regular matching method, wherein the candidate login webpage data are webpage data possibly containing login functions;
extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and candidate webpage label attributes;
constructing graph data based on keywords of candidate webpage labels and distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through entity boundaries of the candidate webpage labels;
and inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
2. The method of claim 1, wherein prior to inputting the constructed graph data into the trained web log-in entity recognition model, the method further comprises:
and performing data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into a graph neural network model for model training, performing function tuning and parameter tuning, and obtaining a webpage login entity identification model after training is completed.
3. The method of claim 1, wherein constructing graph data based on the keywords of the candidate web page tags and the distance between the candidate web page tags comprises:
selecting keywords of candidate webpage labels by using a TF-IDF method, and quantifying node characteristics of the candidate webpage labels by using keyword word frequency;
calculating the distance between candidate webpage labels based on the webpage Dom tree, and calculating to obtain the edge weight between webpage label nodes;
and constructing the webpage label graph data based on the candidate webpage label node characteristics and the edge weights between the webpage label nodes.
4. The method of claim 3, wherein calculating the distance between candidate web labels based on the web Dom tree, the edge weight between web label nodes, comprises calculating by a first formula, the first formula specifically comprising:
Smilarity=(1-distance)/max(path length 1,path length 2)
wherein Similarity represents edge weight between web page label nodes, distance represents distance between candidate web page labels, and path length1 and path length2 are depths of two candidate web page labels in a Dom tree respectively.
5. The method of claim 1, wherein extracting candidate web page tags from the candidate login web page data by a canonical matching method comprises:
extracting the webpage labels of the preset type and the webpage labels containing the login keywords through a regular matching method.
6. The method of claim 5, wherein the preset type of web page tag comprises at least: input tags and button tags.
7. A web page entry entity identification device, the device comprising:
the acquisition module is used for acquiring candidate login webpage data through a regular matching method, wherein the candidate login webpage data are webpage data possibly comprising login functions;
the extraction module is used for extracting candidate webpage labels from the candidate login webpage data through a regular matching method, and determining entity boundaries of the candidate webpage labels according to priorities among the candidate webpage labels and the attribute of the candidate webpage labels;
the construction module is used for constructing graph data based on the keywords of the candidate webpage labels and the distances between the candidate webpage labels, wherein the distances between the candidate webpage labels are determined through the entity boundaries of the candidate webpage labels;
and the output module is used for inputting the constructed graph data into the trained webpage login entity identification model to obtain a webpage login entity identification type list.
8. The apparatus of claim 7, wherein the apparatus further comprises:
the training module is used for carrying out data preprocessing on different types of login webpage data, converting the login webpage data into webpage label graph data, inputting the webpage label graph data into the graph neural network model for model training, carrying out function tuning and parameter tuning, and obtaining a webpage login entity identification model until training is completed.
9. An electronic device comprising a memory and a processor, the memory storing a computer program which, when executed by the processor, implements the web page login entity identification method according to any one of claims 1 to 6.
10. A computer readable storage medium, having stored thereon a computer program which when executed by a processor implements the web page entry entity identification method of any of claims 1 to 6.
CN202210242582.9A 2022-03-11 2022-03-11 Webpage login entity identification method and device, electronic equipment and storage medium Active CN114579834B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210242582.9A CN114579834B (en) 2022-03-11 2022-03-11 Webpage login entity identification method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210242582.9A CN114579834B (en) 2022-03-11 2022-03-11 Webpage login entity identification method and device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN114579834A CN114579834A (en) 2022-06-03
CN114579834B true CN114579834B (en) 2023-07-21

Family

ID=81775296

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210242582.9A Active CN114579834B (en) 2022-03-11 2022-03-11 Webpage login entity identification method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114579834B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248375B (en) * 2023-02-01 2023-12-15 北京市燃气集团有限责任公司 Webpage login entity identification method, device, equipment and storage medium

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103309961B (en) * 2013-05-30 2015-07-15 北京智海创讯信息技术有限公司 Webpage content extraction method based on Markov random field
CN113037709B (en) * 2021-02-02 2022-03-29 厦门大学 Webpage fingerprint monitoring method for multi-label browsing of anonymous network

Also Published As

Publication number Publication date
CN114579834A (en) 2022-06-03

Similar Documents

Publication Publication Date Title
CN109885692B (en) Knowledge data storage method, apparatus, computer device and storage medium
US20210081376A1 (en) Construction method, device, computing device, and storage medium for constructing patent knowledge database
US10133650B1 (en) Automated API parameter resolution and validation
US9390176B2 (en) System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data
CN110458324B (en) Method and device for calculating risk probability and computer equipment
EP2202645A1 (en) Method of feature extraction from noisy documents
US20220197923A1 (en) Apparatus and method for building big data on unstructured cyber threat information and method for analyzing unstructured cyber threat information
CN105378731A (en) Correlating corpus/corpora value from answered questions
CN111125343A (en) Text analysis method and device suitable for human-sentry matching recommendation system
CN111881398B (en) Page type determining method, device and equipment and computer storage medium
CN110955608B (en) Test data processing method, device, computer equipment and storage medium
CN111984792A (en) Website classification method and device, computer equipment and storage medium
CN111931935A (en) Network security knowledge extraction method and device based on One-shot learning
US20230081737A1 (en) Determining data categorizations based on an ontology and a machine-learning model
CN112287199A (en) Big data center processing system based on cloud server
CN114579834B (en) Webpage login entity identification method and device, electronic equipment and storage medium
CN112579729A (en) Training method and device for document quality evaluation model, electronic equipment and medium
Wang et al. Cyber threat intelligence entity extraction based on deep learning and field knowledge engineering
CN113032548A (en) Information processing apparatus, storage medium, and information processing method
CN112464660B (en) Text classification model construction method and text data processing method
CN112380346B (en) Financial news emotion analysis method and device, computer equipment and storage medium
CN110781310A (en) Target concept graph construction method and device, computer equipment and storage medium
CN116484025A (en) Vulnerability knowledge graph construction method, vulnerability knowledge graph evaluation equipment and storage medium
CN115238645A (en) Asset data identification method and device, electronic equipment and computer storage medium
CN115690821A (en) Intelligent electronic file cataloging method and computer equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant