CN110191124B - Web front-end development data-based website identification method and device and storage equipment - Google Patents

Web front-end development data-based website identification method and device and storage equipment Download PDF

Info

Publication number
CN110191124B
CN110191124B CN201910458634.4A CN201910458634A CN110191124B CN 110191124 B CN110191124 B CN 110191124B CN 201910458634 A CN201910458634 A CN 201910458634A CN 110191124 B CN110191124 B CN 110191124B
Authority
CN
China
Prior art keywords
feature
sub
features
website
development data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910458634.4A
Other languages
Chinese (zh)
Other versions
CN110191124A (en
Inventor
李宝俊
童志明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Antiy Technology Group Co Ltd
Original Assignee
Antiy Technology Group Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Antiy Technology Group Co Ltd filed Critical Antiy Technology Group Co Ltd
Priority to CN201910458634.4A priority Critical patent/CN110191124B/en
Publication of CN110191124A publication Critical patent/CN110191124A/en
Application granted granted Critical
Publication of CN110191124B publication Critical patent/CN110191124B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/20Network architectures or network communication protocols for network security for managing network security; network security policies in general

Abstract

The embodiment of the invention discloses a website identification method, a website identification device and storage equipment based on web front-end development data, which are used for solving the problem that the potential safety hazard exists because of the missed report generated by a blacklist mechanism in the traditional website identification mode; and only most known websites can be identified through a white list mechanism, and the newly-appeared websites cannot be accurately identified, so that the problem of false alarm is easily caused. The method comprises the following steps: collecting front-end development data of a safe website page; extracting the front-end development data characteristics of the page and forming an information base; extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base; if any one of the features is matched, the website attribute is safe.

Description

Web front-end development data-based website identification method and device and storage equipment
Technical Field
The embodiment of the invention relates to the field of network security, in particular to a website authentication method and device based on web front-end development data and storage equipment.
Background
In security research work, it is often necessary to identify the security and reputation of a web site. Conventional secure web site authentication has two ways: websites with low reputation characteristics can be filtered out through a blacklist mechanism, but the method can generate report missing and has potential safety hazards; and the other mode is a white list mechanism, which can only identify most known high-reputation websites, is not very friendly to the newly-appeared high-reputation websites, has insufficient accuracy and is easy to cause false alarm.
Disclosure of Invention
Based on the existing problems, the embodiment of the invention provides a website identification method, a website identification device and storage equipment based on web front-end development data, which are used for solving the problem that in the traditional website identification mode, a blacklist mechanism generates report omission and has potential safety hazards; and the white list mechanism can only identify most of known websites, and can not accurately identify newly-appeared websites, thereby easily causing the problem of false alarm.
The embodiment of the invention discloses a website identification method based on web front-end development data, which comprises the following steps:
collecting front-end development data of a safe website page; extracting the front-end development data characteristics of the page and forming an information base; extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base; if any one of the features is matched, the website attribute is safe.
Further, the page front-end developing data features, including: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code; the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords; the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes; the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution; the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
Further, extracting the page front-end development data features of the website with unknown attributes, and matching the features with the features in the information base, specifically: setting a threshold T for each feature in the information basen,TnanSetting a weight value for each sub-feature of each feature in the information base; extracting any front-end development data characteristic of the website page with unknown attribute, matching the sub-characteristic with the corresponding sub-characteristic in the database, and calculating the characteristic matching degree Sn(ii) a If the matching degree SnNot less than threshold value TnIf so, matching the characteristics with the corresponding characteristics in the information base, and setting the website attribute as safe; if the matching degree Sn< threshold TnContinuously selecting other characteristics for matching; wherein the matching degree Sn=Tna1+Tna2+…+TnanIf the sub-features match, then TnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
Further, if the matching degree S of all the characteristicsnAre all less than threshold TnThen, the unknown website attribute is identified by adopting the following method: setting a statistical value P, P ═ 1- (1-S)1/t1)*(1-S2/t2)*(1-S3/t3)*……*(1-Sn/tn) (ii) a Setting statistical preset values
Figure GDA0003285622490000021
If the statistical value is
Figure GDA0003285622490000022
The website attribute is security; otherwise the website attribute is not secure.
Further, if reputation evaluation needs to be further performed on the website, reputation level values R with different reputations need to be setn(ii) a Setting credit weight S according to each feature in information basenbnPerforming a feature matching degree S on all the featuresnCalculating and calculating the reputation value X of the website, wherein X is S1*S1b1+S2*S2b2+…+Sn*Snbn(ii) a Comparing the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
The embodiment of the invention discloses a website identification device based on web front-end development data, which comprises a memory and a processor, wherein the memory is used for storing a plurality of instructions, and the processor is used for loading the instructions stored in the memory to execute:
collecting front-end development data of a safe website page; extracting the front-end development data characteristics of the page and forming an information base; extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base; if any one of the features is matched, the website attribute is safe.
Further, the processor is also configured to load instructions stored in the memory to perform:
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code; the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords; the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes; the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution; the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
Further, the processor is also configured to load instructions stored in the memory to perform:
extracting the page front-end development data characteristics of the unknown attribute website, and matching the characteristics with the characteristics in the information base, specifically: setting a threshold T for each feature in the information basen,TnanSetting a weight value for each sub-feature of each feature in the information base; extracting any front-end development data characteristic of the website page with unknown attribute, matching the sub-characteristic with the corresponding sub-characteristic in the database, and calculating the characteristic matching degree Sn(ii) a If the matching degree SnNot less than threshold value TnIf so, matching the characteristics with the corresponding characteristics in the information base, and setting the website attribute as safe; if the matching degree Sn< threshold TnContinuously selecting other characteristics for matching; wherein the matching degree Sn=Tna1+Tna2+…+TnanIf the sub-features match, then TnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
Further, the processor is also configured to load instructions stored in the memory to perform:
if all the characteristics match with each othernAre all smallAt a threshold value TnThen, the unknown website attribute is identified by adopting the following method: setting a statistical value P, P ═ 1- (1-S)1/t1)*(1-S2/t2)*(1-S3/t3)*……*(1-Sn/tn) (ii) a Setting statistical preset values
Figure GDA0003285622490000031
If the statistical value is
Figure GDA0003285622490000032
The website attribute is security; otherwise the website attribute is not secure.
Further, the processor is also configured to load instructions stored in the memory to perform:
if reputation evaluation needs to be further performed on the website, reputation grade values R with different reputations need to be setn(ii) a Setting credit weight S according to each feature in information basenbnPerforming a feature matching degree S on all the featuresnCalculating and calculating the reputation value X of the website, wherein X is S1*S1b1+S2*S262+…+Sn*Snbn(ii) a Comparing the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
The embodiment of the invention also discloses a website identification device based on the web front-end development data, which comprises the following steps:
a data collection module: collecting front-end development data of a safe website page;
a feature extraction module: extracting the front-end development data characteristics of the page and forming an information base;
a matching module: extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
an authentication module: if any one feature is matched, the website attribute is safe;
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code; the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords; the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes; the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution; the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
The embodiment of the invention provides a storage device, wherein a plurality of instructions are stored in the storage device, and the instructions are suitable for being loaded by a processor and executing the steps of the website authentication method based on the web front-end development data provided by the embodiment of the invention.
Compared with the prior art, the website identification method, the website identification device and the storage equipment based on the web front-end development data provided by the embodiment of the invention at least realize the following beneficial effects:
collecting front-end development data of a safe website page; extracting the front-end development data characteristics of the page and forming an information base; extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base; if any one of the features is matched, the website attribute is safe. According to the embodiment of the invention, through summarizing the characteristics of the multidimensional web front-end development mode, websites with high reliability can be identified more accurately, the generation of false reports under a white list mechanism can be effectively reduced, and the generation of false reports under a black list mechanism can be avoided.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise.
FIG. 1 is a flowchart of a website authentication method based on web front-end development data according to an embodiment of the present invention;
FIG. 2 is a flowchart of another website authentication method based on web front-end development data according to an embodiment of the present invention;
FIG. 3 is a flowchart of a website reputation evaluation method based on web front-end development data according to an embodiment of the present invention;
FIG. 4 is a block diagram of a website authentication apparatus based on web front-end development data according to an embodiment of the present invention;
fig. 5 is a structural diagram of another website authentication apparatus based on web front-end development data according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention clearer, a detailed description will be given below of a specific implementation of a web site authentication method based on web front-end development data according to an embodiment of the present invention with reference to the accompanying drawings. It should be understood that the preferred embodiments described below are only for illustrating and explaining the present invention and are not to be used for limiting the present invention. And the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
The embodiment of the invention provides a flow chart of a website identification method based on web front-end development data, as shown in fig. 1, comprising the following steps:
step 11, collecting front-end development data of the safe website page;
step 12, extracting the front-end development data characteristics of the page and forming an information base; the information base can be updated regularly, and the updating frequency can be adjusted according to the requirement.
Step 13, extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
and 14, if any one feature is matched, the website attribute is safe.
Wherein the page front end develops data features, including: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
the adaptive technology is a technology which can automatically adjust products to the best expression form to adapt to different operating systems, browsers, equipment and the like; the adaptation technical features comprise a plurality of sub-features including but not limited to platform adaptation features, device adaptation features, interface adaptation features, browser adaptation features; and the adaptation technology is characterized by adaptation codes, for example, "< meta name" ("viewport" ("content") "-width" ("device-width"), initial-scale ═ 1.0, and user-scale ═ 0 ">" are relatively common mobile end interface standard adaptation codes.
The framework technical features include a number of sub-features including, but not limited to: calling special marks such as keywords, calling modes and the like in frames such as a CSS frame, a modular development frame and a tool frame; the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords; for example, "aria-valuew" is a unique attribute name in the CSS framework, i.e., a CSS framework keyword.
The interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes; for example, CSS code is referenced by url alone and is not intermixed in html code.
The design features comprise a plurality of sub-features including, but not limited to: page layout, color richness, color matching scheme, material quantity and quality; and the design appearance feature is a graphical composition feature and/or a color number feature and/or a resolution; through image processing and image recognition technology, appearance design is abstracted into characteristics such as image combination, color number, resolution ratio and the like.
The webpage content features comprise a plurality of sub-features, and the webpage content features are filing information features and/or copyright information features and/or URLs and/or title and annotation features, for example, the title format is usually concise and intuitive, the word number is controlled within 20 words (within 10 English words), the symbols are not more than 3, and no operation symbols are included, so that the core content of the current webpage is easy to understand by a user.
The method provided by the embodiment of the invention can accurately identify the website with high reliability by summarizing the multi-dimensional web front-end development mode characteristics, can effectively reduce the generation of false reports under a white list mechanism, and can also avoid the generation of false reports under a black list mechanism.
The flow chart of the website authentication method based on the web front-end development data provided by the embodiment of the invention is shown in fig. 2, and comprises the following steps:
step 201, collecting front-end development data of a security website page;
step 202, extracting the front-end development data characteristics of the page and forming an information base;
step 203, setting a threshold T for each feature in the information basen,TnanSetting a weight value for each sub-feature of each feature in the information base;
step 204, extracting any front-end development data characteristic of the website page with unknown attribute, matching the sub-characteristic with the corresponding sub-characteristic in the database, and calculating the characteristic matching degree Sn
Wherein, the matching degree Sn is Tna1+ Tna2+ … + Tnan, if the sub-characteristics are matched with each other, T isnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
Step 205, compare the feature matching degree SnAnd the size of the threshold;
if the matching degree Sn is larger than or equal to the threshold value Tn, the characteristic is matched with the corresponding characteristic in the information base, and the website attribute is safe; if the matching degree Sn is smaller than the threshold value Tn, other features are continuously selected for matching;
step 206, if all the characteristics match with each other, the matching degree SnAre all less than threshold TnThen, the statistical value P and the statistical default value are set
Figure GDA0003285622490000071
And comparing the statistical value P with the statistical preset value
Figure GDA0003285622490000072
The size of (d);
wherein, P is 1- (1-S)1/t1)*(1-S2/t2)*(1-S3/t3)*……*(1-Sn/tn);
The preset value can be set according to the user requirement, if the statistic value
Figure GDA0003285622490000073
The website attribute is security; otherwise, the website attribute is unsafe; if the website attribute is judged to be unsafe, measures are taken for the unsafe website according to user requirements, wherein the measures include but are not limited to: and alarming, prohibiting further operation, closing the webpage and collecting the website information.
If the website attribute is judged to be safe and reputation evaluation needs to be further performed on the website, the embodiment of the invention provides a website reputation evaluation method flow chart based on web front-end development data, as shown in fig. 3;
301, setting reputation grade values R of different reputations for security of website attributesn
For reputation level value RnThe setting of (2) can be set according to the requirement.
Step 302, set a reputation weight S for each feature in the information basenbnAnd performing a feature matching degree S on all the featuresnCalculating and calculating a reputation value X of the website;
wherein X is S1*S1b1+S2*S2b2+…+Sn*Snbn
Step 303, compare the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
According to the method provided by the embodiment of the invention, through summarizing the web front-end development modes of multiple dimensions, the website with high reliability can be more accurately identified, and the method is strong in logicality and easy to understand; the false alarm generation under the white list mechanism can be effectively reduced, and the false alarm generation under the black list mechanism can be avoided; meanwhile, the credibility of the website can be further evaluated, and convenience is provided for users.
The embodiment of the present invention further provides a website authentication apparatus based on web front-end development data, as shown in fig. 4, including: the apparatus includes a memory 410 and a processor 420, the memory 410 is configured to store a plurality of instructions, and the processor 420 is configured to load the instructions stored in the memory 410 to perform:
collecting front-end development data of a safe website page; extracting the front-end development data characteristics of the page and forming an information base; extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base; if any one of the features is matched, the website attribute is safe.
The processor 420 is configured to load instructions stored in the memory 410 to perform:
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code; the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords; the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes; the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution; the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
The processor 420 is configured to load instructions stored in the memory 410 to perform:
extracting the page front-end development data characteristics of the unknown attribute website, and matching the characteristics with the characteristics in the information base, specifically:
setting a threshold T for each feature in the information basen
Extracting unknownsDeveloping data characteristics at any front end of the attribute website page, matching the sub-characteristics with the corresponding sub-characteristics in the database, and calculating the characteristic matching degree Sn
If the matching degree SnNot less than threshold value TnIf so, matching the characteristics with the corresponding characteristics in the information base, and setting the website attribute as safe; if the matching degree Sn< threshold TnContinuously selecting other characteristics for matching;
wherein the matching degree Sn=Tna1+Tna2+…+Tnan,TnanSetting weight for each sub-feature of each feature in the information base, if the sub-features are matched with each other, TnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
The processor 420 is configured to load instructions stored in the memory 410 to perform:
if all the characteristics match with each othernAre all less than threshold TnThen, the unknown website attribute is identified by adopting the following method:
setting a statistical value P, P is 1- (1-s1/t1) ((1-s 2/t2) ((1-s 3/t3) × … … ([ 1-sn/tn) ];
setting statistical preset values
Figure GDA0003285622490000091
If the statistical value is
Figure GDA0003285622490000092
The website attribute is security; otherwise the website attribute is not secure.
The processor 320 is configured to load the instructions stored in the memory 310 to perform:
if reputation evaluation needs to be further performed on the website, reputation grade values R with different reputations need to be setn
Setting credit weight S according to each feature in information basenbnPerforming a feature matching degree S on all the featuresnCalculating and calculating the reputation value X of the website, wherein X is S1*S1b1+S2*S2b2+…+Sn*Snbn
Comparing the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
The embodiment of the present invention also provides another website authentication apparatus based on web front-end development data, as shown in fig. 5, including:
the data collection module 51: collecting front-end development data of a safe website page;
the feature extraction module 52: extracting the front-end development data characteristics of the page and forming an information base;
the matching module 53: extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
the authentication module 54: if any one of the features is matched, the website attribute is safe.
The embodiment of the invention also provides a storage device, wherein a plurality of instructions are stored in the storage device, and the instructions are suitable for being loaded by a processor and executing the steps of the website authentication method based on the web front-end development data provided by the embodiment of the invention.
Through the above description of the embodiments, it is clear to those skilled in the art that the embodiments of the present invention may be implemented by hardware, or by software plus a necessary general hardware platform. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.), and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods according to the embodiments of the present invention.
Those skilled in the art will appreciate that the drawings are merely schematic representations of one preferred embodiment and that the blocks or flow diagrams in the drawings are not necessarily required to practice the present invention.
Those skilled in the art will appreciate that the modules in the devices in the embodiments may be distributed in the devices in the embodiments according to the description of the embodiments, and may be correspondingly changed in one or more devices different from the embodiments. The modules of the above embodiments may be combined into one module, or further split into multiple sub-modules.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A website identification method based on web front-end development data is characterized in that:
collecting front-end development data of a safe website page;
extracting the front-end development data characteristics of the page and forming an information base;
extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
if any one feature is matched, the website attribute is safe;
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code;
the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords;
the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes;
the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution;
the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
2. The method of claim 1, wherein extracting page front-end development data features of unknown-attribute websites to match with features in an information base is specifically:
setting a threshold T for each feature in the information basen,TnanSetting a weight value for each sub-feature of each feature in the information base;
extracting any front-end development data characteristic of the website page with unknown attribute, matching the sub-characteristic with the corresponding sub-characteristic in the database, and calculating the characteristic matching degree Sn
If the matching degree SnNot less than threshold value TnIf so, matching the characteristics with the corresponding characteristics in the information base, and setting the website attribute as safe; if the matching degree Sn< threshold TnContinuously selecting other characteristics for matching;
wherein the matching degree Sn=Tna1+Tna2+…+TnanIf the sub-features match, then TnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
3. The method of claim 2, wherein the degree of match S is given to all featuresnAre all less than threshold TnThen, the unknown website attribute is identified by adopting the following method:
setting a statistical value P, P ═ 1- (1-S)1/t1)*(1-S2/t2)*(1-S3/t3)*……*(1-Sn/tn);
Setting statistical preset values
Figure FDA0003285622480000021
If the statistical value is
Figure FDA0003285622480000022
The website attribute is security; otherwise the website attribute is not secure.
4. The method of claim 2, wherein if reputation evaluation is further performed on the website, reputation level values R of different reputations are setn
Setting credit weight S according to each feature in information basenbnPerforming a feature matching degree S on all the featuresnCalculating and calculating the reputation value X of the website, wherein X is S1*S1b1+S2*S2b2+…+Sn*Snbn(ii) a Comparing the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
5. An apparatus for web site authentication based on web front-end development data, the apparatus comprising a memory for storing a plurality of instructions and a processor for loading the instructions stored in the memory to perform:
collecting front-end development data of a safe website page;
extracting the front-end development data characteristics of the page and forming an information base;
extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
if any one feature is matched, the website attribute is safe;
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code;
the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords;
the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes;
the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution;
the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
6. The apparatus of claim 5, wherein the processor is further to load instructions stored in the memory to perform:
extracting the page front-end development data characteristics of the unknown attribute website, and matching the characteristics with the characteristics in the information base, specifically:
setting a threshold T for each feature in the information basen
Extracting any front-end development data characteristic of the website page with unknown attribute, matching the sub-characteristic with the corresponding sub-characteristic in the database, and calculating the characteristic matching degree Sn
If the matching degree SnNot less than threshold value TnIf so, matching the characteristics with the corresponding characteristics in the information base, and setting the website attribute as safe; if the matching degree Sn< threshold TnContinuously selecting other characteristics for matching;
wherein the matching degree Sn=Tna1+Tna2+…+Tnan,TnanSetting weight for each sub-feature of each feature in the information base, if the sub-features are matched with each other, TnanAnd setting the weight value for the sub-features of the information base, and if the sub-features are not matched with each other, setting the weight value of the corresponding sub-feature to be zero.
7. The apparatus of claim 6, wherein the processor is further to load instructions stored in the memory to perform:
if all the characteristics match with each othernAre all less than threshold TnThen, the unknown website attribute is identified by adopting the following method:
setting a statistical value P.P ═ 1- (1-s1/t1) × (1-s2/t2) × (1-s3/t3) × … … × (1-sn/tn);
setting statistical preset values
Figure FDA0003285622480000031
If the statistical value is
Figure FDA0003285622480000032
The website attribute is security; otherwise the website attribute is not secure.
8. The apparatus of claim 6, wherein the processor is further to load instructions stored in the memory to perform:
if reputation evaluation needs to be further performed on the website, reputation grade values R with different reputations need to be setn
Setting credit weight S according to each feature in information basenbnPerforming a feature matching degree S on all the featuresnCalculating and calculating the reputation value X of the website, wherein X is S1*S1b1+S2*S2b2+…+Sn*Snbn
Comparing the reputation value X with the reputation level value RnAnd the credibility of the website can be evaluated.
9. A web site authentication apparatus based on web front-end development data, comprising:
a data collection module: collecting front-end development data of a safe website page;
a feature extraction module: extracting the front-end development data characteristics of the page and forming an information base;
a matching module: extracting the front-end development data characteristics of the web pages with unknown attributes, and matching the characteristics with the characteristics in the information base;
an authentication module: if any one feature is matched, the website attribute is safe;
the page front end developing data features, comprising: adapting technical characteristics, framework technical characteristics, interface design characteristics, appearance design characteristics and webpage content characteristics;
wherein the adaptation technical feature comprises a plurality of sub-features and is an adaptation code;
the frame technical characteristics comprise a plurality of sub-characteristics, and the frame technical characteristics are keywords and/or keywords in the frame and the number of the keywords;
the interface design feature comprises a plurality of sub-features, and the interface design feature is that page HTML does not contain other language codes;
the design appearance feature comprises a plurality of sub-features, and the design appearance feature is a graphic composition feature and/or a color number feature and/or a resolution;
the webpage content feature comprises a plurality of sub-features, and the webpage content feature is a filing information feature and/or a copyright information feature and/or a URL feature and/or a title and annotation feature.
10. A storage device having stored therein a plurality of instructions adapted to be loaded by a processor and to perform the method according to any one of claims 1-4.
CN201910458634.4A 2019-05-29 2019-05-29 Web front-end development data-based website identification method and device and storage equipment Active CN110191124B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910458634.4A CN110191124B (en) 2019-05-29 2019-05-29 Web front-end development data-based website identification method and device and storage equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910458634.4A CN110191124B (en) 2019-05-29 2019-05-29 Web front-end development data-based website identification method and device and storage equipment

Publications (2)

Publication Number Publication Date
CN110191124A CN110191124A (en) 2019-08-30
CN110191124B true CN110191124B (en) 2022-02-22

Family

ID=67718703

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910458634.4A Active CN110191124B (en) 2019-05-29 2019-05-29 Web front-end development data-based website identification method and device and storage equipment

Country Status (1)

Country Link
CN (1) CN110191124B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112785130B (en) * 2021-01-13 2024-04-16 上海派拉软件股份有限公司 Website risk level identification method, device, equipment and storage medium
CN113535458B (en) * 2021-09-17 2021-12-28 上海观安信息技术股份有限公司 Abnormal false alarm processing method and device, storage medium and terminal

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7559085B1 (en) * 2004-08-13 2009-07-07 Sun Microsystems, Inc. Detection for deceptively similar domain names
CN103927480A (en) * 2013-01-14 2014-07-16 腾讯科技(深圳)有限公司 Method, device and system for identifying malicious web page
CN104537303A (en) * 2014-12-30 2015-04-22 中国科学院深圳先进技术研究院 Distinguishing system and method for phishing website
CN104954372A (en) * 2015-06-12 2015-09-30 中国科学院信息工程研究所 Method and system for performing evidence acquisition and verification on phishing website
CN108650250A (en) * 2018-04-27 2018-10-12 北京奇安信科技有限公司 Illegal page detection method, system, computer system and readable storage medium storing program for executing

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102982129B (en) * 2012-11-14 2016-10-19 优视科技有限公司 Content in webpage is marked the method, system and device of prompting
CN109242487A (en) * 2018-09-26 2019-01-18 石帅 A kind of value assessment method of internet block chain environment lower network domain name

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7559085B1 (en) * 2004-08-13 2009-07-07 Sun Microsystems, Inc. Detection for deceptively similar domain names
CN103927480A (en) * 2013-01-14 2014-07-16 腾讯科技(深圳)有限公司 Method, device and system for identifying malicious web page
CN104537303A (en) * 2014-12-30 2015-04-22 中国科学院深圳先进技术研究院 Distinguishing system and method for phishing website
CN104954372A (en) * 2015-06-12 2015-09-30 中国科学院信息工程研究所 Method and system for performing evidence acquisition and verification on phishing website
CN108650250A (en) * 2018-04-27 2018-10-12 北京奇安信科技有限公司 Illegal page detection method, system, computer system and readable storage medium storing program for executing

Also Published As

Publication number Publication date
CN110191124A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110362370B (en) Webpage language switching method and device and terminal equipment
Sun et al. Dom based content extraction via text density
CN105677764B (en) Information extraction method and device
US8515212B1 (en) Image relevance model
US8630972B2 (en) Providing context for web articles
US20130339840A1 (en) System and method for logical chunking and restructuring websites
CA2918840C (en) Presenting fixed format documents in reflowed format
CN108566399B (en) Phishing website identification method and system
CN103136228A (en) Image search method and image search device
CN102054024A (en) Information processing apparatus, information extracting method, program, and information processing system
CN115982376B (en) Method and device for training model based on text, multimode data and knowledge
CN103942211B (en) A kind of recognition methods of text page and device
CN110191124B (en) Web front-end development data-based website identification method and device and storage equipment
JP2014112433A (en) Device and method for search result ordering using reliability of representative
CN104881428A (en) Information graph extracting and retrieving method and device for information graph webpages
EP3467633B1 (en) Method, device, and terminal device for extracting data
US11520835B2 (en) Learning system, learning method, and program
CN106202349A (en) Web page classifying dictionary creation method and device
EP3706014A1 (en) Methods, apparatuses, devices, and storage media for content retrieval
US20090313558A1 (en) Semantic Image Collection Visualization
CN106570003B (en) Data pushing method and device
WO2016105334A1 (en) Providing a print-ready document
CN104866545B (en) The method of search key on information displayed page
CN113806667B (en) Method and system for supporting webpage classification
CN108595453B (en) URL (Uniform resource locator) identifier mapping obtaining method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 150010 building 7, innovation and entrepreneurship Plaza, science and technology innovation city, Harbin high tech Industrial Development Zone, Harbin, Heilongjiang Province (No. 838 Shikun Road)

Applicant after: Antan Technology Group Co.,Ltd.

Address before: 150010 building 7, innovation and entrepreneurship Plaza, science and technology innovation city, Harbin high tech Industrial Development Zone, Harbin, Heilongjiang Province (No. 838 Shikun Road)

Applicant before: Harbin Antian Science and Technology Group Co.,Ltd.

GR01 Patent grant
GR01 Patent grant