CN113779481B

CN113779481B - Method, device, equipment and storage medium for identifying fraud websites

Info

Publication number: CN113779481B
Application number: CN202111129835.3A
Authority: CN
Inventors: 周宇轩; 傅强; 蔡琳; 阿曼太; 梁彧; 马寒军; 田野; 王杰; 杨满智; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2024-04-09
Anticipated expiration: 2041-09-26
Also published as: CN113779481A

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for identifying fraud websites, wherein the method comprises the following steps: acquiring a webpage source code of a website to be identified, and acquiring a target feature vector according to the webpage source code; judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library; if yes, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors. According to the technical scheme provided by the embodiment of the invention, a new mode for identifying the fraud websites based on the webpage source codes is provided, so that the identification accuracy of the fraud websites is improved, and the personal and property safety of users is effectively protected.

Description

Method, device, equipment and storage medium for identifying fraud websites

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for identifying fraud websites.

Background

With the continuous development of phishing technology, endangered fraud websites seriously endanger the property safety of people, realize timely discovery and blocking of the fraud websites, and have important significance for improving the personal and property safety of phishing users.

Currently, the existing fraud website identification method is generally to detect the fraud related keywords appearing in websites so as to identify fraud websites. Specifically, collecting each type of fraud website sample in advance, and performing word segmentation and statistics on the phrases appearing in the sample to obtain forward and reverse word frequencies; when the keywords of the websites to be identified hit the reverse word frequency, determining the websites to be identified as fraud websites.

However, when the fraud website does not display keywords in the web page source code or adopts a source code manner of hypertext markup language (Hyper Text Markup Language, HTML), for example, < title > & #x738B; & # x4E2D; the prior art can not realize accurate identification of fraud websites, so that the identification accuracy of the fraud websites is reduced, and effective protection of personal and property safety of users can not be realized.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying fraud websites, provides a novel method for identifying fraud websites based on webpage source codes, improves the identification accuracy of fraud websites, and effectively protects the personal and property safety of users.

In a first aspect, an embodiment of the present invention provides a method for identifying fraud websites, including:

Acquiring a webpage source code of a website to be identified, and acquiring a target feature vector according to the webpage source code;

judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library;

the preset fraud website feature library comprises at least one standard feature vector set, and each standard feature vector set comprises at least one standard feature vector;

if so, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In a second aspect, an embodiment of the present invention further provides a device for identifying fraud websites, including:

the target feature vector acquisition module is used for acquiring the webpage source codes of the websites to be identified and acquiring target feature vectors according to the webpage source codes;

the matching judgment module is used for judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library;

And the classification determining module is used for determining the websites to be identified as fraud websites if the websites to be identified are fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In a third aspect, an embodiment of the present invention further provides an electronic device, including:

one or more processors;

storage means for storing one or more computer programs;

the method for identifying fraud websites provided by any embodiment of the present invention is implemented when the one or more computer programs are executed by the one or more processors, such that the one or more processors execute the computer programs.

In a fourth aspect, the embodiment of the present invention further provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the method for identifying fraud websites according to any embodiment of the present invention.

According to the technical scheme provided by the embodiment of the invention, the target feature vector is obtained by obtaining the webpage source code of the website to be identified and according to the webpage source code; when the target standard feature vector matched with the target feature vector exists in the preset fraud website feature library, the websites to be identified are determined to be fraud websites, and the fraud website classification of the websites to be identified is determined according to the fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors, so that the accurate identification of the fraud websites based on the webpage source code is realized, the identification accuracy of the fraud websites is improved, and the personal and property safety of users is effectively protected.

Drawings

FIG. 1 is a flowchart of a method for identifying fraud websites according to an embodiment of the present invention;

FIG. 2 is a flowchart of a method for identifying fraud websites in another embodiment of the present invention;

FIG. 3A is a flowchart of a method for identifying fraud websites in another embodiment of the present invention;

FIG. 3B is a flowchart of a method for identifying fraud websites according to another embodiment of the present invention;

FIG. 4 is a schematic diagram of a fraud site identification apparatus according to another embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present invention.

Detailed Description

Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While the invention is susceptible of embodiment in the drawings, it is to be understood that the invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided to provide a more thorough and complete understanding of the invention. It should be understood that the drawings and embodiments of the invention are for illustration purposes only and are not intended to limit the scope of the present invention.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Furthermore, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "including" and variations thereof as used herein are intended to be open-ended, i.e., including, but not limited to. The term "based on" is based at least in part on. The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments. Related definitions of other terms will be given in the description below.

It should be noted that the terms "first," "second," and the like herein are merely used for distinguishing between different devices, modules, or units and not for limiting the order or interdependence of the functions performed by such devices, modules, or units.

It should be noted that references to "one", "a plurality" and "a plurality" in this disclosure are intended to be illustrative rather than limiting, and those skilled in the art will appreciate that "one or more" is intended to be construed as "one or more" unless the context clearly indicates otherwise.

The names of messages or information interacted between the devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of such messages or information.

FIG. 1 is a flowchart of a method for identifying fraud websites according to an embodiment of the present invention, which is applicable to the accurate identification of whether the website to be identified is a fraud website by analyzing the webpage source code of the website to be identified; the method may be performed by a fraud site identification means, which may consist of hardware and/or software and may be integrated in an electronic device in general, and in a computer device or a server in typical cases. As shown in fig. 1, the method specifically includes the following steps:

s110, acquiring a webpage source code of a website to be identified, and acquiring a target feature vector according to the webpage source code.

The webpage source code refers to codes written by programmers for making webpages when the webpages are designed; typically, when viewing a web page, a user may click a web page right on a mouse and click a view page source code option in a pop-up window to view the source code of the web page corresponding to the current web page. Note that, for static web pages, the web page source codes obtained in the above manner are complete source codes; for the dynamic web page, the web page source code obtained through the mode is the hypertext markup language format code.

In this embodiment, after determining the website to be identified, the client may access the website to be identified according to the domain name address of the website to be identified, and perform different operations in the website to be identified, so as to obtain the web page source codes of all the web pages corresponding to the current website to be identified; the method for acquiring the web page source code of the website to be identified is not particularly limited.

The target feature vector refers to vector representation of a plurality of dimension features of the webpage source code; in this embodiment, after the web page source code of the website to be identified is obtained, a corresponding one-dimensional feature zero matrix may be first established according to a preset feature dimension, where the number of columns of the current one-dimensional feature zero matrix is the number of preset feature dimensions, and one element corresponds to one preset feature dimension; further, according to the preset feature dimensions, corresponding feature extraction is carried out in the current webpage source codes so as to obtain feature values under each preset feature dimension; for example, the preset feature dimensions may be a uniform resource locator (Uniform Resource Locator, URL) feature, a text label feature, and a hyperlink feature.

Further, after the feature values corresponding to the preset feature dimensions are obtained, the feature values can be filled in the corresponding positions in the one-dimensional feature zero matrix, so that the feature matrix corresponding to the current webpage source code is obtained, and further, the target feature vector is obtained according to the feature values in the feature matrix. It should be noted that, when the number of websites to be identified is multiple, n×m feature matrices corresponding to the multiple websites to be identified may be obtained, where N represents the number of websites to be identified, and M represents the number of preset feature dimensions, that is, each row of feature matrices corresponds to one website to be identified.

In an optional implementation manner of this embodiment, according to the web page source code, obtaining the target feature vector may include: positioning at least one preset target parameter item in the webpage source code, and acquiring a target parameter value of each target parameter item; and forming characteristic values corresponding to at least one target characteristic dimension respectively according to the target parameter values, and generating a target characteristic vector according to the characteristic values.

Wherein the target parameter item may include at least one of a tag type, a tag value, an attribute type, and an attribute value.

Specifically, after the webpage source code is obtained, content analysis can be performed on the webpage source code according to preset target parameter items so as to obtain parameter values corresponding to the target parameter items; taking a hypertext markup language as an example, the tag format of the hypertext markup language is a tag content of a tag name, and the tag content corresponding to each tag name can be obtained by analyzing the tag in the format; when the target parameter item is a label type and a label value, the association relation between the label name and the label content can be extracted from the webpage source code one by one to obtain a plurality of association relation pairs, so that the parameter value of the target parameter item is obtained.

It should be noted that, the obtained target parameter value may be a character string after HTML encoding; in this embodiment, a mapping relationship between each chinese character and a corresponding HTML code may be pre-established, and when the HTML code is acquired, the corresponding chinese character may be acquired according to the pre-established mapping relationship. The target parameter value of the Chinese character type can be obtained by pre-establishing the mapping relation between the Chinese character and the HTML code.

In this embodiment, after obtaining the parameter values of each target parameter item, the parameter values may be analyzed according to each target feature dimension to obtain the feature value corresponding to each target feature dimension; for example, when the target feature dimension is the total number of tags, the number of tag numbers may be counted to obtain a feature value under the current target feature dimension; for another example, when the target feature dimension is the number of duplication removal labels, duplication removal processing may be performed on the obtained label name to obtain a label name without duplication, and the number of labels after duplication removal processing is counted and used as the feature value under the current target feature dimension.

In an alternative implementation of this embodiment, the target feature dimension may include at least one of a target parameter value total, a target parameter value deduplication number, a label total, a duplicate maximum label number, and an attribute total.

The total number of target parameter values refers to the number of all target parameter values, typically, the number of entries of all acquired tag types, tag values, attribute types and attribute values. The target parameter value deduplication amount refers to the number of entries of the target parameter value after performing deduplication processing; specifically, the target parameter values corresponding to the target parameter items may be subjected to deduplication processing, so as to ensure that the target parameter values corresponding to the target parameter items do not include repeated parameter values, and finally calculate the number of the target parameter values after deduplication, as the deduplication number of the target parameter values. The total number of labels refers to the number of all label marks acquired in the web page source code. The number of repeated maximum tags refers to the number of the maximum identical tags included in all the currently acquired tag numbers. The total number of attributes refers to the total number of acquired attribute types.

S120, judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library.

The preset fraud website feature library comprises at least one standard feature vector set, and each standard feature vector set comprises at least one standard feature vector.

In this embodiment, the standard feature vector is a feature vector generated according to the webpage source code of the fraud website; the standard feature vector set is a set formed by a plurality of standard feature vectors, and each standard feature vector set can correspond to one type of fraud website and is provided with a corresponding fraud type label; types of fraud websites may include, among others, gaming fraud websites, pornography fraud websites, swiping fraud websites, and distribution fraud websites.

It should be noted that, a certain number of domain addresses of different types of fraud websites can be obtained in advance, and different types of fraud websites are logged in according to each domain address through the terminal device, so as to obtain webpage source codes of different types of fraud websites, and then feature extraction is performed on each webpage source code, so as to obtain corresponding standard feature vectors. Further, after the standard feature vectors corresponding to the fraud websites of different types are obtained, the classification statistics can be carried out on all the standard feature vectors according to the types of the fraud websites, the standard feature vectors belonging to the fraud websites of the same type are stored into a standard feature vector set, and the types of the corresponding fraud websites are used as labels of the current standard feature vector set. And finally, generating a preset fraud website feature library through a plurality of standard feature vector sets.

In this embodiment, after obtaining the target feature vector corresponding to the website to be identified according to the webpage source code of the website to be identified, searching for the matched target standard feature vector in the preset fraud website feature library according to the target feature vector; if the matched target standard feature vector is determined to be found, the fact that the website to be identified hits a certain type of fraud website is indicated; otherwise, the website to be identified is a normal website or belongs to a type of fraud website not included in the preset fraud website feature library.

It should be noted that, searching the target standard feature vector matched in the preset fraud website feature library may include: and respectively calculating the similarity of the target feature vector and each standard feature vector, and judging whether the target feature vector is matched with the standard feature vector according to the similarity calculation result. For example, the euclidean distance between the target feature vector and the standard feature vector may be calculated, if it is determined that the current euclidean distance is smaller than the preset distance threshold, it is determined that the target feature vector is matched with the standard feature vector, and the standard feature vector is used as the target standard feature vector; for another example, cosine values of included angles between the target feature vector and each standard feature vector (typically, the included angle is 0 degree, the cosine value is 1, the included angle is 90 degrees, the cosine value is 0, the included angle is 180 degrees, and the cosine value is-1) can be calculated respectively through a cosine similarity calculation method, and whether the target feature vector is matched with the standard feature vector or not is judged according to the cosine values of the included angles.

S130, if so, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In this embodiment, if it is determined that the target standard feature vector matched with the target feature vector exists in the preset fraud website feature library, the current website to be identified may be determined as a fraud website; further, a standard feature vector set to which the target standard feature vector belongs can be determined by presetting a fraud website feature library, so that a fraud type label corresponding to the standard feature vector set is obtained, a fraud website type corresponding to the standard feature vector set is determined, and the fraud website type is used as a fraud website classification of websites to be identified.

It should be noted that, after determining that the website to be identified is a fraud website, a prompt message, for example, "fraud website, please exit in time" may be sent to the user, so as to prompt the user that the website currently visited is a fraud website, thereby ensuring personal and property safety of the user.

In the embodiment, the corresponding target feature vector is generated according to the webpage source code of the website to be identified, and the fraud website classification of the website to be identified is determined in a feature vector matching mode, so that the influence of HTML codes in the webpage source code of the fraud website on the fraud website identification is avoided, the accurate identification of any type of fraud website based on the webpage source code is realized, and the scene applicability of the fraud website identification method is improved.

A further embodiment of the present invention provides a method for identifying fraud websites, which is based on the above embodiment, and specifically introduces a method for determining a classification of websites to be identified when it is determined that no target standard feature vector matching the target feature vector exists in a preset fraud website feature library.

FIG. 2 is a flowchart of a method for identifying fraud websites according to another embodiment of the present invention, wherein the method for identifying fraud websites includes:

s210, acquiring a webpage source code of a website to be identified, and acquiring a target feature vector according to the webpage source code.

S220, judging whether a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library.

And S230, if not, caching the target feature vector into a problem feature vector set, and respectively calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be greater than or equal to the preset feature vector number.

The problem feature vector set is a vector set composed of target feature vectors which are not found to match target standard feature vectors in a preset fraud website feature library. It should be noted that, because the matching target standard feature vector is not found, at this time, it cannot be determined whether the target feature vector belongs to a normal website or to a fraud website type not included in the preset fraud website feature library; therefore, the target feature vector can be used as a problem feature vector to be cached to the problem feature vector set, and then the problem feature vector in the problem feature vector set is analyzed to determine the classification of each problem feature vector corresponding to the website to be identified.

In this embodiment, when the number of problem feature vectors reaches a certain number, the problem feature vectors can be identified, so that the problem of low identification accuracy when the number of problem feature vectors is small can be avoided; specifically, the number of problem feature vectors in the problem feature vector set can be detected in real time, and once the number of problem feature vectors reaches or exceeds the preset number of feature vectors, the similarity between the problem feature vectors is calculated respectively.

The similarity calculation method can comprise an Euclidean distance method and a cosine similarity method; it should be noted that, there is generally a large similarity between feature vectors of the same type of website, and all problem feature vectors can be classified by calculating the similarity between problem feature vectors.

S240, according to the similarity calculation result, a plurality of target problem feature vectors with the similarity larger than or equal to a preset similarity threshold value are obtained.

In this embodiment, when the similarity between the two problem feature vectors is greater than or equal to a preset similarity threshold, it may be determined that the websites to be identified corresponding to the two problem feature vectors belong to the same website type; therefore, the target problem feature vectors of a plurality of websites to be identified belonging to the same website type can be obtained.

S250, obtaining a common label of the webpage source codes corresponding to the target problem feature vectors, and determining classification corresponding to the websites to be identified respectively with the target problem feature vectors according to the common label.

When the problem feature vectors are cached, the target parameter values corresponding to the problem feature vectors may be cached at the same time. Specifically, after a plurality of target problem feature vectors are obtained, the respective corresponding tag types may be obtained from the target parameter values corresponding to the target problem feature vectors, and the tag types of the target problem feature vectors may be counted to obtain the tag types commonly owned by the target problem feature vectors. Note that the common tag may be one or a plurality of tags.

In this embodiment, a mapping relationship between a tag type and a website type may be pre-established, and after the common tag of the target problem feature vectors is obtained, it may be determined that each target problem feature vector corresponds to a website classification of the website to be identified according to the common tag and the pre-established mapping relationship between the tag type and the website type. The website type can be a normal website or a fraud website. For example, a mapping relationship between the types of tags such as Australian gambling house, winice and Mo Bo entertainment city and the lottery fraud websites is pre-established, when the common tag of the target problem feature vectors is Australian gambling house, it can be determined that the classification of the current target problem feature vector corresponding to the website to be identified is lottery fraud websites; or, the mapping relation between football, swimming and weightlifting and sports websites is pre-established, and when the shared label is football, the classification of the website to be identified corresponding to each problem feature vector as the sports website can be determined.

Note that, in the mapping relationship between the pre-established tag type and the website type, the tag type may be HTML code corresponding to the tag type of the chinese character; therefore, when the common tag in the HTML format is obtained, the corresponding website type can be directly determined according to the pre-established mapping relation, so that accurate determination of website classification is realized.

When the plurality of common labels corresponding to the web source codes of the target problem feature vectors are provided, if the type of the website corresponding to one common label is a fraud website, it may be determined that the websites to be identified corresponding to the target problem feature vectors are fraud websites.

In an optional implementation manner of this embodiment, after obtaining a common tag of the web page source code corresponding to each of the target problem feature vectors, and determining, according to the common tag, classification of the web sites to be identified corresponding to each of the target problem feature vectors, the method may further include:

when the classification of the websites to be identified corresponding to each target problem feature vector is determined to be a fraud website, a new standard feature vector set is established, and all target problem feature vectors are respectively added into the new standard feature vector set; and taking the common label of the webpage source codes corresponding to the target problem feature vectors as the label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

It should be noted that, after determining that each target problem feature vector corresponds to the website type of the website to be identified, each target problem feature vector may also be added to the preset fraud website feature library, so as to update the preset fraud website feature library. Specifically, when determining that the target problem feature vectors respectively correspond to the classification of the websites to be identified as fraud websites, establishing a new standard feature vector set according to the target problem feature vectors, and taking the common label of the target problem feature vectors as the label of the new standard feature vector set.

Note that, since the standard feature vectors stored in the preset fraud website feature library all correspond to fraud websites, before a new standard feature vector set is established according to each target problem feature vector, it is first determined whether each target problem feature vector corresponds to a classification of a website to be identified as fraud websites; if the classification of each website to be identified is determined not to be a fraud website, the feature vectors of each target problem do not need to be stored, and the feature vectors of each target problem can be directly discarded; if the classification of each website to be identified is determined to be a fraud website, a new standard feature vector set can be established based on each target problem feature vector, and the new standard feature vector set is stored in a preset fraud website feature library.

In this embodiment, when it is determined that the target problem feature vector corresponds to a fraud website classified by the website to be identified, the preset fraud website feature library is updated according to the target problem feature vector, so that dynamic update of the preset fraud website feature library is achieved, identification accuracy of the target feature vector is improved, and identification accuracy of the fraud website is further improved.

According to the technical scheme provided by the embodiment of the invention, the target feature vector is obtained by obtaining the webpage source code of the website to be identified and according to the webpage source code; when it is determined that no target standard feature vector matched with the target feature vector exists in the preset fraud website feature library, caching the target feature vector into a problem feature vector set, and when the number of problem feature vectors in the problem feature vector set is detected to be greater than or equal to the number of preset feature vectors, calculating the similarity between the problem feature vectors respectively; obtaining a plurality of target problem feature vectors with similarity greater than or equal to a preset similarity threshold according to the similarity calculation result; and further, the common labels of the webpage source codes corresponding to the target problem feature vectors are obtained, and the classification corresponding to the websites to be identified with the target problem feature vectors is determined according to the common labels, so that when the target standard feature vector matched with the target feature vector is not included in the preset fraud website feature library, the accurate identification of the websites to be identified is realized, and the identification accuracy of the websites to be identified is further improved.

A further embodiment of the present invention provides a method for identifying fraud websites, which is based on the above embodiment, and specifically introduces that a preset fraud website feature library is pre-established before identifying websites to be identified.

FIG. 3A is a flowchart of a method for identifying fraud websites according to another embodiment of the present invention, wherein the method for identifying fraud websites includes:

s310, acquiring webpage source codes of a plurality of fraud websites, and extracting content of the webpage source codes of each fraud website according to the target parameter items to acquire target parameter values corresponding to each target parameter item.

In this embodiment, before acquiring the webpage source codes of the websites to be identified, domain name addresses of a plurality of fraud websites may be acquired in advance, and each fraud website is logged in through the terminal device according to the acquired domain name addresses, so as to acquire the webpage source codes of each fraud website. Further, in the source code of each webpage, content extraction is performed according to each target parameter item so as to obtain a target parameter value corresponding to each target parameter item.

S320, obtaining characteristic values corresponding to the characteristic dimensions of each target according to the target parameter values, and generating standard characteristic vectors corresponding to each fraud website according to the characteristic values.

Specifically, after target parameter values corresponding to all target parameter items are obtained in the webpage source codes, the target parameter values are counted according to all target feature dimensions so as to obtain feature values corresponding to all target feature dimensions; and further generating standard feature vectors corresponding to the fraud websites according to the feature values.

S330, clustering the standard feature vectors through a Kmeans algorithm, and acquiring at least one standard feature vector set according to a clustering result.

In this embodiment, after the standard feature vector corresponding to each fraud website is obtained, clustering may be performed on each standard feature vector by using a Kmeans algorithm to divide all the standard feature vectors into a plurality of clusters, that is, a plurality of standard feature vector sets in which classification is completed are obtained. The Kmeans algorithm is an algorithm for dividing input data into sets corresponding to the number of clusters according to the number of the clusters input; the data similarity in the same cluster is higher, while the data similarity in different clusters is lower. Note that the number of clusters in the Kmeans algorithm can be adaptively set according to task requirements.

S340, obtaining a standard feature vector set, wherein each standard feature vector corresponds to a common label of the web page source code, and taking the common label as a label of the standard feature vector set.

In this embodiment, after a plurality of standard feature vector sets are obtained, in each standard feature vector set, statistics may be performed on tag types of web source codes corresponding to each standard feature vector, so as to obtain a common tag of each standard feature vector as a tag of a corresponding standard feature vector set.

The common tag may be a tag type commonly owned by all standard feature vectors in the standard feature vector set, or may be a tag type commonly owned by standard feature vectors in the standard feature vector set greater than a predetermined proportion (for example, 80%).

S350, generating a preset fraud website feature library according to the standard feature vector set added with the labels.

In this embodiment, after the standard feature vector set with the tag is obtained, the current standard feature vector sets may be formed into a total set to generate a corresponding preset fraud website feature library, thereby implementing efficient construction of the preset fraud website feature library.

In an optional implementation manner of this embodiment, generating the preset fraud website feature library according to the standard feature vector set after the tag is added may include: judging whether the label of the standard feature vector set is a preset normal label or not; if yes, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

Note that the normal label types also exist in the webpage source codes of the fraud websites, so that the common label corresponding to the webpage source code of each standard feature vector may be a normal label, namely, the label of the standard feature vector set is a normal label; at this time, the labels according to the standard feature vector set cannot realize accurate classification of the websites to be identified. Therefore, before the standard feature vector set is stored, whether the label of the standard feature vector set is a preset normal label or not can be judged in advance; if the normal label is determined to be preset, the storage of the current standard feature vector set is abandoned; otherwise, the current standard feature vector set is added to the preset fraud website feature library.

When the labels of the standard feature vector set are multiple, if some labels are normal labels, at this time, some labels belonging to the normal labels can be deleted, and only the labels corresponding to fraud websites are reserved; if all the tags are normal tags, the storage of the standard feature vector set can be abandoned.

S360, acquiring a webpage source code of a website to be identified, and acquiring a target feature vector according to the webpage source code.

S370, judging whether target standard feature vectors matched with the target feature vectors exist in the preset fraud website feature library.

And S380, if so, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In a specific implementation manner of this embodiment, as shown in fig. 3B, a fraud website webpage source code set is first obtained, and the webpage source code is parsed according to a target parameter item, so as to obtain a corresponding target parameter value; and further, counting and calculating the target parameter value according to the selected characteristic dimension to obtain the characteristic value corresponding to the current characteristic dimension, and generating a corresponding standard characteristic vector according to each characteristic value. Secondly, clustering standard feature vectors corresponding to the webpage source code sets to obtain a plurality of standard feature vector sets; and extracting a common label of each standard feature vector from each standard feature vector set, and taking the common label as a label of the corresponding standard feature vector set to establish a preset fraud website feature library. Finally, after the website to be identified is obtained, the corresponding target feature vector is obtained according to the webpage source code of the website to be identified, and the matching search of the target feature vector is carried out through a preset fraud website feature library, so that the fraud website classification of the website to be identified is determined.

According to the technical scheme, before acquiring the webpage source codes of the websites to be identified, acquiring the webpage source codes of a plurality of fraud websites, extracting the content of the webpage source codes of each fraud website according to target parameter items, and acquiring target parameter values corresponding to the target parameter items; acquiring characteristic values corresponding to each target characteristic dimension according to the target parameter values, and generating standard characteristic vectors corresponding to each fraud website respectively according to each characteristic value; clustering the standard feature vectors through a Kmeans algorithm, and acquiring a plurality of standard feature vector sets according to a clustering result; further obtaining a standard feature vector set, wherein each standard feature vector corresponds to a common label of the webpage source code, and the common label is used as a label of the standard feature vector set; and generating a preset fraud website feature library according to the standard feature vector set added with the label, so that the high-efficiency construction of the preset fraud website feature library is realized.

FIG. 4 is a schematic diagram of a fraud site identification apparatus according to another embodiment of the present invention. As shown in fig. 4, the apparatus includes: a target feature vector acquisition module 410, a match determination module 420, and a classification determination module 430. Wherein,

The target feature vector acquisition module 410 is configured to acquire a web page source code of a website to be identified, and acquire a target feature vector according to the web page source code;

the matching judgment module 420 is configured to judge whether a target standard feature vector matching the target feature vector exists in a preset fraud website feature library;

the classification determining module 430 is configured to determine the website to be identified as a fraud website if yes, and determine a fraud website classification of the website to be identified according to the fraud type tag of the standard feature vector set corresponding to the target standard feature vector.

Optionally, based on the above technical solution, the target feature vector obtaining module 410 includes:

the target parameter value acquisition unit is used for positioning at least one preset target parameter item in the webpage source code and acquiring a target parameter value of each target parameter item; the target parameter item comprises at least one of a tag type, a tag value, an attribute type and an attribute value;

and the target feature vector generation unit is used for forming feature values corresponding to at least one target feature dimension respectively according to each target parameter value and generating a target feature vector according to each feature value.

Optionally, on the basis of the above technical solution, the device for identifying fraud websites further includes:

the similarity calculation module is used for caching the target feature vector into a problem feature vector set, and respectively calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be greater than or equal to the preset feature vector number;

the target problem feature vector acquisition module is used for acquiring a plurality of target problem feature vectors with similarity larger than or equal to a preset similarity threshold according to the similarity calculation result;

The website classification determining module is used for acquiring the common labels of the webpage source codes corresponding to the target problem feature vectors, and determining the classification corresponding to the websites to be identified respectively with the target problem feature vectors according to the common labels.

the standard feature vector set establishing module is used for establishing a new standard feature vector set when determining that the classification corresponding to the websites to be identified respectively to each target problem feature vector is a fraud website, and adding all target problem feature vectors to the new standard feature vector set respectively;

and the standard feature vector set storage module is used for taking the common label of the webpage source codes corresponding to each target problem feature vector as the label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

Optionally, based on the above technical solution, the target feature vector obtaining module 410 is further configured to obtain web page source codes of a plurality of fraud websites, and extract content of the web page source codes of each fraud website according to the target parameter items, so as to obtain target parameter values corresponding to each target parameter item.

the standard feature vector generation module is used for acquiring feature values corresponding to each target feature dimension according to the target parameter values and generating standard feature vectors corresponding to each fraud website respectively according to each feature value;

the standard feature vector set acquisition module is used for carrying out clustering processing on each standard feature vector through a Kmeans algorithm and acquiring at least one standard feature vector set according to a clustering result;

the label acquisition module is used for acquiring a standard feature vector set, wherein each standard feature vector corresponds to a common label of the webpage source code, and the common label is used as a label of the standard feature vector set;

the system comprises a preset fraud website feature library generation module, a label-added standard feature vector set and a label-added standard feature vector set, wherein the preset fraud website feature library generation module is used for generating a preset fraud website feature library according to the label-added standard feature vector set.

Optionally, on the basis of the above technical solution, a fraud website feature library generating module is preset, and is specifically configured to determine whether the label of the standard feature vector set is a preset normal label; if yes, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

The device can execute the fraud website identification method provided by the embodiment of the invention, and has the corresponding functional modules and beneficial effects of executing the method. Technical details which are not described in detail in the embodiments of the present invention can be seen in the identification method of fraud websites provided in the foregoing embodiments of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present invention, as shown in fig. 5, the electronic device includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of processors 510 in the electronic device may be one or more, one processor 510 being taken as an example in fig. 5; the processor 510, memory 520, input device 530, and output device 540 in the electronic device may be connected by a bus or other means, for example in fig. 5. The memory 520 serves as a computer readable storage medium, and may be used to store software programs, computer executable programs, and modules, such as program instructions/modules corresponding to a fraud site identification method in any embodiment of the present invention (e.g., a target feature vector acquisition module 410, a match judgment module 420, and a classification determination module 430 in a fraud site identification apparatus). The processor 510 executes various functional applications of the electronic device and data processing by running software programs, instructions and modules stored in the memory 520, i.e., implements a fraud site identification method as described above. That is, the program, when executed by the processor, implements:

Memory 520 may include primarily a program storage area and a data storage area, wherein the program storage area may store an operating system, at least one application program required for functionality; the storage data area may store data created according to the use of the terminal, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid-state storage device. In some examples, memory 520 may further include memory located remotely from processor 510, which may be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The input means 530 may be used to receive input numeric or character information and to generate key signal inputs related to user settings and function control of the electronic device, and may include a keyboard, a mouse, and the like. The output 540 may include a display device such as a display screen.

Optionally, the electronic device may be a server, and the server may be an independent server, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), and basic cloud computing services such as big data and artificial intelligence platforms.

The embodiments of the present invention also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method according to any of the embodiments of the present invention. Of course, the computer readable storage medium provided by the embodiments of the present invention may perform the related operations in the method for identifying a fraud website provided by any embodiment of the present invention. That is, the program, when executed by the processor, implements:

From the above description of embodiments, it will be clear to a person skilled in the art that the present invention may be implemented by means of software and necessary general purpose hardware, but of course also by means of hardware, although in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a computer readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, etc., and include several instructions for causing an electronic device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments of the present invention.

It should be noted that, in the embodiment of the fraud website identification apparatus, each unit and module included are only divided according to the functional logic, but not limited to the above division, so long as the corresponding functions can be implemented; in addition, the specific names of the functional units are also only for distinguishing from each other, and are not used to limit the protection scope of the present invention.

Note that the above is only a preferred embodiment of the present invention and the technical principle applied. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, while the invention has been described in connection with the above embodiments, the invention is not limited to the embodiments, but may be embodied in many other equivalent forms without departing from the spirit or scope of the invention, which is set forth in the following claims.

Claims

1. A method for identifying fraud websites, comprising:

if yes, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors;

Otherwise, caching the target feature vector into a problem feature vector set, and respectively calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be greater than or equal to the number of the preset feature vectors;

according to the similarity calculation result, a plurality of target problem feature vectors with similarity larger than or equal to a preset similarity threshold value are obtained;

and obtaining a common label of the webpage source codes corresponding to the target problem feature vectors, and determining classification corresponding to the websites to be identified respectively with the target problem feature vectors according to the common label.

2. The method of claim 1, wherein obtaining a target feature vector from the web page source code comprises:

positioning at least one preset target parameter item in the webpage source code, and acquiring a target parameter value of each target parameter item; the target parameter item comprises at least one of a tag type, a tag value, an attribute type and an attribute value;

and forming characteristic values corresponding to at least one target characteristic dimension respectively according to the target parameter values, and generating a target characteristic vector according to the characteristic values.

3. The method of claim 1, further comprising, after obtaining a common tag of web page source codes corresponding to each of the target question feature vectors, and determining classifications corresponding to the web sites to be identified respectively with each of the target question feature vectors according to the common tag:

when the classification of the websites to be identified corresponding to each target problem feature vector is determined to be a fraud website, a new standard feature vector set is established, and all target problem feature vectors are respectively added into the new standard feature vector set;

and taking the common label of the webpage source codes corresponding to the target problem feature vectors as the label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

4. The method of claim 2, further comprising, prior to obtaining the web page source code for the web site to be identified:

acquiring webpage source codes of a plurality of fraud websites, extracting content of the webpage source codes of each fraud website according to the target parameter items, and acquiring target parameter values corresponding to each target parameter item;

acquiring characteristic values corresponding to each target characteristic dimension according to the target parameter values, and generating standard characteristic vectors corresponding to each fraud website respectively according to each characteristic value;

Clustering the standard feature vectors through a Kmeans algorithm, and acquiring at least one standard feature vector set according to a clustering result;

obtaining a standard feature vector set, wherein each standard feature vector corresponds to a common label of a webpage source code, and taking the common label as a label of the standard feature vector set;

and generating a preset fraud website feature library according to the standard feature vector set added with the label.

5. The method of claim 4, wherein generating a pre-set fraud website feature library from the tagged standard feature vector set comprises:

judging whether the label of the standard feature vector set is a preset normal label or not;

if yes, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

6. The method of claim 2 or 4, wherein the target feature dimension comprises at least one of a target parameter value total, a target parameter value deduplication amount, a label total, a duplicate maximum label number, and an attribute total.

7. A fraud site identification device, comprising:

the classification determining module is used for determining the website to be identified as a fraud website if the website to be identified is the fraud website, and determining the fraud website classification of the website to be identified according to the fraud type label of the standard feature vector set corresponding to the target standard feature vector;

8. An electronic device, comprising:

one or more processors;

storage means for storing one or more computer programs;

the method for identifying fraud websites of any of claims 1-6 is implemented when the one or more computer programs are executed by the one or more processors, such that the one or more processors execute the computer programs.

9. A computer readable storage medium having stored thereon a computer program, wherein the computer program when executed by a processor implements the fraud website identification method as defined in any of claims 1-6.