CN113779481A

CN113779481A - Method, device, equipment and storage medium for identifying fraud websites

Info

Publication number: CN113779481A
Application number: CN202111129835.3A
Authority: CN
Inventors: 周宇轩; 傅强; 蔡琳; 阿曼太; 梁彧; 马寒军; 田野; 王杰; 杨满智; 金红; 陈晓光
Original assignee: Eversec Beijing Technology Co Ltd
Current assignee: Eversec Beijing Technology Co Ltd
Priority date: 2021-09-26
Filing date: 2021-09-26
Publication date: 2021-12-10
Anticipated expiration: 2041-09-26
Also published as: CN113779481B

Abstract

The embodiment of the invention discloses a method, a device, equipment and a storage medium for identifying fraud websites, wherein the method comprises the following steps: acquiring a webpage source code of a website to be identified, and acquiring a target characteristic vector according to the webpage source code; judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library or not; and if so, determining the websites to be identified as fraud websites, and determining fraud website classification of the websites to be identified according to fraud type labels of the standard feature vector set corresponding to the target standard feature vector. The technical scheme of the embodiment of the invention provides a new method for realizing fraud website identification based on webpage source codes, improves the identification accuracy of fraud websites, and effectively protects the personal and property safety of users.

Description

Method, device, equipment and storage medium for identifying fraud websites

Technical Field

The embodiment of the invention relates to the technical field of computers, in particular to a method, a device, equipment and a storage medium for identifying fraud websites.

Background

With the continuous development of the phishing technology, the endless phishing websites seriously endanger the property safety of people, the phishing websites can be found and blocked in time, and the method has important significance for improving the personal and property safety of network users.

At present, the existing fraud website identification methods usually detect fraud-related keywords appearing in websites to realize identification of fraud websites. Specifically, collecting samples of each type of fraud websites in advance, and performing word segmentation and statistics on phrases appearing in the samples to obtain forward and reverse word frequencies; and when the keywords of the website to be identified hit the reverse word frequency, determining that the website to be identified is a fraud website.

However, when a fraud website does not display keywords in the webpage source code, or adopts a source code encoding manner of hypertext Markup Language (HTML), for example, < title > & # x 738B; and # x4E 2D; and # x738B, the prior art cannot realize accurate identification of fraud websites, so that the identification accuracy of the fraud websites is reduced, and effective protection of personal and property safety of users cannot be realized.

Disclosure of Invention

The embodiment of the invention provides a method, a device, equipment and a storage medium for identifying a fraud website, provides a new mode for realizing the identification of the fraud website based on a webpage source code, improves the identification accuracy of the fraud website, and effectively protects the personal and property safety of a user.

In a first aspect, an embodiment of the present invention provides a method for identifying a fraud website, including:

acquiring a webpage source code of a website to be identified, and acquiring a target characteristic vector according to the webpage source code;

judging whether a target standard feature vector matched with the target feature vector exists in a preset fraud website feature library or not;

wherein, the preset fraud website feature library comprises at least one standard feature vector set, and each standard feature vector set comprises at least one standard feature vector;

and if so, determining the websites to be identified as fraud websites, and determining fraud website classifications of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In a second aspect, an embodiment of the present invention further provides an apparatus for identifying a fraud website, including:

the target characteristic vector acquisition module is used for acquiring a webpage source code of a website to be identified and acquiring a target characteristic vector according to the webpage source code;

the matching judgment module is used for judging whether a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library or not;

and if so, determining the website to be identified as a fraud website, and determining a fraud website classification of the website to be identified according to the fraud type label of the standard feature vector set corresponding to the target standard feature vector.

In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:

one or more processors;

storage means for storing one or more computer programs;

the method for identifying fraud websites provided by any embodiment of the present invention is implemented when the one or more computer programs are executed by the one or more processors, so that the one or more processors execute the computer programs.

In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the method for identifying a fraud website provided by any embodiment of the present invention is implemented.

According to the technical scheme provided by the embodiment of the invention, the webpage source code of the website to be identified is obtained, and the target characteristic vector is obtained according to the webpage source code; when a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library, the website to be identified is determined as a fraud website, and the fraud website classification of the website to be identified is determined according to the fraud type label of the target standard feature vector corresponding to the standard feature vector set, so that accurate identification of the fraud website based on the webpage source code is realized, the identification accuracy of the fraud website is improved, and the personal and property safety of the user is effectively protected.

Drawings

FIG. 1 is a flow chart of a method for identifying fraudulent websites according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method of identifying fraudulent websites in another embodiment of the present invention;

FIG. 3A is a flow chart of a method of identifying fraudulent websites in another embodiment of the present invention;

FIG. 3B is a flowchart illustrating a method for identifying fraud websites in another embodiment of the present invention;

FIG. 4 is a schematic structural diagram of an identification device of a fraud website in another embodiment of the present invention;

fig. 5 is a schematic structural diagram of an electronic device in another embodiment of the invention.

Detailed Description

Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.

It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.

The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.

It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence relationship of the functions performed by the devices, modules or units.

It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.

The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only, and are not intended to limit the scope of the messages or information.

FIG. 1 is a flowchart of a method for identifying a fraud website according to an embodiment of the present invention, which can be applied to analyze webpage source codes of websites to be identified so as to accurately identify whether the websites to be identified are fraud websites; the method may be performed by identification means of fraud websites, which means may consist of hardware and/or software and may be integrated in electronic devices in general, and in computer devices or servers in typical. As shown in fig. 1, the method specifically includes the following steps:

s110, acquiring a webpage source code of the website to be identified, and acquiring a target characteristic vector according to the webpage source code.

The webpage source code refers to a code written by a programmer for webpage making when a webpage is designed; typically, when viewing a web page, a user may right click on the web page with a mouse and click on a view page source code option in a pop-up window to view a web page source code corresponding to a current web page. The value is noted that, for a static webpage, the webpage source code obtained in the above manner is a complete source code; for the dynamic web page, the web page source code obtained by the method is a code in a hypertext markup language format.

In this embodiment, after determining the website to be identified, the client may access the website to be identified according to the domain name address of the website to be identified, and perform different operations in the website to be identified, so as to obtain the web page source codes of all web pages corresponding to the current website to be identified; the web page source code of the website to be identified may also be obtained by the package capture software, and the method for obtaining the web page source code of the website to be identified is not particularly limited in this embodiment.

The target feature vector is vector representation of a plurality of dimensional features of the webpage source code; in this embodiment, after acquiring a web page source code of a website to be identified, a corresponding one-dimensional feature zero matrix may be first established according to preset feature dimensions, where the number of columns of the current one-dimensional feature zero matrix is the number of the preset feature dimensions, and one element corresponds to one preset feature dimension; further, according to preset feature dimensions, corresponding feature extraction is carried out in the current webpage source code to obtain feature values under the preset feature dimensions; for example, the preset feature dimensions may be a Uniform Resource Locator (URL) feature, a text label feature, and a hyperlink feature.

Further, after the eigenvalue corresponding to each preset eigen dimension is obtained, each eigenvalue may be filled to a corresponding position in the one-dimensional eigen zero matrix to achieve the obtaining of the eigen matrix corresponding to the current web page source code, and then the obtaining of the target eigenvector is achieved according to each eigenvalue in the eigen matrix. It should be noted that, when the number of websites of the website to be identified is multiple, an nxm feature matrix corresponding to multiple websites to be identified may be obtained, where N represents the number of websites to be identified, and M represents the number of preset feature dimensions, that is, each row of the feature matrix corresponds to one website to be identified.

In an optional implementation manner of this embodiment, obtaining the target feature vector according to the web page source code may include: positioning at least one preset target parameter item in the webpage source code, and acquiring a target parameter value of each target parameter item; and forming characteristic values respectively corresponding to at least one target characteristic dimension according to each target parameter value, and generating a target characteristic vector according to each characteristic value.

Wherein the target parameter item may include at least one of a tag type, a tag value, an attribute type, and an attribute value.

Specifically, after the webpage source code is obtained, content analysis can be performed on the webpage source code according to preset target parameter items to obtain parameter values corresponding to the target parameter items; taking hypertext markup language as an example, the tag format is tag content, and tag content corresponding to each tag name can be obtained by analyzing tags in the format; when the target parameter item is the tag type and the tag value, the association relationship between the tag name and the tag content can be extracted one by one from the webpage source code to obtain a plurality of association relationship pairs, so as to obtain the parameter value of the target parameter item.

It should be noted that the obtained target parameter value may be a character string encoded by HTML; in this embodiment, a mapping relationship between each chinese character and the corresponding HTML code may be pre-established, and when the HTML code is obtained, the corresponding chinese character may be obtained according to the pre-established mapping relationship. The target parameter value of the Chinese character type can be obtained by pre-establishing the mapping relation between the Chinese character and the HTML code.

In this embodiment, after the parameter value of each target parameter item is obtained, the parameter value may be analyzed according to each target feature dimension to obtain a feature value corresponding to each target feature dimension; for example, when the target feature dimension is the total number of tags, the number of tag names may be counted to obtain a feature value under the current target feature dimension; for another example, when the target feature dimension is the number of duplicate tags, the obtained tag name may be subjected to duplicate removal processing to obtain a tag name without duplication, and the number of tags subjected to the duplicate removal processing is counted as the feature value in the current target feature dimension.

In an optional implementation manner of this embodiment, the target feature dimension may include at least one of a target parameter value total number, a target parameter value deduplication number, a tag total number, a most repeated tag number, and an attribute total number.

The total number of target parameter values refers to the number of all target parameter values, and may typically be the number of all acquired tag types, tag values, attribute types, and items of attribute values. The target parameter value deduplication quantity is the number of items of the target parameter value after deduplication processing is executed; specifically, the target parameter values corresponding to the target parameter items may be subjected to deduplication processing, so as to ensure that the target parameter values corresponding to the target parameter items do not include repeated parameter values, and the number of the deduplicated target parameter values is finally calculated as the deduplication number of the target parameter values. The total number of tags is the number of all tag names obtained from the source code of the web page. The maximum number of repeated tags refers to the number of the same tags included in all currently acquired tag names at most. The total number of attributes refers to the total number of acquired attribute types.

S120, judging whether a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library.

Wherein, the preset fraud website feature library comprises at least one standard feature vector set, and each standard feature vector set comprises at least one standard feature vector.

In this embodiment, the standard feature vector is a feature vector generated according to the webpage source codes of the fraud websites; the standard feature vector set is a set consisting of a plurality of standard feature vectors, each standard feature vector set can correspond to one type of fraud websites and has corresponding fraud type labels; the types of fraud websites can include lottery-type fraud websites, pornographic-type fraud websites, swipe-type fraud websites and distribution-type fraud websites.

It should be noted that, a certain number of domain names of different types of fraud websites can be obtained in advance, and different types of fraud websites are respectively logged in according to each domain name address through the terminal device to obtain the webpage source codes of the different types of fraud websites, and further, feature extraction is performed on each webpage source code to obtain the corresponding standard feature vector. Further, after the standard feature vectors corresponding to different types of fraud websites are obtained, all the standard feature vectors can be classified and counted according to the types of the fraud websites, the standard feature vectors belonging to the same type of fraud websites are stored into a standard feature vector set, and the types corresponding to the fraud websites are used as tags of the current standard feature vector set. Finally, a preset fraud website feature library is generated through a plurality of standard feature vector sets.

In this embodiment, after the target feature vector corresponding to the website to be identified is obtained according to the webpage source code of the website to be identified, the matched target standard feature vector can be searched in the preset fraud website feature library according to the target feature vector; if the matched target standard characteristic vector is determined to be found, the website to be identified hits a certain type of fraud website; otherwise, the website to be identified is a normal website, or belongs to a fraud website type not included in the preset fraud website feature library.

It should be noted that, the searching of the matched target standard feature vector in the preset fraud website feature library may include: and respectively calculating the similarity of the target characteristic vector and each standard characteristic vector, and judging whether the target characteristic vector is matched with the standard characteristic vector according to the similarity calculation result. For example, the euclidean distance between the target feature vector and the standard feature vector may be calculated, and if it is determined that the current euclidean distance is smaller than the preset distance threshold, it is determined that the target feature vector matches the standard feature vector, and the standard feature vector is used as the target standard feature vector; for another example, cosine values of included angles between the target feature vector and the standard feature vectors (typically, the included angle is 0 degree, the cosine value is 1, the included angle is 90 degrees, the cosine value is 0, the included angle is 180 degrees, and the cosine value is-1) may be calculated by the cosine similarity calculation method, and whether the target feature vector is matched with the standard feature vectors is determined according to the cosine values of the included angles.

S130, if yes, determining the websites to be identified as fraud websites, and determining fraud website classifications of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In this embodiment, if it is determined that the target standard feature vector matching the target feature vector exists in the preset fraud website feature library, the current website to be identified may be determined as a fraud website; furthermore, a standard feature vector set to which the target standard feature vector belongs can be determined through a preset fraud website feature library, and a fraud type label corresponding to the standard feature vector set is further acquired, so as to determine a fraud website type corresponding to the standard feature vector set, and the fraud website type is used as a fraud website classification of the website to be identified.

It should be noted that after determining that the website to be identified is a fraud website, a prompt message, such as "fraud website please quit in time", may be sent to the user to prompt the website currently visited by the user to be a fraud website, so as to ensure the personal and property security of the user.

In the embodiment, the corresponding target feature vectors are generated according to the webpage source codes of the websites to be identified, and the fraud website classification of the websites to be identified is determined in a feature vector matching manner, so that the influence of HTML codes in the webpage source codes of the fraud websites on fraud website identification is avoided, accurate identification of any type of fraud websites based on the webpage source codes is realized, and the scene applicability of the fraud website identification method is improved.

The present embodiment specifically introduces the determination of the classification of websites to be identified when it is determined that there is no target standard feature vector matching the target feature vector in the preset fraud website feature library based on the above embodiments.

FIG. 2 is a flowchart of a method for identifying fraud websites according to still another embodiment of the present invention, which is based on the above technical solution, and the method comprises:

s210, acquiring a webpage source code of the website to be identified, and acquiring a target characteristic vector according to the webpage source code.

S220, judging whether a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library.

And S230, if not, caching the target feature vectors into a problem feature vector set, and respectively calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be greater than or equal to the number of preset feature vectors.

The problem feature vector set is a vector set composed of target feature vectors of target standard feature vectors that are not found to match in the preset fraud website feature library. It should be noted that, because the matched target standard feature vector is not found, at this time, it cannot be determined whether the target feature vector belongs to a normal website or a fraud website type not included in the preset fraud website feature library; therefore, the target feature vectors can be taken as problem feature vectors and cached in the problem feature vector set, and then the problem feature vectors in the problem feature vector set are analyzed to determine the classification of each problem feature vector corresponding to the website to be identified.

In this embodiment, the problem feature vectors can be identified when the problem feature vectors reach a certain number, so that the problem of low identification accuracy when the number of the problem feature vectors is small can be avoided; specifically, the number of the problem feature vectors in the problem feature vector set can be detected in real time, and once the number of the problem feature vectors is determined to reach or exceed the preset number of the feature vectors, the similarity between the problem feature vectors is calculated respectively.

The similarity calculation method can comprise an Euclidean distance method and a cosine similarity method; it should be noted that there is usually a large similarity between feature vectors of websites of the same type, and all problem feature vectors can be classified by calculating the similarity between the problem feature vectors.

S240, according to the similarity calculation result, a plurality of target problem feature vectors with the similarity greater than or equal to a preset similarity threshold are obtained.

In this embodiment, when the similarity between two question feature vectors is greater than or equal to a preset similarity threshold, it may be determined that websites to be identified corresponding to the two question feature vectors belong to the same website type; therefore, the target problem feature vectors of a plurality of websites to be identified belonging to the same website type can be obtained.

And S250, acquiring a common label of the webpage source code corresponding to each target problem feature vector, and determining the classification of the websites to be identified corresponding to each target problem feature vector according to the common label.

It should be noted that, when the problem feature vectors are cached, the target parameter values corresponding to the problem feature vectors may be cached at the same time. Specifically, after a plurality of target problem feature vectors are obtained, the corresponding label types may be obtained from the target parameter values corresponding to the target problem feature vectors, and the label types of the target problem feature vectors are counted to obtain the label types that the target problem feature vectors commonly have. Note that the common label may be one or a plurality of labels.

In this embodiment, a mapping relationship between a tag type and a website type may be pre-established, and after obtaining a common tag of the target problem feature vectors, it may be determined that each target problem feature vector corresponds to a website classification of a website to be identified according to the common tag and the pre-established mapping relationship between the tag type and the website type. The website type may be a normal website or a fraud website. For example, a mapping relation between label types such as Macau casinos, Venice men and Wanbo casinos and lottery type fraud websites is established in advance, and when the common label of the target problem feature vectors is Macau casinos, the classification of the current target problem feature vectors corresponding to the websites to be identified as the lottery type fraud websites can be determined; or, the mapping relation between football, swimming and weightlifting and sports websites is established in advance, and when the common label is football, the classification of the problem feature vectors corresponding to the websites to be identified as the sports websites can be determined.

Note that, in the mapping relationship between the pre-established tag type and the website type, the tag type may be an HTML code corresponding to the tag type of the chinese character; therefore, when the common tags in the HTML format are acquired, the corresponding website types can be directly determined according to the pre-established mapping relation, so that accurate determination of website classification is realized.

It should be noted that, when there are a plurality of common tags corresponding to the source codes of the webpages in each target problem feature vector, as long as the website type corresponding to one common tag is a fraud website, it can be determined that the websites to be identified corresponding to each target problem feature vector are fraud websites.

In an optional implementation manner of this embodiment, after obtaining a common tag of a web page source code corresponding to each of the target problem feature vectors, and determining, according to the common tag, a category corresponding to a to-be-identified website respectively corresponding to each of the target problem feature vectors, the method may further include:

when the classification of websites to be identified, which respectively correspond to the target problem feature vectors, as fraud websites is determined, establishing a new standard feature vector set, and respectively adding all the target problem feature vectors to the new standard feature vector set; and taking the common label of the webpage source code corresponding to each target problem feature vector as the label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

It should be noted that, after determining that each target problem feature vector corresponds to the website type of the website to be identified, each target problem feature vector may be added to the preset fraud website feature library to update the preset fraud website feature library. Specifically, when it is determined that the target problem feature vectors respectively correspond to the websites to be identified and are classified as fraud websites, a new standard feature vector set is established according to the target problem feature vectors, and the common labels of the target problem feature vectors are used as the labels of the new standard feature vector set.

It is worth noting that, since the standard feature vectors stored in the preset fraud website feature library all correspond to fraud websites, before establishing a new standard feature vector set according to each target problem feature vector, first determining whether the classification of each target problem feature vector corresponding to a website to be identified is a fraud website; if the classification of each website to be identified is determined not to be a fraud website, the target problem feature vectors do not need to be stored, and can be directly discarded; if the websites to be identified are classified as fraud websites, a new standard feature vector set can be established based on the target problem feature vectors, and the new standard feature vector set is stored in the preset fraud website feature library.

In this embodiment, when it is determined that the target problem feature vector corresponds to the classification of the websites to be identified as fraud websites, the preset fraud website feature library is updated according to the target problem feature vector, so that dynamic update of the preset fraud website feature library is realized, the identification accuracy of the target feature vector is improved, and the identification accuracy of the fraud websites is further improved.

According to the technical scheme provided by the embodiment of the invention, the webpage source code of the website to be identified is obtained, and the target characteristic vector is obtained according to the webpage source code; caching the target characteristic vectors into a problem characteristic vector set when it is determined that target standard characteristic vectors matched with the target characteristic vectors do not exist in a preset fraud website characteristic library, and respectively calculating the similarity between the problem characteristic vectors when the number of the problem characteristic vectors in the problem characteristic vector set is detected to be greater than or equal to the number of the preset characteristic vectors; according to the similarity calculation result, acquiring a plurality of target problem feature vectors with the similarity greater than or equal to a preset similarity threshold; and then, the common labels of the webpage source codes corresponding to the target problem feature vectors are obtained, and the classification of the websites to be identified corresponding to the target problem feature vectors is determined according to the common labels, so that when the target standard feature vectors matched with the target feature vectors are not included in the preset fraud website feature library, the websites to be identified are accurately identified, and the identification accuracy of the websites to be identified is further improved.

The present embodiment provides a method for identifying fraud websites, which is based on the above embodiments and specifically introduces that a preset fraud website feature library is established in advance before the websites to be identified are identified.

FIG. 3A is a flowchart of a method for identifying fraud websites according to still another embodiment of the present invention, which is based on the above technical solution, and the method comprises:

s310, acquiring webpage source codes of a plurality of fraud websites, extracting the content of the webpage source codes of the fraud websites according to the target parameter items, and acquiring target parameter values corresponding to the target parameter items.

In this embodiment, before acquiring the webpage source code of the website to be identified, the domain name addresses of a plurality of fraud websites can be acquired in advance, and the terminal device logs in each fraud website according to the acquired domain name addresses to acquire the webpage source code of each fraud website. Furthermore, in each webpage source code, content extraction is carried out according to each target parameter item so as to obtain a target parameter value corresponding to each target parameter item.

S320, obtaining characteristic values corresponding to the characteristic dimensions of each target according to the target parameter values, and generating standard characteristic vectors corresponding to the fraud websites respectively according to the characteristic values.

Specifically, after target parameter values corresponding to target parameter items are obtained from the webpage source codes, the target parameter values are counted according to target characteristic dimensions to obtain characteristic values corresponding to the target characteristic dimensions; and generating standard feature vectors corresponding to the fraud websites according to the feature values.

S330, clustering each standard feature vector through a Kmeans algorithm, and acquiring at least one standard feature vector set according to a clustering result.

In this embodiment, after the standard feature vector corresponding to each fraud website is obtained, each standard feature vector may be subjected to clustering processing through a Kmeans algorithm, so as to divide all the standard feature vectors into a plurality of clusters, that is, obtain a plurality of standard feature vector sets that are classified. The Kmeans algorithm is an algorithm for dividing input data into a set corresponding to the number of clusters according to the number of the input clusters; data similarity is higher in the same cluster and lower in different clusters. The value is noted that the number of clusters in the Kmeans algorithm can be set adaptively according to the task requirements.

S340, acquiring a standard feature vector set, wherein each standard feature vector corresponds to a common label of the webpage source code, and taking the common label as a label of the standard feature vector set.

In this embodiment, after a plurality of standard feature vector sets are obtained, the tag types of the web page source codes corresponding to the standard feature vectors may be counted in each standard feature vector set, so as to obtain a common tag of each standard feature vector as a tag corresponding to the standard feature vector set.

The common label may be a label type that is commonly owned by all standard feature vectors in the standard feature vector set, or may be a label type that is commonly owned by standard feature vectors in the standard feature vector set that are greater than a preset proportion (e.g., 80%).

And S350, generating a preset fraud website feature library according to the standard feature vector set added with the label.

In this embodiment, after the standard feature vector sets with the tags are obtained, the current standard feature vector sets can be formed into a total set to generate the corresponding preset fraud website feature library, so that the efficient construction of the preset fraud website feature library is realized.

In an optional implementation manner of this embodiment, the generating the preset fraud website feature library according to the tagged standard feature vector set may include: judging whether the label of the standard feature vector set is a preset normal label or not; if so, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

Note that, the webpage source codes of the fraud websites also have normal label types, so that the common labels of the webpage source codes corresponding to the standard feature vectors may be normal labels, that is, the labels of the standard feature vector set are normal labels; at this time, the accurate classification of the websites to be identified cannot be realized according to the tags of the standard feature vector set. Therefore, before the standard feature vector set is stored, whether the label of the standard feature vector set is a preset normal label or not can be judged in advance; if the preset normal label is determined, the current standard feature vector set is abandoned; otherwise, the current standard feature vector set is added to the preset fraud website feature library.

It should be noted that, when there are a plurality of tags in the standard feature vector set, if there are some tags that are normal tags, at this time, some tags that belong to normal tags can be deleted, and only tags corresponding to fraud websites are reserved; and if all the tags are normal tags, the storage of the standard feature vector set can be abandoned.

And S360, acquiring a webpage source code of the website to be identified, and acquiring a target characteristic vector according to the webpage source code.

S370, judging whether a target standard feature vector matched with the target feature vector exists in the preset fraud website feature library.

And S380, if so, determining the websites to be identified as fraud websites, and determining fraud website classifications of the websites to be identified according to fraud type labels of the standard feature vector sets corresponding to the target standard feature vectors.

In a specific implementation manner of this embodiment, as shown in fig. 3B, a set of phishing website webpage source codes is first obtained, and the webpage source codes are analyzed according to the target parameter items to obtain corresponding target parameter values; and then, carrying out statistics and calculation on the target parameter values according to the selected characteristic dimensions to obtain characteristic values corresponding to the current characteristic dimensions, and generating corresponding standard characteristic vectors according to the characteristic values. Secondly, clustering standard feature vectors corresponding to the webpage source code sets to obtain a plurality of standard feature vector sets; and extracting the common label of each standard feature vector in each standard feature vector set, and using the common label as the label of the corresponding standard feature vector set to establish a preset fraud website feature library. Finally, after the website to be identified is obtained, the corresponding target feature vector is obtained according to the webpage source code of the website to be identified, and matching search of the target feature vector is performed through a preset fraud website feature library so as to determine fraud website classification of the website to be identified.

According to the technical scheme of the embodiment of the invention, before the webpage source codes of the websites to be identified are obtained, the webpage source codes of a plurality of fraud websites are obtained, the content of the webpage source codes of the fraud websites is extracted according to the target parameter items, and the target parameter values corresponding to the target parameter items are obtained; obtaining characteristic values corresponding to the characteristic dimensions of each target according to the target parameter values, and generating standard characteristic vectors corresponding to each fraud website according to the characteristic values; clustering each standard feature vector through a Kmeans algorithm, and acquiring a plurality of standard feature vector sets according to clustering results; further acquiring common labels of the standard feature vectors in the standard feature vector set, wherein each standard feature vector corresponds to a webpage source code, and taking the common labels as the labels of the standard feature vector set; and generating a preset fraud website feature library according to the standard feature vector set added with the label, thereby realizing the efficient construction of the preset fraud website feature library.

FIG. 4 is a schematic structural diagram of an identification device of a fraud website according to another embodiment of the present invention. As shown in fig. 4, the apparatus includes: a target feature vector acquisition module 410, a match determination module 420, and a classification determination module 430. Wherein the content of the first and second substances,

a target feature vector obtaining module 410, configured to obtain a webpage source code of a website to be identified, and obtain a target feature vector according to the webpage source code;

a matching judgment module 420, configured to judge whether a target standard feature vector matching the target feature vector exists in the preset fraud website feature library;

and the classification determining module 430 is configured to determine, if yes, the website to be identified as a fraud website, and determine a fraud website classification of the website to be identified according to the fraud type label of the target standard feature vector corresponding to the standard feature vector set.

Optionally, on the basis of the foregoing technical solution, the target feature vector obtaining module 410 includes:

a target parameter value obtaining unit, configured to locate at least one preset target parameter item in the web page source code, and obtain a target parameter value of each target parameter item; the target parameter item comprises at least one item of a label type, a label value, an attribute type and an attribute value;

and the target characteristic vector generating unit is used for forming characteristic values respectively corresponding to at least one target characteristic dimension according to each target parameter value and generating a target characteristic vector according to each characteristic value.

Optionally, on the basis of the above technical solution, the identification apparatus for a fraud website further includes:

the similarity calculation module is used for caching the target feature vectors into a problem feature vector set if the target feature vectors are not in the preset feature vector set, and calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be larger than or equal to the number of the preset feature vectors;

the target problem feature vector acquisition module is used for acquiring a plurality of target problem feature vectors with the similarity greater than or equal to a preset similarity threshold according to the similarity calculation result;

and the website classification determining module is used for acquiring a common label of the webpage source code corresponding to each target problem feature vector, and determining the classification of the website to be identified corresponding to each target problem feature vector according to the common label.

a standard feature vector set establishing module, configured to establish a new standard feature vector set when determining that the websites to be identified respectively corresponding to the target problem feature vectors are fraud websites, and add all the target problem feature vectors to the new standard feature vector set;

and the standard feature vector set storage module is used for taking a common label of the webpage source code corresponding to each target problem feature vector as a label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

Optionally, on the basis of the above technical solution, the target feature vector obtaining module 410 is further configured to obtain webpage source codes of a plurality of fraud websites, perform content extraction on the webpage source codes of each fraud website according to the target parameter item, and obtain a target parameter value corresponding to each target parameter item.

the standard characteristic vector generation module is used for acquiring characteristic values corresponding to all target characteristic dimensions according to the target parameter values and generating standard characteristic vectors respectively corresponding to all fraud websites according to all the characteristic values;

the standard feature vector set acquisition module is used for clustering each standard feature vector through a Kmeans algorithm and acquiring at least one standard feature vector set according to a clustering result;

the label acquisition module is used for acquiring common labels of the standard feature vectors in the standard feature vector set, wherein each standard feature vector corresponds to a webpage source code, and the common labels are used as labels of the standard feature vector set;

and the preset fraud website feature library generating module is used for generating a preset fraud website feature library according to the standard feature vector set added with the label.

Optionally, on the basis of the above technical scheme, the preset fraud website feature library generation module is specifically configured to determine whether a tag of the standard feature vector set is a preset normal tag; if so, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

The device can execute the method for identifying the fraud websites provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the method. Technical details that are not described in detail in the embodiments of the present invention can be referred to the identification method of fraud websites provided by the foregoing embodiments of the present invention.

Fig. 5 is a schematic structural diagram of an electronic device according to another embodiment of the present invention, as shown in fig. 5, the electronic device includes a processor 510, a memory 520, an input device 530, and an output device 540; the number of the processors 510 in the electronic device may be one or more, and one processor 510 is taken as an example in fig. 5; the processor 510, the memory 520, the input device 530 and the output device 540 in the electronic apparatus may be connected by a bus or other means, and the connection by the bus is exemplified in fig. 5. Memory 520, as a computer-readable storage medium, can be used for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to a method for identifying fraud websites (e.g., target feature vector acquisition module 410, match judgment module 420, and classification determination module 430 in an identification device of a fraud website) in any embodiment of the present invention. The processor 510 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 520, namely, implements one of the above-mentioned methods for identifying fraudulent websites. That is, the program when executed by the processor implements:

The memory 520 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 520 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 520 may further include memory located remotely from processor 510, which may be connected to an electronic device through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The input device 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the electronic apparatus, and may include a keyboard, a mouse, and the like. The output device 540 may include a display device such as a display screen.

Optionally, the electronic device may be a server, and the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.

Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method according to any of the embodiments of the present invention. Of course, the embodiment of the present invention provides a computer-readable storage medium, which can perform the relevant operations in the method for identifying a fraud website provided by any embodiment of the present invention. That is, the program when executed by the processor implements:

From the above description of the embodiments, it is obvious for those skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly, can also be implemented by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.

It should be noted that, in the embodiment of the fraud website identification apparatus, the units and modules included in the fraud website are merely divided according to the functional logic, but are not limited to the above division as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.

It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions as will now become apparent to those skilled in the art without departing from the scope of the invention. Therefore, although the present invention has been described in greater detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims

1. A method of identifying a fraudulent website, comprising:

2. The method of claim 1, wherein obtaining a target feature vector according to the web page source code comprises:

positioning at least one preset target parameter item in the webpage source code, and acquiring a target parameter value of each target parameter item; the target parameter item comprises at least one item of a label type, a label value, an attribute type and an attribute value;

and forming characteristic values respectively corresponding to at least one target characteristic dimension according to each target parameter value, and generating a target characteristic vector according to each characteristic value.

3. The method as recited in claim 1, further comprising, after determining whether there is a target standard feature vector matching the target feature vector in the predetermined fraud website feature library:

otherwise, caching the target feature vectors into a problem feature vector set, and respectively calculating the similarity between the problem feature vectors when the number of the problem feature vectors in the problem feature vector set is detected to be greater than or equal to the number of preset feature vectors;

according to the similarity calculation result, acquiring a plurality of target problem feature vectors with the similarity greater than or equal to a preset similarity threshold;

and acquiring a common label of the webpage source code corresponding to each target problem feature vector, and determining the classification of the website to be identified corresponding to each target problem feature vector according to the common label.

4. The method according to claim 3, wherein after obtaining a common label of a web page source code corresponding to each of the target problem feature vectors and determining a category of a website to be identified corresponding to each of the target problem feature vectors according to the common label, the method further comprises:

when the classification of websites to be identified, which respectively correspond to the target problem feature vectors, as fraud websites is determined, establishing a new standard feature vector set, and respectively adding all the target problem feature vectors to the new standard feature vector set;

and taking the common label of the webpage source code corresponding to each target problem feature vector as the label of the new standard feature vector set, and storing the new standard feature vector set into the fraud website feature library.

5. The method of claim 2, further comprising, before obtaining the web page source code of the website to be identified:

acquiring webpage source codes of a plurality of fraud websites, extracting the content of the webpage source codes of the fraud websites according to the target parameter items, and acquiring target parameter values corresponding to the target parameter items;

obtaining characteristic values corresponding to all target characteristic dimensions according to the target parameter values, and generating standard characteristic vectors corresponding to all fraud websites respectively according to all the characteristic values;

clustering each standard feature vector through a Kmeans algorithm, and acquiring at least one standard feature vector set according to a clustering result;

acquiring common labels of standard feature vectors in a standard feature vector set, wherein each standard feature vector corresponds to a webpage source code, and taking the common labels as labels of the standard feature vector set;

and generating a preset fraud website feature library according to the standard feature vector set added with the label.

6. The method as claimed in claim 5, wherein generating a preset fraud website feature library according to the tagged standard feature vector set comprises:

judging whether the label of the standard feature vector set is a preset normal label or not;

if so, discarding the standard feature vector set; otherwise, generating a preset fraud website feature library according to the standard feature vector set.

7. The method of claim 2 or 5, wherein the target feature dimension comprises at least one of a target parameter value total, a target parameter value deduplication count, a label total, a most repeated label number, and an attribute total.

8. An apparatus for identifying a fraudulent website, comprising:

9. An electronic device, comprising:

one or more processors;

storage means for storing one or more computer programs;

the one or more computer programs, when executed by the one or more processors, cause the one or more processors to execute the computer program, thereby implementing the method for identifying fraud websites of any one of claims 1-7.

10. A computer-readable storage medium, having stored thereon a computer program, characterized in that the computer program, when being executed by a processor, implements the method of identifying fraud websites of any one of claims 1-7.