CN111324831A

CN111324831A - Method and device for detecting fraudulent website

Info

Publication number: CN111324831A
Application number: CN201811545469.8A
Authority: CN
Inventors: 张锴
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Group Beijing Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Group Beijing Co Ltd
Priority date: 2018-12-17
Filing date: 2018-12-17
Publication date: 2020-06-23

Abstract

The invention discloses a method and a device for detecting a cheating website, which are used for acquiring text information of a webpage in the website to be detected, performing text word segmentation on the text information to obtain at least one word, classifying the at least one word based on a preset model to obtain a classification result, determining the website to be detected as the cheating website if the classification result is determined to be preset identification information, wherein the preset identification information is used for representing that the website is the cheating website, classifying the text information in the webpage by using a preset rapid text model, determining the website to be detected as the cheating website by using the classification result, and realizing classification by using a machine learning algorithm, so that the detection accuracy and efficiency of the cheating website are improved.

Description

Method and device for detecting fraudulent website

Technical Field

The invention relates to the technical field of network security, in particular to a method and a device for detecting a cheating website.

Background

Today, the number of netizens is increasing year by year in the rapid development of information technology, but for most netizens with relatively lacking security awareness, the invasion of the property security of people by a cheating website is a serious problem. The fraud website can be a website which highly imitates a real website to cheat a user to input an account password, and also can be a website which contains fraud information such as winning, betting, false advertisement and the like and endangers the property safety of people. Detection of fraudulent websites becomes particularly important in order to avoid users being tricked by the fraudulent website.

At present, for the detection of a fraudulent website, a manual mode is usually used to detect a text in the website, when detecting that there are words such as "transfer", "recharge", "password" and the like which may have fraud, the website is determined to be a fraudulent website, the processing mode needs manual investigation, the detection efficiency is not high, and the detection accuracy is also low.

Disclosure of Invention

The invention aims to provide a method and a device for detecting a cheating website so as to improve the accuracy of detecting the cheating website.

The purpose of the invention is realized by the following technical scheme:

in a first aspect, the present invention provides a method for detecting a fraudulent website, including:

acquiring text information of a webpage in a website to be detected;

performing text word segmentation on the text information to obtain at least one word;

classifying the at least one word based on a preset model to obtain a classification result;

and if the classification result is determined to be the preset identification information, determining that the website to be detected is a fraud website, wherein the preset identification information is used for representing that the website is the fraud website.

Optionally, the preset model is obtained by training in the following way:

obtaining sample data, and selecting words to be trained from the sample data;

determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word;

determining similarity between the word vectors according to the word vectors of each word;

and training the similarity between the determined word vectors to obtain the preset model.

Optionally, the selecting a word to be trained from the sample data includes:

determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word;

determining the probability value of each word in the words according to the probability value of each word;

determining words with probability values smaller than a preset probability value in the probability values of all words in the words;

and selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.

Optionally, acquiring text information of a webpage in a website to be detected includes:

and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.

In a second aspect, the present invention provides a device for detecting a fraudulent website, including:

the acquisition unit is used for acquiring text information of a webpage in a website to be detected;

the processing unit is used for performing text word segmentation on the text information acquired by the acquisition unit to obtain at least one word, and classifying the at least one word based on a preset model to obtain a classification result;

and the determining unit is used for determining that the website to be detected is a fraud website when the classification result is determined to be the preset identification information, and the preset identification information is used for representing that the website is the fraud website.

Optionally, the preset model is obtained by training in the following way:

the acquisition unit is further configured to: obtaining sample data, and selecting words to be trained from the sample data;

the determination unit is further configured to: determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word; determining the similarity between the word vectors according to the word vector of each word;

the processing unit is further to: and training the similarity between the determined word vectors to obtain the preset model.

Optionally, the obtaining unit is specifically configured to select a word to be trained from the sample data as follows:

Optionally, the acquiring unit is specifically configured to acquire text information of a web page in a website to be detected in the following manner:

In a third aspect, the present invention provides a device for detecting a fraudulent website, including:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method of the first aspect according to the obtained program.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method of the first aspect.

The invention provides a method and a device for detecting a cheating website, which are used for acquiring text information of a webpage in the website to be detected, performing text word segmentation on the text information to obtain at least one word, classifying the at least one word based on a preset model to obtain a classification result, determining the website to be detected as the cheating website if the classification result is determined to be preset identification information, wherein the preset identification information is used for representing that the website is the cheating website, classifying the text information in the webpage by using a preset fast text (Fasttext) model, determining the website to be detected as the cheating website by using the classification result, and realizing classification by using a machine learning algorithm, so that the detection accuracy and efficiency of the cheating website are improved.

Drawings

Fig. 1 is a flowchart of a method for detecting a fraudulent website according to an embodiment of the present application;

FIG. 2 is a flow chart of a method for model training according to an embodiment of the present disclosure;

FIG. 3 is a diagram of a word vector generation model according to an embodiment of the present application;

fig. 4 is a schematic diagram of determining similarity of word vectors according to an embodiment of the present application;

FIG. 5 is a flowchart of a method for selecting words to be trained according to an embodiment of the present application;

fig. 6 is a block diagram illustrating a structure of a detection apparatus for a fraud website according to an embodiment of the present application;

fig. 7 is a schematic diagram of a detection apparatus for a fraud website according to an embodiment of the present disclosure.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

At present, the detection of fraud websites (phishing websites) is necessary and important for the developing network environment, and the existing detection methods are manually intervened by a large amount of personnel, so that the content analysis on the internet is limited and the accuracy is not high.

In view of this, embodiments of the present application provide a method and an apparatus for detecting a fraudulent website, where text information in a website is classified by using a pre-trained classification model, and when a result obtained by the classification corresponds to the fraudulent website, the website to be detected is determined to be the fraudulent website, so that manual intervention is not required, and detection efficiency and accuracy of the fraudulent website are improved.

Fig. 1 is a flowchart illustrating a method for detecting a fraudulent website according to an embodiment of the present application, and referring to fig. 1, the method includes:

s101: and acquiring text information of the web pages in the website to be detected.

S102: and performing text word segmentation on the text information to obtain at least one word.

Specifically, text word segmentation is a step of preprocessing text information, and can be performed through space and punctuation marks.

S103: and classifying at least one word based on a preset model to obtain a classification result.

In the embodiment of the application, text word segmentation can be performed on text information in a webpage, words after word segmentation are input into a preset model, and the words after word segmentation are classified to obtain a classification result.

Specifically, the predetermined model may be a fast text (Fasttext) classification model, a sequence of words (for example, a text or a sentence) is input into the model, and the probability that the sequence of words belongs to different categories may be output.

It is understood that the articles obtained after word segmentation can also be classified as articles in the web page.

S104: and if the classification result is determined to be the preset identification information, determining that the website to be detected is a fraud website.

In this embodiment, identification information may be set for the classification result of the preset model, for example, the classification result may include 0 and 1, where 0 may identify that the website is not a fraudulent website, and 1 may identify that the website is a fraudulent website.

Therefore, whether the website to be detected is a fraud website can be determined according to the classification result, and if the classification result is 1, the website to be detected is determined to be a fraud website.

Specifically, in the embodiment of the present application, a preset model may be obtained by training using a flow chart of a method as shown in fig. 2, with reference to fig. 2, the method includes:

s201: and acquiring sample data, and selecting a word to be trained from the sample data.

S202: and determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word.

S203: and determining the similarity between the word vectors according to the word vectors of each word.

S204: and training the similarity between the determined word vectors to obtain a preset model.

The method steps involved in the examples of the present application will be described in detail below:

(1) and (3) generating a word vector:

in the embodiment of the present application, each word may be represented by a gaussian mixture having K gaussian components.

Fig. 3 is a schematic diagram of a word vector generation model according to an embodiment of the present application, where (a) in fig. 3 shows a gaussian component and a subword structure thereof. (b) Fast likelihood text classification with Gaussian density (PFT-G) model: the mean vector of each gaussian component is a sub-word vector. (c) Fast likelihood text classification (PFT-GM) model with Gaussian mixture density: for each gaussian mixture distribution, the mean vector of one component is estimated from the subword structure, while the other components are dictionary-based vectors.

Wherein the thick arrow in (a) represents the final average vector, estimated by averaging the gray n-gram vectors.

It should be noted that a dictionary-based vector can be understood as a corpus-based vector, and a word vector is a vector to represent a word.

n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is described as an example:

assuming that the Chinese word is "happy and remunerative", when the recognition of the cheating website is performed, the word can be regarded as a word and is inseparable, and the 3-gram and the 4-gram of the word are respectively as follows:

3-grams < happy, happy and remunerative, big remuneration object, remuneration object >.

4-grams < happy, happy and rewarded > is disclosed.

Wherein the start symbol is represented by < and the end symbol by >.

It is to be understood that the word "beautiful" or the like in fig. 3 is merely an exemplary illustration.

For the word w, w is associated with a density function f (x), which can be expressed as:

wherein, mu_w，iDenotes the mean vector, ∑ μ, i denotes the covariance matrix, p_w，iThe component probability of a sum of 1 is shown, and N is the distribution belonging to the Hilbert space (Hilbert space).

In the embodiment of the present application, the average vector is estimated by using the subword structure, and the formula is as follows:

for the word w, μ_wRepresenting the average vector, and calculating to obtain μ by the average vector of n-gram and dictionary level vector_w，z_gRepresenting vectors, v, associated with g in n-gram combinations_wRepresenting dictionary-based vectors for the word w, NG_ωRepresents the sum of a set of w of n-grams.

In the embodiments of the present application, each mean vector may be represented by a single letter in a word or by a kanji estimate. For the gaussian mixture model, the average vector of one gaussian component is represented by a subword structure, and the average vectors of the other components are derived based on a dictionary (corpus). The Fasttext model in the present application adds a dictionary-based mean vector in order to reduce the constraint effect imposed by the word-based substructures in the participle, thereby promoting the independent meaning of the words.

(2) Calculating word similarity:

a common word vector similarity measure can be implemented by dot products, usually if the words are represented by vectors, whereas in the case of words represented by distribution functions, generalized dot products, i.e. the desired likelihood kernel, can be used in the hubert space L2 in the embodiments of the present application.

In particular, it may be defined that E (f, g) represents a closed form of the Gaussian mixture of energies between f and g:

E(f,g)＝log<f,g)L2＝∫log f(x)g(x)dx

wherein, f (x) PKi 1piN (x; μ f, i, ∑ f, i).

g(x)＝PKi＝1piN(x；～μg，i，∑g，i)。

In the above formula: n represents belonging to the Hilbert space distribution (Hilbert space); μ denotes: averaging the vectors; f represents: a probability density function; k represents: k Gaussian components; p represents: and a component probability of 1.

From the above equation:

wherein, ξ_i，jRepresenting the partial energy corresponding to the similarity between component i of the first word f and component j of the second word.

Specifically, the method comprises the following steps:

in the above formula, D represents: hilbert spatial variance; mu.s_f，iRepresents the average density probability of the word f; mu.s_g，iRepresenting the average probability density of the word g ∑_f，i∑ representing the sum of the probability densities of the word f_g，iRepresenting the sum of the probability densities of the words g.

The similarity between word vectors can be calculated by the formula referred to above.

In the embodiment of the present application, a schematic diagram as shown in fig. 4 may be used to represent the similarity between words as shown in fig. 4. The words "pop", "rock", etc. shown in FIG. 4 are merely exemplary and should not be construed as limiting.

Since the result of the classification performed by the trained model may have an error from the actual real classification result, in the embodiment of the present application, a loss function may be used to represent the error between the real classification result and the classification result of the trained model.

Specifically, the loss function can be expressed as follows:

L(f，g)＝max[0,m-E(f,g)+E(f，n)]

it is understood that the formula of the loss function is derived based on the above-described similarity calculation formula.

In the above formula, m represents a boundary value, which can be understood as an error distance value, m can be set by itself, and the difference between E (f, g) and E (f, n) is controlled, and an error is calculated only when the two differences are larger than m. In short, the penalty function wants a higher score for correctly classified classes than for incorrect classes, and at least m higher. If this is not met, the calculation of the loss value is started.

n represents the negative context, the least likely energy of a word belonging to a certain context.

Further, the method shown in fig. 5 may be referred to select a word to be trained from the sample data, and as shown in fig. 5, the method includes:

s301: and determining the probability value of each word according to the frequency of each word to be trained in the corpus and the preset frequency threshold of each word.

S302: and determining the probability value of each word in the words according to the probability value of each word.

S303: and determining words with probability values smaller than preset probability values in the probability values of all words in the words.

S304: selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.

Specifically, at the beginning of model training, sample data sampling is required to be performed first, and a word to be trained can be selected from the sample data in the embodiment of the application.

Because the occurrence frequency of words in the corpus is different, the occurrence frequency of some words is very high, but the meaning of some words is not great, the occurrence frequency of the words in the corpus can be used for determining the occurrence probability of the words in the embodiment of the application, and the words which frequently occur but have low probability are mainly trained to reduce the importance of the words such as 'the' word 'and' the 'word'.

Specifically, for a word w, its probability can be expressed as:

where f (W) represents the frequency of the word W in the corpus, and t represents the frequency threshold.

In the distribution space, the distribution of the particles,

where U (W) is the unigram probability of w, the index x/y can also reduce the importance of frequent words to shift the focus of training to other less common words.

Specifically, in the embodiment of the present application, a word, a real context of the word, and an opposite context may be sampled, a set window size may be used to select a context of the word, and a word included in the selected context is used as a word to be trained.

It should be understood that the set window size may be, for example, a window of size 1, or may be other sizes, which is not limited in the present application.

Preferably, the acquiring text information of the web page in the website to be detected may include:

Furthermore, ξ is included in the above formula for calculating similarity_i，jWhich can be understood as an energy formula, which in the embodiment of the present application can be ξ_i，jThe energy formula is simplified, so that the calculation amount in the formula is simplified, the floating point operation times of the program are reduced, and the performance is improved.

Since the loss function is obtained based on the similarity calculation formula, ξ is a function variable in the similarity calculation, ξ can be understood as a function variable in the loss function, so that the energy formula is simplified, and the loss function is also simplified.

Specifically, since the spherical covariance is substantially equal to the diagonal covariance, ξ can be paired_i，jThe simplification is carried out:

wherein the hyperparameter α is the above ξ_i，jThe scale of covariance agreement terms in the formula (1).

Wherein, mu_f，iRepresents the average density probability of the word f; mu.s_g，iRepresenting the average probability density of the word g.

Based on the same concept as the detection method embodiment of the cheating website, the embodiment of the invention also provides a detection device of the cheating website. Fig. 6 is a block diagram illustrating a structure of a detection apparatus for a fraud website according to an embodiment of the present application, including: an acquisition unit 101, a processing unit 102, and a determination unit 103.

The acquiring unit 101 is configured to acquire text information of a web page in a website to be detected.

The processing unit 102 is configured to perform text word segmentation on the text information acquired by the acquiring unit 101 to obtain at least one word, and classify the at least one word based on a preset model to obtain a classification result.

The determining unit 103 is configured to determine that the website to be detected is a fraudulent website when the classification result is determined to be the preset identification information.

The preset identification information is used for representing that the website is a fraud website.

Specifically, the preset model is obtained by training in the following way:

the obtaining unit 101 is further configured to: and acquiring sample data, and selecting a word to be trained from the sample data.

The determining unit 102 is further configured to: determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word; and determining the similarity between the word vectors according to the word vector of each word.

The processing unit 103 is further configured to: and training the similarity between the determined word vectors to obtain a preset model.

Further, the obtaining unit 101 is specifically configured to select a word to be trained from the sample data as follows:

determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word; determining the probability value of each word in the words according to the probability value of each word; determining words with probability values smaller than a preset probability value in the probability values of all words in the words; selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.

Further, the obtaining unit 101 is specifically configured to obtain text information of a web page in a website to be detected as follows:

It should be noted that, for the function implementation of each unit in the above-mentioned detection apparatus for a fraud website in the embodiment of the present invention, reference may be further made to the description of the related method embodiment, which is not described herein again.

An embodiment of the present application further provides another apparatus for detecting a fraudulent website, as shown in fig. 7, the apparatus includes:

a memory 202 for storing program instructions.

A transceiver 201 for receiving and transmitting a fraud website detection instruction.

And the processor 200 is configured to call the program instructions stored in the memory, and execute any one of the method flows described by the processing unit (102) and the determining unit (103) shown in fig. 6 according to the obtained program according to the instructions received by the transceiver 201.

Where in fig. 7 the bus architecture may include any number of interconnected buses and bridges, with various circuits of one or more processors, represented by processor 200, and memory, represented by memory 202, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.

The transceiver 201 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium.

The processor 200 is responsible for managing the bus architecture and general processing, and the memory 202 may store data used by the processor 200 in performing operations.

The processor 200 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or a Complex Programmable Logic Device (CPLD).

Embodiments of the present application also provide a computer storage medium for storing computer program instructions for any apparatus described in the embodiments of the present application, which includes a program for executing any method provided in the embodiments of the present application.

The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.

As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims

1. A method for detecting a fraudulent website, comprising:

acquiring text information of a webpage in a website to be detected;

2. The method of claim 1, wherein the predetermined model is trained by:

obtaining sample data, and selecting words to be trained from the sample data;

3. The method of claim 2, wherein said selecting a word to be trained from said sample data comprises:

4. The method of claim 1, wherein obtaining text information of web pages in the website to be detected comprises:

5. A fraud detection apparatus for a website, comprising:

6. The apparatus of claim 5, wherein the predetermined model is trained by:

7. The apparatus of claim 6, wherein the obtaining unit is specifically configured to select a word to be trained from the sample data as follows:

8. The apparatus according to claim 5, wherein the acquiring unit is specifically configured to acquire the text information of the web page in the website to be detected as follows:

9. A fraud detection apparatus for a website, comprising:

a memory for storing program instructions;

a processor for calling the program instructions stored in the memory and executing the method of any one of claims 1 to 4 according to the obtained program.

10. A computer readable storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-4.