CN111324831A - Method and device for detecting fraudulent website - Google Patents

Method and device for detecting fraudulent website Download PDF

Info

Publication number
CN111324831A
CN111324831A CN201811545469.8A CN201811545469A CN111324831A CN 111324831 A CN111324831 A CN 111324831A CN 201811545469 A CN201811545469 A CN 201811545469A CN 111324831 A CN111324831 A CN 111324831A
Authority
CN
China
Prior art keywords
word
website
words
determining
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201811545469.8A
Other languages
Chinese (zh)
Inventor
张锴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Group Beijing Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Group Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Group Beijing Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201811545469.8A priority Critical patent/CN111324831A/en
Publication of CN111324831A publication Critical patent/CN111324831A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a method and a device for detecting a cheating website, which are used for acquiring text information of a webpage in the website to be detected, performing text word segmentation on the text information to obtain at least one word, classifying the at least one word based on a preset model to obtain a classification result, determining the website to be detected as the cheating website if the classification result is determined to be preset identification information, wherein the preset identification information is used for representing that the website is the cheating website, classifying the text information in the webpage by using a preset rapid text model, determining the website to be detected as the cheating website by using the classification result, and realizing classification by using a machine learning algorithm, so that the detection accuracy and efficiency of the cheating website are improved.

Description

Method and device for detecting fraudulent website
Technical Field
The invention relates to the technical field of network security, in particular to a method and a device for detecting a cheating website.
Background
Today, the number of netizens is increasing year by year in the rapid development of information technology, but for most netizens with relatively lacking security awareness, the invasion of the property security of people by a cheating website is a serious problem. The fraud website can be a website which highly imitates a real website to cheat a user to input an account password, and also can be a website which contains fraud information such as winning, betting, false advertisement and the like and endangers the property safety of people. Detection of fraudulent websites becomes particularly important in order to avoid users being tricked by the fraudulent website.
At present, for the detection of a fraudulent website, a manual mode is usually used to detect a text in the website, when detecting that there are words such as "transfer", "recharge", "password" and the like which may have fraud, the website is determined to be a fraudulent website, the processing mode needs manual investigation, the detection efficiency is not high, and the detection accuracy is also low.
Disclosure of Invention
The invention aims to provide a method and a device for detecting a cheating website so as to improve the accuracy of detecting the cheating website.
The purpose of the invention is realized by the following technical scheme:
in a first aspect, the present invention provides a method for detecting a fraudulent website, including:
acquiring text information of a webpage in a website to be detected;
performing text word segmentation on the text information to obtain at least one word;
classifying the at least one word based on a preset model to obtain a classification result;
and if the classification result is determined to be the preset identification information, determining that the website to be detected is a fraud website, wherein the preset identification information is used for representing that the website is the fraud website.
Optionally, the preset model is obtained by training in the following way:
obtaining sample data, and selecting words to be trained from the sample data;
determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word;
determining similarity between the word vectors according to the word vectors of each word;
and training the similarity between the determined word vectors to obtain the preset model.
Optionally, the selecting a word to be trained from the sample data includes:
determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word;
determining the probability value of each word in the words according to the probability value of each word;
determining words with probability values smaller than a preset probability value in the probability values of all words in the words;
and selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
Optionally, acquiring text information of a webpage in a website to be detected includes:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
In a second aspect, the present invention provides a device for detecting a fraudulent website, including:
the acquisition unit is used for acquiring text information of a webpage in a website to be detected;
the processing unit is used for performing text word segmentation on the text information acquired by the acquisition unit to obtain at least one word, and classifying the at least one word based on a preset model to obtain a classification result;
and the determining unit is used for determining that the website to be detected is a fraud website when the classification result is determined to be the preset identification information, and the preset identification information is used for representing that the website is the fraud website.
Optionally, the preset model is obtained by training in the following way:
the acquisition unit is further configured to: obtaining sample data, and selecting words to be trained from the sample data;
the determination unit is further configured to: determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word; determining the similarity between the word vectors according to the word vector of each word;
the processing unit is further to: and training the similarity between the determined word vectors to obtain the preset model.
Optionally, the obtaining unit is specifically configured to select a word to be trained from the sample data as follows:
determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word;
determining the probability value of each word in the words according to the probability value of each word;
determining words with probability values smaller than a preset probability value in the probability values of all words in the words;
and selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
Optionally, the acquiring unit is specifically configured to acquire text information of a web page in a website to be detected in the following manner:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
In a third aspect, the present invention provides a device for detecting a fraudulent website, including:
a memory for storing program instructions;
a processor for calling the program instructions stored in the memory and executing the method of the first aspect according to the obtained program.
In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method of the first aspect.
The invention provides a method and a device for detecting a cheating website, which are used for acquiring text information of a webpage in the website to be detected, performing text word segmentation on the text information to obtain at least one word, classifying the at least one word based on a preset model to obtain a classification result, determining the website to be detected as the cheating website if the classification result is determined to be preset identification information, wherein the preset identification information is used for representing that the website is the cheating website, classifying the text information in the webpage by using a preset fast text (Fasttext) model, determining the website to be detected as the cheating website by using the classification result, and realizing classification by using a machine learning algorithm, so that the detection accuracy and efficiency of the cheating website are improved.
Drawings
Fig. 1 is a flowchart of a method for detecting a fraudulent website according to an embodiment of the present application;
FIG. 2 is a flow chart of a method for model training according to an embodiment of the present disclosure;
FIG. 3 is a diagram of a word vector generation model according to an embodiment of the present application;
fig. 4 is a schematic diagram of determining similarity of word vectors according to an embodiment of the present application;
FIG. 5 is a flowchart of a method for selecting words to be trained according to an embodiment of the present application;
fig. 6 is a block diagram illustrating a structure of a detection apparatus for a fraud website according to an embodiment of the present application;
fig. 7 is a schematic diagram of a detection apparatus for a fraud website according to an embodiment of the present disclosure.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
At present, the detection of fraud websites (phishing websites) is necessary and important for the developing network environment, and the existing detection methods are manually intervened by a large amount of personnel, so that the content analysis on the internet is limited and the accuracy is not high.
In view of this, embodiments of the present application provide a method and an apparatus for detecting a fraudulent website, where text information in a website is classified by using a pre-trained classification model, and when a result obtained by the classification corresponds to the fraudulent website, the website to be detected is determined to be the fraudulent website, so that manual intervention is not required, and detection efficiency and accuracy of the fraudulent website are improved.
Fig. 1 is a flowchart illustrating a method for detecting a fraudulent website according to an embodiment of the present application, and referring to fig. 1, the method includes:
s101: and acquiring text information of the web pages in the website to be detected.
S102: and performing text word segmentation on the text information to obtain at least one word.
Specifically, text word segmentation is a step of preprocessing text information, and can be performed through space and punctuation marks.
S103: and classifying at least one word based on a preset model to obtain a classification result.
In the embodiment of the application, text word segmentation can be performed on text information in a webpage, words after word segmentation are input into a preset model, and the words after word segmentation are classified to obtain a classification result.
Specifically, the predetermined model may be a fast text (Fasttext) classification model, a sequence of words (for example, a text or a sentence) is input into the model, and the probability that the sequence of words belongs to different categories may be output.
It is understood that the articles obtained after word segmentation can also be classified as articles in the web page.
S104: and if the classification result is determined to be the preset identification information, determining that the website to be detected is a fraud website.
In this embodiment, identification information may be set for the classification result of the preset model, for example, the classification result may include 0 and 1, where 0 may identify that the website is not a fraudulent website, and 1 may identify that the website is a fraudulent website.
Therefore, whether the website to be detected is a fraud website can be determined according to the classification result, and if the classification result is 1, the website to be detected is determined to be a fraud website.
Specifically, in the embodiment of the present application, a preset model may be obtained by training using a flow chart of a method as shown in fig. 2, with reference to fig. 2, the method includes:
s201: and acquiring sample data, and selecting a word to be trained from the sample data.
S202: and determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word.
S203: and determining the similarity between the word vectors according to the word vectors of each word.
S204: and training the similarity between the determined word vectors to obtain a preset model.
The method steps involved in the examples of the present application will be described in detail below:
(1) and (3) generating a word vector:
in the embodiment of the present application, each word may be represented by a gaussian mixture having K gaussian components.
Fig. 3 is a schematic diagram of a word vector generation model according to an embodiment of the present application, where (a) in fig. 3 shows a gaussian component and a subword structure thereof. (b) Fast likelihood text classification with Gaussian density (PFT-G) model: the mean vector of each gaussian component is a sub-word vector. (c) Fast likelihood text classification (PFT-GM) model with Gaussian mixture density: for each gaussian mixture distribution, the mean vector of one component is estimated from the subword structure, while the other components are dictionary-based vectors.
Wherein the thick arrow in (a) represents the final average vector, estimated by averaging the gray n-gram vectors.
It should be noted that a dictionary-based vector can be understood as a corpus-based vector, and a word vector is a vector to represent a word.
n-gram is a language model commonly used in large vocabulary continuous speech recognition, and is described as an example:
assuming that the Chinese word is "happy and remunerative", when the recognition of the cheating website is performed, the word can be regarded as a word and is inseparable, and the 3-gram and the 4-gram of the word are respectively as follows:
3-grams < happy, happy and remunerative, big remuneration object, remuneration object >.
4-grams < happy, happy and rewarded > is disclosed.
Wherein the start symbol is represented by < and the end symbol by >.
It is to be understood that the word "beautiful" or the like in fig. 3 is merely an exemplary illustration.
For the word w, w is associated with a density function f (x), which can be expressed as:
Figure BDA0001909229260000061
wherein, muw,iDenotes the mean vector, ∑ μ, i denotes the covariance matrix, pw,iThe component probability of a sum of 1 is shown, and N is the distribution belonging to the Hilbert space (Hilbert space).
In the embodiment of the present application, the average vector is estimated by using the subword structure, and the formula is as follows:
Figure BDA0001909229260000071
for the word w, μwRepresenting the average vector, and calculating to obtain μ by the average vector of n-gram and dictionary level vectorw,zgRepresenting vectors, v, associated with g in n-gram combinationswRepresenting dictionary-based vectors for the word w, NGωRepresents the sum of a set of w of n-grams.
In the embodiments of the present application, each mean vector may be represented by a single letter in a word or by a kanji estimate. For the gaussian mixture model, the average vector of one gaussian component is represented by a subword structure, and the average vectors of the other components are derived based on a dictionary (corpus). The Fasttext model in the present application adds a dictionary-based mean vector in order to reduce the constraint effect imposed by the word-based substructures in the participle, thereby promoting the independent meaning of the words.
(2) Calculating word similarity:
a common word vector similarity measure can be implemented by dot products, usually if the words are represented by vectors, whereas in the case of words represented by distribution functions, generalized dot products, i.e. the desired likelihood kernel, can be used in the hubert space L2 in the embodiments of the present application.
In particular, it may be defined that E (f, g) represents a closed form of the Gaussian mixture of energies between f and g:
E(f,g)=log<f,g)L2=∫log f(x)g(x)dx
wherein, f (x) PKi 1piN (x; μ f, i, ∑ f, i).
g(x)=PKi=1piN(x;~μg,i,∑g,i)。
In the above formula: n represents belonging to the Hilbert space distribution (Hilbert space); μ denotes: averaging the vectors; f represents: a probability density function; k represents: k Gaussian components; p represents: and a component probability of 1.
From the above equation:
Figure BDA0001909229260000072
wherein, ξi,jRepresenting the partial energy corresponding to the similarity between component i of the first word f and component j of the second word.
Specifically, the method comprises the following steps:
Figure BDA0001909229260000081
in the above formula, D represents: hilbert spatial variance; mu.sf,iRepresents the average density probability of the word f; mu.sg,iRepresenting the average probability density of the word g ∑f,i∑ representing the sum of the probability densities of the word fg,iRepresenting the sum of the probability densities of the words g.
The similarity between word vectors can be calculated by the formula referred to above.
In the embodiment of the present application, a schematic diagram as shown in fig. 4 may be used to represent the similarity between words as shown in fig. 4. The words "pop", "rock", etc. shown in FIG. 4 are merely exemplary and should not be construed as limiting.
Since the result of the classification performed by the trained model may have an error from the actual real classification result, in the embodiment of the present application, a loss function may be used to represent the error between the real classification result and the classification result of the trained model.
Specifically, the loss function can be expressed as follows:
L(f,g)=max[0,m-E(f,g)+E(f,n)]
it is understood that the formula of the loss function is derived based on the above-described similarity calculation formula.
In the above formula, m represents a boundary value, which can be understood as an error distance value, m can be set by itself, and the difference between E (f, g) and E (f, n) is controlled, and an error is calculated only when the two differences are larger than m. In short, the penalty function wants a higher score for correctly classified classes than for incorrect classes, and at least m higher. If this is not met, the calculation of the loss value is started.
n represents the negative context, the least likely energy of a word belonging to a certain context.
Further, the method shown in fig. 5 may be referred to select a word to be trained from the sample data, and as shown in fig. 5, the method includes:
s301: and determining the probability value of each word according to the frequency of each word to be trained in the corpus and the preset frequency threshold of each word.
S302: and determining the probability value of each word in the words according to the probability value of each word.
S303: and determining words with probability values smaller than preset probability values in the probability values of all words in the words.
S304: selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
Specifically, at the beginning of model training, sample data sampling is required to be performed first, and a word to be trained can be selected from the sample data in the embodiment of the application.
Because the occurrence frequency of words in the corpus is different, the occurrence frequency of some words is very high, but the meaning of some words is not great, the occurrence frequency of the words in the corpus can be used for determining the occurrence probability of the words in the embodiment of the application, and the words which frequently occur but have low probability are mainly trained to reduce the importance of the words such as 'the' word 'and' the 'word'.
Specifically, for a word w, its probability can be expressed as:
Figure BDA0001909229260000091
where f (W) represents the frequency of the word W in the corpus, and t represents the frequency threshold.
In the distribution space, the distribution of the particles,
Figure BDA0001909229260000092
where U (W) is the unigram probability of w, the index x/y can also reduce the importance of frequent words to shift the focus of training to other less common words.
Specifically, in the embodiment of the present application, a word, a real context of the word, and an opposite context may be sampled, a set window size may be used to select a context of the word, and a word included in the selected context is used as a word to be trained.
It should be understood that the set window size may be, for example, a window of size 1, or may be other sizes, which is not limited in the present application.
Preferably, the acquiring text information of the web page in the website to be detected may include:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
Furthermore, ξ is included in the above formula for calculating similarityi,jWhich can be understood as an energy formula, which in the embodiment of the present application can be ξi,jThe energy formula is simplified, so that the calculation amount in the formula is simplified, the floating point operation times of the program are reduced, and the performance is improved.
Since the loss function is obtained based on the similarity calculation formula, ξ is a function variable in the similarity calculation, ξ can be understood as a function variable in the loss function, so that the energy formula is simplified, and the loss function is also simplified.
Specifically, since the spherical covariance is substantially equal to the diagonal covariance, ξ can be pairedi,jThe simplification is carried out:
Figure BDA0001909229260000101
wherein the hyperparameter α is the above ξi,jThe scale of covariance agreement terms in the formula (1).
Wherein, muf,iRepresents the average density probability of the word f; mu.sg,iRepresenting the average probability density of the word g.
Based on the same concept as the detection method embodiment of the cheating website, the embodiment of the invention also provides a detection device of the cheating website. Fig. 6 is a block diagram illustrating a structure of a detection apparatus for a fraud website according to an embodiment of the present application, including: an acquisition unit 101, a processing unit 102, and a determination unit 103.
The acquiring unit 101 is configured to acquire text information of a web page in a website to be detected.
The processing unit 102 is configured to perform text word segmentation on the text information acquired by the acquiring unit 101 to obtain at least one word, and classify the at least one word based on a preset model to obtain a classification result.
The determining unit 103 is configured to determine that the website to be detected is a fraudulent website when the classification result is determined to be the preset identification information.
The preset identification information is used for representing that the website is a fraud website.
Specifically, the preset model is obtained by training in the following way:
the obtaining unit 101 is further configured to: and acquiring sample data, and selecting a word to be trained from the sample data.
The determining unit 102 is further configured to: determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word; and determining the similarity between the word vectors according to the word vector of each word.
The processing unit 103 is further configured to: and training the similarity between the determined word vectors to obtain a preset model.
Further, the obtaining unit 101 is specifically configured to select a word to be trained from the sample data as follows:
determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word; determining the probability value of each word in the words according to the probability value of each word; determining words with probability values smaller than a preset probability value in the probability values of all words in the words; selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
Further, the obtaining unit 101 is specifically configured to obtain text information of a web page in a website to be detected as follows:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
It should be noted that, for the function implementation of each unit in the above-mentioned detection apparatus for a fraud website in the embodiment of the present invention, reference may be further made to the description of the related method embodiment, which is not described herein again.
An embodiment of the present application further provides another apparatus for detecting a fraudulent website, as shown in fig. 7, the apparatus includes:
a memory 202 for storing program instructions.
A transceiver 201 for receiving and transmitting a fraud website detection instruction.
And the processor 200 is configured to call the program instructions stored in the memory, and execute any one of the method flows described by the processing unit (102) and the determining unit (103) shown in fig. 6 according to the obtained program according to the instructions received by the transceiver 201.
Where in fig. 7 the bus architecture may include any number of interconnected buses and bridges, with various circuits of one or more processors, represented by processor 200, and memory, represented by memory 202, being linked together. The bus architecture may also link together various other circuits such as peripherals, voltage regulators, power management circuits, and the like, which are well known in the art, and therefore, will not be described any further herein. The bus interface provides an interface.
The transceiver 201 may be a number of elements, including a transmitter and a transceiver, providing a means for communicating with various other apparatus over a transmission medium.
The processor 200 is responsible for managing the bus architecture and general processing, and the memory 202 may store data used by the processor 200 in performing operations.
The processor 200 may be a Central Processing Unit (CPU), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or a Complex Programmable Logic Device (CPLD).
Embodiments of the present application also provide a computer storage medium for storing computer program instructions for any apparatus described in the embodiments of the present application, which includes a program for executing any method provided in the embodiments of the present application.
The computer storage media may be any available media or data storage device that can be accessed by a computer, including, but not limited to, magnetic memory (e.g., floppy disks, hard disks, magnetic tape, magneto-optical disks (MOs), etc.), optical memory (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor memory (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND FLASH), Solid State Disks (SSDs)), etc.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present invention without departing from the spirit and scope of the invention. Thus, if such modifications and variations of the present invention fall within the scope of the claims of the present invention and their equivalents, the present invention is also intended to include such modifications and variations.

Claims (10)

1. A method for detecting a fraudulent website, comprising:
acquiring text information of a webpage in a website to be detected;
performing text word segmentation on the text information to obtain at least one word;
classifying the at least one word based on a preset model to obtain a classification result;
and if the classification result is determined to be the preset identification information, determining that the website to be detected is a fraud website, wherein the preset identification information is used for representing that the website is the fraud website.
2. The method of claim 1, wherein the predetermined model is trained by:
obtaining sample data, and selecting words to be trained from the sample data;
determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word;
determining similarity between the word vectors according to the word vectors of each word;
and training the similarity between the determined word vectors to obtain the preset model.
3. The method of claim 2, wherein said selecting a word to be trained from said sample data comprises:
determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word;
determining the probability value of each word in the words according to the probability value of each word;
determining words with probability values smaller than a preset probability value in the probability values of all words in the words;
and selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
4. The method of claim 1, wherein obtaining text information of web pages in the website to be detected comprises:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
5. A fraud detection apparatus for a website, comprising:
the acquisition unit is used for acquiring text information of a webpage in a website to be detected;
the processing unit is used for performing text word segmentation on the text information acquired by the acquisition unit to obtain at least one word, and classifying the at least one word based on a preset model to obtain a classification result;
and the determining unit is used for determining that the website to be detected is a fraud website when the classification result is determined to be the preset identification information, and the preset identification information is used for representing that the website is the fraud website.
6. The apparatus of claim 5, wherein the predetermined model is trained by:
the acquisition unit is further configured to: obtaining sample data, and selecting words to be trained from the sample data;
the determination unit is further configured to: determining a word vector of each word according to the vector of the corpus of each word and the average vector of each word; determining the similarity between the word vectors according to the word vector of each word;
the processing unit is further to: and training the similarity between the determined word vectors to obtain the preset model.
7. The apparatus of claim 6, wherein the obtaining unit is specifically configured to select a word to be trained from the sample data as follows:
determining a probability value of each word according to the frequency of each word to be trained in the corpus and a preset frequency threshold of each word;
determining the probability value of each word in the words according to the probability value of each word;
determining words with probability values smaller than a preset probability value in the probability values of all words in the words;
and selecting the context of the words with the probability value smaller than the preset probability value according to the set window size, and taking the words included in the context as the words to be trained.
8. The apparatus according to claim 5, wherein the acquiring unit is specifically configured to acquire the text information of the web page in the website to be detected as follows:
and capturing text information of the web page in the website to be detected by using a web crawler, or restoring the text information of the web page in the website to be detected through a web log of the web page.
9. A fraud detection apparatus for a website, comprising:
a memory for storing program instructions;
a processor for calling the program instructions stored in the memory and executing the method of any one of claims 1 to 4 according to the obtained program.
10. A computer readable storage medium having stored thereon computer instructions which, when run on a computer, cause the computer to perform the method of any of claims 1-4.
CN201811545469.8A 2018-12-17 2018-12-17 Method and device for detecting fraudulent website Pending CN111324831A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811545469.8A CN111324831A (en) 2018-12-17 2018-12-17 Method and device for detecting fraudulent website

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811545469.8A CN111324831A (en) 2018-12-17 2018-12-17 Method and device for detecting fraudulent website

Publications (1)

Publication Number Publication Date
CN111324831A true CN111324831A (en) 2020-06-23

Family

ID=71168542

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811545469.8A Pending CN111324831A (en) 2018-12-17 2018-12-17 Method and device for detecting fraudulent website

Country Status (1)

Country Link
CN (1) CN111324831A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632222A (en) * 2020-12-25 2021-04-09 海信视像科技股份有限公司 Terminal equipment and method for determining data belonging field
CN114528456A (en) * 2021-09-26 2022-05-24 四川大学 Digital currency fraud website detection method based on machine learning
CN113779481B (en) * 2021-09-26 2024-04-09 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for identifying fraud websites

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279875A (en) * 2011-06-24 2011-12-14 成都市华为赛门铁克科技有限公司 Method and device for identifying phishing website
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN103914478A (en) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 Webpage training method and system and webpage prediction method and system
CN106570112A (en) * 2016-11-01 2017-04-19 四川用联信息技术有限公司 Improved ant colony algorithm-based text clustering realization method
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN108170818A (en) * 2017-12-29 2018-06-15 深圳市金立通信设备有限公司 A kind of file classification method, server and computer-readable medium
CN108345612A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of question processing method and device, a kind of device for issue handling
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763349A (en) * 2018-05-15 2018-11-06 邢汉发 Urban land use degree of mixing measuring method and system based on social media data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102279875A (en) * 2011-06-24 2011-12-14 成都市华为赛门铁克科技有限公司 Method and device for identifying phishing website
CN102663093A (en) * 2012-04-10 2012-09-12 中国科学院计算机网络信息中心 Method and device for detecting bad website
CN103914478A (en) * 2013-01-06 2014-07-09 阿里巴巴集团控股有限公司 Webpage training method and system and webpage prediction method and system
CN106570112A (en) * 2016-11-01 2017-04-19 四川用联信息技术有限公司 Improved ant colony algorithm-based text clustering realization method
CN108345612A (en) * 2017-01-25 2018-07-31 北京搜狗科技发展有限公司 A kind of question processing method and device, a kind of device for issue handling
CN107391575A (en) * 2017-06-20 2017-11-24 浙江理工大学 A kind of implicit features recognition methods of word-based vector model
CN108111478A (en) * 2017-11-07 2018-06-01 中国互联网络信息中心 A kind of phishing recognition methods and device based on semantic understanding
CN108170818A (en) * 2017-12-29 2018-06-15 深圳市金立通信设备有限公司 A kind of file classification method, server and computer-readable medium
CN108573047A (en) * 2018-04-18 2018-09-25 广东工业大学 A kind of training method and device of Module of Automatic Chinese Documents Classification
CN108763349A (en) * 2018-05-15 2018-11-06 邢汉发 Urban land use degree of mixing measuring method and system based on social media data

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112632222A (en) * 2020-12-25 2021-04-09 海信视像科技股份有限公司 Terminal equipment and method for determining data belonging field
CN112632222B (en) * 2020-12-25 2023-02-03 海信视像科技股份有限公司 Terminal equipment and method for determining data belonging field
CN114528456A (en) * 2021-09-26 2022-05-24 四川大学 Digital currency fraud website detection method based on machine learning
CN113779481B (en) * 2021-09-26 2024-04-09 恒安嘉新(北京)科技股份公司 Method, device, equipment and storage medium for identifying fraud websites

Similar Documents

Publication Publication Date Title
US10942962B2 (en) Systems and methods for categorizing and moderating user-generated content in an online environment
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
EP2664997B1 (en) System and method for resolving named entity coreference
JP5901001B1 (en) Method and device for acoustic language model training
US8077984B2 (en) Method for computing similarity between text spans using factored word sequence kernels
US10741092B1 (en) Application of high-dimensional linguistic and semantic feature vectors in automated scoring of examination responses
CN110222182B (en) Statement classification method and related equipment
US20150286632A1 (en) Predicting the quality of automatic translation of an entire document
US20180267956A1 (en) Identification of reading order text segments with a probabilistic language model
CN109614625B (en) Method, device and equipment for determining title text relevancy and storage medium
CN108491389B (en) Method and device for training click bait title corpus recognition model
US8909514B2 (en) Unsupervised learning using global features, including for log-linear model word segmentation
CN106570180A (en) Artificial intelligence based voice searching method and device
US11941361B2 (en) Automatically identifying multi-word expressions
CN114416943B (en) Training method and device for dialogue model, electronic equipment and storage medium
CN111324831A (en) Method and device for detecting fraudulent website
Ishihara A likelihood ratio-based evaluation of strength of authorship attribution evidence in SMS messages using N-grams.
CN111832281A (en) Composition scoring method and device, computer equipment and computer readable storage medium
CN113657098A (en) Text error correction method, device, equipment and storage medium
JP7155625B2 (en) Inspection device, inspection method, program and learning device
Fernandez et al. Discriminative training and unsupervised adaptation for labeling prosodic events with limited training data.
Hládek et al. Dagger: The slovak morphological classifier
CN113672731A (en) Emotion analysis method, device and equipment based on domain information and storage medium
CN113705207A (en) Grammar error recognition method and device
CN110717326A (en) Text information author identification method and device based on machine learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination