CN114048311A - Phishing early warning method, device, equipment and storage medium - Google Patents

Phishing early warning method, device, equipment and storage medium Download PDF

Info

Publication number
CN114048311A
CN114048311A CN202111101612.6A CN202111101612A CN114048311A CN 114048311 A CN114048311 A CN 114048311A CN 202111101612 A CN202111101612 A CN 202111101612A CN 114048311 A CN114048311 A CN 114048311A
Authority
CN
China
Prior art keywords
behavior classification
url
behavior
acquiring
target
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111101612.6A
Other languages
Chinese (zh)
Inventor
杨蓝暄
阿曼太
马寒军
傅强
梁彧
蔡琳
田野
王杰
杨满智
金红
陈晓光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Eversec Beijing Technology Co Ltd
Original Assignee
Eversec Beijing Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eversec Beijing Technology Co Ltd filed Critical Eversec Beijing Technology Co Ltd
Priority to CN202111101612.6A priority Critical patent/CN114048311A/en
Publication of CN114048311A publication Critical patent/CN114048311A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • G06F18/24155Bayesian classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • H04L63/1483Countermeasures against malicious traffic service impersonation, e.g. phishing, pharming or web spoofing

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Hardware Design (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses an early warning method, a device, equipment and a storage medium for phishing, wherein the method comprises the following steps: acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether a preset domain name blacklist comprises the URL or not; if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part; acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model; and determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user. According to the technical scheme of the embodiment of the invention, the URL in the network flow is analyzed through the pre-trained behavior classification model, the cheating degree of the user is determined, and the phishing early warning is carried out based on the cheating degree of the user, so that the prompt early warning of the phishing is realized, and the personal and property losses of the user are avoided.

Description

Phishing early warning method, device, equipment and storage medium
Technical Field
The embodiment of the invention relates to the technical field of computers, in particular to an online fraud early warning method, device, equipment and storage medium.
Background
By counting and analyzing the network behaviors of the user, when the network behaviors of the user are found to have risks, an alarm is given in time, and the method has important significance for improving personal and property safety of the user.
At present, an existing user behavior analysis method generally analyzes a user's web browsing habits and interest preferences based on information such as Uniform Resource Locators (URLs), referrers, cookies, web content, and the like in network traffic to purposefully recommend advertisements to the user; however, for the cheating degree of the user in the cheating website, the corresponding judgment and prediction cannot be realized, and thus the early warning cannot be timely given to the user.
Disclosure of Invention
Embodiments of the present invention provide an phishing early-warning method, device, apparatus, and storage medium, which can implement timely early-warning of phishing, reduce the risk of phishing of a user, and avoid personal and property losses of the user.
In a first aspect, an embodiment of the present invention provides an phishing early warning method, including:
acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether a preset domain name blacklist comprises the URL;
if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part;
acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
In a second aspect, an embodiment of the present invention further provides an phishing early warning device, including:
the analysis module is used for obtaining a Uniform Resource Locator (URL) to be analyzed and judging whether a preset domain name blacklist comprises the URL;
the word segmentation list acquisition module is used for extracting the set part of the URL if the word segmentation list acquisition module is used for extracting the set part of the URL and acquiring a word segmentation list according to the set part;
the target behavior classification acquisition module is used for acquiring a target characteristic matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and obtaining a target behavior classification output by the behavior classification model;
and the user cheating degree determining module is used for determining the cheating degree of the user according to the target behavior classification and carrying out fraud early warning according to the cheating degree of the user.
In a third aspect, an embodiment of the present invention further provides an electronic device, where the electronic device includes:
one or more processors;
storage means for storing one or more computer programs;
the phishing early warning method provided by any embodiment of the invention is realized when the one or more computer programs are executed by the one or more processors, so that the one or more processors execute the computer programs.
In a fourth aspect, an embodiment of the present invention further provides a computer-readable storage medium, where the storage medium stores a computer program, and the computer program, when executed by a processor, implements the network fraud early warning method provided in any embodiment of the present invention.
According to the technical scheme provided by the embodiment of the invention, a Uniform Resource Locator (URL) to be analyzed is obtained, when the preset domain name blacklist comprises the current URL, a set part of the URL is extracted, and a word segmentation list is obtained according to the set part; further acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model; and finally, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user, so that the online early warning of the phishing is realized, the risk of the phishing of the user is reduced, and the personal and property losses of the user are avoided.
Drawings
FIG. 1 is a flowchart illustrating an phishing warning method in an embodiment of the present invention;
FIG. 2A is a flowchart of an phishing warning method in another embodiment of the present invention;
FIG. 2B is a flowchart illustrating an phishing warning method in another embodiment of the present invention;
FIG. 3 is a schematic structural diagram of an phishing warning apparatus in another embodiment of the present invention;
fig. 4 is a schematic structural diagram of an electronic device in another embodiment of the present invention.
Detailed Description
Embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While certain embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be construed as limited to the embodiments set forth herein, but rather are provided for a more thorough and complete understanding of the present invention. It should be understood that the drawings and the embodiments of the present invention are illustrative only and are not intended to limit the scope of the present invention.
It should be understood that the various steps recited in the method embodiments of the present invention may be performed in a different order and/or performed in parallel. Moreover, method embodiments may include additional steps and/or omit performing the illustrated steps. The scope of the invention is not limited in this respect.
The term "include" and variations thereof as used herein are open-ended, i.e., "including but not limited to". The term "based on" is "based, at least in part, on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Relevant definitions for other terms will be given in the following description.
It should be noted that the terms "first", "second", and the like in the present invention are only used for distinguishing different devices, modules or units, and are not used for limiting the order or interdependence of the functions performed by the devices, modules or units.
It is noted that references to "a", "an", and "the" modifications in the present invention are intended to be illustrative rather than limiting, and that those skilled in the art will recognize that reference to "one or more" unless the context clearly dictates otherwise.
The names of messages or information exchanged between devices in the embodiments of the present invention are for illustrative purposes only and are not intended to limit the scope of the messages or information.
FIG. 1 is a flowchart of an phishing early warning method provided by an embodiment of the present invention, which is applicable to analyzing URLs in network traffic based on a pre-trained behavior classification model, determining a cheating degree of a user, and performing phishing early warning based on the cheating degree of the user; the method may be performed by an phishing early warning apparatus, which may be composed of hardware and/or software, and may be integrated in electronic devices in general, and may be integrated in computer devices or servers in typical. As shown in fig. 1, the method specifically includes the following steps:
s110, obtaining a Uniform Resource Locator (URL) to be analyzed, and judging whether the URL is included in a preset domain name blacklist or not.
Wherein, Uniform Resource Locator (URL) is the address of the standard Resource on the internet; in the internet, each file corresponds to a unique URL, and the position and the corresponding processing mode of the file can be acquired through information contained in the URL.
In this embodiment, the URL of the website accessed by the user may be captured as the URL to be analyzed by the package capture software, or the URL of the website acquired by viewing the source code may be captured as the URL to be analyzed by accessing the website through the client.
Presetting a domain name blacklist, which is a list including URLs of at least one fraud website; typically, the fraud websites may be lottery-type fraud websites. In this embodiment, the fraud website may be a law violation website published by the relevant organization or a reported law violation crime website, and the embodiment does not specifically limit the manner of acquiring the URL of the fraud website.
In this embodiment, after the URL to be analyzed is obtained, a search for a matching stored URL is performed in a preset domain name blacklist; if the stored URL consistent with the URL to be analyzed is found, the URL to be analyzed can be determined to be the URL of the fraud website, which indicates that the current user has the access behavior of the fraud website, that is, the risk of being fraud exists.
And S120, if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part.
The setting part of the URL is a designated component of the URL; typically, the set portion may be a Path (Path) portion of the URL; path, which is used to describe the absolute Path of the resource in the project or module. It should be noted that the composition format of the URL may include protocol// hostname [: port ]/path/[; parameter ] [? query ] # fragment; wherein Protocol denotes a Protocol, for example, a commonly used Protocol is hypertext Transfer Protocol (HTTP); hostname represents a host address, which can be a domain name or an internet protocol address; port represents a host port number; for the HTTP protocol, the default port is 80 ports, namely if the content is empty, the default is 80 ports; the path represents a specified path of the network resource in the server; the device comprises a parameter used for configuring parameters needing to be transmitted to a server, a query used for configuring query character strings so as to query the content in the server; fragment is used for directly reaching the specified position after configuring and accessing the webpage; [. indicates that the item is optional.
It should be noted that, after the user logs in the fraud website, different operations will jump to different pages; correspondingly, the Path parts in the URLs of different pages are different, so that the Path part content can reflect the behavior of the user. Therefore, by acquiring the Path part in the URL and analyzing the current Path part, the corresponding operation performed by the user can be acquired to realize the classification of the user behavior.
In an optional implementation manner of this embodiment, extracting the setting part of the URL, and obtaining the word segmentation list according to the setting part may include: extracting a set part of the URL, segmenting the set part according to a preset spacer, and acquiring at least one participle corresponding to the set part; and acquiring a word segmentation list according to the at least one word segmentation.
The preset spacer may be a "/" or "-" equal spacer. Specifically, after the set part of the URL is obtained, the set part is segmented by a preset delimiter to obtain a plurality of segmented participles; for example, the setting portion is divided into ABC/DEF, and the division of the setting portion is performed by a preset spacer "/", whereby the divisional words ABC and DEF can be acquired. After a plurality of participles corresponding to the URL to be analyzed are obtained, a participle list is generated according to the participles.
S130, acquiring a target feature matrix according to the word segmentation list; and inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring the target behavior classification output by the behavior classification model.
In this embodiment, behavior feature extraction may be performed on the segmentation list according to a feature extraction algorithm, and a target feature matrix may be determined according to a feature extraction result corresponding to each behavior classification; specifically, the feature extraction quantity corresponding to each behavior classification can be used as an element value of the target feature matrix to obtain a target feature matrix corresponding to the URL to be analyzed; the number of elements of the target feature matrix is equal to the number of behavior classifications, and each element corresponds to one behavior classification.
Further, after the target feature matrix is obtained, the target feature matrix may be input to a behavior classification model trained in advance, so as to obtain behavior classification output by the behavior classification model and corresponding to the URL to be analyzed. The behavior classification model may be obtained by training an initial behavior classification model constructed based on a machine learning algorithm (e.g., a hidden markov algorithm or a bayesian algorithm used for classification) by using a feature matrix training sample having a behavior classification label.
In this embodiment, an initial behavior classification model may be first constructed based on a machine learning algorithm, a feature matrix labeled with behavior classification in advance is obtained as a training sample, supervised training is performed on the initial behavior classification model until the recognition result of the training sample by the behavior classification model is consistent with the labeling information, and a trained behavior classification model is obtained.
In this embodiment, the behavior classification may include at least one of registration, login, load cash out, access to a personal hub, wager buying, query record, and contact with online customer service. For the lottery fraud websites, the main behavior classification of the user is obtained, then the training samples of the behavior classification are obtained to train the initial behavior classification model, and the behavior classification model capable of realizing the behavior classification identification is obtained.
In an optional implementation manner of this embodiment, obtaining the target feature matrix according to the word segmentation list may include:
comparing the word segmentation list with each preset behavior keyword library respectively to obtain the number of the keywords in each preset behavior keyword library included in the word segmentation list; acquiring an initial characteristic matrix according to the number of a preset behavior keyword library; filling the number of the keywords in the word segmentation list including each preset behavior keyword library into an initial characteristic matrix to obtain a target characteristic matrix; the initial characteristic matrix is a 1 multiplied by N zero matrix, N represents the number of the preset behavior keyword libraries, and the preset behavior keyword libraries correspond to the behavior classifications one by one.
In this embodiment, a corresponding keyword library may be respectively pre-established for each behavior classification; specifically, URLs at the time of executing different behaviors in the fraud website may be acquired, and corresponding keywords may be acquired according to the URLs to establish keyword libraries corresponding to the respective behavior classifications. The number and types of the behavior classifications may be set adaptively according to task requirements, which is not specifically limited in this embodiment.
Secondly, comparing the word segmentation list with keyword libraries corresponding to each behavioral classification respectively, and recording the number of keywords in each keyword library included in the word segmentation list; for example, the keyword library of the current participle list corresponding to the registration behavior has 5 identical keywords, and the keyword library of the participle list corresponding to the registration behavior has 4 identical keywords. The initial feature matrix is obtained according to the number of the preset behavior keyword libraries, for example, if there are 7 preset behavior keyword libraries currently, the initial feature matrix may be [0,0,0,0,0,0,0, 0], and each bit element corresponds to one behavior classification.
Finally, filling the number of the keywords in the word segmentation list including each preset behavior keyword library into the initial feature matrix to obtain a target feature matrix; for example, the initial feature matrix is [0,0,0,0,0,0,0, 0], and the corresponding behavior classifications are registration, login, recharge and cash withdrawal, visit to a personal center, bet and buy, query records, and contact with online customer service in sequence; the word segmentation list comprises that the number of the keywords in the keyword library corresponding to each current behavior classification is 1,2,0,0,0,3 and 4 in sequence, and the numerical values are added to the initial feature matrix in sequence to obtain a target feature matrix of [1,2,0,0,0,3,4 ].
S140, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
The degree of cheating may include mild cheating, moderate cheating and severe cheating, among others. In this embodiment, the correspondence between the behavior classification and the fraud level may be pre-selected and determined, for example, registration and login correspond to mild fraud, access to a personal center, query records and contact with online customer service correspond to moderate fraud, and betting and buying and top-up withdrawal correspond to severe fraud.
It should be noted that after the target behavior classification is obtained, the user cheating degree corresponding to the target behavior classification can be determined according to the corresponding relationship between the predetermined behavior classification and the cheating degree; further, the fraud early warning according to the fraud level of the user may be that when the fraud level of the user is detected to be slightly fraudulent, that is, when the user is detected to be logging in a fraud website or logging in the fraud website, the fraud early warning is directly performed, for example, a prompt voice is played to the user, that "the behavior is at risk of being fraud, please note"; the fraud warning may also be performed when it is detected that the fraud level of the user reaches to a moderate fraud level or a severe fraud level, which is not specifically limited in this embodiment.
In the embodiment, the URL of the website accessed by the user is analyzed through the behavior classification model, the behavior classification of the user is identified, the warning information can be given in time when the current behavior of the user is determined to have the risk of being cheated, the probability of the subsequent cheating of the user can be reduced, and the online early warning of the phishing is realized.
According to the technical scheme provided by the embodiment of the invention, a Uniform Resource Locator (URL) to be analyzed is obtained, when the preset domain name blacklist comprises the current URL, a set part of the URL is extracted, and a word segmentation list is obtained according to the set part; further acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model; and finally, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user, so that the online early warning of the phishing is realized, the risk of the phishing of the user is reduced, and the personal and property losses of the user are avoided.
The embodiment of the present invention provides an online fraud early warning method, and specifically introduces a behavior recognition model obtained after training before a URL to be analyzed is recognized based on the above embodiment.
Fig. 2A is a flowchart of an phishing early warning method provided by another embodiment of the present invention, which is based on the above technical solution, and the embodiment provides a phishing early warning method, including:
s210, obtaining a sample URL, comparing the sample URL with each preset behavior keyword library, and obtaining a sample characteristic matrix according to a comparison result.
The sample URL is a URL used for training the behavior classification model. Note that the behavior classification corresponding to each sample URL is known.
In this embodiment, a certain number of URLs with behavior classification labels may be obtained as sample URLs, and a set portion of the sample URLs is extracted and segmented to obtain a word segmentation list corresponding to the sample URLs; further, the word segmentation list is compared with each preset behavior keyword library to obtain the number of the keywords in each preset behavior keyword library included in the word segmentation list, and a sample characteristic matrix is obtained according to the number and the number of the preset behavior keyword libraries.
In an optional implementation manner of this embodiment, before obtaining the sample URL and comparing the sample URL with each preset behavior keyword library, the method may further include: acquiring URLs corresponding to different classification behaviors executed in the fraud websites, and extracting set parts of the URLs; segmenting the set part according to a preset spacer to obtain initial segmentation; and performing duplication removing operation and invalid word segmentation filtering operation on the initial word segmentation to obtain a preset behavior keyword library corresponding to each behavior classification.
In this embodiment, a deceived user can be simulated to access a certain number of fraud websites, perform seven types of operations including registration, login, recharge and withdrawal, access to a personal center, bet purchase, query records and contact with online customer service, and record corresponding URLs when each operation is executed. Further, a Path part in the URL is extracted, and the Path part is divided according to the equal spacers of the "/" and the "-" to obtain initial participles of the minimum unit; after the initial segmentation is obtained, the initial segmentation may be respectively counted according to different behavior classifications, and duplicate removal and invalid segmentation (e.g., index, html, and the like) filtering are performed on the initial segmentation of each behavior classification, so as to obtain a preset behavior keyword library corresponding to each behavior classification.
S220, obtaining training samples according to the sample characteristic matrix and the behavior classification of the sample URL.
In this embodiment, after the sample feature matrix is obtained, the behavior of the sample URL is classified as a label of the sample feature matrix to generate a training sample.
And S230, training the initial behavior classification model through the training sample to obtain a trained behavior classification model.
Wherein, the initial behavior classification model can be constructed based on a naive Bayes algorithm. Naive Bayes (
Figure BDA0003271140010000111
Bayes) algorithm is a classification method based on Bayes theorem and independent hypothesis of characteristic conditions, and can combine the prior probability and the posterior probability to realize high-accuracy identification of large data sets, and the algorithm is simple to realize. In this embodiment, an initial behavior classification model is established based on a naive Bayes algorithm, and an acquisition is adoptedThe obtained training samples train the initial behavior classification model to obtain the trained behavior classification model, so that the classification accuracy of the obtained behavior classification model is improved.
It should be noted that, when the initial behavior classification model is trained through the training samples, 70% of the training samples may be used for training the initial behavior classification model, and the remaining 30% of the training samples are used for performing classification prediction on the trained behavior classification model, so as to observe the classification effect of the behavior classification model. It can be understood that, when the classification effect cannot meet the preset classification accuracy, the training samples with classification errors can be screened from the training samples for classification prediction, and the trained classification model is trained again until the behavior classification result of the behavior classification model meets the preset classification accuracy.
S240, acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether the URL is included in a preset domain name blacklist or not.
And S250, if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part.
S260, acquiring a target feature matrix according to the word segmentation list; and inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring the target behavior classification output by the behavior classification model.
In an optional implementation manner of this embodiment, inputting the target feature matrix into a pre-trained behavior classification model, and obtaining the target behavior classification output by the behavior classification model may include:
judging whether at least one element value in the target characteristic matrix is not zero or not; and if so, inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring the target behavior classification output by the behavior classification model.
It should be noted that, when each element value of the target feature matrix is zero, it indicates that any behavior classification is not hit in the current URL to be analyzed; the behavior classification model is obtained by training samples corresponding to several current behavior classifications, so that when the URL to be analyzed does not relate to the current behavior classification, the behavior classification model cannot determine the behavior classification of the URL to be analyzed; at this time, it is not necessary to input the target behavior feature into the behavior classification model.
In this embodiment, before performing behavior recognition on the target feature matrix through the behavior classification model, the target feature matrix may be detected first; if all the element values of the target feature matrix are detected to be zero, the current target feature matrix can be directly discarded, and the URL to be analyzed is abandoned. If at least one element value of the target characteristic matrix is detected to be not zero, the URL to be analyzed at least hits a behavior classification; at this time, the target feature matrix may be input to the trained behavior classification model to determine the behavior classification of the target feature matrix through the behavior classification model.
In this embodiment, before performing behavior recognition on the target feature matrix through the trained behavior classification model, each element value of the target feature matrix is detected in advance to determine whether to input the target feature matrix into the behavior classification model, so that the data amount required to be processed by the behavior classification model can be reduced, and the data processing pressure of the system can be reduced.
S270, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
According to the technical scheme provided by the embodiment of the invention, a sample characteristic matrix is obtained according to a comparison result by obtaining a sample URL and comparing the sample URL with each preset behavior keyword library; obtaining training samples according to the sample characteristic matrix and the behavior classification of the sample URL; training an initial behavior classification model constructed based on a naive Bayesian algorithm through a training sample to obtain a trained behavior classification model, so that the accuracy of the obtained behavior classification model is improved; further, by acquiring a Uniform Resource Locator (URL) to be analyzed, when the preset domain name blacklist comprises the current URL, extracting a set part of the URL, and acquiring a word segmentation list according to the set part; further acquiring a target characteristic matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and obtaining a target behavior classification output by the behavior classification model; and finally, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user, so that the online early warning of the phishing is realized, the risk of the phishing of the user is reduced, and the personal and property losses of the user are avoided.
In a specific implementation manner of this embodiment, as shown in fig. 2B, first, a URL list to be analyzed is traversed to determine whether a domain name blacklist is hit; if the current URL to be analyzed is determined not to hit the domain name blacklist, whether the next URL to be analyzed hits the domain name blacklist or not is continuously judged until traversal of the URL list is completed; and if the current URL to be analyzed is determined to hit the domain name blacklist, extracting the PATH part of the URL, splitting words of the PATH part, and filtering invalid keywords of the split words to obtain a word splitting list. Further, matching the word segmentation list with a keyword library to obtain a feature matrix; finally, the feature matrix is input into a behavior classification model to predict the user behavior and determine the cheating degree of the user.
Fig. 3 is a schematic structural diagram of an phishing warning device according to another embodiment of the present invention. As shown in fig. 3, the apparatus includes: a to-be-analyzed URL acquisition module 310, a participle list acquisition module 320, a target behavior classification acquisition module 330 and a user cheating degree determination module 340. Wherein the content of the first and second substances,
a URL to be analyzed obtaining module 310, configured to obtain a URL to be analyzed, and determine whether a preset domain name blacklist includes the URL;
a word segmentation list obtaining module 320, configured to, if yes, extract a set portion of the URL, and obtain a word segmentation list according to the set portion;
a target behavior classification obtaining module 330, configured to obtain a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and the user cheating degree determining module 340 is configured to determine the cheating degree of the user according to the target behavior classification, and perform fraud early warning according to the cheating degree of the user.
According to the technical scheme provided by the embodiment of the invention, a Uniform Resource Locator (URL) to be analyzed is obtained, when the preset domain name blacklist comprises the current URL, a set part of the URL is extracted, and a word segmentation list is obtained according to the set part; further acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model; and finally, determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user, so that the online early warning of the phishing is realized, the risk of the phishing of the user is reduced, and the personal and property losses of the user are avoided.
Optionally, on the basis of the foregoing technical solution, the participle list obtaining module 320 is specifically configured to extract a set portion of the URL, and segment the set portion according to a preset spacer to obtain at least one participle corresponding to the set portion; and acquiring a word segmentation list according to the at least one word segmentation.
Optionally, on the basis of the foregoing technical solution, the target behavior classification obtaining module 330 includes:
the keyword quantity obtaining unit is used for respectively comparing the word segmentation list with each preset behavior keyword library and respectively obtaining the number of the keywords in each preset behavior keyword library included in the word segmentation list;
the initial feature matrix obtaining unit is used for obtaining an initial feature matrix according to the number of the preset behavior keyword libraries;
the target characteristic matrix obtaining unit is used for filling the number of the keywords in the word segmentation list including each preset behavior keyword library into an initial characteristic matrix to obtain a target characteristic matrix;
the initial characteristic matrix is a zero matrix of 1 multiplied by N, N represents the number of the preset behavior keyword libraries, and the preset behavior keyword libraries correspond to the behavior classifications one by one.
Optionally, on the basis of the foregoing technical solution, the user cheating degree determining module 340 includes:
the element value judging unit is used for judging whether at least one element value in the target characteristic matrix is not zero or not; and if so, inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring the target behavior classification output by the behavior classification model.
Optionally, on the basis of the above technical solution, the phishing early warning device further includes:
a set part extraction module for acquiring URLs corresponding to different classification behaviors executed in the fraud website and extracting a set part of the URLs;
the initial segmentation acquisition module is used for segmenting the set part according to a preset spacer to acquire initial segmentation;
and the preset behavior keyword library acquisition module is used for performing duplication removal operation and invalid word segmentation filtering operation on the initial word segmentation to acquire a preset behavior keyword library corresponding to each behavior classification.
Optionally, on the basis of the above technical solution, the phishing early warning device further includes:
the system comprises a sample characteristic matrix acquisition module, a behavior keyword library acquisition module and a behavior keyword library acquisition module, wherein the sample characteristic matrix acquisition module is used for acquiring a sample URL, comparing the sample URL with each preset behavior keyword library and acquiring a sample characteristic matrix according to a comparison result;
the training sample acquisition module is used for acquiring training samples according to the sample characteristic matrix and the behavior classification of the sample URL;
the model training module is used for training the initial behavior classification model through the training sample to obtain a trained behavior classification model; the initial behavior classification model is constructed based on a naive Bayes algorithm.
Optionally, on the basis of the above technical solution, the behavior classification includes at least one of registration, login, recharge and withdrawal, visit to a personal center, bet and buy, query records, and contact with online customer service.
The device can execute the phishing early warning method provided by the embodiment of the invention, and has corresponding functional modules and beneficial effects for executing the method. For the technical details not described in detail in the embodiments of the present invention, reference may be made to the phishing warning method provided in the foregoing embodiments of the present invention.
Fig. 4 is a schematic structural diagram of an electronic device according to another embodiment of the present invention, as shown in fig. 4, the electronic device includes a processor 410, a memory 420, an input device 430, and an output device 440; the number of the processors 410 in the electronic device may be one or more, and one processor 410 is taken as an example in fig. 4; the processor 410, the memory 420, the input device 430 and the output device 440 in the electronic apparatus may be connected by a bus or other means, and the bus connection is exemplified in fig. 4. The memory 420 serves as a computer-readable storage medium for storing software programs, computer-executable programs, and modules, such as program instructions/modules corresponding to an phishing warning method in any embodiment of the present invention (e.g., the URL to be analyzed acquisition module 310, the participle list acquisition module 320, the target behavior classification acquisition module 330, and the user cheating degree determination module 340 in an phishing warning device). The processor 410 executes various functional applications and data processing of the electronic device by executing software programs, instructions and modules stored in the memory 420, so as to implement one of the above-mentioned phishing warning methods. That is, the program when executed by a processor implements:
acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether a preset domain name blacklist comprises the URL;
if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part;
acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
The memory 420 may mainly include a program storage area and a data storage area, wherein the program storage area may store an operating system, an application program required for at least one function; the storage data area may store data created according to the use of the terminal, and the like. Further, the memory 420 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other non-volatile solid state storage device. In some examples, memory 420 may further include memory located remotely from processor 410, which may be connected to an electronic device over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The input device 430 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function controls of the electronic device, and may include a keyboard, a mouse, and the like. The output device 440 may include a display device such as a display screen.
Optionally, the electronic device may be a server, and the server may be an independent server, or may be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, middleware service, a domain name service, a security service, a Content Delivery Network (CDN), a big data and artificial intelligence platform, and the like.
Embodiments of the present invention further provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method according to any of the embodiments of the present invention. Of course, the embodiment of the present invention provides a computer-readable storage medium, which can perform related operations in an phishing warning method provided in any embodiment of the present invention. That is, the program when executed by the processor implements:
acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether a preset domain name blacklist comprises the URL;
if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part;
acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
From the above description of the embodiments, it is obvious for a person skilled in the art that the present invention can be implemented by software and necessary general hardware, and certainly by hardware, but the former is a better embodiment in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as a floppy disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a FLASH Memory (FLASH), a hard disk or an optical disk of a computer, and includes instructions for enabling an electronic device (which may be a personal computer, a server, or a network device) to execute the methods according to the embodiments of the present invention.
It should be noted that, in the embodiment of the phishing early warning device, the units and modules included in the embodiment are only divided according to the functional logic, but not limited to the above division, as long as the corresponding functions can be implemented; in addition, specific names of the functional units are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present invention.
It is to be noted that the foregoing is only illustrative of the preferred embodiments of the present invention and the technical principles employed. It will be understood by those skilled in the art that the present invention is not limited to the particular embodiments described herein, but is capable of various obvious changes, rearrangements and substitutions without departing from the scope of the invention. Therefore, although the present invention has been described in more detail by the above embodiments, the present invention is not limited to the above embodiments, and may include other equivalent embodiments without departing from the spirit of the present invention, and the scope of the present invention is determined by the scope of the appended claims.

Claims (10)

1. An phishing early warning method, comprising:
acquiring a Uniform Resource Locator (URL) to be analyzed, and judging whether a preset domain name blacklist comprises the URL;
if so, extracting a set part of the URL, and acquiring a word segmentation list according to the set part;
acquiring a target feature matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and determining the cheating degree of the user according to the target behavior classification, and carrying out fraud early warning according to the cheating degree of the user.
2. The method according to claim 1, wherein extracting a set portion of the URL and obtaining a word segmentation list according to the set portion comprises:
extracting a set part of the URL, segmenting the set part according to a preset spacer, and acquiring at least one participle corresponding to the set part;
and acquiring a word segmentation list according to the at least one word segmentation.
3. The method of claim 1, wherein obtaining a target feature matrix from the word segmentation list comprises:
comparing the word segmentation list with each preset behavior keyword library respectively to obtain the number of the keywords in each preset behavior keyword library included in the word segmentation list;
acquiring an initial characteristic matrix according to the number of a preset behavior keyword library;
filling the number of the keywords in the word segmentation list including each preset behavior keyword library into an initial characteristic matrix to obtain a target characteristic matrix;
the initial characteristic matrix is a zero matrix of 1 multiplied by N, N represents the number of the preset behavior keyword libraries, and the preset behavior keyword libraries correspond to the behavior classifications one by one.
4. The method of claim 1, wherein inputting the target feature matrix into a pre-trained behavior classification model, and obtaining a target behavior classification output by the behavior classification model comprises:
judging whether at least one element value in the target characteristic matrix is not zero or not;
and if so, inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring the target behavior classification output by the behavior classification model.
5. The method of claim 1, further comprising:
acquiring URLs corresponding to different classification behaviors executed in the fraud website, and extracting a set part of the URL;
segmenting the set part according to a preset spacer to obtain initial segmentation;
and performing duplication removing operation and invalid word segmentation filtering operation on the initial word segmentation to obtain a preset behavior keyword library corresponding to each behavior classification.
6. The method of claim 5, wherein after performing de-duplication and invalid segmentation filtering operations on the initial segmentation to obtain a preset behavior keyword library corresponding to each behavior classification, the method further comprises:
acquiring a sample URL, comparing the sample URL with each preset behavior keyword library, and acquiring a sample characteristic matrix according to a comparison result;
obtaining a training sample according to the sample characteristic matrix and the behavior classification of the sample URL;
training an initial behavior classification model through the training samples to obtain a trained behavior classification model; the initial behavior classification model is constructed based on a naive Bayes algorithm.
7. The method of any of claims 1-6, wherein the behavior classification includes at least one of registration, login, load cash-out, visit to a personal center, wager buy, query record, and contact with online customer service.
8. An phishing early warning device, comprising:
the analysis module is used for obtaining a Uniform Resource Locator (URL) to be analyzed and judging whether a preset domain name blacklist comprises the URL;
the word segmentation list acquisition module is used for extracting the set part of the URL if the word segmentation list acquisition module is used for extracting the set part of the URL and acquiring a word segmentation list according to the set part;
the target behavior classification acquisition module is used for acquiring a target characteristic matrix according to the word segmentation list; inputting the target characteristic matrix into a pre-trained behavior classification model, and acquiring a target behavior classification output by the behavior classification model;
and the user cheating degree determining module is used for determining the cheating degree of the user according to the target behavior classification and carrying out fraud early warning according to the cheating degree of the user.
9. An electronic device, comprising:
one or more processors;
storage means for storing one or more computer programs;
the one or more computer programs when executed by the one or more processors, cause the one or more processors to execute the computer program, thereby implementing the phishing early warning method as recited in any one of claims 1-7.
10. A computer-readable storage medium, having a computer program stored thereon, wherein the computer program, when being executed by a processor, implements the phishing early warning method as recited in any one of claims 1-7.
CN202111101612.6A 2021-09-18 2021-09-18 Phishing early warning method, device, equipment and storage medium Pending CN114048311A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111101612.6A CN114048311A (en) 2021-09-18 2021-09-18 Phishing early warning method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101612.6A CN114048311A (en) 2021-09-18 2021-09-18 Phishing early warning method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114048311A true CN114048311A (en) 2022-02-15

Family

ID=80204452

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101612.6A Pending CN114048311A (en) 2021-09-18 2021-09-18 Phishing early warning method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114048311A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115460059A (en) * 2022-07-28 2022-12-09 浪潮通信信息系统有限公司 Risk early warning method and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115460059A (en) * 2022-07-28 2022-12-09 浪潮通信信息系统有限公司 Risk early warning method and device
CN115460059B (en) * 2022-07-28 2024-03-08 浪潮通信信息系统有限公司 Risk early warning method and device

Similar Documents

Publication Publication Date Title
CN107204960B (en) Webpage identification method and device and server
CN104156490A (en) Method and device for detecting suspicious fishing webpage based on character recognition
CN107257390B (en) URL address resolution method and system
US20090089279A1 (en) Method and Apparatus for Detecting Spam User Created Content
CN103685308A (en) Detection method and system of phishing web pages, client and server
CN103685307A (en) Method, system, client and server for detecting phishing fraud webpage based on feature library
US20170289082A1 (en) Method and device for identifying spam mail
CN108600172B (en) Method, device and equipment for detecting database collision attack and computer readable storage medium
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN113779481B (en) Method, device, equipment and storage medium for identifying fraud websites
CN112532624B (en) Black chain detection method and device, electronic equipment and readable storage medium
Deshpande et al. Detection of phishing websites using Machine Learning
CN104239582A (en) Method and device for identifying phishing webpage based on feature vector model
CN112131507A (en) Website content processing method, device, server and computer-readable storage medium
CN116015842A (en) Network attack detection method based on user access behaviors
CN106790025B (en) Method and device for detecting link maliciousness
CN114048311A (en) Phishing early warning method, device, equipment and storage medium
CN113965377A (en) Attack behavior detection method and device
CN116318974A (en) Site risk identification method and device, computer readable medium and electronic equipment
CN115801455B (en) Method and device for detecting counterfeit website based on website fingerprint
CN115879110A (en) System for identifying financial risk website based on fingerprint penetration technology
CN112468444B (en) Internet domain name abuse identification method and device, electronic equipment and storage medium
CN114528908A (en) Network request data classification model training method, classification method and storage medium
CN115051859A (en) Information analysis method, information analysis device, electronic apparatus, and medium
CN114363039A (en) Method, device, equipment and storage medium for identifying fraud websites

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination