CN117749496A - Email-based risk prompt information generation method, device and medium - Google Patents

Email-based risk prompt information generation method, device and medium Download PDF

Info

Publication number
CN117749496A
CN117749496A CN202311780789.2A CN202311780789A CN117749496A CN 117749496 A CN117749496 A CN 117749496A CN 202311780789 A CN202311780789 A CN 202311780789A CN 117749496 A CN117749496 A CN 117749496A
Authority
CN
China
Prior art keywords
prompt
mail
preset
generating
target mail
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311780789.2A
Other languages
Chinese (zh)
Inventor
林延中
左自清
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangdong Yingshi Computer Technology Co ltd
Original Assignee
Guangdong Yingshi Computer Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangdong Yingshi Computer Technology Co ltd filed Critical Guangdong Yingshi Computer Technology Co ltd
Priority to CN202311780789.2A priority Critical patent/CN117749496A/en
Publication of CN117749496A publication Critical patent/CN117749496A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method, a device and a medium for generating risk prompt information based on mail, wherein the method comprises the following steps: acquiring a target mail; SPF verification is carried out on the target mail, and a prompt is generated according to the verification result; calculating a probability value according to the signaling domain name, and generating a prompt by combining the probability value with a confidence lower limit and spelling characteristics; generating a prompt by splitting and real number conversion of the uniform resource locator by using a neural network model; carrying out feature analysis on the name and money information of the macro attachment and suspected impersonation, and respectively generating corresponding prompt; and generating risk prompt information according to the prompt. The invention provides a method, a device and a medium for generating risk prompt information based on mails, which are used for generating comprehensive risk prompt information by carrying out omnibearing analysis and detection on target mails from six different angles, so that the problem that effective comprehensive risk prompt information is difficult to generate according to mail contents and reminding omission is avoided can be solved.

Description

Email-based risk prompt information generation method, device and medium
Technical Field
The present invention relates to the field of email technology, and in particular, to a method, an apparatus, and a medium for generating risk prompt information based on email.
Background
Email is used as a communication mode, has the characteristics of simple protocol, easy acquisition of mailbox addresses, wide use area and the like, and is not only kept at the right of normal communication, but also is used by people with great significance to abuse the email to transmit junk information, phishing mails and the like; compared with common junk mail, the meticulously designed phishing mail is identical to common mail in terms of content, format and the like, and a reader without enough experience can hardly accurately identify fraudulent parts contained in the phishing mail; on the other hand, compared with the recognition scenes of fraudulent means such as cash fraud, transfer and the like, the mail-based prevention scheme is not abundant enough for the possible application of how to exert the activity of a user as a person and arouse the warning property thereof to reduce the success of fraud. The prior art means mainly prompts the user that the mail has use risk by marking the strange mail and providing a pre-transit page for the outer link.
However, the direct marking and the way of providing the transfer page are mechanical, which easily leads the user to miss the marking and be deceived; in addition, the direct marking is to mark a plurality of abnormal points to remind the user for a plurality of times, so that the prompting frequency is higher, the user experience is poor and the prompting is easy to miss; although mail clients show real URL (uniform resource locator) or SPAM tags, most users do not have enough knowledge to understand such characters and tag formations, resulting in tag failure; in addition, the manner of providing the transfer page can only judge whether the jump link is the domain name owned by the user, and the prompting accuracy is low.
Disclosure of Invention
The invention provides a method, a device and a medium for generating risk prompt information based on mail, which are used for solving the problem that effective comprehensive risk prompt information is difficult to generate according to mail content and avoiding missing prompt.
In order to solve the above problems, the present invention provides a method for generating risk prompt information based on mail, including:
acquiring a target mail;
SPF verification is carried out on the target mail, and a first prompt is generated according to a verification result;
according to spelling of the signaling domain name in the target mail, calculating a probability value by using a state transition probability method, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail;
generating a third prompt by using a preset neural network model and performing segmentation and real number conversion on the uniform resource locator in the target mail;
performing feature analysis on the information with macro attachments, suspected impersonation names and money in the target mail to respectively generate a fourth prompt, a fifth prompt and a sixth prompt;
and generating risk prompt information according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
According to the invention, through adopting different detection methods from six angles of SPF verification, spelling of a signaling domain name, uniform resource locator, macro attachment, suspected impersonation name and money information, corresponding prompt is obtained, and further risk prompt information with extremely high comprehensiveness is generated, so that risk prompt can be rapidly and effectively carried out, and the safety of target mail is ensured. The probability value obtained by calculation by using the state transition probability method can reflect the abnormal degree of the mailbox address of the target mail, so that credibility support can be provided for the generation of the second prompt; the neural network model is used for segmenting the uniform resource locator, so that the problem of longer content of the uniform resource locator can be solved, data analysis can be conveniently carried out on the uniform resource locator, and the acquisition process of the third prompt is accelerated.
Compared with the prior art, the method and the device have the advantages that the target mail is subjected to omnibearing analysis and detection from six different angles, so that comprehensive risk prompt information is generated, effective risk prompt can be carried out on a user at one time, and the problem of missing prompt is avoided; in addition, because the prompt is obtained through specific data analysis and calculation, the prompt has higher credibility, can avoid the occurrence of ambiguous risk prompt information without pertinence, helps a user avoid risks, and can solve the problem that effective comprehensive risk prompt information is difficult to generate according to mail content and avoid missing reminding.
As a preferred scheme, according to spelling of the signaling domain name in the target mail, calculating a probability value by using a state transition probability method, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail, wherein the method specifically comprises the following steps:
traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring;
combining a preset character set, and constructing all possible N-Gram tuples according to the character substrings to obtain a state transition probability matrix;
splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
calculating the average state transition probability value of the plurality of tuples to obtain the probability value;
and comparing the probability value with the lower limit of the confidence coefficient, and generating a second prompt by combining a comparison result with spelling features in the target mail.
The preferred scheme is that the mailbox address of the sender is detected, and then a second prompt is generated; starting from the spelling aspect of the signaling domain name, the process of constructing all the possible N-Gram tuples according to the character sub-strings is the process of constructing a Markov chain, and the probability reliability support can be provided for the generation of the second prompt by constructing the Markov chain on the spelling to calculate the average state transition probability value as the probability value of the mailbox address.
As a preferred scheme, comparing the probability value with a preset lower confidence coefficient, and generating a second prompt by combining a comparison result with spelling features in the target mail, wherein the second prompt specifically comprises:
comparing the probability value with a preset confidence coefficient lower limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
performing normalization detection on the spelling characteristics of vowels and special symbols in the target mail to obtain a detection result;
and generating the second prompt according to the comparison result, the detection result and a low-frequency communication top-level domain list of a mailbox where the target mail is located.
In the preferred scheme, the lower confidence limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, so that the minimum mail reliability is shown, the probability values and the lower confidence limit are compared, and the obtained comparison result can reflect the mail address abnormality degree of the current target mail from the official angle; by combining the character spelling feature detection result of the target mail and the low-frequency communication top domain list, the mailbox address composition of the target mail and fewer mailbox addresses can be considered, so that the second prompt is not only a result generated by analyzing the target mail from the external angle, but also the detailed consideration of the internal details is included, and the second prompt is more objective, accurate and effective.
As a preferred solution, the lower confidence limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, specifically:
acquiring a world comprehensive ranking list of a website, and constructing a trusted domain name list by using part of list information in the world comprehensive ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
and taking the minimum value in the probability value set as the lower confidence limit.
The trusted domain name list in the preferred scheme is established on the basis of the world comprehensive ranking list, so that the trusted domain name list has higher reliability, and the lower confidence limit can be ensured to be a comprehensive credibility measurement standard so as to be compared with the probability value to generate a second prompt.
As a preferred scheme, a preset neural network model is used, and the third prompt is generated by splitting and real number conversion of the uniform resource locator in the target mail, specifically:
using a preset neural network model to segment the uniform resource locator in the target mail into a first character set;
screening out characters of the first character set which are not in a preset character set, obtaining a second character set, and defining character identifiers of the second character set as unknown characters;
Respectively adding a preset beginning character and a preset ending character at the head and the tail of the unknown character to obtain a numbering list related to the second character set;
and carrying out transformation processing on the numbering list by using a two-class network structure in the neural network model to obtain the abnormal probability of the uniform resource locator, and generating the third prompt according to the abnormal probability.
The preferred scheme is to detect the uniform resource locator in the mail and then generate a third prompt. The unknown characters in the target mail can be obtained by cutting and screening the uniform resource locator, the character regulations in the numbering list can be clear by adding characters at the head and the tail of the unknown characters, the faster conversion speed can be conveniently obtained when the conversion processing is carried out, the abnormal probability acquisition process of the uniform resource locator is accelerated, and the third prompt is quickly generated. Compared with the originating domain name, the uniform resource locator content in the mail is longer, so that the method is not suitable for judging by using simple state transition probability and character number, analysis time can be saved to a great extent by using a neural network model to carry out abnormal probability transition, and the prompting accuracy of the third prompting language is improved.
As a preferred scheme, performing feature analysis on the macro attachment in the target mail to generate a fourth prompt, which specifically includes:
acquiring macro attachments in the target mail;
if dangerous files in the focused attention list appear in the macro attachment, generating a fourth prompt according to the dangerous files;
the important attention list is established according to the type of the forbidden attachments of the mailbox where the target mail is located and preset important attention files.
The preferred scheme is to detect the attachments carried in the mail and then generate a fourth prompt. Judging whether the macro attachment is a dangerous file or not through the important focusing list, and rapidly acquiring whether the macro attachment in the target mail has operation risk or not, wherein the fourth prompting message is generated simply, conveniently and rapidly.
As a preferred scheme, the feature analysis is performed on the suspected impersonation name in the target mail to generate a fifth prompt, specifically:
acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; wherein the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
Obtaining the entity name of the target mail, and matching the entity name with a company list and a personal list in a sequence labeling algorithm to obtain a second matching result;
if the first matching result or the second matching result is that the matching is successful, checking the mail sending address of the target mail, and generating the fifth prompt according to the checking result.
The preferred scheme is to detect whether the main body name of the mail is counterfeit or not, and then generate a fifth prompt. Firstly, whether the target mail is successfully matched or not, namely whether a statement is formed or not can be known by a matching mode, and the statement refers to a formal statement issued by an official agency of a government, an enterprise, an organization and the like, and usually, in order to respond to a certain event or condition, the official stand and attitude are expressed, and the mail sending address is generally compared with the official and formal one, so that after the statement is formed, the mail sending address is directly checked, whether the target mail is counterfeit or not can be quickly known, and a fifth prompt is obtained.
As a preferred scheme, checking the mail sending address of the target mail, and generating the fifth prompt according to the checking result, specifically:
Judging the category of the mail sending address of the target mail to obtain a declaration object of the target mail;
if the declaration object is a government agency or a public inspection agency, checking whether the top-level domain name of the target mail belongs to a preset special top-level domain;
if the statement object is a bank, an invoice service provider or an enterprise, checking whether the domain name used by the target mail belongs to a pre-collected list;
if the stated object is the enterprise setting mechanism, check whether the sender and receiver of the stated goal mail form the same domain or subdomain;
if the declaration object is a name, checking whether the declaration object is in a preset address book;
and generating the fifth prompt according to the checking result.
In the preferred scheme, because the mail top domain names of government institutions and public inspection institutions have special top domain names in most cases, the top domain names can be directly checked to know whether the mail boxes are abnormal; since there is usually a attribution relationship between the enterprise permanent institutions, whether the mailbox is abnormal or not can be known by checking whether the same domain or the subdomain is formed; by using the pre-collected list and address book, whether the mailbox of the bank or the person is abnormal or not can be rapidly and directly judged. The scheme provides corresponding different mail sending address checking methods from the angles of government institutions, banks, enterprise resident institutions, individuals and the like, and the checking method has strong pertinence and can accelerate the generation process of the fifth prompt.
As a preferred scheme, feature analysis is performed on the monetary information in the target mail, and a sixth prompt is generated, specifically:
acquiring text information of the target mail;
cutting the text information by using a preset classifier to obtain a plurality of character strings;
matching the character strings with a preset money word list, and generating the sixth prompt according to a matching result;
wherein, the monetary vocabulary is established according to preset financial fraud information. According to the text information segmentation method and device, the classifier is used for segmenting text information, namely data segmentation extraction is carried out on an original mail text, so that the obtained vocabulary in a plurality of character strings is convenient to compare and match with a money vocabulary, and the obtaining process of a sixth prompt is accelerated.
The invention also provides a risk prompt message generating device based on the mail, which comprises a mail obtaining module, a first generating module, a second generating module, a third generating module, a fourth generating module and a summarizing module;
the mail acquisition module is used for acquiring a target mail;
the first generation module is used for performing SPF (specific pathogen free) verification on the target mail and generating a first prompt according to a verification result;
The second generation module is used for calculating a probability value by using a state transition probability method according to spelling of the signaling domain name in the target mail, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail;
the third generation module is used for generating a third prompt by using a preset neural network model and carrying out segmentation and real number conversion on the uniform resource locator in the target mail;
the fourth generation module is used for carrying out feature analysis on the information with macro attachments, suspected impersonation names and money in the target mail to generate a fourth prompt, a fifth prompt and a sixth prompt respectively;
and the summarization module is used for generating risk prompt information according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
As a preferred scheme, the second generation module comprises a character unit, a matrix unit, a splitting unit, a probability value unit and a comparison unit;
the character unit is used for traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring;
The matrix unit is used for combining a preset character set, constructing all possible N-Gram tuples according to the character substring, and obtaining a state transition probability matrix;
the splitting unit is used for splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
the probability value unit is used for calculating the average state transition probability value of the plurality of tuples to obtain the probability value;
and the comparison unit is used for comparing the probability value with the lower confidence coefficient limit, and generating a second prompt by combining a comparison result with the spelling characteristic in the target mail.
Preferably, the comparing unit includes a first subunit, a second subunit and a third subunit;
the first subunit is configured to compare the probability value with a preset lower confidence coefficient limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
the second subunit is used for normative detection of character spelling features of vowels and special symbols in the target mail to obtain a detection result;
And the third subunit is configured to generate the second prompt according to the comparison result, the detection result, and a low-frequency communication top domain list of a mailbox where the target mail is located.
As a preferred solution, the lower confidence limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, specifically:
acquiring a world comprehensive ranking list of a website, and constructing a trusted domain name list by using part of list information in the world comprehensive ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
and taking the minimum value in the probability value set as the lower confidence limit.
As a preferable scheme, the third generating module comprises a first segmentation unit, a screening unit, a reconstruction unit and a transformation unit;
the first segmentation unit is used for segmenting the uniform resource locator in the target mail into a first character set by using a preset neural network model;
the screening unit is used for screening out characters of the first character set which are not in a preset character set to obtain a second character set, and defining character identifiers of the second character set as unknown characters;
The reconstruction unit is used for respectively adding a preset beginning character and a preset ending character at the head and the tail of the unknown character to obtain a numbering list related to the second character set;
the transformation unit is configured to perform transformation processing on the number list by using a two-class network structure in the neural network model, obtain an abnormal probability of the uniform resource locator, and generate the third prompt according to the abnormal probability.
Preferably, the fourth generating module includes a first acquiring unit and a first generating unit;
the first acquiring unit is used for acquiring macro attachments in the target mail;
the first generating unit is configured to generate the fourth prompt according to the dangerous file if the dangerous file in the focused attention list appears in the macro attachment;
the important attention list is established according to the type of the forbidden attachments of the mailbox where the target mail is located and preset important attention files.
As a preferable mode, the fourth generating module comprises a second obtaining unit, a first matching unit and an inspection unit;
the second acquiring unit is used for acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; wherein the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
The first matching unit is used for acquiring the entity name of the target mail, and matching the entity name with a company list and a personal list in a sequence labeling algorithm to obtain a second matching result;
and the checking unit is used for checking the mail sending address of the target mail if the first matching result or the second matching result is successful, and generating the fifth prompt according to the checking result.
Preferably, the inspection unit includes a fourth subunit, a fifth subunit, a sixth subunit, a seventh subunit, an eighth subunit, and a ninth subunit;
the fourth subunit is configured to perform category judgment on the mailbox address of the target mail, so as to obtain a declaration object of the target mail;
the fifth subunit is configured to check whether the top domain name of the target mail belongs to a preset special top domain if the declaration object is a government agency or a public inspection agency;
the sixth subunit is configured to check whether the domain name used by the target mail belongs to a pre-collected list if the declaration object is a bank, an invoice service provider or an enterprise;
The seventh subunit is configured to check whether the sender and the receiver of the target mail form a same domain or a sub-domain if the declaration object is an enterprise permanent mechanism;
the eighth subunit is configured to check whether the declaration object is in a preset address book if the declaration object is a name;
and the ninth subunit is used for generating the fifth prompt according to the checking result.
As a preferable scheme, the fourth generating module comprises a third obtaining unit, a second segmentation unit and a second matching unit;
the third acquiring unit is used for acquiring text information of the target mail;
the second segmentation unit is used for segmenting the text information by using a preset classifier to obtain a plurality of character strings;
the second matching unit is used for matching the character strings with a preset money word list, and generating the sixth prompt according to a matching result;
wherein, the monetary vocabulary is established according to preset financial fraud information. The invention also provides a storage medium, wherein the storage medium is stored with a computer program, the computer program is called and executed by a computer, and the risk prompt information generation method based on the mail is realized.
Drawings
Fig. 1 is a schematic flow chart of a method for generating risk prompt information based on mail according to an embodiment of the present invention;
fig. 2 is a schematic structural diagram of a risk prompting information generating device based on mail according to an embodiment of the present invention.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
In the description of the present application, it should be understood that the terms "first," "second," "third," … … "tenth" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "first", "second", "third" … … "tenth" may explicitly or implicitly include one or more such feature. In the description of the present application, unless otherwise indicated, the meaning of "a number" is two or more.
The method for generating the risk prompt information based on the mail is mainly applied to risk detection of the received target mail when the electronic mail box is used, so that the prompt information is generated to give a corresponding prompt to a user, and the alertness of the user is improved.
Embodiment one:
referring to fig. 1, an embodiment of the present invention provides a method for generating risk prompt information based on mail, including S1 to S6, which specifically includes the following implementation steps:
s1, acquiring a target mail.
The step S1 of the embodiment of the invention is specifically as follows:
and acquiring the target mail from the mailbox system.
S2, SPF verification is carried out on the target mail, and a first prompt is generated according to the verification result.
The step S2 of the embodiment of the invention is specifically as follows:
target mail is checked by using SPF (Sender Policy Framework ), DKIM (DomainKeys Identified Mail, domain name Key identification mail Standard) and DMARC (Domain-based Message Authentication, reporting, and Conformance, mail Domain verification protocol), and a first prompt is generated according to the check result.
In this embodiment, since SMTP does not verify the mail address of the sender, SPF can supplement the security of SMTP (Simple Mail Transfer Protocol ) in the target mail; however, there are many problems in the use process of the SPF protocol, for example, a part of domain names lack of the SPF configuration with strict standards, so that some SPF certificates of mails which are not forged may not pass through the SPF protocol, so that serious risk problems exist in mails which fail in all SPF certificates, so that a user knows the SPF verification result to help the user make a judgment; and the SPF can be further assisted to obtain more perfect prompt through DKIM and DMARC, so that more comprehensive prompt information is provided for users.
And S3, calculating a probability value by using a state transition probability method according to spelling of the signaling domain name in the target mail, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail.
In step S3 of the embodiment of the present invention, S3 includes S3.1 to S3.5, specifically:
s3.1, converting the signaling domain name in the target mail into a lower case form;
and traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring.
For example: assuming that the signaling domain name of the target mail is 123456.Cn, and N is designated as 3, after traversing the signaling domain name through the N-Gram algorithm, the final list of substrings is: "123, 234, 345, 456, 56..6. C,. Cn".
S3.2, constructing all possible N-Gram tuples according to the character substrings by combining a preset character set to obtain a state transition probability matrix; wherein, the character set is established according to character related RFC standards which can be used in the signaling domain name, and the sources thereof can be RFC 1035, RFC 1123, RFC 2181 and RFC 5892;
splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
and calculating the average state transition probability value of the plurality of tuples to obtain a probability value.
In this step S3.2, n=2 may be specified to reduce the size of the matrix, so that all possible 2-gram tuples with the size of M may be directly constructed, and then an initial state transition matrix with the size of m×m may be formed, and then the probability of state transition between all 2-gram tuples may be calculated, thereby obtaining a state transition probability matrix.
The specific embodiment of step S3.2 is exemplified as follows:
assuming that the corpus only has a sentence "I robot", traversing the signaling domain name in the target mail by using the N-Gram algorithm, it can be known that the tuple after splitting when n=3 is: "iro, rob, obo, bot", according to the order, it can be known that the transfer object of iro tuples is only rob tuples, and other tuples are similar, so that the transfer probability values of several tuples are shown in table 1.
Table 1 tuple transition probability value lookup table
iro rob obo bot
iro 0 1 0 0
rob 0 0 1 0
obo 0 0 0 1
bot 0 0 0 0
Referring to table 1, table 1 provides a tuple transition probability value comparison table, which is a transition probability value of several tuples in the example of the embodiment of step S3.2.
The process of constructing all the possible N-Gram tuples according to the character sub-strings is a process of constructing a markov chain starting from the spelling aspect of the signaling domain name, and the probability value of the occurrence of the mailbox address can be calculated as the probability value of the occurrence of the mailbox address by constructing the markov chain on the spelling, so that probability reliability support can be provided for the generation of the second prompt.
S3.3, comparing the probability value with a preset confidence coefficient lower limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
the construction process of the confidence coefficient lower limit specifically comprises the following steps:
acquiring an AlexaRank ranking list (world comprehensive ranking list) of a website, and constructing a trusted domain name list by using list information of the top 20000 names in the AlexaRank ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
the minimum value in the probability value set is taken as the lower confidence limit.
The trusted domain name list in the embodiment is built on the basis of the world comprehensive ranking list, so that the trusted domain name list has higher reliability, and the lower confidence limit can be ensured to be a comprehensive credibility measurement standard so as to be compared with the probability value, and a second prompt is generated.
S3.4, normalized detection is carried out on character spelling features of vowels and special symbols in the target mail, and detection results are obtained.
The method comprises the following steps: counting the number of vowels and the number of special symbols (such as "-", "_", and "") in the target mail, and judging the domain name exceeding the upper limit in the counted number according to the upper threshold limit calculated by the historical statistical data to obtain a detection result;
Wherein the detection result can display the number of symbols with low use ratio and vowels with too low duty ratio.
S3.5, generating a second prompt according to the comparison result, the detection result and the low-frequency communication top-level domain list of the mailbox where the target mail is located.
The construction process of the low-frequency communication top-level domain list specifically comprises the following steps:
summarizing the pre-collected junk mail data and the safety report disclosed by the manufacturer to obtain a top-level domain name list, wherein the selection conditions of the list comprise low use frequency, low registration cost and high junk mail frequency;
counting the historical mail data of the user of the mailbox where the target mail is located, calculating a top domain list with higher frequency of the user, and if the top domain of the mail sending box is not in the list, considering the top domain as a top domain with low frequency of the user to communicate with the user to obtain a low-frequency top domain list;
and establishing a low-frequency communication top domain list according to the top domain list and the low-frequency communication top domain list.
In the step S3 in this embodiment, from the whole point of view, it is difficult for each security expert or recipient to identify whether a mailbox address of an sender is likely to have a problem with independent knowledge, so that it is of higher importance to perform risk detection reminding on the mailbox address;
The lower confidence limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, so that the minimum mail reliability is indicated, the probability values and the lower confidence limit are compared, and the obtained comparison result can reflect the abnormal degree of the mail address of the current target mail from the official angle;
in addition, since the threshold lower limit of the state transition probability calculation uses the minimum value, the threshold lower limit is relatively conservative, by combining the character spelling feature detection result of the target mail and the low-frequency communication top-level domain list, the mailbox address composition of the target mail and fewer mailbox addresses can be considered, the second prompt is not only the result generated by analyzing the target mail from the external point of view, but also the detailed consideration of the internal details is included, and the second prompt is more objective, accurate and effective.
S4, using a preset neural network model, and generating a third prompt by segmenting and converting real numbers of the uniform resource locator in the target mail.
In step S4 of the embodiment of the present invention, S1 includes S4.1 to S4.4, specifically:
S4.1, using a preset LSTM (Long Short-Term Memory) deep learning network structure, and dividing a URL (Uniform Resource Locator ) in a target mail into a first character set; the input processing of the LSTM deep learning network structure refers to the input word processing mode of a large language model, and a large number of characters capable of supporting segmentation are defined in the network structure.
S4.2, screening out characters of the first character set which are not in the preset character set, obtaining a second character set, and defining character identifiers of the second character set as unknown characters;
and respectively adding a preset beginning character and an end character at the head and the tail of the unknown character to obtain a numbering list related to the second character set.
S4.3, generating a comprehensive auxiliary prompt according to the IP host position, the short website and the external download link in the target mail, wherein the comprehensive auxiliary prompt specifically comprises the following steps:
if the regular detection shows that the IP address of the host position of the target mail is not the intranet IP, outputting a first auxiliary prompt for prompting that the IP address of the target mail is not the intranet IP;
if the end of the URL in the target mail points to a downloading link of a browser, outputting a second auxiliary prompt for prompting the target mail to point to an external downloading link; for example https:// www.123.com/abc. Png, since "png" is an image, typically the browser will not trigger an automatic download and therefore not an external download link, but if the URL is https:// www.123.com/abc. Exe, most browsers will parse it into a download link pop-up save window, which constitutes a download link;
If the host position of the target mail is matched with the domain name in the pre-acquired short website service list through regularization, outputting a third auxiliary prompt for prompting the target mail to point to the short website if the matching hit occurs;
and generating a comprehensive auxiliary prompt according to the first auxiliary prompt, the second auxiliary prompt and the third auxiliary prompt.
In this embodiment, the IP is used as the host, a short website is used, and the difference of the download links pointing to the outside is used in different scenes to generate the comprehensive auxiliary prompt.
S4.4, performing softmax conversion on the number list by using a two-class network structure in the LSTM deep learning network structure to obtain the abnormal probability of the uniform resource locator, and generating an initial prompt according to the abnormal probability;
and combining the initial prompt and the comprehensive auxiliary prompt to obtain a third prompt.
It should be noted that, in the embodiment of the present invention, the combination of the first-level domain name and the top-level domain after the top-level domain is excluded is regarded as the host location, and for a plurality of URLs at the same host location, only the URL that appears first is checked. For example, the host locations of 123.Com.cn and 456.123.Com.cn are both 123.Com.cn, while 123.Com and 123.Cn are not the same host locations. Also, the softmax transformation is a mathematical transformation of the original variable, wherein a softmax function, also called normalized exponential function, is applied, which is a mathematical function commonly used to transform an arbitrary set of real numbers into real numbers representing a probability distribution.
In the step S4 of the embodiment, as the junk fishing mail often adopts a mode of guiding the user to click the corresponding URL to achieve the purpose, the URL possibly having risk in the target mail is prompted, and a more refined and accurate prompting effect can be achieved compared with the use of the transfer page to transfer all the outer links;
in this embodiment, the url in the mail is detected, and then the third prompt is generated. The unknown characters in the target mail can be obtained by cutting and screening the uniform resource locator, the character regulations in the numbering list can be clear by adding characters at the head and the tail of the unknown characters, the faster conversion speed can be conveniently obtained when the conversion processing is carried out, the abnormal probability acquisition process of the uniform resource locator is accelerated, and the third prompt is quickly generated;
compared with the originating domain name, the uniform resource locator content in the mail is longer, so that the method is not suitable for judging by using simple state transition probability and character number, analysis time can be saved to a great extent by using a neural network model to carry out abnormal probability transition, and the prompting accuracy of the third prompting language is improved.
And S5, carrying out feature analysis on the information with macro attachments, suspected impersonation names and money in the target mail to respectively generate a fourth prompt, a fifth prompt and a sixth prompt.
In step S5 of the embodiment of the present invention, S5 includes S5.1 to S5.3, specifically:
s5.1, acquiring macro attachments in the target mail;
if dangerous files in the focused attention list appear in the macro attachment, generating a fourth prompt according to the dangerous files;
the important attention list is established according to the type of the attachments forbidden to be uploaded (such as the type of the attachments list forbidden to be uploaded by outlook) of the mailbox where the target mail is located and a preset important attention file;
among them, the important attention files include files that can be directly executed such as ". Exe", ". Bat", ". Ps", and ". Sh", and macro files such as ". Docm" and ". Xlsm".
In this embodiment, the attachment carried in the mail is detected, and then a fourth prompt is generated. Because the above-mentioned focused files and the files are easy to be executed due to misoperation or induction content of the mail, for example, the office file with the macro is generally safe even if carrying the macro virus under the condition of using the read-only mode and disabling the macro, but if the macro is enabled by the user, the user may be poisoned, so that whether the macro-attached file is a dangerous file or not can be quickly known whether the operation risk exists in the macro-attached file in the target mail through the focused attention list, the user is timely reminded, and the fourth prompting message is simply, conveniently and quickly generated.
S5.2, acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
obtaining the entity name of the target mail, and matching the entity name with a B-ORG (company or organization) list and an I-PER (personal) list in a sequence labeling algorithm to obtain a second matching result;
if the first matching result or the second matching result is successful, judging the category of the mail sending address of the target mail to obtain a declaration object of the target mail;
if the declaration object is a government agency or a public inspection agency, checking whether the top-level domain name of the target mail belongs to a preset special top-level domain (such as ". Gov" and ". Mit");
if the declaration object is a bank, an invoice service provider or an enterprise, checking whether the domain name used by the target mail belongs to a pre-collected list;
if the declaration object is an enterprise normal establishment mechanism, checking whether the sender and the receiver of the target mail form the same domain or sub-domain;
if the declaration object is a name, checking whether the declaration object is in a preset address book;
And generating a fifth prompt according to the checking result.
In this embodiment, whether the body name of the mail is counterfeit is detected, and then a fifth prompt is generated. Although protocols such as SPF have certain protection capability for falsifying mail addresses, the protocols cannot cope with the behavior of falsifying names of other people by utilizing the visual effect of mail reading, so that the method has important significance of finding entities and judging the entities to know whether the main names are falsified or not;
firstly, whether the target mail is successfully matched or not, namely whether a statement is formed or not can be known by a matching mode, and the statement refers to a formal statement issued by an official agency of a government, an enterprise, an organization and the like, and usually, in order to respond to a certain event or condition, the official stand and attitude are expressed, and the mail sending address is generally compared with the official and formal one, so after the statement is formed, the mail sending address is directly checked, whether the target mail is counterfeit or not can be quickly known, and a fifth prompt is obtained; in addition, the generation of the second matching result is applied to the BiLSTM+CRF technology of an open source, and a corresponding list can be provided for the important attention objects so as to carry out quick matching;
in addition, since the mail top domain names of government institutions and public inspection institutions have special top domain names in most cases, the top domain names can be directly checked to know whether the mail boxes are abnormal; since there is usually a attribution relationship between the enterprise permanent institutions, whether the mailbox is abnormal or not can be known by checking whether the same domain or the subdomain is formed; by using the pre-collected list and address book, whether the mailbox of the bank or the person is abnormal or not can be rapidly and directly judged. The scheme provides corresponding different mail sending address checking methods from the angles of government institutions, banks, enterprise resident institutions, individuals and the like, and the checking method has strong pertinence and can accelerate the generation process of the fifth prompt.
S5.3, acquiring text information of the target mail;
cutting text information by using a textCNN (convolutional neural network) module as a classifier to obtain a plurality of character strings token; the classifier is used for controlling the physical size and occupied memory space of the classifier through a preset word list;
matching a plurality of character strings with a preset money word list, and masking token (MASK) which is not in the money word list to obtain a matching result;
generating a sixth prompt according to the matching result;
wherein, the monetary vocabulary is established according to the preset financial fraud information.
In this embodiment, since the general mail classification only determines whether the mail is junk mail and not what type of junk mail is to be considered, but in all fraud, money is often the most sensitive, so that additional reminding is necessary for information related to money and possibly constituting fraud;
the text information is segmented by using the classifier, which is equivalent to data segmentation and extraction of the original mail text, so that the vocabulary in the acquired character strings is convenient to be compared and matched with the money vocabulary, and the acquisition process of the sixth prompt is accelerated.
S6, risk prompt information is generated according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
The step S6 of the embodiment of the invention is specifically as follows:
respectively configuring a unique id and a unique label for the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt as labels, and controlling and outputting the information after integrating the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt corresponding to each label in a mode of inputting language marking parameters according to a preset multilingual language dictionary and placeholders corresponding to the output of each label to obtain risk prompt information;
wherein, the multilingual text dictionary is obtained by manual writing.
According to the embodiment, the generation result of the prompt is controlled in a formatted output mode, so that multiple languages can be supported, and compared with a mode of using machine translation, the prompt language case of the scheme is manually written, the language is more natural, and the experience of a user can be enhanced.
In general, the embodiment of the invention has the following beneficial effects:
According to the embodiment of the invention, through respectively adopting different detection methods from six angles of SPF verification, spelling of a signaling domain name, uniform resource locator, name with macro attachment, suspected impersonation and money information, a corresponding prompt is obtained, and further risk prompt information with extremely high comprehensiveness is generated, so that risk prompt can be quickly and effectively carried out at one time, the occurrence of missing prompt problem is avoided, and the information safety of a target mail is ensured;
the potential non-corresponding identity statement is identified by a way based on entity identification and knowledge base matching so as to prompt possible impossions, the scope of the inclusion is wider, and the omission risk can be reduced; the scheme based on entity identification can find out the organization names in the given text to discover potential entity impersonation, which cannot be done simply by contact matching; in addition, compared with a 'one-tool' scheme of prohibiting direct access to an external chain, the embodiment of the invention uses a method combining experience knowledge and a deep learning model to identify the suspicious specific URL, and can avoid the problem that users are numb and information reminding is omitted due to too general prompt.
Embodiment two:
Referring to fig. 2, an embodiment of the present invention provides a risk prompting information generating device based on mail, which includes a mail obtaining module 10, a first generating module 20, a second generating module 30, a third generating module 40, a fourth generating module 50, and a summarizing module 60;
wherein, the mail obtaining module 10 is used for obtaining the target mail;
the first generating module 20 is configured to perform SPF verification on the target mail, and generate a first prompt according to a verification result;
the second generating module 30 is configured to calculate a probability value according to spelling of the signaling domain name in the target mail by using a state transition probability method, and combine the probability value with a preset confidence lower limit and spelling features in the target mail to generate a second prompt;
a third generating module 40, configured to generate a third prompt by performing segmentation and real number conversion on the uniform resource locator in the target mail using a preset neural network model;
a fourth generation module 50, configured to perform feature analysis on the information with macro attachment, suspected impersonation name and money in the target mail, and generate a fourth prompt, a fifth prompt and a sixth prompt respectively;
the summarization module 60 is configured to generate risk prompt information according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
In one embodiment, mail acquisition module 10 is specifically:
and acquiring the target mail from the mailbox system.
In one embodiment, the first generating module 20 is specifically:
target mail is checked by using SPF (Sender Policy Framework ), DKIM (DomainKeys Identified Mail, domain name Key identification mail Standard) and DMARC (Domain-based Message Authentication, reporting, and Conformance, mail Domain verification protocol), and a first prompt is generated according to the check result.
In this embodiment, since SMTP does not verify the mail address of the sender, SPF can supplement the security of SMTP (Simple Mail Transfer Protocol ) in the target mail; however, there are many problems in the use process of the SPF protocol, for example, a part of domain names lack of the SPF configuration with strict standards, so that some SPF certificates of mails which are not forged may not pass through the SPF protocol, so that serious risk problems exist in mails which fail in all SPF certificates, so that a user knows the SPF verification result to help the user make a judgment; and the SPF can be further assisted to obtain more perfect prompt through DKIM and DMARC, so that more comprehensive prompt information is provided for users.
In one embodiment, the second generation module 30 includes a character unit, a matrix unit, a split unit, a probability value unit, a matrix unit, a first sub-unit, a second sub-unit, and a third sub-unit;
the character unit is used for converting the signaling domain name in the target mail into a lower case form;
and the character unit is also used for traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring.
For example: assuming that the signaling domain name of the target mail is 123456.Cn, and N is designated as 3, after traversing the signaling domain name through the N-Gram algorithm, the final list of substrings is: "123, 234, 345, 456, 56..6. C,. Cn".
The matrix unit is used for combining a preset character set, constructing all possible N-Gram tuples according to the character substrings, and obtaining a state transition probability matrix; wherein, the character set is established according to character related RFC standards which can be used in the signaling domain name, and the sources thereof can be RFC 1035, RFC 1123, RFC 2181 and RFC 5892;
the splitting unit is used for splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
and the probability value unit is used for calculating the average state transition probability value of the plurality of tuples to obtain a probability value.
In the matrix unit, n=2 may be specified to reduce the size of the matrix, so that all possible 2-gram tuples with the size of M may be directly constructed, an initial state transition matrix with the size of m×m may be further formed, and then the probability of state transition between all 2-gram tuples may be calculated, thereby obtaining a state transition probability matrix.
Specific embodiments of the matrix unit, the splitting unit and the probability value unit are exemplified as follows:
assuming that the corpus only has a sentence "I robot", traversing the signaling domain name in the target mail by using the N-Gram algorithm, it can be known that the tuple after splitting when n=3 is: "iro, rob, obo, bot", according to the order, it can be known that the transfer object of iro tuples is only rob tuples, and other tuples are similar, so that the transfer probability values of several tuples are shown in table 1.
Table 1 tuple transition probability value lookup table
iro rob obo bot
iro 0 1 0 0
rob 0 0 1 0
obo 0 0 0 1
bot 0 0 0 0
Referring to table 1, table 1 provides a tuple transition probability value comparison table, which is a transition probability value of a plurality of tuples in a concrete implementation example of a matrix unit, a splitting unit and a probability value unit.
The process of constructing all the possible N-Gram tuples according to the character sub-strings is a process of constructing a markov chain starting from the spelling aspect of the signaling domain name, and the probability value of the occurrence of the mailbox address can be calculated as the probability value of the occurrence of the mailbox address by constructing the markov chain on the spelling, so that probability reliability support can be provided for the generation of the second prompt.
The first subunit is used for comparing the probability value with a preset confidence coefficient lower limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
the construction process of the confidence coefficient lower limit specifically comprises the following steps:
acquiring an AlexaRank ranking list (world comprehensive ranking list) of a website, and constructing a trusted domain name list by using list information of the top 20000 names in the AlexaRank ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
the minimum value in the probability value set is taken as the lower confidence limit.
The trusted domain name list in the embodiment is built on the basis of the world comprehensive ranking list, so that the trusted domain name list has higher reliability, and the lower confidence limit can be ensured to be a comprehensive credibility measurement standard so as to be compared with the probability value, and a second prompt is generated.
And the second subunit is used for carrying out normalization detection on the character spelling characteristics of the vowels and the special symbols in the target mail to obtain a detection result.
The method comprises the following steps: counting the number of vowels and the number of special symbols (such as "-", "_", and "") in the target mail, and judging the domain name exceeding the upper limit in the counted number according to the upper threshold limit calculated by the historical statistical data to obtain a detection result;
Wherein the detection result can display the number of symbols with low use ratio and vowels with too low duty ratio.
And the third subunit is used for generating a second prompt according to the comparison result, the detection result and the low-frequency communication top-level domain list of the mailbox where the target mail is located.
The construction process of the low-frequency communication top-level domain list specifically comprises the following steps:
summarizing the pre-collected junk mail data and the safety report disclosed by the manufacturer to obtain a top-level domain name list, wherein the selection conditions of the list comprise low use frequency, low registration cost and high junk mail frequency;
counting the historical mail data of the user of the mailbox where the target mail is located, calculating a top domain list with higher frequency of the user, and if the top domain of the mail sending box is not in the list, considering the top domain as a top domain with low frequency of the user to communicate with the user to obtain a low-frequency top domain list;
and establishing a low-frequency communication top domain list according to the top domain list and the low-frequency communication top domain list.
The second generating module 30 in this embodiment has a higher importance for risk detection reminding of mailbox addresses, because it is difficult for each security expert or recipient to identify whether a sender mailbox address may have a problem with independent knowledge;
The lower confidence limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, so that the minimum mail reliability is indicated, the probability values and the lower confidence limit are compared, and the obtained comparison result can reflect the abnormal degree of the mail address of the current target mail from the official angle;
in addition, since the threshold lower limit of the state transition probability calculation uses the minimum value, the threshold lower limit is relatively conservative, by combining the character spelling feature detection result of the target mail and the low-frequency communication top-level domain list, the mailbox address composition of the target mail and fewer mailbox addresses can be considered, the second prompt is not only the result generated by analyzing the target mail from the external point of view, but also the detailed consideration of the internal details is included, and the second prompt is more objective, accurate and effective.
In one embodiment, the third generating module 40 includes a first segmentation unit, a screening unit, a reconstruction unit, an auxiliary unit, and a transformation unit;
the first segmentation unit is used for using a preset LSTM (Long Short-Term Memory) deep learning network structure to segment a URL (Uniform Resource Locator ) in a target mail into a first character set; the input processing of the LSTM deep learning network structure refers to the input word processing mode of a large language model, and a large number of characters capable of supporting segmentation are defined in the network structure.
The screening unit is used for screening out characters of the first character set, which are not in the preset character set, so as to obtain a second character set, and defining character identifiers of the second character set as unknown characters;
and a reconstruction unit, configured to add a preset beginning character and an ending character to the beginning and the end of the unknown character, respectively, to obtain a numbering list related to the second character set.
The auxiliary unit is used for generating a comprehensive auxiliary prompt according to the IP host position, the short website and the external download link in the target mail, and specifically comprises the following steps:
if the regular detection shows that the IP address of the host position of the target mail is not the intranet IP, outputting a first auxiliary prompt for prompting that the IP address of the target mail is not the intranet IP;
if the end of the URL in the target mail points to a downloading link of a browser, outputting a second auxiliary prompt for prompting the target mail to point to an external downloading link; for example https:// www.123.com/abc. Png, since "png" is an image, typically the browser will not trigger an automatic download and therefore not an external download link, but if the URL is https:// www.123.com/abc. Exe, most browsers will parse it into a download link pop-up save window, which constitutes a download link;
If the host position of the target mail is matched with the domain name in the pre-acquired short website service list through regularization, outputting a third auxiliary prompt for prompting the target mail to point to the short website if the matching hit occurs;
and generating a comprehensive auxiliary prompt according to the first auxiliary prompt, the second auxiliary prompt and the third auxiliary prompt.
In this embodiment, the IP is used as the host, a short website is used, and the difference of the download links pointing to the outside is used in different scenes to generate the comprehensive auxiliary prompt.
The conversion unit is used for carrying out softmax conversion on the number list by using a two-class network structure in the LSTM deep learning network structure to obtain the abnormal probability of the uniform resource locator, and generating an initial prompt according to the abnormal probability;
and the transformation unit is also used for combining the initial prompt and the comprehensive auxiliary prompt to obtain a third prompt.
It should be noted that, in the embodiment of the present invention, the combination of the first-level domain name and the top-level domain after the top-level domain is excluded is regarded as the host location, and for a plurality of URLs at the same host location, only the URL that appears first is checked. For example, the host locations of 123.Com.cn and 456.123.Com.cn are both 123.Com.cn, while 123.Com and 123.Cn are not the same host locations. Also, the softmax transformation is a mathematical transformation of the original variable, wherein a softmax function, also called normalized exponential function, is applied, which is a mathematical function commonly used to transform an arbitrary set of real numbers into real numbers representing a probability distribution.
In the third generation module 40 of this embodiment, as the junk fishing mail often adopts a manner of guiding the user to click the corresponding URL to achieve the purpose, the URL possibly having risk in the target mail is prompted, which can achieve a more refined and accurate prompting effect than the use of the transfer page to transfer all the outer links;
in this embodiment, the url in the mail is detected, and then the third prompt is generated. The unknown characters in the target mail can be obtained by cutting and screening the uniform resource locator, the character regulations in the numbering list can be clear by adding characters at the head and the tail of the unknown characters, the faster conversion speed can be conveniently obtained when the conversion processing is carried out, the abnormal probability acquisition process of the uniform resource locator is accelerated, and the third prompt is quickly generated;
compared with the originating domain name, the uniform resource locator content in the mail is longer, so that the method is not suitable for judging by using simple state transition probability and character number, analysis time can be saved to a great extent by using a neural network model to carry out abnormal probability transition, and the prompting accuracy of the third prompting language is improved.
In one embodiment, the fourth generating module 50 includes a first acquiring unit, a first generating unit, a second acquiring unit, a first matching unit, a fourth subunit, a fifth subunit, a sixth subunit, a seventh subunit, an eighth subunit, a ninth subunit, a third acquiring unit, a second slicing unit, and a second matching unit;
the first acquisition unit is used for acquiring macro attachments in the target mail;
the first generation unit is used for generating a fourth prompt according to the dangerous files if the dangerous files in the important attention list appear in the macro attachment;
the important attention list is established according to the type of the attachments forbidden to be uploaded (such as the type of the attachments list forbidden to be uploaded by outlook) of the mailbox where the target mail is located and a preset important attention file;
among them, the important attention files include files that can be directly executed such as ". Exe", ". Bat", ". Ps", and ". Sh", and macro files such as ". Docm" and ". Xlsm".
In this embodiment, the attachment carried in the mail is detected, and then a fourth prompt is generated. Because the above-mentioned focused files and the files are easy to be executed due to misoperation or induction content of the mail, for example, the office file with the macro is generally safe even if carrying the macro virus under the condition of using the read-only mode and disabling the macro, but if the macro is enabled by the user, the user may be poisoned, so that whether the macro-attached file is a dangerous file or not can be quickly known whether the operation risk exists in the macro-attached file in the target mail through the focused attention list, the user is timely reminded, and the fourth prompting message is simply, conveniently and quickly generated.
The second acquisition unit is used for acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
the first matching unit is used for acquiring the entity name of the target mail, and matching the entity name with a B-ORG (company or organization) list and an I-PER (personal) list in a sequence labeling algorithm to obtain a second matching result;
a fourth subunit, configured to determine a category of the mailbox address of the target mail if the first matching result or the second matching result is that the matching is successful, so as to obtain a declaration object of the target mail;
a fifth subunit for checking whether the top-level domain name of the target mail belongs to a preset special top-level domain (such as ". Gov" and ". Mit") if the declaration object is a government agency or a public inspection agency;
a sixth subunit, configured to check whether the domain name used by the target mail belongs to a pre-collected list if the declaration object is a bank, an invoice service provider or an enterprise;
a seventh subunit, configured to check whether the sender and the receiver of the target mail form a same domain or a sub-domain if the declaration object is an enterprise normal establishment;
An eighth subunit, configured to check whether the declaration object is in a preset address book if the declaration object is a name;
and a ninth subunit, configured to generate a fifth prompt according to the inspection result.
In this embodiment, whether the body name of the mail is counterfeit is detected, and then a fifth prompt is generated. Although protocols such as SPF have certain protection capability for falsifying mail addresses, the protocols cannot cope with the behavior of falsifying names of other people by utilizing the visual effect of mail reading, so that the method has important significance of finding entities and judging the entities to know whether the main names are falsified or not;
firstly, whether the target mail is successfully matched or not, namely whether a statement is formed or not can be known by a matching mode, and the statement refers to a formal statement issued by an official agency of a government, an enterprise, an organization and the like, and usually, in order to respond to a certain event or condition, the official stand and attitude are expressed, and the mail sending address is generally compared with the official and formal one, so after the statement is formed, the mail sending address is directly checked, whether the target mail is counterfeit or not can be quickly known, and a fifth prompt is obtained; in addition, the generation of the second matching result is applied to the BiLSTM+CRF technology of an open source, and a corresponding list can be provided for the important attention objects so as to carry out quick matching;
In addition, since the mail top domain names of government institutions and public inspection institutions have special top domain names in most cases, the top domain names can be directly checked to know whether the mail boxes are abnormal; since there is usually a attribution relationship between the enterprise permanent institutions, whether the mailbox is abnormal or not can be known by checking whether the same domain or the subdomain is formed; by using the pre-collected list and address book, whether the mailbox of the bank or the person is abnormal or not can be rapidly and directly judged. The scheme provides corresponding different mail sending address checking methods from the angles of government institutions, banks, enterprise resident institutions, individuals and the like, and the checking method has strong pertinence and can accelerate the generation process of the fifth prompt.
A third obtaining unit for obtaining text information of the target mail;
the second segmentation unit is used for segmenting the text information by using a textCNN (convolutional neural network) module as a classifier to obtain a plurality of character strings token; the classifier is used for controlling the physical size and occupied memory space of the classifier through a preset word list;
the second matching unit is used for matching a plurality of character strings with a preset money word list, and Masking (MASK) token which is not in the money word list to obtain a matching result;
The second matching unit is also used for generating a sixth prompt according to the matching result;
wherein, the monetary vocabulary is established according to the preset financial fraud information.
In this embodiment, since the general mail classification only determines whether the mail is junk mail and not what type of junk mail is to be considered, but in all fraud, money is often the most sensitive, so that additional reminding is necessary for information related to money and possibly constituting fraud;
the text information is segmented by using the classifier, which is equivalent to data segmentation and extraction of the original mail text, so that the vocabulary in the acquired character strings is convenient to be compared and matched with the money vocabulary, and the acquisition process of the sixth prompt is accelerated.
In one embodiment, the summarization module 60 is specifically:
respectively configuring a unique id and a unique label for the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt as labels, and controlling and outputting the information after integrating the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt corresponding to each label in a mode of inputting language marking parameters according to a preset multilingual language dictionary and placeholders corresponding to the output of each label to obtain risk prompt information;
Wherein, the multilingual text dictionary is obtained by manual writing.
According to the embodiment, the generation result of the prompt is controlled in a formatted output mode, so that multiple languages can be supported, and compared with a mode of using machine translation, the prompt language case of the scheme is manually written, the language is more natural, and the experience of a user can be enhanced.
In general, the embodiment of the invention has the following beneficial effects:
the device obtains the corresponding prompt by respectively adopting different detection methods from six angles of SPF verification, spelling of a signaling domain name, uniform resource locator, name with macro attachment, suspected impersonation and money information, thereby generating risk prompt information with extremely high comprehensiveness, quickly and effectively carrying out risk prompt at one time, avoiding the occurrence of missing prompt problems and guaranteeing the information safety of target mails;
the potential non-corresponding identity statement is identified by a way based on entity identification and knowledge base matching so as to prompt possible impossions, the scope of the inclusion is wider, and the omission risk can be reduced; the scheme based on entity identification can find out the organization names in the given text to discover potential entity impersonation, which cannot be done simply by contact matching; in addition, compared with a 'one-tool' scheme of prohibiting direct access to an external chain, the embodiment of the invention uses a method combining experience knowledge and a deep learning model to identify the suspicious specific URL, and can avoid the problem that users are numb and information reminding is omitted due to too general prompt.
Embodiment III:
the embodiment of the invention provides a computer readable storage medium, which comprises a stored computer program, wherein when the computer program runs, equipment in which the computer readable storage medium is positioned is controlled to execute the risk prompt information generation method based on mail;
wherein, the risk prompt information generating method based on mail can be stored in a computer readable storage medium if implemented in the form of a software functional unit and used as a stand-alone product. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.
The foregoing is a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention and are intended to be comprehended within the scope of the present invention.

Claims (19)

1. The mail-based risk prompt message generation method is characterized by comprising the following steps:
acquiring a target mail;
SPF verification is carried out on the target mail, and a first prompt is generated according to a verification result;
according to spelling of the signaling domain name in the target mail, calculating a probability value by using a state transition probability method, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail;
generating a third prompt by using a preset neural network model and performing segmentation and real number conversion on the uniform resource locator in the target mail;
performing feature analysis on the information with macro attachments, suspected impersonation names and money in the target mail to respectively generate a fourth prompt, a fifth prompt and a sixth prompt;
and generating risk prompt information according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
2. The method for generating risk prompt message based on mail as set forth in claim 1, wherein a probability value is calculated by using a state transition probability method according to spelling of a signaling domain name in said target mail, and said probability value is combined with a preset confidence lower limit and spelling features in said target mail to generate a second prompt message, specifically:
traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring;
combining a preset character set, and constructing all possible N-Gram tuples according to the character substrings to obtain a state transition probability matrix;
splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
calculating the average state transition probability value of the plurality of tuples to obtain the probability value;
and comparing the probability value with the lower limit of the confidence coefficient, and generating a second prompt by combining a comparison result with spelling features in the target mail.
3. The mail-based risk prompting message generating method according to claim 2, wherein the probability value is compared with a preset confidence lower limit, and a second prompting message is generated by combining a comparison result with spelling features in the target mail, specifically:
Comparing the probability value with a preset confidence coefficient lower limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
performing normalization detection on the spelling characteristics of vowels and special symbols in the target mail to obtain a detection result;
and generating the second prompt according to the comparison result, the detection result and a low-frequency communication top-level domain list of a mailbox where the target mail is located.
4. The mail-based risk prompting message generating method according to claim 3, wherein the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, specifically:
acquiring a world comprehensive ranking list of a website, and constructing a trusted domain name list by using part of list information in the world comprehensive ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
and taking the minimum value in the probability value set as the lower confidence limit.
5. The mail-based risk prompting message generating method according to claim 1, wherein a preset neural network model is used to generate a third prompting message by splitting and real number converting a uniform resource locator in the target mail, and the third prompting message is specifically:
Using a preset neural network model to segment the uniform resource locator in the target mail into a first character set;
screening out characters of the first character set which are not in a preset character set, obtaining a second character set, and defining character identifiers of the second character set as unknown characters;
respectively adding a preset beginning character and a preset ending character at the head and the tail of the unknown character to obtain a numbering list related to the second character set;
and carrying out transformation processing on the numbering list by using a two-class network structure in the neural network model to obtain the abnormal probability of the uniform resource locator, and generating the third prompt according to the abnormal probability.
6. The mail-based risk prompt message generating method as claimed in claim 1, wherein the feature analysis is performed on the macro attachment in the target mail to generate a fourth prompt, specifically:
acquiring macro attachments in the target mail;
if dangerous files in the focused attention list appear in the macro attachment, generating a fourth prompt according to the dangerous files;
the important attention list is established according to the type of the forbidden attachments of the mailbox where the target mail is located and preset important attention files.
7. The mail-based risk prompt message generating method as claimed in claim 1, wherein the feature analysis is performed on suspected imposter names in the target mail to generate a fifth prompt, specifically:
acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; wherein the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
obtaining the entity name of the target mail, and matching the entity name with a company list and a personal list in a sequence labeling algorithm to obtain a second matching result;
if the first matching result or the second matching result is that the matching is successful, checking the mail sending address of the target mail, and generating the fifth prompt according to the checking result.
8. The mail-based risk prompting message generating method as in claim 7, wherein checking the originating mailbox address of the target mail, and generating the fifth prompting message according to the checking result, specifically:
judging the category of the mail sending address of the target mail to obtain a declaration object of the target mail;
If the declaration object is a government agency or a public inspection agency, checking whether the top-level domain name of the target mail belongs to a preset special top-level domain;
if the statement object is a bank, an invoice service provider or an enterprise, checking whether the domain name used by the target mail belongs to a pre-collected list;
if the stated object is the enterprise setting mechanism, check whether the sender and receiver of the stated goal mail form the same domain or subdomain;
if the declaration object is a name, checking whether the declaration object is in a preset address book;
and generating the fifth prompt according to the checking result.
9. The mail-based risk prompt message generating method as claimed in claim 1, wherein the feature analysis is performed on the monetary information in the target mail to generate a sixth prompt message, specifically:
acquiring text information of the target mail;
cutting the text information by using a preset classifier to obtain a plurality of character strings;
matching the character strings with a preset money word list, and generating the sixth prompt according to a matching result;
wherein, the monetary vocabulary is established according to preset financial fraud information.
10. The risk prompt information generating device based on the mail is characterized by comprising a mail acquisition module, a first generating module, a second generating module, a third generating module, a fourth generating module and a summarizing module;
the mail acquisition module is used for acquiring a target mail;
the first generation module is used for performing SPF (specific pathogen free) verification on the target mail and generating a first prompt according to a verification result;
the second generation module is used for calculating a probability value by using a state transition probability method according to spelling of the signaling domain name in the target mail, and generating a second prompt by combining the probability value with a preset confidence lower limit and spelling features in the target mail;
the third generation module is used for generating a third prompt by using a preset neural network model and carrying out segmentation and real number conversion on the uniform resource locator in the target mail;
the fourth generation module is used for carrying out feature analysis on the information with macro attachments, suspected impersonation names and money in the target mail to generate a fourth prompt, a fifth prompt and a sixth prompt respectively;
and the summarization module is used for generating risk prompt information according to the first prompt, the second prompt, the third prompt, the fourth prompt, the fifth prompt and the sixth prompt.
11. The mail-based risk prompt message generating apparatus of claim 10, wherein said second generating module includes a character unit, a matrix unit, a splitting unit, a probability value unit, and a comparing unit;
the character unit is used for traversing the signaling domain name in the target mail by using a preset N value through an N-Gram algorithm to generate a character substring;
the matrix unit is used for combining a preset character set, constructing all possible N-Gram tuples according to the character substring, and obtaining a state transition probability matrix;
the splitting unit is used for splitting the signaling domain name in the state transition probability matrix to obtain a plurality of tuples;
the probability value unit is used for calculating the average state transition probability value of the plurality of tuples to obtain the probability value;
and the comparison unit is used for comparing the probability value with the lower confidence coefficient limit, and generating a second prompt by combining a comparison result with the spelling characteristic in the target mail.
12. The mail-based risk cue information generation apparatus of claim 11, wherein the comparing unit includes a first subunit, a second subunit, and a third subunit;
The first subunit is configured to compare the probability value with a preset lower confidence coefficient limit to obtain a comparison result; the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list;
the second subunit is used for normative detection of character spelling features of vowels and special symbols in the target mail to obtain a detection result;
and the third subunit is configured to generate the second prompt according to the comparison result, the detection result, and a low-frequency communication top domain list of a mailbox where the target mail is located.
13. The mail-based risk prompting message generating device according to claim 12, wherein the confidence lower limit is obtained by calculating probability values of all domain names in a preset trusted domain name list, specifically:
acquiring a world comprehensive ranking list of a website, and constructing a trusted domain name list by using part of list information in the world comprehensive ranking list;
calculating probability values of all domain names in the trusted domain name list to obtain a probability value set;
and taking the minimum value in the probability value set as the lower confidence limit.
14. The mail-based risk prompt message generating apparatus as claimed in claim 10, wherein said third generating module includes a first segmentation unit, a screening unit, a reconstruction unit, and a transformation unit;
the first segmentation unit is used for segmenting the uniform resource locator in the target mail into a first character set by using a preset neural network model;
the screening unit is used for screening out characters of the first character set which are not in a preset character set to obtain a second character set, and defining character identifiers of the second character set as unknown characters;
the reconstruction unit is used for respectively adding a preset beginning character and a preset ending character at the head and the tail of the unknown character to obtain a numbering list related to the second character set;
the transformation unit is configured to perform transformation processing on the number list by using a two-class network structure in the neural network model, obtain an abnormal probability of the uniform resource locator, and generate the third prompt according to the abnormal probability.
15. The mail-based risk guidance information generation apparatus of claim 10, wherein the fourth generation module includes a first acquisition unit and a first generation unit;
The first acquiring unit is used for acquiring macro attachments in the target mail;
the first generating unit is configured to generate the fourth prompt according to the dangerous file if the dangerous file in the focused attention list appears in the macro attachment;
the important attention list is established according to the type of the forbidden attachments of the mailbox where the target mail is located and preset important attention files.
16. The mail-based risk prompt message generating apparatus as claimed in claim 10, wherein said fourth generating module includes a second acquiring unit, a first matching unit, and an inspection unit;
the second acquiring unit is used for acquiring a mail header of the target mail, and matching the mail header with a preset dictionary tree to obtain a first matching result; wherein the dictionary tree is established according to lists of government institutions, banks and public inspection institutions and names of enterprise resident institutions;
the first matching unit is used for acquiring the entity name of the target mail, and matching the entity name with a company list and a personal list in a sequence labeling algorithm to obtain a second matching result;
And the checking unit is used for checking the mail sending address of the target mail if the first matching result or the second matching result is successful, and generating the fifth prompt according to the checking result.
17. The mail-based risk guidance information generation apparatus of claim 16, wherein the check unit includes a fourth subunit, a fifth subunit, a sixth subunit, a seventh subunit, an eighth subunit, and a ninth subunit;
the fourth subunit is configured to perform category judgment on the mailbox address of the target mail, so as to obtain a declaration object of the target mail;
the fifth subunit is configured to check whether the top domain name of the target mail belongs to a preset special top domain if the declaration object is a government agency or a public inspection agency;
the sixth subunit is configured to check whether the domain name used by the target mail belongs to a pre-collected list if the declaration object is a bank, an invoice service provider or an enterprise;
the seventh subunit is configured to check whether the sender and the receiver of the target mail form a same domain or a sub-domain if the declaration object is an enterprise permanent mechanism;
The eighth subunit is configured to check whether the declaration object is in a preset address book if the declaration object is a name;
and the ninth subunit is used for generating the fifth prompt according to the checking result.
18. The mail-based risk prompt message generating apparatus according to claim 10, wherein the fourth generating module includes a third acquiring unit, a second segmenting unit and a second matching unit;
the third acquiring unit is used for acquiring text information of the target mail;
the second segmentation unit is used for segmenting the text information by using a preset classifier to obtain a plurality of character strings;
the second matching unit is used for matching the character strings with a preset money word list, and generating the sixth prompt according to a matching result;
wherein, the monetary vocabulary is established according to preset financial fraud information.
19. A storage medium, wherein a computer program is stored on the storage medium, and the computer program is called and executed by a computer, to implement the mail-based risk prompting information generating method according to any one of claims 1 to 9.
CN202311780789.2A 2023-12-21 2023-12-21 Email-based risk prompt information generation method, device and medium Pending CN117749496A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311780789.2A CN117749496A (en) 2023-12-21 2023-12-21 Email-based risk prompt information generation method, device and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311780789.2A CN117749496A (en) 2023-12-21 2023-12-21 Email-based risk prompt information generation method, device and medium

Publications (1)

Publication Number Publication Date
CN117749496A true CN117749496A (en) 2024-03-22

Family

ID=90279305

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311780789.2A Pending CN117749496A (en) 2023-12-21 2023-12-21 Email-based risk prompt information generation method, device and medium

Country Status (1)

Country Link
CN (1) CN117749496A (en)

Similar Documents

Publication Publication Date Title
EP2803031B1 (en) Machine-learning based classification of user accounts based on email addresses and other account information
CN108259415B (en) Mail detection method and device
US8131742B2 (en) Method and system for processing fraud notifications
US7451487B2 (en) Fraudulent message detection
US20050060643A1 (en) Document similarity detection and classification system
US20170289082A1 (en) Method and device for identifying spam mail
CN108418777A (en) A kind of fishing mail detection method, apparatus and system
US11924245B2 (en) Message phishing detection using machine learning characterization
Joshi et al. Phishing attack detection using feature selection techniques
US10341382B2 (en) System and method for filtering electronic messages
US11978020B2 (en) Email security analysis
CN111753171A (en) Malicious website identification method and device
US12021896B2 (en) Method for detecting webpage spoofing attacks
CN112948725A (en) Phishing website URL detection method and system based on machine learning
CN116886387A (en) Phishing mail detecting system
CN117749496A (en) Email-based risk prompt information generation method, device and medium
CN113746814B (en) Mail processing method, mail processing device, electronic equipment and storage medium
US11936686B2 (en) System, device and method for detecting social engineering attacks in digital communications
Morovati et al. Detection of Phishing Emails with Email Forensic Analysis and Machine Learning Techniques.
US11757816B1 (en) Systems and methods for detecting scam emails
CN113992390A (en) Phishing website detection method and device and storage medium
Saleem The P-Fryer: Using Machine Learning and Classification to Effectively Detect Phishing Emails
US11997138B1 (en) Detecting and analyzing phishing attacks through artificial intelligence
US20220343067A1 (en) Text Analysis System, and Characteristic Evaluation System for Message Exchange Using the Same
Ermakova Spam and phishing detection in various languages

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination