CN104375982A

CN104375982A - Method for determining visual similarity of texts

Info

Publication number: CN104375982A
Application number: CN201410564469.8A
Authority: CN
Inventors: 柳厅文; 张浩亮; 闫旸; 时金桥; 亚静; 季月英
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2014-10-21
Filing date: 2014-10-21
Publication date: 2015-02-25

Abstract

The invention relates to a method for determining visual similarity of texts. The method comprises the steps of (1) calculating the direct visual distance between two character strings; (2) calculating the direct similarity of the two character strings; (3) detecting the similarity of specific character strings, namely email addresses with the maximum value of previous email records and the maximum value of the current email as the feature of a classifier; (4) conducting classification training and detection on missent emails by means of a random forest classifier to detect the missent emails. By the adoption of the method, the accuracy and recall rate are higher compared with traditional detection technologies.

Description

A kind of method determining textual visual similarity

Technical field

The present invention relates to a kind of method determining textual visual similarity, belong to Internet technical field.

Background technology

Flourish along with internet, carry magnanimity information miscellaneous, and scale is in quick growth in internet.This wherein contains the very high text of a large amount of vision similarity.Textual visual similarity refers to two given texts, weighs the similarity of two texts from visual perception's angle of people.Concerning a legal or normal text A, if certain text B has very high vision similarity with it, so likely cause the visual error of people, thus wrongly text B is worked as composition notebook A use.Unnecessary risk and trouble will be brought like this to user.Such as, if text A is the URL of a website of bank, lawless person may forge this website of bank, wherein attack script such as carry malice wooden horse etc., and the text B closely similar with text A is as the URL forging website to use one to look.Once user is confused, mistakenly text B is clicked as composition notebook A, user account number so will occur and to steal or even fund quilt cover is walked, cause a series of severe economic consequences.If user is confused by two closely similar addresses of items of mail, wrong everybody address of addressee to be wrongly write, mail will occur and send out an event by mistake.Comprise personal information, the financial data even sensitive data such as classified information if sent out in mail by mistake, serious society and economic problems will be caused.Present Mail Clients all has the automatic polishing function of addresses of items of mail substantially, namely user inputs several characters of recipient mailbox address, client can according to the transmission mail history of user, recommend some email addresses to select for user to user, the character that these email addresses input with user is for prefix.Automatic polishing function makes user can not input complete addressee's addresses of items of mail, brings that some are convenient, but the problem introduced selects alternate item and the situation that causes mail by mistake to be sent out more easily occurs because people falsely drop.Therefore, needing a kind of method determining textual visual similarity, causing the generation by mistake sending out mail events to avoiding the visual neglect due to people.

The method of determination similarity of character string traditional is at present Levenshtein similarity calculating method, the editing distance namely between character string, is endowed identical weight at the character of the direct diverse location of character string.This mode cannot reasonably reflect and understand the custom that user reads and writes specific character string (such as email address, URL) accurately.For two character strings, according to the custom of the actual reading of people, different weights is given to the diverse location of character string, cause this URL phishing attack of row to prevent people from misreading or cause due to the wrongly writing address of the addressee of user and send out addressee wrong, thus cause individual privacy, or even the leakage of state secret.

So in the urgent need to a kind of novel similarity of character string defining method, to make up above-mentioned deficiency.

Summary of the invention

Technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of method determining particular text vision similarity is provided, a kind of method of specific character string (usually for Email) text similarity, reaches the accuracy rate higher than traditional detection technique and recall rate.

The technology of the present invention solution: a kind of method determining textual visual similarity, performing step is as follows:

(1) calculate two direct visible sensation distances of character string, computing formula is as follows:

VD (α_{[t + 1, l (a)]}, β_{[t + 1, l (β)]}) = θ \times \min (x, y, z) + \{\begin{matrix} 0 & if & α_{l (a)} = β_{l (β)} \\ 1 & if & {αβ}_{l (a)} &NotEqual; β_{l (β)} \end{matrix}

Wherein

\{\begin{matrix} x = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β)]}) \\ y = VD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) \\ z = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β) - 1]}) \end{matrix}

Calculate visible sensation distance maximal value possible between two character strings:

\begin{matrix} MVD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) = Σ_{k = 0}^{\max (l (α), l (β) - t - 1)} θ^{k} \\ = \frac{{1 - θ}^{\max (l (α), l (β) - t)}}{1 - θ} \end{matrix}

Wherein α, β represent two character strings, and l () represents the length of character string, and θ represents the ratio of the weight factor between each adjacent character.

(2) calculate two direct similarities of character string according to two direct visible sensation distances of character string in step (1), computing formula is as follows:

VS (A, B) = \{\begin{matrix} 0 & {ifα}_{[1, t]} &NotEqual; β_{[1, t]} \\ 1 - \frac{VD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})}{MVD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})} & otherwise \end{matrix}

VD represents the visible sensation distance between the character string above calculated, and wherein VD represents the vision similarity calculated, and MVD represents the maximal value of vision similarity in theory between character string α above-mentioned, β.

(3) vision similarity calculated by the addresses of items of mail of the contacts list of the address of the addressee of this mail and user's history outbox record, as VESA (the Visual Similarity between Email Addresses) feature in sorter, detects specific character string such as form class and is similar to the vision similarity of e-mail address;

(4) each envelope mail calculates corresponding VESA feature, using the feature of VESA feature as sorter, utilizes random forest sorter to carry out classification based training and test to all transmission mails of safe data centralization, sends out mail for detection by mistake.

(5) maximal value addresses of items of mail of the sender of each envelope mail and this sender being sent out in the past mail record maximal value in mail record and current e-mail, as the feature of sorter, detects the similarity of specific character string and e-mail address;

When adopting safe data set to carry out performance training and testing in described step (4), safe data set is divided into three parts, the senders list of first part of all mail user of statistics, second part is used as training set, with user generate random forest sorter, the 3rd part be used as test set, for the performance of testing classification device; To each envelope mail in training set, calculate this sender and all character string vision similarity of its history addressee, get the VESA feature of maximal value wherein as this envelope mail; And then for each the envelope mail in test set, utilize random forest to classify, obtain the classification results whether mail is sent out by mistake.

The present invention's advantage is compared with prior art: the present invention has taken into full account that people reads URL or Email and gives priority to string prefix constantly, and be easy to the custom neglecting suffix, compared with the similarity of traditional calculating character illustration and text juxtaposed setting basis, that has weighed two character strings more accurately easily obscures degree.

Accompanying drawing explanation

Fig. 1 is the realization flow figure of the inventive method;

Fig. 2 is experiment effect comparison diagram.

Embodiment

As shown in Figure 1, a kind of of the present invention's proposition determines that the method for textual visual similarity is specific as follows.

Definition: to text A=α ₁α _{l (α)}(be abbreviated as α _{[1, l (α)]}) and B=β [ _{1, (β)]}, l () represents the length of text, and the vision similarity of the two is defined as:

VS (A, B) = \{\begin{matrix} 0 & {ifα}_{[1, t]} &NotEqual; β_{[1, t]} \\ 1 - \frac{VD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})}{MVD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})} & otherwise \end{matrix}

The span of vision similarity is [0,1].If front t character of two texts is not identical, then think that the vision similarity of two texts is 0, for Mail Clients, user can in address box several letter before Input Address, then from the alternate item of prompting, select carriage return to knock in.By noted earlier, this process likely causes due to the carelessness of user have selected wrong addressee and then causing a wrong mail.

The visible sensation distance of what VD function provided is two texts, its computing formula is as follows:

VD (α_{[t + 1, l (a)]}, β_{[t + 1, l (β)]}) = θ \times \min (x, y, z) + \{\begin{matrix} 0 & if & α_{l (a)} = β_{l (β)} \\ 1 & if & {αβ}_{l (a)} &NotEqual; β_{l (β)} \end{matrix}

Wherein,

\{\begin{matrix} x = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β)]}) \\ y = VD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) \\ z = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β) - 1]}) \end{matrix}

θ represents the weight difference between two adjacent characters, and its span is [0,1].Be worth less expression character weight below higher.When θ equals 1, visible sensation distance is Levenshtein distance.

Levenshtein distance, also referred to as editing distance, refers to two given texts, changes the minimum editing operation number of times needed for another into one.Editing operation refers to inserts character manipulation, delete character operation, substitute character operation.If the Levenshtein distance of two texts is larger, then the otherness of two texts is larger, and their similarity is lower.

Such as, kitten is changed into sitting and at least needs following three step operations:

1, replacement operation: sitten (k → s)

2, replacement operation: sittin (e → i)

3, update: sitting (→ g)

Therefore, the Levenshtein distance of text kitten and text sitting is 3.

The weight that Levenshtein is identical apart from the editing operation of giving diverse location.Such as, Levenshtein distance thinks that the replacement operation at initial character is the same with the replacement operation at intermediate character.Namely think that the editing distance of levenshtein and nevenshtein be the editing distance of 1, levenshtein and levemshtein is all 1, their similarity is all identical, but obviously latter two text is admitted one's mistake more mutually.This is because when people judges that whether two texts are identical at short notice, can not compare character by character, but whether the character more paying close attention to specific position is identical, such as, some characters near the some characters started, several characters of ending, special character.A lot of fishing URL utilizes this judgement psychology structure of people.

The maximal value of what MVD function provided the is visible sensation distance of two texts.According to the computing formula of VD function, can obtain:

\begin{matrix} MVD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) = Σ_{k = 0}^{\max (l (α), l (β) - t - 1)} θ^{k} \\ = \frac{{1 - θ}^{\max (l (α), l (β) - t)}}{1 - θ} \end{matrix}

When the ratio of the weight factor of θ=1 that is between adjacent character is 1, make MVD (α _{[t+1, l (α)]}, β _{[t+1, l (β)]})=max (l (α), l (β))-t, θ=1 time, the maximal value of two character string visible sensation distances;

Send out test experience with mail by mistake and evaluate method of the present invention.Given message M first, message content is c, and the sender of message is s, and the recipient of message has n: R={r ₁, r ₂..., r _n.So message M actually contains n and seals mail: M={E ₁, E ₂..., E _n, wherein E _itlv triple a: E _i=(s, r _i, c).Mail is sent out detection by mistake and is actually a binary classification problems, namely judges often to seal mail E _iwhether be send out by mistake.To mail E _i, can with in the history recipient list of sender s with current addressee r _iconduct key character of vision similarity, be designated as VSEA (Visual Similarity of Email Address), then:

VSEA (E_{i}) = \max_{u &Element; EAB (s)} VS (r_{i}, u)

Wherein, EAB represents all receiver address set sending mail of this certain addresses of items of mail, wherein VS is the vision similarity between foregoing two addresses of items of mail character strings, and this formula represents the maximal value of vision similarity in the address of the history addressee that the sender of this envelope mail is corresponding.

Send out all the other 20 features used in detection as shown in the table at mail by mistake.

Table 1

Using VSEA as the 21st feature.The data set of experiment is Enron's mail record data set.This data set is the real large-scale e-mail messages of the staff of Enron.This data set is announced by the energy management council of the United States Federal openly to study purposes.Experiment have employed ten folding cross validations and obtains final detection result, and the safe data centralization of use is from January, 2010 to the mail record in October, 2010.Experimental result as shown in Figure 2.

The data set by mistake sending out mail adopts the artificial method generated.Data centralization is sent out mail by mistake and is comprised two subsets: 1, the address of the addressee of stochastic generation.2, the address of the addressee of certain hour e-mail contact is never again exceeded.Fig. 2 shows the impact using different characteristic mail to be sent out by mistake to testing result.As seen from the figure, only take the recall rate of the sorter of front 10 features and accuracy rate only less than 10%, use front 20 features, recall rate and accuracy rate can reach more than 80%.After with the addition of VESA feature, a mail by mistake accuracy rate detected and recall rate can further increasing 2%-5%.β wherein in Fig. 2 represent the mail that sender and addressee have no precedent relationship record in wrong mail set account for the number percent of wrong mail.

Examples of implementation

First for two email address α=" chenxiaojun@iie.ac.cn " and β=" chengxiuyun@iie.ac.cn ",

Utilize the similarity of textual visual Similarity Measure two character strings:

Here parameters t=3, θ=1/2:

\begin{matrix} VS (α, β) = 1 - \frac{VD ('' nxiaojun'','' ngxiuyun'')}{MVD ('' nxiaojun'','' ngxiuyun'')} \\ = 1 - \frac{1 / 512 + 1 / 64 + 1 / 32 + 1 / 16 + 1 / 4}{{1 - (1 / 2)}^{8}} \\ \approx 0.82 \end{matrix}

VD is the vision similarity between character string, and MVD is the value that two character strings suppose the situation that direct all characters are not identical.

As can be seen here, meticulousr measurement results can be obtained based on textual visual similarity.

And the distance utilizing Levenshtein to calculate two character strings is 3, here by the inverse of Levenshtein distance, similarity as character string is 0.33, obviously, proposes to determine that textual visual similarity based method more can reflect the essence of portraying people's reading character string custom based on the present invention.

There is provided above embodiment to be only used to describe object of the present invention, and do not really want to limit the scope of the invention.Scope of the present invention is defined by the following claims.Do not depart from spirit of the present invention and principle and the various equivalent substitutions and modifications made, all should contain within the scope of the present invention.

Claims

1. determine a method for textual visual similarity, it is characterized in that performing step is as follows:

VD (α_{[t + 1, l (a)],} β_{[t + 1, l (β)]}) = θ \times \min (x, y, z) + \{\begin{matrix} 0 & if & α_{1 (a)} = β_{l (β)} \\ 1 & if & α_{1 (a)} &NotEqual; β_{l (β)} \end{matrix}

Wherein

\{\begin{matrix} x = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β)]}) \\ y = VD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) \\ z = VD (α_{[t + 1, l (α) - 1]}, β_{[t + 1, l (β) - 1}) \end{matrix}

\begin{matrix} MVD (α_{[t + 1, l (α)]}, β_{[t + 1, l (β)]}) = Σ_{k = 0}^{\max (l (α), l (β) - t - 1)} θ^{k} \\ = \frac{1 - θ^{\max (l (α), l (β) - t)}}{1 - θ} \end{matrix}

Wherein α, β represent two character strings, and l () represents the length of character string, and θ represents the ratio of the weight factor between each adjacent character;

(2) calculate two character string vision similarity according to two direct visible sensation distances of character string in step (1), computing formula is as follows:

VS (A, B) = \{\begin{matrix} 0 & if α_{[1, t]} &NotEqual; β_{[1, t]} \\ 1 - \frac{VD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})}{MVD (α_{(t + 1, l (a))}, β_{(t + 1), l (β)})} & otherwise \end{matrix}

VD represents the visible sensation distance between the addresses of items of mail character string above calculated, and MVD represents the maximal value of vision similarity in theory between character string α, β;

(3) vision similarity calculated by the addresses of items of mail of the contacts list of the address of the addressee of this mail and user's history outbox record is as the VESA in sorter, Visual Similarity between Email Addresses) feature, detects specific character string such as form class and is similar to the vision similarity of e-mail address;

2. the method determining textual visual similarity according to claim 1, it is characterized in that: when adopting safe data set to carry out performance training and testing in described step (4), safe data set is divided into three parts, the senders list of first part of all mail user of statistics, second part is used as training set, with user generate random forest sorter, the 3rd part be used as test set, for the performance of testing classification device; To each envelope mail in training set, calculate this sender and all character string vision similarity of its history addressee, get the VESA feature of maximal value wherein as this envelope mail; And then for each the envelope mail in test set, utilize random forest to classify, obtain the classification results whether mail is sent out by mistake.