CN104375982A - Method for determining visual similarity of texts - Google Patents

Method for determining visual similarity of texts Download PDF

Info

Publication number
CN104375982A
CN104375982A CN201410564469.8A CN201410564469A CN104375982A CN 104375982 A CN104375982 A CN 104375982A CN 201410564469 A CN201410564469 A CN 201410564469A CN 104375982 A CN104375982 A CN 104375982A
Authority
CN
China
Prior art keywords
beta
alpha
mail
similarity
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201410564469.8A
Other languages
Chinese (zh)
Inventor
柳厅文
张浩亮
闫旸
时金桥
亚静
季月英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute of Information Engineering of CAS
Original Assignee
Institute of Information Engineering of CAS
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute of Information Engineering of CAS filed Critical Institute of Information Engineering of CAS
Priority to CN201410564469.8A priority Critical patent/CN104375982A/en
Publication of CN104375982A publication Critical patent/CN104375982A/en
Pending legal-status Critical Current

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to a method for determining visual similarity of texts. The method comprises the steps of (1) calculating the direct visual distance between two character strings; (2) calculating the direct similarity of the two character strings; (3) detecting the similarity of specific character strings, namely email addresses with the maximum value of previous email records and the maximum value of the current email as the feature of a classifier; (4) conducting classification training and detection on missent emails by means of a random forest classifier to detect the missent emails. By the adoption of the method, the accuracy and recall rate are higher compared with traditional detection technologies.

Description

A kind of method determining textual visual similarity
Technical field
The present invention relates to a kind of method determining textual visual similarity, belong to Internet technical field.
Background technology
Flourish along with internet, carry magnanimity information miscellaneous, and scale is in quick growth in internet.This wherein contains the very high text of a large amount of vision similarity.Textual visual similarity refers to two given texts, weighs the similarity of two texts from visual perception's angle of people.Concerning a legal or normal text A, if certain text B has very high vision similarity with it, so likely cause the visual error of people, thus wrongly text B is worked as composition notebook A use.Unnecessary risk and trouble will be brought like this to user.Such as, if text A is the URL of a website of bank, lawless person may forge this website of bank, wherein attack script such as carry malice wooden horse etc., and the text B closely similar with text A is as the URL forging website to use one to look.Once user is confused, mistakenly text B is clicked as composition notebook A, user account number so will occur and to steal or even fund quilt cover is walked, cause a series of severe economic consequences.If user is confused by two closely similar addresses of items of mail, wrong everybody address of addressee to be wrongly write, mail will occur and send out an event by mistake.Comprise personal information, the financial data even sensitive data such as classified information if sent out in mail by mistake, serious society and economic problems will be caused.Present Mail Clients all has the automatic polishing function of addresses of items of mail substantially, namely user inputs several characters of recipient mailbox address, client can according to the transmission mail history of user, recommend some email addresses to select for user to user, the character that these email addresses input with user is for prefix.Automatic polishing function makes user can not input complete addressee's addresses of items of mail, brings that some are convenient, but the problem introduced selects alternate item and the situation that causes mail by mistake to be sent out more easily occurs because people falsely drop.Therefore, needing a kind of method determining textual visual similarity, causing the generation by mistake sending out mail events to avoiding the visual neglect due to people.
The method of determination similarity of character string traditional is at present Levenshtein similarity calculating method, the editing distance namely between character string, is endowed identical weight at the character of the direct diverse location of character string.This mode cannot reasonably reflect and understand the custom that user reads and writes specific character string (such as email address, URL) accurately.For two character strings, according to the custom of the actual reading of people, different weights is given to the diverse location of character string, cause this URL phishing attack of row to prevent people from misreading or cause due to the wrongly writing address of the addressee of user and send out addressee wrong, thus cause individual privacy, or even the leakage of state secret.
So in the urgent need to a kind of novel similarity of character string defining method, to make up above-mentioned deficiency.
Summary of the invention
Technology of the present invention is dealt with problems: overcome the deficiencies in the prior art, a kind of method determining particular text vision similarity is provided, a kind of method of specific character string (usually for Email) text similarity, reaches the accuracy rate higher than traditional detection technique and recall rate.
The technology of the present invention solution: a kind of method determining textual visual similarity, performing step is as follows:
(1) calculate two direct visible sensation distances of character string, computing formula is as follows:
VD ( α [ t + 1 , l ( a ) ] , β [ t + 1 , l ( β ) ] ) = θ × min ( x , y , z ) + 0 if α l ( a ) = β l ( β ) 1 if αβ l ( a ) ≠ β l ( β )
Wherein x = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) ] ) y = VD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) z = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) - 1 ] )
Calculate visible sensation distance maximal value possible between two character strings:
MVD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) = Σ k = 0 max ( l ( α ) , l ( β ) - t - 1 ) θ k = 1 - θ max ( l ( α ) , l ( β ) - t ) 1 - θ
Wherein α, β represent two character strings, and l () represents the length of character string, and θ represents the ratio of the weight factor between each adjacent character.
(2) calculate two direct similarities of character string according to two direct visible sensation distances of character string in step (1), computing formula is as follows:
VS ( A , B ) = 0 ifα [ 1 , t ] ≠ β [ 1 , t ] 1 - VD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) MVD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) otherwise
VD represents the visible sensation distance between the character string above calculated, and wherein VD represents the vision similarity calculated, and MVD represents the maximal value of vision similarity in theory between character string α above-mentioned, β.
(3) vision similarity calculated by the addresses of items of mail of the contacts list of the address of the addressee of this mail and user's history outbox record, as VESA (the Visual Similarity between Email Addresses) feature in sorter, detects specific character string such as form class and is similar to the vision similarity of e-mail address;
(4) each envelope mail calculates corresponding VESA feature, using the feature of VESA feature as sorter, utilizes random forest sorter to carry out classification based training and test to all transmission mails of safe data centralization, sends out mail for detection by mistake.
(5) maximal value addresses of items of mail of the sender of each envelope mail and this sender being sent out in the past mail record maximal value in mail record and current e-mail, as the feature of sorter, detects the similarity of specific character string and e-mail address;
When adopting safe data set to carry out performance training and testing in described step (4), safe data set is divided into three parts, the senders list of first part of all mail user of statistics, second part is used as training set, with user generate random forest sorter, the 3rd part be used as test set, for the performance of testing classification device; To each envelope mail in training set, calculate this sender and all character string vision similarity of its history addressee, get the VESA feature of maximal value wherein as this envelope mail; And then for each the envelope mail in test set, utilize random forest to classify, obtain the classification results whether mail is sent out by mistake.
The present invention's advantage is compared with prior art: the present invention has taken into full account that people reads URL or Email and gives priority to string prefix constantly, and be easy to the custom neglecting suffix, compared with the similarity of traditional calculating character illustration and text juxtaposed setting basis, that has weighed two character strings more accurately easily obscures degree.
Accompanying drawing explanation
Fig. 1 is the realization flow figure of the inventive method;
Fig. 2 is experiment effect comparison diagram.
Embodiment
As shown in Figure 1, a kind of of the present invention's proposition determines that the method for textual visual similarity is specific as follows.
Definition: to text A=α 1α l (α)(be abbreviated as α [1, l (α)]) and B=β [ 1, (β)], l () represents the length of text, and the vision similarity of the two is defined as:
VS ( A , B ) = 0 ifα [ 1 , t ] ≠ β [ 1 , t ] 1 - VD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) MVD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) otherwise
The span of vision similarity is [0,1].If front t character of two texts is not identical, then think that the vision similarity of two texts is 0, for Mail Clients, user can in address box several letter before Input Address, then from the alternate item of prompting, select carriage return to knock in.By noted earlier, this process likely causes due to the carelessness of user have selected wrong addressee and then causing a wrong mail.
The visible sensation distance of what VD function provided is two texts, its computing formula is as follows:
VD ( α [ t + 1 , l ( a ) ] , β [ t + 1 , l ( β ) ] ) = θ × min ( x , y , z ) + 0 if α l ( a ) = β l ( β ) 1 if αβ l ( a ) ≠ β l ( β )
Wherein,
x = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) ] ) y = VD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) z = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) - 1 ] )
θ represents the weight difference between two adjacent characters, and its span is [0,1].Be worth less expression character weight below higher.When θ equals 1, visible sensation distance is Levenshtein distance.
Levenshtein distance, also referred to as editing distance, refers to two given texts, changes the minimum editing operation number of times needed for another into one.Editing operation refers to inserts character manipulation, delete character operation, substitute character operation.If the Levenshtein distance of two texts is larger, then the otherness of two texts is larger, and their similarity is lower.
Such as, kitten is changed into sitting and at least needs following three step operations:
1, replacement operation: sitten (k → s)
2, replacement operation: sittin (e → i)
3, update: sitting (→ g)
Therefore, the Levenshtein distance of text kitten and text sitting is 3.
The weight that Levenshtein is identical apart from the editing operation of giving diverse location.Such as, Levenshtein distance thinks that the replacement operation at initial character is the same with the replacement operation at intermediate character.Namely think that the editing distance of levenshtein and nevenshtein be the editing distance of 1, levenshtein and levemshtein is all 1, their similarity is all identical, but obviously latter two text is admitted one's mistake more mutually.This is because when people judges that whether two texts are identical at short notice, can not compare character by character, but whether the character more paying close attention to specific position is identical, such as, some characters near the some characters started, several characters of ending, special character.A lot of fishing URL utilizes this judgement psychology structure of people.
The maximal value of what MVD function provided the is visible sensation distance of two texts.According to the computing formula of VD function, can obtain:
MVD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) = Σ k = 0 max ( l ( α ) , l ( β ) - t - 1 ) θ k = 1 - θ max ( l ( α ) , l ( β ) - t ) 1 - θ
When the ratio of the weight factor of θ=1 that is between adjacent character is 1, make MVD (α [t+1, l (α)], β [t+1, l (β)])=max (l (α), l (β))-t, θ=1 time, the maximal value of two character string visible sensation distances;
Send out test experience with mail by mistake and evaluate method of the present invention.Given message M first, message content is c, and the sender of message is s, and the recipient of message has n: R={r 1, r 2..., r n.So message M actually contains n and seals mail: M={E 1, E 2..., E n, wherein E itlv triple a: E i=(s, r i, c).Mail is sent out detection by mistake and is actually a binary classification problems, namely judges often to seal mail E iwhether be send out by mistake.To mail E i, can with in the history recipient list of sender s with current addressee r iconduct key character of vision similarity, be designated as VSEA (Visual Similarity of Email Address), then:
VSEA ( E i ) = max u ∈ EAB ( s ) VS ( r i , u )
Wherein, EAB represents all receiver address set sending mail of this certain addresses of items of mail, wherein VS is the vision similarity between foregoing two addresses of items of mail character strings, and this formula represents the maximal value of vision similarity in the address of the history addressee that the sender of this envelope mail is corresponding.
Send out all the other 20 features used in detection as shown in the table at mail by mistake.
Table 1
Using VSEA as the 21st feature.The data set of experiment is Enron's mail record data set.This data set is the real large-scale e-mail messages of the staff of Enron.This data set is announced by the energy management council of the United States Federal openly to study purposes.Experiment have employed ten folding cross validations and obtains final detection result, and the safe data centralization of use is from January, 2010 to the mail record in October, 2010.Experimental result as shown in Figure 2.
The data set by mistake sending out mail adopts the artificial method generated.Data centralization is sent out mail by mistake and is comprised two subsets: 1, the address of the addressee of stochastic generation.2, the address of the addressee of certain hour e-mail contact is never again exceeded.Fig. 2 shows the impact using different characteristic mail to be sent out by mistake to testing result.As seen from the figure, only take the recall rate of the sorter of front 10 features and accuracy rate only less than 10%, use front 20 features, recall rate and accuracy rate can reach more than 80%.After with the addition of VESA feature, a mail by mistake accuracy rate detected and recall rate can further increasing 2%-5%.β wherein in Fig. 2 represent the mail that sender and addressee have no precedent relationship record in wrong mail set account for the number percent of wrong mail.
Examples of implementation
First for two email address α=" chenxiaojun@iie.ac.cn " and β=" chengxiuyun@iie.ac.cn ",
Utilize the similarity of textual visual Similarity Measure two character strings:
Here parameters t=3, θ=1/2:
VS ( α , β ) = 1 - VD ( ′ ′ nxiaojun ′ ′ , ′ ′ ngxiuyun ′ ′ ) MVD ( ′ ′ nxiaojun ′ ′ , ′ ′ ngxiuyun ′ ′ ) = 1 - 1 / 512 + 1 / 64 + 1 / 32 + 1 / 16 + 1 / 4 1 - ( 1 / 2 ) 8 ≈ 0.82
VD is the vision similarity between character string, and MVD is the value that two character strings suppose the situation that direct all characters are not identical.
As can be seen here, meticulousr measurement results can be obtained based on textual visual similarity.
And the distance utilizing Levenshtein to calculate two character strings is 3, here by the inverse of Levenshtein distance, similarity as character string is 0.33, obviously, proposes to determine that textual visual similarity based method more can reflect the essence of portraying people's reading character string custom based on the present invention.
There is provided above embodiment to be only used to describe object of the present invention, and do not really want to limit the scope of the invention.Scope of the present invention is defined by the following claims.Do not depart from spirit of the present invention and principle and the various equivalent substitutions and modifications made, all should contain within the scope of the present invention.

Claims (2)

1. determine a method for textual visual similarity, it is characterized in that performing step is as follows:
(1) calculate two direct visible sensation distances of character string, computing formula is as follows:
VD ( α [ t + 1 , l ( a ) ] , β [ t + 1 , l ( β ) ] ) = θ × min ( x , y , z ) + 0 if α 1 ( a ) = β l ( β ) 1 if α 1 ( a ) ≠ β l ( β )
Wherein x = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) ] ) y = VD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) z = VD ( α [ t + 1 , l ( α ) - 1 ] , β [ t + 1 , l ( β ) - 1 )
Calculate visible sensation distance maximal value possible between two character strings:
MVD ( α [ t + 1 , l ( α ) ] , β [ t + 1 , l ( β ) ] ) = Σ k = 0 max ( l ( α ) , l ( β ) - t - 1 ) θ k = 1 - θ max ( l ( α ) , l ( β ) - t ) 1 - θ
Wherein α, β represent two character strings, and l () represents the length of character string, and θ represents the ratio of the weight factor between each adjacent character;
(2) calculate two character string vision similarity according to two direct visible sensation distances of character string in step (1), computing formula is as follows:
VS ( A , B ) = 0 if α [ 1 , t ] ≠ β [ 1 , t ] 1 - VD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) MVD ( α ( t + 1 , l ( a ) ) , β ( t + 1 ) , l ( β ) ) otherwise
VD represents the visible sensation distance between the addresses of items of mail character string above calculated, and MVD represents the maximal value of vision similarity in theory between character string α, β;
(3) vision similarity calculated by the addresses of items of mail of the contacts list of the address of the addressee of this mail and user's history outbox record is as the VESA in sorter, Visual Similarity between Email Addresses) feature, detects specific character string such as form class and is similar to the vision similarity of e-mail address;
(4) each envelope mail calculates corresponding VESA feature, using the feature of VESA feature as sorter, utilizes random forest sorter to carry out classification based training and test to all transmission mails of safe data centralization, sends out mail for detection by mistake.
2. the method determining textual visual similarity according to claim 1, it is characterized in that: when adopting safe data set to carry out performance training and testing in described step (4), safe data set is divided into three parts, the senders list of first part of all mail user of statistics, second part is used as training set, with user generate random forest sorter, the 3rd part be used as test set, for the performance of testing classification device; To each envelope mail in training set, calculate this sender and all character string vision similarity of its history addressee, get the VESA feature of maximal value wherein as this envelope mail; And then for each the envelope mail in test set, utilize random forest to classify, obtain the classification results whether mail is sent out by mistake.
CN201410564469.8A 2014-10-21 2014-10-21 Method for determining visual similarity of texts Pending CN104375982A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410564469.8A CN104375982A (en) 2014-10-21 2014-10-21 Method for determining visual similarity of texts

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410564469.8A CN104375982A (en) 2014-10-21 2014-10-21 Method for determining visual similarity of texts

Publications (1)

Publication Number Publication Date
CN104375982A true CN104375982A (en) 2015-02-25

Family

ID=52554905

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410564469.8A Pending CN104375982A (en) 2014-10-21 2014-10-21 Method for determining visual similarity of texts

Country Status (1)

Country Link
CN (1) CN104375982A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794176A (en) * 2015-04-02 2015-07-22 中国科学院信息工程研究所 Multiattribute-based detection method for missent e-mail
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
WO2018177071A1 (en) * 2017-03-31 2018-10-04 杭州海康威视数字技术股份有限公司 Method and apparatus for matching registration plate number, and method and apparatus for matching character information

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030847A (en) * 2007-03-30 2007-09-05 刘文印 Method and system for discriminating cheat by unified code
US20120323877A1 (en) * 2011-06-17 2012-12-20 Microsoft Corporation Enriched Search Features Based In Part On Discovering People-Centric Search Intent

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101030847A (en) * 2007-03-30 2007-09-05 刘文印 Method and system for discriminating cheat by unified code
US20120323877A1 (en) * 2011-06-17 2012-12-20 Microsoft Corporation Enriched Search Features Based In Part On Discovering People-Centric Search Intent

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TINAWEN LIU ET AL.: ""Towards misdirected email detection for preventing information leakage"", 《2014 IEEE SYMPOSIUM ON COMPUTERS AND COMMUNICATION (ISCC)》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104794176A (en) * 2015-04-02 2015-07-22 中国科学院信息工程研究所 Multiattribute-based detection method for missent e-mail
CN105913094A (en) * 2016-05-03 2016-08-31 中国科学院信息工程研究所 Minimum distance string calculation searching method
CN105913094B (en) * 2016-05-03 2019-06-21 中国科学院信息工程研究所 A kind of minimum range character string calculating lookup method
CN106127222A (en) * 2016-06-13 2016-11-16 中国科学院信息工程研究所 The similarity of character string computational methods of a kind of view-based access control model and similarity determination methods
WO2018177071A1 (en) * 2017-03-31 2018-10-04 杭州海康威视数字技术股份有限公司 Method and apparatus for matching registration plate number, and method and apparatus for matching character information
US11093782B2 (en) * 2017-03-31 2021-08-17 Hangzhou Hikvision Digital Technology Co., Ltd. Method for matching license plate number, and method and electronic device for matching character information

Similar Documents

Publication Publication Date Title
CN103514174B (en) A kind of file classification method and device
Bratt et al. Perceived age discrimination across age in Europe: From an ageing society to a society for all ages.
CN104375982A (en) Method for determining visual similarity of texts
An et al. Fragmented social media: a look into selective exposure to political news
GB2600028A (en) Detection of phishing campaigns
Egozi et al. Phishing email detection using robust nlp techniques
El-Halees Mining opinions in user-generated contents to improve course evaluation
CN105447505B (en) A kind of multi-level important email detection method
CN103795612A (en) Method for detecting junk and illegal messages in instant messaging
CN103559176A (en) Microblog emotional evolution analysis method and system
US10599998B2 (en) Feature selection using a large deviation principle
Kabbur et al. Content-based methods for predicting web-site demographic attributes
CN103150646A (en) Classified display method and device of electronic mail
Zhang et al. Joint monitoring of post-sales online review processes based on a distribution-free EWMA scheme
Lin et al. Revisiting CO2 emissions convergence in G18 countries
Hosseinpour et al. An ensemble learning approach for sms spam detection
CN103198396A (en) Mail classification method based on social network behavior characteristics
CN102298583A (en) Method and system for evaluating webpage quality of electronic bulletin board
CN104794176A (en) Multiattribute-based detection method for missent e-mail
Anitha et al. Email spam classification using neighbor probability based Naïve Bayes algorithm
CN102760130A (en) Information processing method and device
Liu et al. Towards misdirected email detection for preventing information leakage
Hershkop et al. Identifying spam without peeking at the contents
Liu et al. Rumor Detection of Sina Weibo Based on MCF Algorithm
Mishra et al. An efficient approach for supervised learning algorithms using Different Data Mining Tools for spam categorization

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20150225