CN107690130A - A kind of information identifying method and system - Google Patents

A kind of information identifying method and system Download PDF

Info

Publication number
CN107690130A
CN107690130A CN201610628540.3A CN201610628540A CN107690130A CN 107690130 A CN107690130 A CN 107690130A CN 201610628540 A CN201610628540 A CN 201610628540A CN 107690130 A CN107690130 A CN 107690130A
Authority
CN
China
Prior art keywords
electronic address
character
information
information text
user terminal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201610628540.3A
Other languages
Chinese (zh)
Inventor
张晓璐
江为强
高家凤
方绍桢
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
China Mobile Suzhou Software Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, China Mobile Suzhou Software Technology Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201610628540.3A priority Critical patent/CN107690130A/en
Publication of CN107690130A publication Critical patent/CN107690130A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W4/00Services specially adapted for wireless communication networks; Facilities therefor
    • H04W4/12Messaging; Mailboxes; Announcements
    • H04W4/14Short messaging services, e.g. short message services [SMS] or unstructured supplementary service data [USSD]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04WWIRELESS COMMUNICATION NETWORKS
    • H04W12/00Security arrangements; Authentication; Protecting privacy or anonymity
    • H04W12/12Detection or prevention of fraud

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of information identifying method, methods described includes:The electronic address carried in setting time section in information text is extracted, and the characteristic information of the electronic address is determined according to the electronic address;Cluster analysis is carried out to the electronic address according to the characteristic information, orients doubtful violation electronic address.The present invention further simultaneously discloses a kind of information identification system.

Description

A kind of information identifying method and system
Technical field
The present invention relates to the information security technology in the communications field, and in particular to a kind of information identifying method and system.
Background technology
With the continuous development of the communication technology, mobile phone utilization rate more and more higher, pass in succession through mobile phone and carry out instant messaging Software is also more and more, and offender largely makes up deceptive information simultaneously by way of the information text such as short message and instant message Fraud is set, long-range, contactless swindle is implemented to victim, lures that victim beats money to offender or transferred accounts into, in this way, leading Increasing victim is caused because receiving fraud information, and is had dust thrown into the eyes, property loss is huge sometimes.The continuous change of fraud information Change, it is difficult accurately to formulate to intercept strategy to carry out control to fraud information to cause the operator of short message and instant message, simultaneously Bank and monitoring party are also difficult to the generation for preventing and following the trail of swindle event.
At present, in the industry generally by extracting short message content, matching sms content and default regular expression, when in short message When holding with the success of default matching regular expressions, it is flame to determine short message.But this technical scheme is due to no pair The information matched is tentatively filtered, and so easily causes and legal short message is blocked by mistake.
The content of the invention
To solve existing technical problem, the embodiment of the present invention it is expected to provide a kind of information identifying method and system, It can avoid blocking legal short message by mistake, while improve the accuracy rate of fraud information identification.
What the technical scheme of the embodiment of the present invention was realized in:
One side according to embodiments of the present invention, there is provided a kind of information identifying method, methods described include:
The electronic address carried in extraction setting time section in information text, and the electricity is determined according to the electronic address The characteristic information of subaddressing;
Cluster analysis is carried out to the electronic address according to the characteristic information, orients doubtful violation electronic address.
It is described that cluster analysis is carried out to the electronic address according to the characteristic information in such scheme, orient doubtful Violation electronic address, including:
The electronic address for belonging to same category feature in the characteristic information is gathered for one kind, and determines same class characteristic information Described in electronic address number;
When detecting that the feature number exceedes predetermined threshold value, it is doubtful violation electronic address to determine the electronic address.
In such scheme, the characteristic information includes:The transmission of electronic address/reception number, the transmission/number of reception ID, One or more in transmission/reception ID lists, the crucial character/word number of hit and normal/violation electronic address number.
In such scheme, the electronic address that is carried in the extraction preset time period in information text, including:
Content progress double byte character parsing to information text in preset time period turns half-angle character dissection process, broad sense word Accord with beeline processing between mapping processing, spcial character pretreatment, continuation character string, at character string vector effective length judgement Reason, and/or, keyword/or word extraction process;
The electronic address carried in information text after extraction process.
In such scheme, the electronic address includes:Telephone number, bank's card number, QQ number code, wechat number, mailbox One or more in location and uniform resource position mark URL.
Another aspect according to embodiments of the present invention, there is provided a kind of information identification system, the system include:Extraction unit And cluster analysis unit;Wherein,
The extraction unit, for extracting the electronic address carried in setting time section in the information text of user terminal, and The characteristic information of the electronic address is determined according to the electronic address;
The cluster analysis unit, for the characteristic information that is determined according to the extraction unit to the extraction unit The electronic address extracted carries out cluster analysis, orients doubtful violation electronic address.
In such scheme, the cluster analysis unit, believe specifically for the feature determined according to the extraction unit Breath, the electronic address for belonging to same category feature is gathered for one kind, and determine electronic address number described in same class characteristic information; When detecting that the electronic address number exceedes predetermined threshold value, it is doubtful violation electronic address to determine the electronic address.
In such scheme, the characteristic information includes:The transmission of electronic address/reception number, the transmission/number of reception ID, One or more in transmission/reception ID lists, the number for hitting crucial character/word and normal/violation electronic address number.
In such scheme, the extraction unit, the information text content to user terminal in preset time period is specifically additionally operable to Carry out double byte character parsing and turn half-angle character dissection process, broad sense character mapping processing, spcial character pretreatment, continuation character string Between beeline processing, character string vector effective length judge processing, and/or, keyword/or word extraction process;
The electronic address carried in information text after extraction process.
In such scheme, the electronic address that the extraction unit extracts includes:It is telephone number, bank's card number, QQ number, micro- One or more in signal, email address and URL.
The embodiment of the present invention provides a kind of information identifying method and system, by extracting in setting time section in information text The electronic address of carrying, and determine according to the electronic address characteristic information of the electronic address;According to the characteristic information Cluster analysis is carried out to the electronic address, orients doubtful violation electronic address.In this way, believed by the feature of electronic address Breath, analysis filtering is carried out to electronic address, it is possible to increase the interception accuracy to invalid information text, while also improve to becoming Change the discrimination of various fraud information.
Brief description of the drawings
Fig. 1 is the implementation process schematic diagram of information identifying method in the embodiment of the present invention;
Fig. 2 is the composition structural representation of information identification system in the embodiment of the present invention.
Embodiment
The embodiment of the present invention is described in detail below in conjunction with the accompanying drawings.It should be appreciated that this place is retouched The embodiment stated is merely to illustrate and explain the present invention, and is not intended to limit the invention.
Fig. 1 is the implementation process schematic diagram of information identifying method in the embodiment of the present invention;As shown in figure 1, this method includes:
Step 101, the electronic address carried in setting time section in information text is extracted, and it is true according to the electronic address The characteristic information of the fixed electronic address;
Here, the electronic address includes:Telephone number, bank's card number, QQ number code, wechat number, email address and system One or more in one URLs (URL, Uniform Resource Locator);Due to generally being needed in fraud information The important information such as telephone number or bank's card number is included, and because the cost of telephone number or bank's card number is high and changes difficult Deng many reasons, the telephone number or bank's card number in fraud information are relatively fixed, are most substantially had in fraud information Feature.Therefore, by the electronic address carried in user terminal information text in server end extraction preset time period, and institute is counted The characteristic informations such as ID transmitting-receiving number, transmitting-receiving number of electronic address are stated, the characteristic information counted is formed into electronic address information Table, in this way, server end can be according to the accounting of each characteristic information in the electronic address information table, to identify fraud information.
Here, the characteristic information includes:The transmission of electronic address/reception number, transmission/ID number of reception, sending/connect Receive the one or more in ID lists, the crucial character/word number of hit and normal/violation electronic address number.Preset time can be with It is a few minutes, several days or some months, is not intended to limit herein.
In embodiments of the present invention, the electronic address carried in preset time period in information text is extracted, including:To default The information text content progress full-shape parsing of user terminal turns half-angle dissection process in period, the mapping of broad sense character handles, be special Beeline processing, character string vector effective length judgement processing, keyword/or word carry between character pre-processing, continuation character string Take processing;The electronic address carried in information text after extraction process.
Below, server end is described in detail exemplified by extracting the telephone number carried in user terminal information text:
Due in fraud information text, usually containing the telephone number of induction user's clawback, server end is according to the phone The string length feature of number, extract the telephone number carried in user terminal information text.Specific extraction includes:
Information text content progress double byte character parsing to user terminal turns half-angle character dissection process.Wherein, Fully Formed Character Symbol is typically expressed as chinese character, numerical character or spcial character and takes two character positions, and half-angle character is typically expressed as Chinese character, numerical character or spcial character take a character position.For example, by institute in the information text content of user terminal It is corresponding half-angle numerical character " 12345 " to have by numerical character " 12345 " dissection process of full-shape input.
Broad sense character mapping processing is carried out to the information text content of user terminal.Such as:By in the information text of user terminal " o, O, zero, 0, zero " are converted into numerical character " 0 " to character in appearance;By character " (i), 1., (1), 1., one, one, I, i, l, L " turn Change numerical character " 1 " into;By character " (ii), 2., (2), 2., two, two " be converted into numerical character " 2 ";By character " (iii), 3., (3), 3., three, three " it is converted into numerical character " 3 ";By character " (iv), 4., (4), 4., four, wantonly " be converted into numerical character " 4 ";By character " (v), 5., (5), 5., five, 5, s " be converted into numerical character " 5 ";By character " (vi), 6., (6), 6., six, land, b " be converted into numeral Character " 6 ";By character " (vii), 7., (7), 7., seven, seven " be converted into numerical character " 7 ";By character " (viii), 8., (8), 8., eight, eight " It is converted into numerical character " 8 ";By character " (ix), 9., (9), 9., nine, nine " be converted into numerical character " 9 " etc..
Spcial character pretreatment is carried out to the information text content of user terminal.Such as:In the information text for deleting user terminal The spcial character carried in appearance between numeral, such as "!" # $ %& ' *+- /" etc.;
Beeline is handled carrying out continuation character string to the information text content of user terminal.Such as:When the letter of user terminal When being separated by the character of one non-" year, month, day, hour, min, second " in informative text content between two character strings, remove two characters A character between string, former and later two character strings, which are merged, turns into a character string.
Character string vector effective length judgement processing is carried out to the information text content of user terminal.Generally, telephone number String length be 7 or 8, either 400 beginning and string length be 10 or 1 start and string length be 11, Either 0 beginning and string length be 11 or 12 or 86 start and string length be 13;Server end first determines whether Whether the character string in the information text content of user terminal meets specific telephone number string length feature and call format, If the character string in the information text content of user terminal meets specific telephone number string length feature and call format, Then exported by server end using the character string as telephone number, conversely, then ignoring, to prevent extracting mistake.
In embodiments of the present invention, server end is specifically carried out at broad sense character mapping to the information text content of user terminal The telephone number obtained between reason and continuation character string after beeline processing carries out the judgement of character string vector length, works as user terminal Information text content in character string when meeting specific telephone number string length feature and call format, server end Character string in every information text is extracted twice, and electricity is used as after carrying out duplicate removal processing to the character string twice of extraction Talk about number output.
Below, server end is described in detail exemplified by extracting the bank's card number carried in user terminal information text:
The bank's card number transferred accounts, collected money from the audience due in fraud information text, usually containing induction user, offender is according to this Bank's card number gains the purpose of user's property to realize by cheating.Specifically, server end is according to the character string of bank's card number within Chinese territory Length characteristic and the former bit digital features of card number are extracted after being matched.Extract the mode and extraction telephone number of bank's card number Mode is roughly the same, and its something in common will not be repeated here, and its difference is:
Server end carries out character string vector effective length judgement processing to the information text content of user terminal:Generally, it is silver-colored The string length of row card number between 15 to 19, and bank's card number often with 35,37,40,41,42,43,44,45,47, 48th, 49,51,52,53,54,55,60,62,64,91,95 numeral beginning, first, it is determined that in the information text content of user terminal Character string whether meet the string length feature and call format of specific bank's card number, if the information text of user terminal Character string meets the string length feature and call format of specific bank's card number in content, then using the character string as bank Card number exports, and such as nothing, then ignores, to prevent extracting mistake.
In embodiments of the present invention, server end is specifically carried out at broad sense character mapping to the information text content of user terminal The bank's card number obtained between reason and continuation character string after beeline processing carries out the judgement of character string vector length, works as user terminal Information text in character string when meeting the string length feature and call format of specific bank's card number, to every information text Character string in this is extracted twice, to after the progress duplicate removal processing of character string twice of extraction and as bank's card number output.
Below, server end is described in detail by taking the qq numbers carried in the information text for extracting user terminal as an example:
During the qq numbers that server end carries in the information text for extracting user terminal, mainly with the keyword in qq numbers The two key messages are point of penetration with string length, extract mode substantially phase of the mode of qq numbers with extracting telephone number Together, its something in common will not be repeated here, and its difference is:
Server end carries out the extraction process of keyword/or word to the information text content of user terminal.In order to avoid qq postals Keyword " qq " interference in case, causes keyword/or keyword extraction mistake, first by the "@in the information text of user terminal Qq.com " is deleted, then judge again in the information text whether containing keyword/or keyword " q ", " Q ", " Tengxun number ", " button ", " penguin number " etc., if containing the keyword/or keyword in the information text of user terminal, the qq numbers are carried out Extraction, conversely, then ignoring, to prevent extracting mistake.
Server end carries out character string vector effective length judgement processing to the content in the information text of user terminal.It is logical Often, the string length of qq numbers is between 7 to 10.Judge whether character string meets specifically in the information text of user terminal The string length feature and call format of qq numbers, if character string meets specific qq numbers in the information text of user terminal String length feature and call format, then using the character string as qq number retentions, if nothing, ignore.
In embodiments of the present invention, server end specifically carries out keyword/to the information text content of user terminal or word carries Take between processing and continuation character string after beeline processing, obtained qq numbers carry out the judgement of character vector length, work as judgement When character string meets the string length feature and call format of specific qq numbers in the information text of user terminal, every is believed Character string is extracted twice in informative text, and after carrying out duplicate removal processing to the character string twice of extraction, is exported as qq numbers.
Below, server end is described in detail by taking the email address carried in the information text for extracting user terminal as an example:
During the email address that server end carries in the information text for extracting user terminal, mainly with the pass in email address Key characters "@" are point of penetration, in embodiments of the present invention, method and the extraction telephone number of server end extraction email address Method is roughly the same, and its something in common will not be repeated here, and its difference is:
Server end carries out broad sense character mapping processing to the information text content of user terminal.Such as:Server end should The deformation map of the common English character of some in information text is half-angle small English character, such as incite somebody to action ".", ", ", " point ", ", " etc. Be converted to English character " ";And by all capitalization lowers in the information text.
Server end carries out character string vector effective length judgement processing to the information text content of user terminal.Generally, postal Case address is that can occur numeral, letter, underscore, strigula with numeral or beginning of letter, the centre of email address and exist Key character "@", set of number or the monogram number of adding some points are comprised at least after character "@", period is afterwards with 2 to 9 Monogram ends up.Judge whether character string meets specific email address string length feature in the information text of user terminal And call format, if character string meets the string length feature qualifying of specific email address in the information text of user terminal Formula requirement, then extracted by server end using the character string as email address, conversely, then ignoring.
In embodiments of the present invention, server end is specifically carried out at broad sense character mapping to the information text content of user terminal Meet the character string feature of email address after reason and the character string of call format is extracted as email address.
Below, server end is described in detail by taking the URL addresses carried in the information text for extracting user terminal as an example:
Because the string length excursion of URL addresses is big, the spcial character species that may be included is more, server end When extracting URL addresses, mainly using the keyword in URL addresses as point of penetration, plus some naming rules that URL addresses are common Extracted.In embodiments of the present invention, method substantially phase of the method for server end extraction URL addresses with extracting telephone number Together, its something in common will not be repeated here, and its difference is:
Server end carries out broad sense character mapping processing to the information text content of user terminal:By the information text of user terminal In the deformation maps of some common English characters be half-angle small English character, such as by character ".", ", ", " point ", ", " etc. replace Change English character " " into;And by capitalization lower all in the information text.
Server end carries out the rule judgment processing of URL addresses to the information text content of user terminal.Generally, URL addresses with Numeral or beginning of letter, digital, alphabetical, a variety of English characters are likely to occur among URL addresses, English period occurs certainly " ", and with numeral or letter end up, URL addresses are in whole English character string, it may appear that keyword " http, ftp, ww, Wap, bbs, news, file, telnet, ed2k, thunder, co, cn, net, cc, htm, hk, tw, org, edu, gov " it One;Judge whether the character string in the information text of user terminal meets the rule feature of specific URL addresses, if the letter of user terminal Character string meets the rule feature of specific URL addresses in informative text, then is exported the character string as URL addresses, conversely, then neglecting Slightly.
In embodiments of the present invention, URL addresses rule are met in the information text that server end will specifically get user terminal Continuous English character string then is extracted as URL addresses.
Below, server end is described in detail by taking the WeChat ID carried in the information text for extracting user terminal as an example:
The mode of server end extraction WeChat ID is roughly the same with the mode of extraction telephone number, and its something in common is herein not Repeat again, its difference is:
Server end carries out keyword/to the information text content of user terminal or key extracted is handled.Generally, WeChat ID bag Include:WeChat ID, the qq numbers of English character composition pass through as WeChat ID or phone number as WeChat ID, the embodiment of the present invention The WeChat ID keyword that includes in the information text of server end extraction user terminal, such as " qq, wechat, weixin, wei xin, Wei_xin, dimension letter, prestige, " etc., and remove the keyword.Avoid interfering subsequent extracted WeChat ID, influence result Correctness.
Server end carries out character string vector effective length judgement processing to the information text content of user terminal.Generally.It is micro- The naming rule of signal be with beginning of letter, behind only allow numeral, letter, underscore or minus sign, the character of WeChat ID occur String length is between 6 to 20;When using No. qq as WeChat ID, the naming rule of WeChat ID is started with non-zero numeral, word String length is accorded between 7 to 10;When using cell-phone number as WeChat ID, the name rule of WeChat ID are started with numeral 1, character The length of string is 11.Judge character string in the information text of user terminal whether meet specific WeChat ID string length it is special Sign and call format, if character string meets the string length feature qualifying of specific WeChat ID in the information text of user terminal Formula requirement, then export the character string as WeChat ID, if nothing, ignore.
In embodiments of the present invention, the information text of user terminal is specifically carried out character string vector effective length by server end Character after processing extracts as WeChat ID.
Server end is counted all electronic addresses extracted in preset time period, and forms electronic address letter Table is ceased, the foundation as identification violation electronic address.
Step 102, cluster analysis is carried out to the electronic address according to the characteristic information, orients doubtful violation electronics Address.
Specifically, characteristic information of the server end in the electronic address information table, same class spy will be belonged to The electronic address of sign gathers for one kind, and determines electronic address number described in same class characteristic information;Described in detecting electronically When location number exceedes predetermined threshold value, it is determined that the electronic address is doubtful violation electronic address, and will include the electronics One category information of address all reports to audit center as violation electronic address.
Such as:Predetermined threshold value is 5, and server end is detected in electronic address information table, and electronic address A is in two days by same One sends ID and is sent 10 times to same user terminal;Electronic address B is interior from same transmission ID to different use on the same day Family end is sent 7 times;Electronic address C was sent 3 times in two days from different transmission ID to user terminal, it is determined that the electronic address A and electronic address B ID number of transmission exceedes predetermined threshold value 5, and now, the server end will include the electronic address A Doubtful violation electronic address is all positioned to electronic address B all information, and reports to audit center.
Here, server end will include after the doubtful violation electronic address reports to audit center, by audit center The selection doubtful violation electronic address in information text it is most long, or the most short information text of information text reversely looked into Ask, when determining really to include in described information text fraud information, then the information text is labeled as doubtful fraud information;Or Information text is most long in person's doubtful violation electronic address as described in attendant's artificial selection of the audit center, or information The most short information text of text carries out Query, when determining really to include in described information text fraud information, then should Information text is labeled as doubtful fraud information.The letter will be sent by the attendant of the audit center or the audit center again The ID of informative text sends ID labeled as doubtful fraud information, and the electronic address in described information text is labeled as into swindle electronically Location, and do not send this information to user terminal.
Fig. 2 is the composition structural representation of information identification system in the embodiment of the present invention;As shown in Fig. 2 the system bag Include:Extraction unit 201 and cluster analysis unit 202;Wherein,
The extraction unit 201, carried electronically in the information text of user terminal 204 for extracting in preset time period Location, and determine according to the electronic address characteristic information of the electronic address;
The cluster analysis unit 202, for the characteristic information determined according to the extraction unit 201, to described The electronic address that extraction unit 201 extracts carries out cluster analysis, orients doubtful violation electronic address.
Here, the electronic address includes:Telephone number, bank's card number, QQ number code, wechat number, email address and URL In one or more;Due to usually requiring to include the important information such as telephone number or bank's card number in fraud information, and due to The cost of telephone number or bank's card number is high and changes many reasons such as difficulty, telephone number or bank's card number in fraud information It is relatively fixed, it is most obvious feature in fraud information.Therefore, extracted by extraction unit 201 in preset time period The electronic address carried in the information text of user terminal 204, and count ID the transmitting-receiving number of the electronic address, transmitting-receiving number etc. Characteristic information, the characteristic information counted is formed into electronic address information table, in this way, information identification system can be according to the electricity The accounting of each characteristic information in subaddress information table, to identify fraud information;Here, the characteristic information includes:Electronic address Transmission/reception number, the transmission/number of reception ID, transmission/reception ID lists, the crucial character/word number of hit and normal/in violation of rules and regulations One or more in electronic address number;The preset time can be a few minutes, several days or some months, herein and unlimited System.
In embodiments of the present invention, extraction unit 201 is specifically used for:To the information text of user terminal in preset time period 204 This content carries out full-shape parsing and turned between half-angle dissection process, broad sense character mapping processing, spcial character pretreatment, continuation character string Beeline processing, character string vector effective length judgement processing, keyword/or word extraction process;Information after extraction process The electronic address carried in text.
Below, extraction unit 201 is carried out detailed by taking the telephone number carried in the information text for extracting user terminal 204 as an example Describe in detail bright:
Due in fraud information text, usually containing the telephone number of induction user's clawback, extraction unit 201 is according to the electricity The string length feature of number is talked about, extracts the telephone number carried in the information text of user terminal 204.Specific extraction includes:
Extraction unit 201 carries out double byte character parsing to the information text content of user terminal 204 and turned at half-angle character resolution Reason.Wherein, double byte character is typically expressed as chinese character, numerical character or spcial character and takes two character positions, half-angle Character is typically expressed as chinese character, numerical character or spcial character and takes a character position.For example, extraction unit 201 It is corresponding half by all numerical character " 12345 " dissection process by full-shape input in the information text content of user terminal 204 Angle numerical character " 12345 ".
Extraction unit 201 carries out broad sense character mapping processing to the information text content of user terminal 204.Such as:Extraction is single By the character in the information text content of user terminal 204, " o, O, zero, 0, zero " are converted into numerical character " 0 " to member 201;By character " (i), 1., (1), 1., one, one, I, i, l, L " be converted into numerical character " 1 ";By character " (ii), 2., (2), 2., two, two " be converted into Numerical character " 2 ";By character " (iii), 3., (3), 3., three, three " be converted into numerical character " 3 ";By character " (iv), 4., (4), 4., four, Wantonly " it is converted into numerical character " 4 ";By character " (v), 5., (5), 5., five, 5, s " be converted into numerical character " 5 ";By character " (vi), 6., (6), 6., six, land, b " be converted into numerical character " 6 ";By character " (vii), 7., (7), 7., seven, seven " be converted into numerical character “7”;By character " (viii), 8., (8), 8., eight, eight " be converted into numerical character " 8 ";By character " (ix), 9., (9), 9., nine, nine " conversion Into numerical character " 9 " etc..
Extraction unit 201 carries out spcial character pretreatment to the information text content of user terminal 204.Such as:Extraction unit The spcial character carried in the information text content of 201 deletion user terminals 204 between numeral, such as "!" # $ %& ' * +-/" etc.;
Beeline is handled extraction unit 201 carries out continuation character string to the information text content of user terminal 204.Example Such as:When the character for being separated by one non-" year, month, day, hour, min, second " between two character strings in the information text of user terminal 204 When, extraction unit 201 removes a character between two character strings, and former and later two character strings, which are merged, turns into a character String.
Extraction unit 201 carries out character string vector effective length judgement processing to the information text content of user terminal 204.It is logical Often, the string length of telephone number is 7 or 8, and either 400 beginnings and string length are 10 or 1 beginning and character String length be 11, either 0 beginning and string length be 11 or 12 or 86 start and string length be 13;Carry Unit 201 is taken to judge whether the character string in the information text content of user terminal 204 meets specific telephone number character string length Feature and call format are spent, if the character string in the information text content of user terminal 204 meets specific telephone number character String length feature and call format, then exported the character string as telephone number, conversely, then ignoring, to prevent extracting mistake.
In embodiments of the present invention, extraction unit 201 specifically carries out broad sense character to the information text content of user terminal 204 Mapping processing carries out the judgement of character vector length obtained telephone number after beeline processing between continuation character string, when sentencing Character string in the information text content of disconnected user terminal 204 meets specific telephone number string length feature and call format When, the character string in every information text is extracted twice, and makees after carrying out duplicate removal processing to the character string twice of extraction Exported for telephone number.
Below, extraction unit 201 is carried out detailed by taking the bank's card number carried in the information text for extracting user terminal 204 as an example Describe in detail bright:
The bank's card number transferred accounts, collected money from the audience due in fraud information text, usually containing induction user, offender is according to this Bank's card number gains the purpose of user's property to realize by cheating.Specific extraction unit 201 is according to the character string of bank's card number within Chinese territory Length characteristic and the former bit digital features of card number are extracted after being matched.Extract the mode and extraction telephone number of bank's card number Mode is roughly the same, and its something in common will not be repeated here, and its difference is:
Extraction unit 201 carries out character string vector effective length judgement processing to the information text content of user terminal 204:It is logical Often, the string length of bank's card number is between 15 to 19, and bank's card number often with 35,37,40,41,42,43,44, 45th, 47,48,49,51,52,53,54,55,60,62,64,91,95 numeral starts, in the information text for judging user terminal 204 Whether the character string in appearance meets the string length feature and call format of specific bank's card number, if user terminal 204 Character string meets the string length feature and call format of specific bank's card number in information text content, then by the character string As bank's card number output, such as nothing, then ignore, to prevent extracting mistake.
In embodiments of the present invention, extraction unit 201 specifically carries out broad sense character to the information text content of user terminal 204 The bank's card number obtained between mapping processing and continuation character string after beeline processing carries out the judgement of character string vector length, when It is right when character string meets the string length feature and call format of specific bank's card number in the information text of user terminal 204 Character string in every information text is extracted twice, to being used as bank card after the progress duplicate removal processing of character string twice of extraction Number output.
Below, extraction unit 201 is carried out detailed by taking the qq numbers carried in the information text for extracting user terminal 204 as an example Explanation:
During the qq numbers that extraction unit 201 carries in the information text for extracting user terminal 204, mainly with qq numbers Keyword and string length the two key messages are point of penetration, extract the mode and the mode of extraction telephone number of qq numbers Roughly the same, its something in common will not be repeated here, and its difference is:
Extraction unit 201 carries out the extraction process of keyword/or word to the information text content of user terminal 204.In order to keep away Exempt from keyword " qq " interference in qq mailboxes, cause keyword/or keyword extraction mistake, extraction unit 201 is first by user End 204 information text in "@qq.com " delete, then judge whether contain keyword/or key in the information text again Word " q ", " Q ", " Tengxun number ", " button ", " penguin number " etc., if containing the keyword/or closed in the information text of user terminal 204 Keyword, then the qq numbers are extracted, conversely, then ignoring, to prevent extracting mistake.
Extraction unit 201 is carried out at character string vector effective length judgement to the content in the information text of user terminal 204 Reason.Generally, the string length of qq numbers is between 7 to 10.Judge whether character string accords with the information text of user terminal 204 The string length feature and call format of specific qq numbers are closed, if the character string symbol in the information text of user terminal 204 The string length feature and call format of specific qq numbers are closed, then using the character string as qq number retentions, if nothing, is neglected Slightly.
In embodiments of the present invention, extraction unit 201 specifically the information text content of user terminal 204 is carried out keyword/ Or the qq numbers obtained between word extraction process and continuation character string after beeline processing carry out the judgement of character vector length, when When judging that character string meets the string length feature and call format of specific qq numbers in the information text of user terminal 204, Character string in every information text is extracted twice, and after carrying out duplicate removal processing to the character string twice of extraction, as qq Number exports.
Below, extraction unit 201 is carried out detailed by taking the email address carried in the information text for extracting user terminal 204 as an example Describe in detail bright:
During the email address that extraction unit 201 carries in the information text for extracting user terminal 204, mainly with email address In key character "@" be point of penetration, in embodiments of the present invention, extraction unit 201 extract email address method and extraction The method of telephone number is roughly the same, and its something in common will not be repeated here, and its difference is:
Extraction unit 201 carries out broad sense character mapping processing to the information text content of user terminal 204.Such as:Extraction is single The deformation map of the common English character of some in the information text is half-angle small English character by member 201, such as incite somebody to action ".”、“、”、 " point ", ", " etc. are converted to English character " ";And by all capitalization lowers in the information text.
Extraction unit 201 carries out character string vector effective length judgement processing to the information text content of user terminal 204.It is logical Often, email address is that numeral, letter, underscore, strigula can occur simultaneously with numeral or beginning of letter, the centre of email address Key character "@" be present, set of number or the monogram number of adding some points are comprised at least after character "@", period is afterwards with 2 to 9 The monogram ending of position.Judge whether character string meets specific email address character string in the information text of user terminal 204 Length characteristic and call format, if character string meets the character string of specific email address in the information text of user terminal 204 Length characteristic and call format, then extracted the character string as email address, conversely, then ignoring.
In embodiments of the present invention, extraction unit 201 specifically carries out broad sense character to the information text content of user terminal 204 Meet the character string feature of email address after mapping processing and the character string of call format is extracted as email address.
Below, extraction unit 201 is carried out detailed by taking the URL addresses carried in the information text for extracting user terminal 204 as an example Explanation:
Because the string length excursion of URL addresses is big, the spcial character species that may be included is more, extraction unit 201 when extracting URL addresses, mainly using the keyword in URL addresses as point of penetration, plus some names that URL addresses are common Rule is extracted.In embodiments of the present invention, extraction unit 201 extracts method and the side of extraction telephone number of URL addresses Method is roughly the same, and its something in common will not be repeated here, and its difference is:
Extraction unit 201 carries out broad sense character mapping processing to the information text content of user terminal 204:Extraction unit 201 It is half-angle small English character by the deformation map of the common English character of some in the information text of user terminal 204, such as by word Symbol ".", ", ", " point ", ", " etc. be substituted for English character " ";And capitalization all in the information text is converted to small Write mother.
Extraction unit 201 carries out the rule judgment processing of URL addresses to the information text content of user terminal 204.Generally, URL Address is likely to occur digital, alphabetical, a variety of English characters among URL addresses, English is occurred certainly with numeral or beginning of letter Period " ", and ended up with numeral or letter, URL addresses are in whole English character string, it may appear that keyword " http, ftp, ww、wap、bbs、news、file、telnet、ed2k、thunder、co、cn、net、cc、htm、hk、tw、org、edu、gov” One of;Judge whether the character string in the information text of user terminal 204 meets the rule feature of specific URL addresses, if user Character string meets the rule feature of specific URL addresses in the information text at end 204, then is exported the character string as URL addresses, Conversely, then ignore.
In embodiments of the present invention, URL is met in the information text that extraction unit 201 will specifically get user terminal 204 The continuous English character string of address rule is extracted as URL addresses.
Below, extraction unit 201 is carried out detailed by taking the WeChat ID carried in the information text for extracting user terminal 204 as an example Explanation:
The mode that extraction unit 201 extracts WeChat ID is roughly the same with the mode for extracting telephone number, and its something in common exists This is repeated no more, and its difference is:
Extraction unit 201 carries out keyword/to the information text content of user terminal 204 or key extracted is handled.Generally, it is micro- Signal includes:WeChat ID, the qq numbers of English character composition are implemented as WeChat ID or phone number as WeChat ID, the present invention Example extracts the WeChat ID keyword that includes in the information text of user terminal 204 by extraction unit 201, such as " qq, wechat, Weixin, wei xin, wei_xin, dimension letter, prestige, " etc., and remove the keyword.Avoid causing subsequent extracted WeChat ID Interference, influence the correctness of result
Extraction unit 201 carries out character string vector effective length judgement processing to the information text content of user terminal 204.It is logical Often.The naming rule of WeChat ID be with beginning of letter, behind only allow numeral, letter, underscore or minus sign occur, WeChat ID String length is between 6 to 20;When using No. qq as WeChat ID, the naming rule of WeChat ID is opened with non-zero numeral Head, string length is between 7 to 10;When using cell-phone number as WeChat ID, the name rule of WeChat ID are opened with numeral 1 Head, the length of character string is 11.Judge whether character string in the information text of user terminal 204 meets the word of specific WeChat ID String length feature and call format are accorded with, if the character string in the information text of user terminal 204 meets the word of specific WeChat ID String length feature and call format are accorded with, then exports the character string as WeChat ID, if nothing, ignores.
In embodiments of the present invention, the information text of user terminal 204 is specifically carried out character string vector by extraction unit 201 has The character after length processing is imitated to extract as WeChat ID.
Extraction unit 201 is counted all electronic addresses extracted in preset time period, and is formed electronically Location information table, the foundation as identification violation electronic address.
Here, the system also includes reporting unit 203, by cluster analysis unit 202 according to the electronic address information Characteristic information in table, all electronic addresses extracted to extraction unit 201 carry out cluster analysis, specifically, by the spy The electronic address for belonging to same category feature in reference breath gathers for one kind, and determines electronic address described in same class characteristic information Number, when detecting that the electronic address number exceedes predetermined threshold value, it is doubtful electronic address to determine the electronic address, and will be true The fixed doubtful violation electronic address is sent to the reporting unit 203, by the reporting unit 203 by the doubtful violation Electronic address reports to audit center.
Such as:Predetermined threshold value is 5, and the cluster analysis unit 202 is detected in the electronic address information table, electronically Location A was sent 10 times in two days from same transmission ID to same user terminal;Electronic address B is interior by same hair on the same day ID is sent to be sent 7 times to different user terminals;Electronic address C in two days from different transmission ID to user terminal send 3 times, then really Fixed the electronic address A and electronic address B ID number of transmission exceed predetermined threshold value 5, now, the cluster analysis unit 202 All information for including the electronic address A and electronic address B are all positioned to doubtful violation electronic address, are sent to described Reporting unit 203, the doubtful violation electronic address is reported into the audit center by the reporting unit 203.In examination & verification After the heart receives the doubtful violation electronic address, information text in the doubtful violation electronic address is selected by the audit center This is most long, or the most short information text of information text carry out it is counter see inquiry, if included really in the content of information text During obvious fraud information, then the information text is labeled as doubtful fraud information, the ID for sending the information text is labeled as Doubtful fraud information sends ID, and the doubtful fraud information and doubtful fraud information are sent into ID;Or by the audit center Attendant's artificial selection described in doubtful violation electronic address information text it is most long, or the information text that information text is most short This progress is counter to see inquiry, if include obvious fraud information in the content of information text really, by the information text Labeled as doubtful fraud information, by the electronic address in described information text labeled as swindle electronic address, and not to user End sends this information.
Information identifying method provided in an embodiment of the present invention and system, existing fraud information text library is not only restricted to, led to Various ways positioning violation electronic address is crossed, the fraud information of change polyisocyanate can be identified;By including violation electronic address Text message carry out Query, avoid situation about being blocked by mistake to legal information text from occurring, while improve swindle letter Cease the accuracy rate of identification.
In actual applications, the extraction unit 201, cluster analysis unit 202 and reporting unit 203 can be by positioned at letters Cease central processing unit (CPU), microprocessor (MPU), digital signal processor (DSP) or the field-programmable in identification device Gate array (FPGA) etc. is realized.
It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the shape of the embodiment in terms of the present invention can use hardware embodiment, software implementation or combination software and hardware Formula.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more to use storage The form for the computer program product that medium is implemented on (including but is not limited to magnetic disk storage and optical memory etc.).
The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.
These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.
These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.
The foregoing is only a preferred embodiment of the present invention, is not intended to limit the scope of the present invention.

Claims (10)

1. a kind of information identifying method, it is characterised in that methods described includes:
The electronic address carried in extraction setting time section in information text, and according to determining the electronic address electronically The characteristic information of location;
Cluster analysis is carried out to the electronic address according to the characteristic information, orients doubtful violation electronic address.
2. according to the method for claim 1, it is characterised in that described that the electronic address is entered according to the characteristic information Row cluster analysis, doubtful violation electronic address is oriented, including:
The electronic address for belonging to same category feature in the characteristic information is gathered for one kind, and determines institute in same class characteristic information State electronic address number;
When detecting that the electronic address number exceedes predetermined threshold value, it is doubtful violation electronic address to determine the electronic address.
3. according to the method for claim 1, it is characterised in that the characteristic information includes:Transmission/reception of electronic address Number, the transmission/number of reception ID, transmission/reception ID lists, the crucial character/word number of hit and normal/violation electronic address number In one or more.
4. according to the method for claim 1, it is characterised in that carried in the extraction preset time period in information text Electronic address, including:
Content progress double byte character parsing to information text in preset time period turns half-angle character dissection process, broad sense character reflects Penetrate processing, spcial character pretreatment, beeline processing between continuation character string, character string vector effective length judgement processing and/ Or, keyword/or word extraction process;
The electronic address carried in information text after extraction process.
5. according to the method for claim 1, it is characterised in that the electronic address includes:Telephone number, bank's card number, One or more in QQ number code, wechat number, email address and uniform resource position mark URL.
6. a kind of information identification system, it is characterised in that the system includes:Extraction unit and cluster analysis unit;Wherein,
The extraction unit, for extracting the electronic address carried in setting time section in the information text of user terminal, and according to The electronic address determines the characteristic information of the electronic address;
The cluster analysis unit, for the characteristic information determined according to the extraction unit, the extraction unit is carried The electronic address got carries out cluster analysis, orients doubtful violation electronic address.
7. system according to claim 6, it is characterised in that the cluster analysis unit, specifically for by the feature The electronic address for belonging to same category feature in information gathers for one kind, and determines electronic address described in same class characteristic information Number;When detecting that the electronic address number exceedes predetermined threshold value, it is doubtful violation electronic address to determine the electronic address.
8. system according to claim 6, it is characterised in that the characteristic information includes:Transmission/reception of electronic address Number, the transmission/number of reception ID, transmission/reception ID lists, the number of the crucial character/word of hit and normal/violation electronic address One or more in number.
9. system according to claim 6, it is characterised in that the extraction unit, be specifically additionally operable to preset time period The information text content of interior user terminal carries out double byte character parsing and turns half-angle character dissection process, broad sense character mapping processing, spy Beeline processing very between character pre-processing, continuation character string, character string vector effective length judge processing, and/or, it is crucial Word/or word extraction process;
The electronic address carried in information text after extraction process.
10. system according to claim 6, it is characterised in that the electronic address that the extraction unit extracts includes:Electricity Talk about the one or more in number, bank's card number, QQ number, WeChat ID, email address and URL.
CN201610628540.3A 2016-08-03 2016-08-03 A kind of information identifying method and system Pending CN107690130A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610628540.3A CN107690130A (en) 2016-08-03 2016-08-03 A kind of information identifying method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610628540.3A CN107690130A (en) 2016-08-03 2016-08-03 A kind of information identifying method and system

Publications (1)

Publication Number Publication Date
CN107690130A true CN107690130A (en) 2018-02-13

Family

ID=61151314

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610628540.3A Pending CN107690130A (en) 2016-08-03 2016-08-03 A kind of information identifying method and system

Country Status (1)

Country Link
CN (1) CN107690130A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063007A (en) * 2018-07-10 2018-12-21 阿里巴巴集团控股有限公司 A kind of exchange medium cleaning method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008084259A1 (en) * 2007-01-09 2008-07-17 Websense Hosted R&D Limited A method and system for collecting addresses for remotely accessible information sources
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
CN103428183A (en) * 2012-05-23 2013-12-04 北京新媒传信科技有限公司 Method and device for identifying malicious website
CN103944810A (en) * 2014-05-06 2014-07-23 厦门大学 Spam e-mail intention recognition system
CN104811424A (en) * 2014-01-26 2015-07-29 腾讯科技(深圳)有限公司 Malicious user identification method and device
WO2016082568A1 (en) * 2014-11-25 2016-06-02 中兴通讯股份有限公司 Short message safe processing method and apparatus

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2008084259A1 (en) * 2007-01-09 2008-07-17 Websense Hosted R&D Limited A method and system for collecting addresses for remotely accessible information sources
CN102572745A (en) * 2010-12-24 2012-07-11 中国移动通信集团上海有限公司 Method and device for determining waste short message
CN103428183A (en) * 2012-05-23 2013-12-04 北京新媒传信科技有限公司 Method and device for identifying malicious website
CN104811424A (en) * 2014-01-26 2015-07-29 腾讯科技(深圳)有限公司 Malicious user identification method and device
CN103944810A (en) * 2014-05-06 2014-07-23 厦门大学 Spam e-mail intention recognition system
WO2016082568A1 (en) * 2014-11-25 2016-06-02 中兴通讯股份有限公司 Short message safe processing method and apparatus

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109063007A (en) * 2018-07-10 2018-12-21 阿里巴巴集团控股有限公司 A kind of exchange medium cleaning method and device

Similar Documents

Publication Publication Date Title
CN106453061B (en) A kind of method and system identifying network fraudulent act
CN105138652B (en) A kind of enterprise's incidence relation recognition methods and system
CN104408093B (en) A kind of media event key element abstracting method and device
CN104982011B (en) Use the document classification of multiple dimensioned text fingerprints
CN106202028B (en) A kind of address information recognition methods and device
CN103415004B (en) A kind of method and device detecting junk short message
CN106713579B (en) Telephone number identification method and device
CN105893615B (en) Owner's characteristic attribute method for digging and its system based on Mobile Phone Forensics data
CN102567534B (en) Interactive product user generated content intercepting system and intercepting method for the same
CN107181745A (en) Malicious messages recognition methods, device, equipment and computer-readable storage medium
CN107517463A (en) A kind of recognition methods of telephone number and device
CN102495942A (en) Assessment method for risks of internal network of organization and system
CN104598595B (en) Cheat page detection method and related device
CN107870988A (en) A kind of information verification method, terminal device and storage medium
CN107958154A (en) A kind of malware detection device and method
CN106598946A (en) Content extracting method and device
CN107135314A (en) Harass detection method, system, mobile terminal and the server of short message
CN101963988A (en) Intelligent engine for normalizing discretion and implementation method thereof
CN108023868A (en) Malice resource address detection method and device
CN112492606A (en) Classification and identification method and device for spam messages, computer equipment and storage medium
CN108694168A (en) A kind of address processing method and processing device, computer installation and readable storage medium storing program for executing
CN104346337B (en) Method and device for intercepting junk information
CN110163013A (en) A kind of method and apparatus detecting sensitive information
CN107690130A (en) A kind of information identifying method and system
CN109284465A (en) A kind of Web page classifying device construction method and its classification method based on URL

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20180213

RJ01 Rejection of invention patent application after publication