CN108734011A - software link detection method and device - Google Patents

software link detection method and device Download PDF

Info

Publication number
CN108734011A
CN108734011A CN201710250473.0A CN201710250473A CN108734011A CN 108734011 A CN108734011 A CN 108734011A CN 201710250473 A CN201710250473 A CN 201710250473A CN 108734011 A CN108734011 A CN 108734011A
Authority
CN
China
Prior art keywords
link
download
download link
text
domain name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710250473.0A
Other languages
Chinese (zh)
Inventor
张峰
胡向东
李林乐
杨子明
梁业裕
付俊
郭智慧
魏琴芳
刘可
林家富
陈国军
白银
刘玥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Mobile Communications Group Co Ltd
Chongqing University of Post and Telecommunications
China Mobile Communications Co Ltd
Original Assignee
China Mobile Communications Group Co Ltd
Chongqing University of Post and Telecommunications
China Mobile Communications Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Mobile Communications Group Co Ltd, Chongqing University of Post and Telecommunications, China Mobile Communications Co Ltd filed Critical China Mobile Communications Group Co Ltd
Priority to CN201710250473.0A priority Critical patent/CN108734011A/en
Publication of CN108734011A publication Critical patent/CN108734011A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Virology (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The embodiment of the invention discloses a kind of software link detection method and device, the method includes:Download link for downloading software and the text other than the download link in extraction predetermined information;Extract the chain feature of the download link;Extract the text feature of the text;According to the chain feature and the text feature, judge whether the download link is the malicious link for meeting Malware download link Rule of judgment.The text and download link in predetermined information can be extracted simultaneously in embodiments of the present invention, and respectively obtain chain feature and with the relevant text feature of text, whether the corresponding download link of comprehensive descision is malicious link, from download link itself and therewith constitute the text of the information content of predetermined information, judge whether corresponding download link is the malicious link for providing Malware download, has the characteristics that judgment accuracy is high in terms of the two.At the same time, also have the characteristics that realize that easy and detection speed is fast.

Description

Software link detection method and device
Technical field
The present invention relates to information technology field more particularly to a kind of software link detection method and device.
Background technology
With the development of information and software technology, there are some in order to steal other people wealth or information malice it is soft Part.Common Malware can Bo Akai viruses, Botnet, worm and Trojan Horse etc..The download of these Malwares Link may be hidden in the information such as short message, if user clicks the information or the connection, terminal downloads malice can be made soft Part, so as to cause the property and information security issue of user.
Some detection methods to Malware are proposed in the prior art, but both for Malware itself The detection of executable file, in this case, only Malware can be just detected after being downloaded, but have many malice soft Part has self-starting behavior and is difficult to delete totally, and the mode of this detection Malware still has very big probability that can lead Apply family fund and insecurity problem that information is stolen.
Invention content
In view of this, an embodiment of the present invention is intended to provide a kind of software link detection method and device, at least partly solve The above problem.
In order to achieve the above objectives, the technical proposal of the invention is realized in this way:
First aspect of the embodiment of the present invention provides a kind of software link detection method, including:
Download link for downloading software and the text other than the download link in extraction predetermined information;
Extract the chain feature of the download link;
Extract the text feature of the text;
According to the chain feature and the text feature, judge whether the download link is to meet Malware download Link the malicious link of Rule of judgment.
Based on said program, the method further includes:
Obtain the download domain name of the download link;
The download domain name is matched with the domain name in domain name library;
The chain feature of the extraction download link, including:
When the download domain name is located in the white list in domain name library, the chain feature of the download link is extracted.
It is described according to the chain feature and the text feature based on said program, whether judge the download link To meet the malicious link of Malware download link Rule of judgment, including:
Described eigenvector is handled using Logic Regression Models, obtains characterizing the download link being the malice The probability of link;
When the probability is more than probability threshold value, determine that the download link is the malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, described eigenvector is handled, It is respectively malicious link or the probability normally linked to export the download link;
When the download link be respectively the malicious link probability more than the download probability be normally link it is general When rate, determine whether the download link is the malicious link.
Based on said program, described eigenvector include it is following at least one:
The corresponding link of the link length of the download link, the path series of the download link, the download link The data volume of the corresponding software installation packet of digital number, the download link that path includes, the domain of the download link Whether first including predetermined top level domain indicates information for name series, the instruction download link, indicates the download link pair Whether whether the time difference of the domain name registration time and current time answered be less than the second instruction information of time threshold, instruction similarity Indicate whether the 4th including predefined type character refers in information and the instruction download link more than the third of similarity threshold Show information, wherein the similarity is the similarity degree of text and sensitive information described in the text feature.
Based on said program, the method further includes:
When it is the malicious link to determine the download link, the text is stored as sensitive information;
And/or
The download link is added in the blacklist of chained library.
Second aspect of the embodiment of the present invention provides a kind of software link detection device, including:
First extraction unit, for extract in predetermined information for download software download link and the download link with Outer text;
Second extraction unit, the chain feature for extracting the download link;
Third extraction unit, the text feature for extracting the text;
Judging unit, for according to the chain feature and the text feature, judging whether the download link is full The malicious link of sufficient Malware download link Rule of judgment.
Based on said program, described device further includes:
Acquiring unit, the download domain name for obtaining the download link;
Matching unit, for matching the download domain name with the domain name in domain name library;
Second extraction unit is specifically used for when the download domain name is located in the white list in domain name library, carries Take the chain feature of the download link;
Second extraction unit is specifically used for when the download domain name is located in the white list in domain name library, carries Take the chain feature of the download link;
The third extraction unit is specifically used for when the download domain name is located in the white list in domain name library, carries Take the text feature of the text.
Based on said program, the judging unit, be specifically used for when the probability be more than probability threshold value when, determine it is described under Load is linked as malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, described eigenvector is handled, It is respectively malicious link or the probability normally linked to export the download link;
When the probability that the download link is respectively malicious link more than the download probability is the probability normally linked, Determine whether the download link is the malicious link.
Based on said program, described eigenvector include it is following at least one:
The corresponding link of the link length of the download link, the path series of the download link, the download link The data volume of the corresponding software installation packet of digital number, the download link that path includes, the domain of the download link Whether first including predetermined top level domain indicates information for name series, the instruction download link, indicates the download link pair Whether whether the time difference of the domain name registration time and current time answered be less than the second instruction information of time threshold, instruction similarity Indicate whether the 4th including predefined type character refers in information and the instruction download link more than the third of similarity threshold Show information, wherein the similarity is the similarity degree of text and sensitive information described in the text feature.
Based on said program, described device further includes:
Storage unit, for when it is the malicious link to determine the download link, the text being stored as quick Feel information;And/or the download link is added in the blacklist of chained library.
In technical solution provided in an embodiment of the present invention, text and download link in predetermined information can be extracted simultaneously, and Respectively obtain chain feature and with the relevant text feature of text, whether the corresponding download link of comprehensive descision is malicious link, From download link itself and therewith constitute the text of the information content of predetermined information, the two aspects come judge it is corresponding under Carry whether link is the malicious link for providing Malware download, one side simplicity determines which download link is malice chain It connects, on the other hand also has the characteristics that judgment accuracy is high.At the same time, also have and realize easy and fast detection speed spy Point.
Description of the drawings
Fig. 1 is the flow diagram of the first software link detection method provided in an embodiment of the present invention;
Fig. 2 is the flow diagram of second of software link detection method provided in an embodiment of the present invention;
Fig. 3 is the result schematic diagram of software link detection device provided in an embodiment of the present invention;
Fig. 4 is the flow diagram of the third software link detection method provided in an embodiment of the present invention;
Fig. 5 is the extraction flow diagram of text feature provided in an embodiment of the present invention.
Specific implementation mode
Technical scheme of the present invention is further elaborated below in conjunction with Figure of description and specific embodiment.
As shown in Figure 1, the present embodiment provides a kind of software link detection methods, including:
Step S110:Download link for downloading software and the text other than the download link in extraction predetermined information This;
Step S120:Extract the chain feature of the download link;
Step S130:Extract the text feature of the text;
Step S140:According to the chain feature and the text feature, judge whether the download link is to meet to dislike Software download of anticipating links the malicious link of Rule of judgment.
Software link detection method described in the present embodiment can be applied in various electronic equipments, such as various terminals or Method in server.The terminal may include mobile terminal and fixed terminal.The mobile terminal may include mobile phone, tablet electricity Brain or wearable device, notebook or e-book etc..The fixed terminal may include the equipment such as PC.
The software can be the software of various applications (Application, APP) or operating system in the present embodiment.? Using the application that may include Android system described in the present embodiment.
After an electronic equipment receives a predetermined information, user, which may click, checks the predetermined information, Electronic equipment described in the present embodiment can execute above-mentioned steps S110 before user clicks and checks or user clicks after checking To step S140, it is chosen as executing above-mentioned steps S110 to step S140 after user clicks and checks, in this case, if user Certain predetermined informations are not clicked and are not checked, then It is not necessary to detection.
The predetermined information may include the webpage informations such as short message, microblogging, may also include the various instant messagings such as wechat or QQ Message.
In the present embodiment in the step S110, the information content of the regular expression from above-mentioned predetermined information can be utilized In extract download link, distinguish text.Here extraction download link and text just expression formula in the present embodiment It can be described as the first regular expression.
After extracting download link, the download link is handled, obtains chain feature.For example, using just Then expression formula handles the download link, and being directly used in the regular expression handled download link here can claim For the second regular expression.
Download link can be divided into standard linkage and short chain connects;The usual short chain connects corresponding string length and is shorter than institute State standard linkage, using short chain tap into row information download or webpage log in when, then need first according to standard connection and short chain connect Between transformational relation, be converted into standard linkage, recycle standard linkage to be linked to corresponding network address or download corresponding resource.
When the download link extracted connects for short chain, the short chain can be connect be reduced into first in the present embodiment Standard linkage, then extract the chain feature.
In some cases, the download link extracted from the predetermined information perhaps and non-software download link, and May only be log on to some webpage etc. other link.It is carried in the present embodiment in the chain feature for executing the download link Before taking, the method further includes:
Request resource is obtained using the head hypertext transfer protocol (Hyper Text Transfer Protocol, HTTP) File type, the download link of software is determined whether further according to file type.If under software upon connecting just after It is continuous to carry out follow-up step S120 to step S140.
The step S130 is the text feature for extracting text in the present embodiment.For example, by modes such as bag of words methods, really Whether comprising sensitive vocabulary in the fixed text, for another example determining that the sensitive vocabulary that the text includes accounts for total text offer The information such as the ratio of word vocabulary.Here sensitive information can be and provide Malware download related information, for example, yellow Information or fishing code etc..
The step S130 specifically may also include in the present embodiment:
Sensitive information processing is carried out to the text, determines the similarity of the text and sensitive information.
Calculate the similarity mode again there are many, several optional modes presented below:
Mode one:
Text vocabulary sensitive in sensitive information is matched, matching here may include accurately matching or mould Paste matching, the similarity is determined according to matching degree.For example, carrying out word segmentation processing to the text, it is assumed that obtained participle Phrase n groups such as carry out multi-mode matching with the text message in the text library of sensitive information, it is assumed that there is the success of m group word match, The then similarity A=(m/n) * 100%.
Mode two:
The message subject for extracting text message judges whether the message subject is one kind in sensitive information, according to assignment Model provides the similarity.
Mode three:
The message subject for extracting text message, calculates the message subject and is matched with sensitive information, according to matching degree, Obtain the similarity.Here the calculating of similarity may refer to mode one.
There are many kinds of the modes for determining the similarity between the text and sensitive information in the present embodiment in a word, not office It is limited to any one of the above.
In the present embodiment in order to accelerate the download link whether be Malware malicious link judgement, in this reality It applies in example, as shown in Fig. 2, the method further includes:
Step S111:Obtain the download domain name of the download link;
Step S112:The download domain name is matched with the domain name in domain name library;
The step S120 may include:
When the download domain name is located in the white list in domain name library, the chain feature of the download link is extracted.
The step S130 may include:When the download domain name is located in the white list in domain name library, described in extraction The text feature of text.
The download domain name for obtaining download link first in the present embodiment, can pass through third canonical table in the present embodiment The part that the download link corresponds to domain name is extracted up to formula.
The download link is compared with the domain name in domain name library in the present embodiment.If the domain in domain name library Name is considered as the domain name being located in white list, can if then described download domain name and the domain name successful match in domain name library Think that the download domain name is normal domain name, it is malicious link that temporarily cannot directly exclude the download link just.If domain name Include not only white list in library, when further including blacklist, if described download domain name and the malice domain name successful match in blacklist, It then can directly determine that the download domain name is malice domain name.The domain name that the white list includes in the present embodiment is normal Domain name, the domain name in the blacklist are malice domain name.
If including simultaneously white and black list in domain name library, may include in the step S112:It simultaneously will be described It downloads domain name to be matched parallel with the domain name in white and black list, if in the download domain name and white list or blacklist Domain name once match, then stop with another list in domain name matching, reduce matching operation number, promoted matching effect Rate.Certainly, may also include in the step S112:Successively by the download link, matched with the white list and blacklist, Or it is matched successively with the blacklist and white list.
There are many achievable modes of the step S140, for example, using various graders to described eigenvector into Row processing, determines whether corresponding download link is malicious link by way of probability calculation.
For example, the step S140 may include:
Described eigenvector is handled using Logic Regression Models, obtains characterizing the download link being the malice The probability of link;
When the probability is more than probability threshold value, determine that the download link is the malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, described eigenvector is handled, It is respectively malicious link or the probability normally linked to export the download link;
When the download link be respectively the malicious link probability more than the download probability be normally link it is general When rate, determine that the download link is the malicious link.
In the present embodiment first with Logic Regression Models carry out first time probability calculation, by and probability threshold value ratio Compared with, part malicious link can be filtered out, it is further using the feature vector for being not determined to malicious link, Piao is utilized again Plain Bayesian model handles feature vector again, exports two probability, and a probability indicates that the download link is just Often link, another probability indicate that the download link is malicious link.It is if the probability that the download link is malicious link is more than The probability normally linked, it is determined that the download link is finally malicious link, otherwise normally to link.
In the present embodiment for acceleration detection, and the accuracy of detection is promoted, the probability threshold value needs are set as closing Suitable value, specific such as 0.7,0.8 or 0.6, when specific implementation, is not limited to these values.
In some embodiments, described eigenvector include it is following at least one:
The corresponding link of the link length of the download link, the path series of the download link, the download link The data volume of the corresponding software installation packet of digital number, the download link that path includes, the domain of the download link Whether first including predetermined top level domain indicates information for name series, the instruction download link, indicates the download link pair Whether the time difference of the domain name registration time and current time answered are less than the second instruction information, the instruction similarity of time threshold Whether be more than similarity threshold third instruction information and the instruction download link in whether include the of predefined type character Four instruction information.
The link length can be the length of the character string of download link, concretely in the corresponding character string of download link Including character number etc..
The path series of the download link may be generally equal to the catalogue of the equipment for the offer loading source that download link is directed to Series.
Under normal circumstances, the significant character string that download link is made of various characters forms, then according to number It may be malicious link.Therefore the number that the correspondence link path that can also extract the download link in the present embodiment includes Number.
In the present embodiment when extracting the chain feature of download link, the mesh that download link is connected to download may be used Marking device checks the data volume of the software installation packet of the download of offer.If data volume is excessive under normal circumstances or too small have can It can be malicious link.
The domain name series of the download link, being equal to "/" in download link, the character string between " // " includes before The number of " " adds 1.For example, to link https:For //www.baidu.com/, in " www.baidu.com " of the link Including 2 " ", the domain name series of the link is 3, and the domain name series of download link may be used this mode and be determined.
It is illegal malice top level domain that some top level domain, which are judged out, then can directly pass through top level domain The matching of name obtains indicating whether first including predetermined top level domain indicates information to the download link.First instruction A bit usually may be used in information, using the two states of " 0 " and " 1 " of the bit, to indicate that above-mentioned download link is The no state for including predetermined top level domain.
Indicate whether the time difference of the download link corresponding domain name registration time and current time is less than time threshold Second instruction information.It, generally all can prolonged use if a domain name registration is normal domain name.General malicious link corresponds to Domain name, in order to avoid being investigated and prosecuted, it may be possible to provisional registration.Therefore when in the present embodiment can also be according to the registration of domain name Between extraction with the time difference of current time, and with the comparison of time threshold, obtain the second instruction information.Certainly, second here Instruction information can also be one or more bits to describe above- mentioned information.
Indicate whether the similarity is more than the third instruction information of similarity threshold.Step S130 in the present embodiment The middle similarity that can calculate text and sensitive information;It, can be directly using similarity as feature when forming feature vector One element of vector.The similarity is compared with similarity threshold in the present embodiment, according to the result of the comparison shape Information is indicated at the third.Here third instruction information can equally indicate for 1 or multiple bit.
Indicate whether the 4th including predefined type character indicates information in the download link.Here predetermined class type-word Symbol, it may for example comprise:The characters such as other non-letter, the punctuation marks such as Chinese character or Tibetan language.If including under normal circumstances predefined type Character, it is likely that be malicious link possibility it is larger.
In short, each element of feature vector provided in this embodiment, can also be the other information other than above- mentioned information, Above- mentioned information can be converted to logic judgment value or have digital value by the processing of feature vector for convenience in the present embodiment.
In some embodiments, the method further includes:It, will when it is the malicious link to determine the download link The text is stored as sensitive information;And/or the download link is added in the blacklist of chained library.
For example, determine that corresponding download link is malicious link according to top level domain or chain feature, then it can be by text As subsequently determining whether that the text of malicious link is added to as sensitive information in the text library of sensitive information.
For another example according to the similarity of text, it, can be by the download when determining that corresponding download link is malicious link Link is added in the blacklist of chained library.
In the present embodiment before executing the step S110, the method may also include:
Judge the information source of the predetermined information, if described information source is legitimate origin, does not execute the step S110 to step S140 enters the step S110 if described information source is not specified legitimate origin.In the present embodiment Middle legitimate origin is the information source specified, for example, the short breath of the download from major common carrier or the timely message of push. If determine whether to execute follow-up step S110 to step S140 using informed source, in the information source for judging predetermined information When, it is also necessary to it is determined as judging that the parameter in described information source is verified, in case pseudo-base station etc. copys what legitimate origin was sent Erroneous judgement caused by information.
As shown in figure 3, the present embodiment provides a kind of software link detection devices, including:
First extraction unit 110, for extracting in predetermined information for downloading the download link of software and the download chain Text other than connecing;
Second extraction unit 120, the chain feature for extracting the download link;
Third extraction unit 130 determines the text and sensitive information for carrying out sensitive information processing to the text Similarity;
Judging unit 140, for according to the chain feature and the text feature, judging whether the loading source is full The malicious link of sufficient Malware loading source Rule of judgment.
Can be applied to various terminal equipment the present embodiment provides a kind of software link detection device.
First extraction unit 110, the second extraction unit 120, third extraction unit 130 and judging unit 140 all may be used Corresponding to processor or processing circuit.The processor may include central processing unit, microprocessor, digital signal processor, answer With processor or programmable array etc..The processing circuit may include application-specific integrated circuit.The processor or processing circuit, can By executing the execution of executable code or program, the operation of above-mentioned each unit is realized.
The software link detection device described in the present embodiment can integrate the text and download link extracted in predetermined information Feature, comprehensive descision go out whether the download link is the download link downloaded for Malware;Judge essence with malicious link Exactness is high, realizes easy and fast detection speed feature.
In some embodiments, described device further includes:
Acquiring unit, the download domain name for obtaining the download link;
Matching unit, for matching the download domain name with the domain name in domain name library;
Second extraction unit 120 is specifically used for when the download domain name is located in the white list in domain name library, Extract the chain feature of the download link;
The third extraction unit 130 is specifically used for when the download domain name is located in the white list in domain name library, Extract the text feature of the text.
Described device further includes that acquiring unit and matching unit equally can correspond to processor or processing in the present embodiment Circuit, the processor or processing circuit can realize aforesaid operations by the execution of code.
In the present embodiment by acquiring unit, the download domain name of download link is obtained using regular expression;According to The matching for downloading domain name first filters out a part of malicious link, reduces the subsequent operations such as the information such as chain feature and similarity Processing.
In some embodiments, the judging unit 140 is specifically used for using Logic Regression Models to described eigenvector It is handled, obtains characterizing the probability that the download link is the malicious link;
When the probability is more than probability threshold value, determine that the download link is the malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, described eigenvector is handled, It is respectively malicious link or the probability normally linked to export the download link;
When the download link be respectively the malicious link probability more than the download probability be normally link it is general When rate, determine that the download link is the malicious link.
The judging unit 140 in the present embodiment, can by the introducing of logistic regression algorithm and model-naive Bayesian, The element calculated in feature vector is calculating parameter, and the probability that one is malicious link is calculated;And pass through probability and probability The comparison of threshold value, or be expressed as normally linking or the comparison of the probability of malicious link, final simplicity quickly determine it is corresponding under Carry whether link is exactly malicious link.
In some embodiments, described eigenvector include it is following at least one:The chain spreading of the download link Degree, the path series of the download link, the download link correspondence link path include digital number, the download Whether the data volume of the corresponding software installation packet of link, the domain name series of the download link, the instruction download link wrap The the first instruction information for including predetermined top level domain, indicate the download link corresponding domain name registration time and current time when Whether difference is less than the second instruction information of time threshold, indicates whether the similarity is more than the third instruction letter of similarity threshold It ceases and indicates whether the 4th including predefined type character indicates information in the download link.
The associated description of each component content of described eigenvector can be referring particularly to previous embodiment in the present embodiment Part is just not repeated herein.
In further embodiments, described device further includes:
Storage unit, for when it is the malicious link to determine the download link, the text being stored as quick Feel information;And/or the download link is added in the blacklist of chained library.
The storage unit can correspond to storage medium in the present embodiment, can be used for that the sensitivity directly is being locally stored Information or download link.The storage unit, also corresponds to communication interface, the communication interface can be used for by the text and/ Or download link is sent in the database of network side and is remotely stored.
Several specific examples are provided below in conjunction with above-described embodiment:
Example one:
This example is by taking the cell phone software of Android mobile phone is downloaded as an example, to the software download link detection of above-described embodiment offer Method is described further, and specifically may include:The method includes:
Contained link and text in A1, the extraction information such as short message, judge contained link whether be cell phone software download chain It connects, for example, the file type that can obtain request resource by HTTP header determine whether under Android system mobile phone software Carry link.
A2, if it is download link, the text in the information such as short message will be extracted, and transfer to text sensitive information processing module Processing;
Spy is collectively formed in A3, extraction chain feature, the text subject similarity obtained with text sensitive information processing module Sign vector;
A4, classification based training is carried out by logistic regression algorithm, model-naive Bayesian;
A5, the result detected according to classification based training judge whether link is mobile phone Malware download link, if so, update Text sensitive information processing module.
The step A1 may include:
By the way that different canonical matching expression extraction links and text is arranged;
If being linked as short chain to connect, original link is reverted to first,
If the link is not software download link, detection terminates, and otherwise executes subsequent operation.
Further include before the step A2:
The domain name that link is extracted by the way that regular expression is arranged, domain name is compared with stored domain name library;
If the link domain name is linked as normal software download link described in white list, judging;
If the link domain name is linked as Malware download link described in blacklist, judging.
Optionally, the text sensitive information processing module in the step A2 refers to utilizing existing Chinese error correcting technique pair The text carries out wrong word correction, the text after error correction is carried out word segmentation processing using existing participle technique, with text Text library in sensitive information processing module carries out multi-mode matching, and calculates Topic Similarity.
Optionally, the feature vector in the step A3 includes:
The length of link, the path series in link, the digital number in link path, the data for downloading software installation packet Whether whether it is be of little use top level domain, domain name registration time containing Chinese character, link domain name in amount, domain name series, link Whether it is less than certain threshold value with the time difference of current time and whether Topic Similarity is more than certain threshold value.
Optionally, the step A4 includes:Chain feature is carried out with logistic regression algorithm, model-naive Bayesian respectively Classification based training.Particularly, in order to reduce the False Rate of link, higher threshold will be arranged in Logic Regression Models in the training process Value.
Optionally, the step A5 includes:The feature and text subject similarity feature of the link are extracted, is formed special Sign vector, detects to obtain prediction result according to logistic regression classifier, if it is determined that described be linked as malicious link, by the text It updates in the text library in text sensitive information processing module.If being linked as normally linking described in conclusion judgement, then carry out Naive Bayes Classifier detects to obtain prediction result, if it is determined that described be linked as malicious link, then returns to malicious link, and will In the text update to the text library in text sensitive information processing module.If it is determined that described be linked as normally linking, then return Normal link is returned, detection terminates.
Text is handled in step A2, specifically can be used such as under type:
Text is extracted by regular expression, and is handled by existing Chinese error correcting technique, text leads to again by treated Existing participle technique processing is crossed, obtained multigroup word and text sensitive information mould text library in the block are subjected to multi-mode Match, text subject similarity is obtained by calculation.Wherein the text library is based on the information text for carrying Malware link This composition.Wherein the text subject similarity meaning is:Assuming that certain text obtains n group words by word segmentation processing, with institute The m group word match of text library is stated, therefore text subject similarity isText subject similarity is as one Characteristic action is in the structure of feature classifiers.
Individually below to being classified using logistic regression algorithm, model-naive Bayesian, determine download link whether be Malicious link is described in further detail.
Chain feature in feature vector includes length, the path series in link, the number in link path of link Number is downloaded and whether contains Chinese character in software installation packet size, domain name series, link, links whether domain name is the top that is of little use Whether the time difference of grade domain name, domain name registration time and current time is less than certain threshold value.In addition, the text in feature vector is special Sign be text subject similarity whether be more than certain threshold value, by chain feature together with text feature constitutive characteristic variable.
Feature (corresponding to vector element) description in specific features vector is as shown in table 1.
Table 1
Specific introduction is done to the Logic Regression Models:The possibility to be measured for being linked as mobile phone Malware download link (refer to feature X with independent variable1、X2、X3、X4、X5、X6、X7、X8、X9) relationship can be indicated with formula (1).
Z=θ01X12X2+...+θnXn (1)
The possibility to be measured for being linked as mobile phone Malware download link in order to obtain calculates link to be measured based on formula (2) It is the probability of malicious link, the range of probability value is [0,1].
P=1/ (1+e-Z) in (2) formula:P --- download link to be measured is the probability of malicious link;
Z --- the sum of all characteristic variables based on weight;
θi(i=0,1 ..., n) --- the regression coefficient obtained based on training sample;
N --- participate in the number of arguments of regression analysis;
Xi(i=1,2 ..., n) --- independent variable;
In construction logic regression model training sample process, by sample to be set as 1 when malicious link, normal link is set It is set to 0.When decision function is arranged, here in order to ensure that higher threshold value will be arranged in the False Rate normally linked.When being waited for The probability value of test sample sheet obtains the sample to be tested as malicious link or the conclusion normally linked according to decision function.
Specific introduction is done to the model-naive Bayesian:According to previously mentioned, the characteristic attribute X={ X of classification1, X2,…,Xm(refer to feature X1、X2、X3、X4、X5、X6、X7、X8、X9), there are category set C={ y1,y2,…,yn(i.e. malice chain Meet y1With normal link y2), calculate separately P (y1|X)、P(y2| X) ..., P (yn| X), by calculating P (yk| X)=max { P (y1| X),P(y2|X),…,P(yn| X) }, then X ∈ yk.According to obtaining P (y shown in Bayes' theorem, that is, formula (3)k|X)。
It is distributed to obtain the conditional probability estimation of each characteristic attribute under of all categories by statistics and gaussian probability.Due to P (X) is constant, and because being respectively characterized in conditional sampling, then can be according to the classification that link is calculated to formula (4).
According to formula (4), ifIt is then otherwise malicious link waits for Surveyor's chain, which connects, to be judged to normally linking.
The Logic Regression Models of above-mentioned structure, model-naive Bayesian are detected successively in a serial fashion, it can be with Judge whether link is mobile phone Malware download link respectively.By logistic regression classifier be placed on Naive Bayes Classifier it Before, by the way that the decision function of Logic Regression Models is arranged, the False Rate of download link can be reduced.
It can be utilized respectively two kinds of Logic Regression Models in this example, model-naive Bayesian carries out the probability meter It calculates, when being only all not more than the probability threshold value there are two types of the probability that model is calculated, can determine that corresponding download link is Otherwise legal link can be illegal link.
Example two:
As shown in figure 4, this example provides a kind of download link detection method, including:
Step S1:Obtain sample to be tested;
Step S2:Extraction link and text;
Step S3:Judge whether link is software download link, if entering step S4, otherwise judges to be normally to link;
Step S4:Extraction link domain name;
Step S5:Judge whether the link domain name matches with the domain name in blacklist, it is no if judging to be malicious link Then enter step S6;
Step S6:Judge whether the link domain name matches with the domain name in white list, if judgement is normally to link, if It is no to enter step S7;
Step S7:Extract text feature;
Step S8:Extract chain feature;
Step S9:Build disaggregated model;Here disaggregated model, for common to the text feature and chain feature of extraction The feature vector of composition is handled, and the probability for judgement is obtained, in conjunction with probability threshold value judge it is corresponding link whether be Malicious link.Here disaggregated model may include:Logic Regression Models and model-naive Bayesian, but it is not limited to the two Disaggregated model specifically can also be the various models with classification feature such as vector machine or neural network.
Step S10:Whether the probability of decision logic regression model output is more than probability threshold value, if being determined as malice chain It connects, enters step S11 if not.
Step S11:Judge whether model-naive Bayesian output result meets malicious link Rule of judgment, if then judging For malicious link, if being otherwise judged to normally linking.
Example three:
As shown in figure 5, this example provides a kind of method of text feature, including:
Extract the text of predetermined information;
Text correction process is carried out to the text of extraction;
Text word segmentation processing is carried out to the text after error correction;
Multi-mode matching is carried out to the participle from text after word segmentation processing,
According to matching result, the similarity that text subject belongs to the sensitive information that Malware can be caused to download is calculated.
In several embodiments provided herein, it should be understood that disclosed device and method can pass through it Its mode is realized.Apparatus embodiments described above are merely indicative, for example, the division of the unit, only A kind of division of logic function, formula that in actual implementation, there may be another division manner, such as:Multiple units or component can combine, or It is desirably integrated into another system, or some features can be ignored or not executed.In addition, shown or discussed each composition portion It can be the INDIRECT COUPLING by some interfaces, equipment or unit to divide mutual coupling or direct-coupling or communication connection Or communication connection, can be electrical, mechanical or other forms.
The above-mentioned unit illustrated as separating component can be or may not be and be physically separated, aobvious as unit The component shown can be or may not be physical unit, you can be located at a place, may be distributed over multiple network lists In member;Some or all of wherein unit can be selected according to the actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in various embodiments of the present invention can be fully integrated into a processing module, also may be used It, can also be during two or more units be integrated in one unit to be each unit individually as a unit;It is above-mentioned The form that hardware had both may be used in integrated unit is realized, can also be realized in the form of hardware adds SFU software functional unit.
One of ordinary skill in the art will appreciate that:Realize that all or part of step of above method embodiment can pass through The relevant hardware of program instruction is completed, and program above-mentioned can be stored in a computer read/write memory medium, the program When being executed, step including the steps of the foregoing method embodiments is executed;And storage medium above-mentioned includes:It is movable storage device, read-only Memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disc or The various media that can store program code such as person's CD.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, protection scope of the present invention should be based on the protection scope of the described claims.

Claims (10)

1. a kind of software link detection method, which is characterized in that including:
Download link for downloading software and the text other than the download link in extraction predetermined information;
Extract the chain feature of the download link;
Extract the text feature of the text;
According to the chain feature and the text feature, judge whether the download link is to meet Malware download link The malicious link of Rule of judgment.
2. according to the method described in claim 1, it is characterized in that,
The method further includes:
Obtain the download domain name of the download link;
The download domain name is matched with the domain name in domain name library;
The chain feature of the extraction download link, including:
When the download domain name is located in the white list in domain name library, the chain feature of the download link is extracted;
The text feature of the extraction text, including:
When the download domain name is located in the white list in domain name library, the text feature of the text is extracted.
3. method according to claim 1 or 2, which is characterized in that
It is described according to the chain feature and the text feature, judge whether the download link is to meet Malware download The malicious link of Rule of judgment is linked, including:
Described eigenvector is handled using Logic Regression Models, obtains characterizing the download link being the malicious link Probability;
When the probability is more than probability threshold value, determine that the download link is the malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, to described eigenvector processing, output The download link is respectively malicious link or the probability normally linked;
When the probability that the download link is respectively the malicious link more than the download probability is the probability normally linked, Determine that the download link is the malicious link.
4. method according to claim 1 or 2, which is characterized in that
Described eigenvector include it is following at least one:
The correspondence link path of the link length of the download link, the path series of the download link, the download link The data volume of the corresponding software installation packet of the digital number, the download link that include, the domain name grade of the download link Number indicates that whether first including predetermined top level domain indicates information to the download link, indicates that the download link is corresponding Whether the second instruction information, the instruction similarity whether domain name registration time and the time difference of current time are less than time threshold are more than Whether the 4th instruction including predefined type character is believed in the third instruction information and the instruction download link of similarity threshold Breath, wherein the similarity is the similarity degree of text and sensitive information described in the text feature.
5. method according to claim 1 or 2, which is characterized in that the method further includes:
When it is the malicious link to determine the download link, the text is stored as sensitive information;
And/or
The download link is added in the blacklist of chained library.
6. a kind of software link detection device, which is characterized in that including:
First extraction unit, for extracting other than download link and the download link in predetermined information for downloading software Text;
Second extraction unit, the chain feature for extracting the download link;
Third extraction unit, the text feature for extracting the text;
Judging unit, for according to the chain feature and the text feature, judging whether the download link is to meet to dislike Software download of anticipating links the malicious link of Rule of judgment.
7. device according to claim 6, which is characterized in that
Described device further includes:
Acquiring unit, the download domain name for obtaining the download link;
Matching unit, for matching the download domain name with the domain name in domain name library;
Second extraction unit is specifically used for when the download domain name is located in the white list in domain name library, extracts institute State the chain feature of download link;
The third extraction unit is specifically used for when the download domain name is located in the white list in domain name library, extracts institute State the text feature of text.
8. the device described according to claim 6 or 7, which is characterized in that
The judging unit is obtained specifically for being handled described eigenvector using Logic Regression Models described in characterization Download link is the probability of the malicious link;
When the probability is more than probability threshold value, determine that the download link is the malicious link;
When the probability is not more than the probability threshold value, using model-naive Bayesian, to described eigenvector processing, output The download link is respectively malicious link or the probability normally linked;
When the probability that the download link is respectively the malicious link more than the download probability is the probability normally linked, Determine that the download link is the malicious link.
9. the device described according to claim 6 or 7, which is characterized in that
Described eigenvector include it is following at least one:
The correspondence link path of the link length of the download link, the path series of the download link, the download link The data volume of the corresponding software installation packet of the digital number, the download link that include, the domain name grade of the download link Number indicates that whether first including predetermined top level domain indicates information to the download link, indicates that the download link is corresponding Whether the second instruction information, the instruction similarity whether domain name registration time and the time difference of current time are less than time threshold are more than Whether the 4th instruction including predefined type character is believed in the third instruction information and the instruction download link of similarity threshold Breath, wherein the similarity is the similarity degree of text and sensitive information described in the text feature.
10. the device described according to claim 6 or 7, which is characterized in that described device further includes:
Storage unit, for when it is the malicious link to determine the download link, the text to be stored as sensitive letter Breath;And/or the download link is added in the blacklist of chained library.
CN201710250473.0A 2017-04-17 2017-04-17 software link detection method and device Pending CN108734011A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710250473.0A CN108734011A (en) 2017-04-17 2017-04-17 software link detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710250473.0A CN108734011A (en) 2017-04-17 2017-04-17 software link detection method and device

Publications (1)

Publication Number Publication Date
CN108734011A true CN108734011A (en) 2018-11-02

Family

ID=63924199

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710250473.0A Pending CN108734011A (en) 2017-04-17 2017-04-17 software link detection method and device

Country Status (1)

Country Link
CN (1) CN108734011A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN113315739A (en) * 2020-02-26 2021-08-27 深信服科技股份有限公司 Malicious domain name detection method and system
CN114553486A (en) * 2022-01-20 2022-05-27 北京百度网讯科技有限公司 Illegal data processing method and device, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932348A (en) * 2012-10-30 2013-02-13 常州大学 Real-time detection method and system of phishing website
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102932348A (en) * 2012-10-30 2013-02-13 常州大学 Real-time detection method and system of phishing website
CN104217160A (en) * 2014-09-19 2014-12-17 中国科学院深圳先进技术研究院 Method and system for detecting Chinese phishing website

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
魏琴芳等: "一种安卓系统手机恶意软件链接串行联合检测方法", 《重庆邮电大学学报(自然科学版)》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113315739A (en) * 2020-02-26 2021-08-27 深信服科技股份有限公司 Malicious domain name detection method and system
CN113221032A (en) * 2021-04-08 2021-08-06 北京智奇数美科技有限公司 Link risk detection method, device and storage medium
CN114553486A (en) * 2022-01-20 2022-05-27 北京百度网讯科技有限公司 Illegal data processing method and device, electronic equipment and storage medium

Similar Documents

Publication Publication Date Title
TWI673625B (en) Uniform resource locator (URL) attack detection method, device and electronic device
US9621570B2 (en) System and method for selectively evolving phishing detection rules
CN105426356B (en) A kind of target information recognition methods and device
CN111061874B (en) Sensitive information detection method and device
CN109005145A (en) A kind of malice URL detection system and its method extracted based on automated characterization
CN111371806A (en) Web attack detection method and device
CN107659570A (en) Webshell detection methods and system based on machine learning and static and dynamic analysis
CN108965245A (en) Detection method for phishing site and system based on the more disaggregated models of adaptive isomery
CN107682348A (en) DGA domain name Quick method and devices based on machine learning
CN108632227A (en) A kind of malice domain name detection process method and device
CN103685307A (en) Method, system, client and server for detecting phishing fraud webpage based on feature library
CN108200034A (en) A kind of method and device for identifying domain name
CN111538929B (en) Network link identification method and device, storage medium and electronic equipment
CN112347367A (en) Information service providing method, information service providing device, electronic equipment and storage medium
CN104158828B (en) The method and system of suspicious fishing webpage are identified based on cloud content rule base
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN109145030B (en) Abnormal data access detection method and device
CN108718306A (en) A kind of abnormal flow behavior method of discrimination and device
CN111753290A (en) Software type detection method and related equipment
CN108734011A (en) software link detection method and device
CN109214178A (en) APP application malicious act detection method and device
Khan et al. Efficient behaviour specification and bidirectional gated recurrent units‐based intrusion detection method for industrial control systems
CN110162973B (en) Webshell file detection method and device
CN114826681A (en) DGA domain name detection method, system, medium, equipment and terminal
CN114448664A (en) Phishing webpage identification method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181102