CN105095752A - Identification method, apparatus and system of virus packet - Google Patents

Identification method, apparatus and system of virus packet Download PDF

Info

Publication number
CN105095752A
CN105095752A CN201410190765.6A CN201410190765A CN105095752A CN 105095752 A CN105095752 A CN 105095752A CN 201410190765 A CN201410190765 A CN 201410190765A CN 105095752 A CN105095752 A CN 105095752A
Authority
CN
China
Prior art keywords
packet
identified
feature
similarity
viral
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410190765.6A
Other languages
Chinese (zh)
Other versions
CN105095752B (en
Inventor
吴鹏志
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Tencent Cloud Computing Beijing Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201410190765.6A priority Critical patent/CN105095752B/en
Publication of CN105095752A publication Critical patent/CN105095752A/en
Application granted granted Critical
Publication of CN105095752B publication Critical patent/CN105095752B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

The present invention discloses an identification method, apparatus and system of a virus packet. The identification method comprises: searching a plurality of to-be-identified packets for a first packet, wherein the first packet is different from recorded virus packets in a virus sample library; obtaining a first feature of the first packet and a virus probability corresponding to the first feature, wherein the virus probability is a probability that the packet carrying the first feature is a virus; and determining whether the first packet is a virus packet by using the virus probability and the first feature. The method, apparatus and system provided by the present invention solve the problem of low efficiency of distinguishing a virus among a large number of packets in the prior art, achieve rapid and accurate distinguishing a virus packet among a large number of packets, and improve efficiency in distinguishing a virus among a large number of packets.

Description

The recognition methods of virus packet, Apparatus and system
Technical field
The present invention relates to data processing field, in particular to a kind of recognition methods, Apparatus and system of viral packet.
Background technology
Internet worm (poison of also pretending illness) refers to establishment or the destruction computer function inserted in computer program or destroys data, affect computing machine to use and can a set computer instruction or program code of self-replacation, along with the development of technology, above-mentioned virus also starts to invade mobile terminal, for this reason, the antivirus software being applied to computer also occurs on mobile terminals.
Current antivirus software is the class software for eliminating the computer threats such as computer virus, Trojan Horse and Malware; anti-viral software is by monitoring in real time and scanning disk; compared with the condition code of the virus base that the data flow through in internal memory and himself are carried; to determine whether virus; remove virus, with protection calculation machine.Along with the development of technology, occur that checking and killing virus software on mobile terminals also has similar function, such as, the checking and killing virus software of common Android mobile phone, checking and killing virus software (as Tengxun mobile phone house keeper) in Android mobile phone is also mainly adopt virus characteristic to identify Malware, and these virus characteristics are all analyze out from the malice sample collected.
The sample size that in the Android mobile phone of prior art, every day is newly-increased is very large, such as, monitoring display every day of Tengxun mobile phone house keeper, newly-increased Android sample averaging number was more than 20,000, in these newly-increased samples, the overwhelming majority is safe sample, only has a little part to be malice, therefrom screens out the malice sample of these minorities fast, become a key point of whole checking and killing virus link, but from a large amount of Android samples, screen out the thing that malice sample is a unusual labor intensive fast.At present, industry mainly uses transmission note, networking, sensitive operation or the authority such as to call to identify doubtful software, normal software also has these usually (as sent note, network and call) operate and authority, adopt the method screening Malware of screening sensitive operation, there is a large amount of normal samples in the sample screened, therefore, in prior art after screening out suspicious sample, technical Analysis personnel just can find malice sample few in number after still needing to attempt very many samples, examination efficiency is low, and need the manpower and materials of at substantial.
For above-mentioned inefficient problem of screening virus from a large amount of packets, at present effective solution is not yet proposed.
Summary of the invention
Embodiments provide a kind of recognition methods of viral packet, Apparatus and system, at least to solve the inefficient technical matters of screening virus from a large amount of packets.
According to an aspect of the embodiment of the present invention, provide a kind of recognition methods of viral packet, this recognition methods comprises: from multiple packet to be identified, search the first packet, and wherein, the first packet is different from Virus Sample storehouse the viral packet recorded; Obtain the fisrt feature of the first packet and the viral probability corresponding with fisrt feature, wherein, viral probability is for representing that the packet carrying fisrt feature is the probability of virus; Whether the first packet is viral packet to use viral probability and fisrt feature to determine.
According to the another aspect of the embodiment of the present invention, additionally provide a kind of recognition device of viral packet, this recognition device comprises: search module, for searching the first packet from multiple packet to be identified, wherein, the first packet is different from Virus Sample storehouse the viral packet recorded; Data acquisition module, for obtaining the fisrt feature of the first packet and the viral probability corresponding with fisrt feature, wherein, viral probability is for representing that the packet carrying fisrt feature is the probability of virus; First determination module, determines for using viral probability and fisrt feature whether the first packet is viral packet.
According to the another aspect of the embodiment of the present invention, additionally provide a kind of recognition system of viral packet, this recognition system comprises: the recognition device of viral packet.
Adopt the present invention, after finding the first packet be not recorded in Virus Sample storehouse, determine whether this first packet is viral packet by finding by the fisrt feature of the first packet and the viral probability of correspondence thereof, also namely screen without the need to carrying out virus to all packets to be identified, and in examination process, employ the packet feature of the first packet and viral probability corresponding to feature, doubtful virus is screened and no longer relies on artificial experience, it is also more accurate to screen the virus of packet, thus solve in prior art the inefficient problem of screening virus from a large amount of packets, achieve and quick and precisely from mass data bag, screen out viral packet, improve the efficiency of screening virus from mass data bag.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the recognition methods of a kind of viral packet according to the embodiment of the present invention;
Fig. 2 is the process flow diagram of the recognition methods of a kind of optional viral packet according to the embodiment of the present invention;
According to the embodiment of the present invention a kind of optional, Fig. 3 judges whether the similarity of two packets to be identified is greater than the schematic diagram of the method for predetermined threshold value;
Fig. 4 is the schematic diagram of a kind of optional more new data packets feature similarity matrix method according to the embodiment of the present invention;
Fig. 5 is the schematic diagram obtaining cluster heap according to a kind of optional cluster analysis of the embodiment of the present invention; And
Fig. 6 is the structural drawing of the recognition device of viral packet according to the embodiment of the present invention.
Embodiment
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged in the appropriate case, so as embodiments of the invention described herein can with except here diagram or describe those except order implement.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
According to the embodiment of the present invention, provide a kind of recognition methods of viral packet, it should be noted that, can perform in the computer system of such as one group of computer executable instructions in the step shown in the process flow diagram of accompanying drawing, and, although show logical order in flow charts, in some cases, can be different from the step shown or described by order execution herein.
According to the embodiment of the present invention, provide a kind of recognition methods of viral packet, as shown in Figure 1, this recognition methods can comprise the steps:
Step S102: search the first packet from multiple packet to be identified.
Wherein, the first packet is different from Virus Sample storehouse the viral packet recorded.Have been found that in record in Virus Sample storehouse in embodiments of the present invention and confirm as viral packet, this step can be searched by comparing with the packet that records in Virus Sample storehouse in (this comparison can for being realized by the virus signature on packet), packet in Virus Sample storehouse in recording in multiple packet to be identified is searched (or screening) to arrive, and obtain the first packet be not recorded in Virus Sample storehouse.
Step S104: obtain the fisrt feature of the first packet and the viral probability corresponding with fisrt feature.
Wherein, viral probability is for representing that the packet carrying fisrt feature is the probability of virus; Each packet to be identified all can have and this packet characteristic of correspondence in the embodiment of the present application, and each feature all can a corresponding parameter characterizing its viral probability.
Step S106: whether the first packet is viral packet to use viral probability and fisrt feature to determine.
Adopt the present invention, after finding the first packet be not recorded in Virus Sample storehouse, determine whether this first packet is viral packet by finding by the fisrt feature of the first packet and the viral probability of correspondence thereof, also namely screen without the need to carrying out virus to all packets to be identified, and in examination process, employ the packet feature of the first packet and viral probability corresponding to feature, doubtful virus is screened and no longer relies on artificial experience, it is also more accurate to screen the virus of packet, thus solve in prior art the inefficient problem of screening virus from a large amount of packets, achieve and quick and precisely from mass data bag, screen out viral packet, improve the efficiency of screening virus from mass data bag.
Wherein, the packet in above-described embodiment can be application data bag, software package etc.Fisrt feature in above-described embodiment can comprise multiple subcharacter, and each subcharacter is respectively used to described in sign one.
In the above embodiment of the present invention, from multiple packet to be identified, search the first packet can comprise: the feature obtaining multiple packet to be identified, and according to the feature of packet, cluster analysis is carried out to multiple packet to be identified, so that multiple packet to be identified is divided into multiple set; The first packet is searched in units of set.
After the viral probability of use and fisrt feature determine whether the first packet is viral packet, recognition methods can comprise: when determining that the first packet is viral packet, determine that the first packet is the variant virus packet of known viruse packet, wherein, known viruse packet is arranged in the set at the first packet place, and is recorded in Virus Sample storehouse.
Wherein, cluster analysis is the analytic process set of physics or abstract object be grouped into as the multiple classes be made up of similar object.Specific in the above embodiment of the present invention, can be according to packet to be identified between similarity, packet to be identified is assigned in different bunch (i.e. the set of above-described embodiment).
Pass through above-described embodiment, before searching packet, get the feature of multiple packet to be identified, and according to this feature, multiple packet to be identified is divided into different set, then in units of set, the first packet is searched, thus when determining that the first packet is viral packet, the set that this first packet is positioned at can be determined, thus can determine that the first packet is the variant virus packet of known viruse packet in this set.In this embodiment, not only can determine whether the first packet is viral packet, can also determine that the first packet is the packet of the variant virus of which (or which class) viral packet, adopt the embodiment of the present invention, not only can quick and precisely screen out viral packet, classification mark and process can also be carried out to all viral packets, thus can for virus research or analysis provide more accurate and complete viral packet data further.
According to the abovementioned embodiments of the present invention, according to the feature of packet, cluster analysis is carried out to multiple packet to be identified, to be divided into multiple set to comprise multiple packet to be identified: the feature of usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value; When similarity is greater than predetermined threshold value, two packets to be identified are divided into same set; When similarity is not more than predetermined threshold value, two packets to be identified are divided into different set.
The similarity of two packets to be identified can be used in an embodiment of the present invention to represent, and whether two packets to be identified are similar, then hierarchical clustering disposal route is adopted, the sample (packet to be identified) similarity being greater than predetermined threshold value θ converges to a set (heap namely in clustering processing or bunch), and sample similarity being not more than predetermined threshold value is divided into different set.
Particularly, the similarity of any two packets in multiple packet to be identified can be calculated in this embodiment, then complete according to the size of similarity and predetermined threshold value that multiple set is obtained to the cluster analysis of multiple packet to be identified.
What need to illustrate further is, above-described embodiment can realize as follows: calculate the similarity between any two packets to be identified in multiple packet to be identified, judge whether similarity is greater than predetermined threshold value, when similarity is greater than predetermined threshold value, two packets to be identified are divided into same set, when similarity is greater than predetermined threshold value, two packets to be identified are divided into different set.
Particularly, the similarity between the packet that calculating two is to be identified can realize as follows:
After the feature (this feature can proper vector represent) getting two packets (being denoted by the first packet to be identified and the second packet to be identified) to be identified, with this, the similarity calculating the proper vector of the first packet to be identified and the proper vector of the second packet to be identified, to obtain the similarity of the first packet to be identified and the second packet to be identified, represents whether two samples (namely two to be identified packet) are similar.More specifically, Jaccard formula (Jie Kade formula) can be adopted to calculate the similarity of two packets to be identified:
J = ( A , B ) = | A → ∩ B → | | A → ∪ B → | ,
Wherein, the proper vector of sample A (the first packet to be identified), represent sample A (i.e. the first packet to be identified) and the total category feature (i.e. subcharacter) of sample B (i.e. the second packet to be identified), represent all category features of sample A and B. represent the quantity of the category feature that sample A and sample B have, represent the quantity of all subcharacters of sample A and B.
As can be seen from above-mentioned formula, the similarity span of two samples is [0,1].When all category features of two samples are just the same, similarity is 1; When two samples do not exist same item feature, their similarity is 0.
Utilize above-mentioned formulae discovery complexity for O (N*N).What needs further illustrated is can reduce above-mentioned Similarity Measure amount by the method shown in Fig. 2, improves the speed calculating Sample Similarity.
Fig. 2 is the method flow diagram of the similarity according to the calculating two of embodiment of the present invention packet to be identified.
As shown in Figure 2, packet to be identified in multiple packet to be identified two is denoted as the first packet to be identified and the second packet to be identified, the feature of usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value and can comprises:
Step S202: the second packet feature obtaining the first packet characteristic sum second packet to be identified of the first packet to be identified.
Wherein, packet feature in above-described embodiment all can comprise one or more subcharacter (i.e. category feature), first packet feature can comprise the first subcharacter of the first data volume, and the second packet feature can comprise the second subcharacter of the second quantity.
Step S204: judge whether the ratio of the first quantity and the second quantity meets default ratio.
Wherein, when the ratio of the first quantity and the second quantity does not meet default ratio, step S206 is performed: determine that similarity is not more than predetermined threshold value; When the ratio of the first quantity and the second quantity meets default ratio, calculate the similarity of the first packet to be identified and the second packet to be identified, and judge whether similarity is greater than predetermined threshold value.
Particularly, when sample (packet namely to be identified) cluster, predetermined threshold value can be adopted to do cluster, the sample that similarity is greater than predetermined threshold value converges in a set, can obtain according to above-mentioned Jie Kade formula:
When | A &RightArrow; | | B &RightArrow; | < &theta; Or | B &RightArrow; | | A &RightArrow; | < &theta; Time,
The similarity of the first packet to be identified and the second packet to be identified is not more than predetermined threshold value, particularly in step S204, judge whether the ratio of the first quantity and the second quantity meets default ratio and be the ratio judging the first quantity and the second quantity and whether be not more than predetermined threshold value, when the ratio of the first quantity and the second quantity is not more than predetermined threshold value, namely determine that the ratio of the first quantity and the second quantity does not meet default ratio; When the ratio of the first quantity and the second quantity is greater than predetermined threshold value, namely determine that the ratio of the first quantity and the second quantity meets default ratio.
By above-described embodiment, when calculating the similarity of multiple packet to be identified, the similarity of sample A and sample B can not be calculated, to avoid invalid computation.
As shown in Figure 2, calculate the similarity of the first packet to be identified and the second packet to be identified, and judge whether similarity is greater than predetermined threshold value and can comprises:
Step S208: the 3rd quantity recording first subcharacter different from the second subcharacter.
Step S210: judge whether the 3rd quantity exceedes predetermined number.
Wherein, when the 3rd quantity exceedes predetermined number, perform step S206: determine that similarity is not more than predetermined threshold value; When the 3rd quantity is no more than predetermined number, perform step S212.
Particularly, predetermined number in above-described embodiment can be determined by predetermined threshold value, if predetermined number can be (1-θ) (| A|+|B|), when calculating the similarity of sample A and B, if when finding the category feature relatively crossed same population do not exceed (1-θ) (| A|+|B|), the similarity of sample A and B can not exceed predetermined threshold value, and remaining category feature can no longer compare, and directly terminates comparison procedure.
Similarity Measure can be terminated as early as possible by this embodiment, thus can computational resource be saved.
Need to further illustrate, the feature that Hive cluster (Hive is a Tool for Data Warehouse) can also be adopted to realize usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value.
Particularly, the number of the subcharacter that multiple described packet to be identified comprises can be obtained; According to described number, multiple packet to be identified is mapped in different reckoners; Different reckoners is calculated in the internal memory of different computing nodes the similarity of packet to be identified corresponding to different calculating groups and other packet to be identified.
Pass through, different calculating groups can be used the internal memory of different computing nodes (as personal terminal, mobile terminal etc.) to calculate similarity, thus the calculating pressure of the similarity of single computing node can be disperseed, use different computing nodes to carry out the speed that Similarity Measure can accelerate whole Similarity Measure simultaneously.
According to described number, multiple packet to be identified is mapped to different reckoners to comprise: from big to small described packet to be identified is mapped into different worksheets and chart according to described number, wherein, each described worksheet is the table with default line number, described chart is the table with default columns, a line of described worksheet represents the feature of a described packet to be identified, a packet to be identified is shown in one list of described chart, with the left side of user during the reading present specification of user front for the left side in the application, with the right side of user during the reading present specification of user front for the right side in the application, then in the graph, the category feature number of the feature of the packet to be identified of the left side one row of any row is not less than the number of the category feature of the feature of the packet to be identified of these any row, the number of the category feature of the feature of the packet to be identified of any a line is all not less than the number of the category feature of the feature of the packet to be identified of the next line of this any a line.
Particularly, job table is as shown in Figure 3 worksheet, and map table is chart.When calculating the similarity of other packets to be identified in each worksheet and chart respectively, packet to be identified in being shown by job is denoted as the first packet to be identified, job table is put in the internal memory of each computing node, the second packet to be identified that the characteristic number only extracting subcharacter from map table meets default ratio calculates, as shown in Figure 3, multiple packet to be identified can be mapped to (m+1) individual job table (job0 to jobm), (K+1) individual map table (map0 to mapK), feature 0 to feature n in table is the numbering of packet to be identified.
Step S212: the similarity calculating the first packet to be identified and the second packet to be identified.Wherein, this similarity can be cosine similarity.
Particularly, by the Jie Kade formulae discovery similarity in above-described embodiment, can not repeat them here.
Step S214: judge whether similarity is greater than predetermined threshold value.
Wherein, when similarity is greater than predetermined threshold value, perform step S216: determine that similarity is greater than predetermined threshold value; When similarity is not more than predetermined threshold value, perform step S206: determine that similarity is not more than predetermined threshold value.
Step S218: the first packet to be identified and the second packet to be identified are divided into same set.
Step S220: the first packet to be identified and the second packet to be identified are divided into different set.
When computationally stating the similarity between two packets to be identified, feature Hash can be obtained to carrying out Hash after category feature (i.e. subcharacter) sequence of each sample (packet namely to be identified), divide into groups to multiple packet to be identified with feature Hash, the packet that extraction one is to be identified from each group and the packet to be identified that other groups extract carry out the calculating of similarity.In the above-described embodiments, enough good words are obtained at hash method, the sample often organized has identical feature Hash, the similarity of the sample of each group the inside is 1, thus, after being divided into group, from every group, all only take out one calculate Sample Similarity, this sample extracted and the similarity of the sample that other groups extract be any one sample in this group with corresponding group in the similarity of any one sample, so just greatly reduce the calculated amount of similarity, decrease the treatment capacity of processor, improve the performance of system.
According to the abovementioned embodiments of the present invention, as shown in Figure 4, after the feature of usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value, the magnitude relationship of similarity (x*x similarity) and similarity and predetermined threshold value is preserved into packet feature similarity matrix as similarity data.
When getting packet newly to be identified, calculate the first similar matrix of packet newly to be identified; Calculate the second similar matrix of packet newly to be identified and described multiple packet to be identified; Use described first similar matrix and described second-phase like packet feature similarity matrix described in matrix update.
Particularly, as shown in Figure 4, in above-described embodiment, X is had been friends in the past sample (i.e. multiple packet to be identified, namely old packet to be identified, have x Geju City sample in this embodiment) sample characteristics similar matrix (the packet feature similarity matrix namely in above-described embodiment), Y is newly-increased sample (packet namely newly to be identified, data newly to be identified are in this embodiment surrounded by y) the first similar matrix, this first similar matrix comprises: the first similarity (y*y similarity) between any two new packets to be identified and the magnitude relationship of the first similarity and predetermined threshold value, Z is the second similar matrix, second similar matrix comprises the magnitude relationship of the second similarity (x*y similarity) between the packet to be identified in each packet newly to be identified and every Geju City and the second similarity and predetermined threshold value.Wherein, x, y are natural number.X*x above-mentioned similarity, a y*y similarity and x*y similarity are theoretical value, by the embodiment shown in Fig. 2 of the present invention, the number of the similarity in fact calculated is not so much, has some to be can be represented by the magnitude relationship of Similarity value and predetermined threshold value.
Feature due to each packet is all fixing, and the sample of calculated similarity is the similarity without the need to again calculating between them, can reduce double counting by the method for above-mentioned incremental update.
According to the abovementioned embodiments of the present invention, the feature obtaining multiple packet to be identified can comprise: obtain system call library, wherein, and the characteristic of in store corresponding data packet function in system call library; The characteristic of each packet function of packet corresponding to be identified is extracted from system call library; Obtain subcharacter to carrying out Hash after the sequence of characteristic duplicate removal, all subcharacters assemble the feature of packet to be identified; The characteristic sum subcharacter proper vector of packet to be identified is represented, and proper vector is preserved into sample characteristics storehouse.
Be that application scenarios introduces the present invention below with Android system: particularly, system call (SystemInvoke) storehouse of Android can be set up, as shown in table 1, the characteristic of the packet function in system call library can use string representation, and can be numbered " 1 ", " 2 " to characteristic until " 66246 ".The system call library of application programming interface API in the present embodiment can be developed net from Android and obtain, also can extract from Android mobile phone system catalogue, the associated documents extracted comprise :/system/framework/framework.jar (core code of system/framework/Android system SDK (Software Development Kit) sdk), system/framework/core.jar (core library) etc., the instrument of extraction can use dexdump (decompiling instrument).
Table 1:
After getting multiple data to be identified, to each sample (packet namely to be identified, as application package, this application package can have many middle application functions, application function is the packet function in above-described embodiment, each packet function is a class) from system call library, extract the characteristic of all classes, carry out Hash after the characteristic duplicate removal extracted being sorted and obtain category feature (subcharacter namely in above-described embodiment), then all category features are formed the feature of this sample (packet namely to be identified), this feature can use vector form to represent (being called proper vector), and by this proper vector stored in sample characteristics storehouse.As,
&RightArrow; vector = ( f 1 , f 2 , . . . . . . , f n ) , Vector is wherein the feature of sample, f 1, f 2..., f nbe the subcharacter (i.e. category feature) of feature, n is natural number.
Wherein, class describes the common attribute of created object and method.Be the interior poly-bag assembled by some specific metadata (data of the attribute of data of description), it is described that the rule of conduct of some objects, and these objects be just known as such example.
Need to further illustrate, before the fisrt feature and the viral probability corresponding with fisrt feature of acquisition first packet, recognition methods can also comprise: obtain the malice sample in Sample Storehouse and safe sample; Statistics carries the malice sample size of subcharacter and carries the safe sample size of subcharacter; Use the viral probability of following formulae discovery subcharacter, and viral probability preserved into viral probability storehouse:
w ( feature ) = n ( feature , malware _ set ) n ( feature , malware _ set ) + n ( feature , safe _ set ) ,
Wherein, the viral probability that w (feature) is subcharacter, n (feature, malware_set) is for carrying the malice sample size of subcharacter, n (feature, safe_set) is for carrying the safe sample size of subcharacter.
In the above embodiment of the present invention, checking and killing virus software (as Tengxun mobile phone house keeper) collected and analyzed a large amount of samples, these Sample preservations are entered Sample Storehouse, the sample preserved in Sample Storehouse is divided into: malice sample and safe sample, wherein, after to the sample analysis collected, confirm that this sample really can produce to the property, privacy etc. of terminal (as the terminal such as mobile phone, panel computer) user the sample endangered and be considered to malice sample; Other do not find that the sample of dangerous act is safe sample.
In the above-described embodiments, the fisrt feature and the viral probability corresponding with fisrt feature that obtain the first packet can comprise: from sample database, read the fisrt feature corresponding with the first packet; And the viral probability corresponding with the subcharacter in fisrt feature is read from viral probability storehouse.
In the above embodiment of the present invention, after getting the multiple set (i.e. cluster heap or bunch) multiple packet to be identified carried out to cluster analysis and obtain, add up the safe distribution situation of the packet to be identified in each cluster heap, filter out suspicious sample (i.e. the first packet).As shown in Figure 5, this there is shown 10 cluster heaps, and each cluster heap comprises respectively: cluster ID, sample number (number of the packet to be identified in set), viral number (number of the known viruse packet namely in corresponding cluster heap) and non-viral number (being the number of the first packet).Cluster ID is as shown in Figure 5 that in the cluster heap of 251765, viral packet (namely known viruse packet has 8852 in this embodiment) and the first packet (having 26 in this embodiment) mix, in conjunction with existing Virus Sample storehouse, quantity and the accounting of virus (i.e. known viruse packet) and non-viral (i.e. the first packet) can be counted, determine whether the first packet is viral packet getting the first packet, such as, packet p is one in 26 the first packets in above-described embodiment, after determining that packet p is viral packet, determine that this packet p is the variant virus packet of known viruse packet (being 8852 known viruse packets in the above-described embodiments), also the virus that namely packet p carries is the variant virus of the entrained virus of 8852 packets.If in a cluster heap, the overwhelming majority is virus, and is non-Virus Sample on a small quantity, and the suspicious degree of these non-viral samples is very high.
After obtaining the first packet, carry out virus by the fisrt feature of the first packet and viral probability corresponding to the subcharacter of fisrt feature one by one and screen.Particularly, can sort to the subcharacter in fisrt feature from big to small according to viral probability, sequentially locate should the code of subcharacter in this first packet, sequentially whether detection of code is viral code, when having at least a code to be viral code in the code got, determine that the first packet is viral packet, when the code got is not all for viral code, determine that this first packet is not viral packet.
Particularly, the position at quick position suspect code place can be come with suspicious characteristic, code is further analyzed, particularly, confirm that whether this suspect code is virus, mainly the logic of the code of sample is followed the tracks of, confirm whether can trigger in code and steal privacy, the dangerous acts such as note occur stealthily to deduct fees, some words confirm that the code of this sample is viral code, to confirm that this sample (i.e. the first packet) is for viral packet.
The scheme of the sample clustering adopting the present invention above-mentioned, can help the suspicious sample of virus analysis personnel quick position, the variant virus of discovery increases greatly.In practice by technique scheme, have been found that many virus mutations.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of set of actions sum, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
Through the above description of the embodiments, those skilled in the art can be well understood to the mode that can add required general hardware platform by software according to the method for above-described embodiment and realize, hardware can certainly be passed through, but in a lot of situation, the former is better embodiment.Based on such understanding, technical scheme of the present invention can embody with the form of software product the part that prior art contributes in essence in other words, this computer software product is stored in a storage medium (as ROM/RAM, magnetic disc, CD), comprising some instructions in order to make a station terminal equipment (can be mobile phone, computing machine, server, or the network equipment etc.) perform the method for each embodiment of the present invention.
Embodiment 2
According to the embodiment of the present invention, additionally provide a kind of recognition device of viral packet, as shown in Figure 6, this recognition device can comprise: search module 10, data acquisition module 30 and the first determination module 50.
Wherein, module 10 is searched for searching the first packet from multiple packet to be identified.
First packet is different from Virus Sample storehouse the viral packet recorded.Have been found that in record in Virus Sample storehouse in embodiments of the present invention and confirm as viral packet, (this comparison can for being realized by the virus signature on packet) can search by comparing with the packet that records in Virus Sample storehouse in above-described embodiment, packet in Virus Sample storehouse in recording in multiple packet to be identified is searched (or screening) to arrive, and obtain the first packet be not recorded in Virus Sample storehouse.
Data acquisition module 30, for obtaining the fisrt feature of the first packet and the viral probability corresponding with fisrt feature, wherein, viral probability is for representing that the packet carrying fisrt feature is the probability of virus.
Each packet to be identified all can have and this packet characteristic of correspondence in the embodiment of the present application, and each feature all can a corresponding parameter characterizing its viral probability.
First determination module 50, determines for using viral probability and fisrt feature whether the first packet is viral packet.
Adopt the present invention, after finding the first packet be not recorded in Virus Sample storehouse, determine whether this first packet is viral packet by finding by the fisrt feature of the first packet and the viral probability of correspondence thereof, also namely screen without the need to carrying out virus to all packets to be identified, and in examination process, employ the packet feature of the first packet and viral probability corresponding to feature, doubtful virus is screened and no longer relies on artificial experience, it is also more accurate to screen the virus of packet, thus solve in prior art the inefficient problem of screening virus from a large amount of packets, achieve and quick and precisely from mass data bag, screen out viral packet, improve the efficiency of screening virus from mass data bag.
Wherein, the packet in above-described embodiment can be application data bag, software package etc.Above-mentioned recognition device can be arranged on terminal or server.
In embodiments of the present invention, terminal can be mobile terminal (such as, mobile phone, panel computer etc.), also can be the terminal of other types.The operating system that terminal is run also is various types of system, such as, and the Android system be widely used at present, or Windows operating system, iOS system etc., but be not limited to this.
This terminal can comprise storage medium, and the program element stored in storage medium can be used for performing the method described in the above embodiment of the present invention.This terminal can also comprise processor, and this processor may be used for performing said procedure unit.It is contemplated that method described in the invention or device can be realized by program element, the mode that also can be combined by hardware or software and hardware is realized.
In the above embodiment of the present invention, searching module can comprise: feature acquisition module, Cluster Analysis module and search submodule, and wherein, feature acquisition module is for obtaining the feature of multiple packet to be identified; Cluster Analysis module is used for carrying out cluster analysis according to the feature of packet to multiple packet to be identified, so that multiple packet to be identified is divided into multiple set; Search submodule for searching the first packet in units of set; Recognition device can comprise: the second determination module, for when determining that the first packet is viral packet, determine that the first packet is the variant virus packet of known viruse packet, wherein, known viruse packet is arranged in the set at the first packet place, and is recorded in Virus Sample storehouse.
After the viral probability of use and fisrt feature determine whether the first packet is viral packet, recognition methods can comprise: when determining that the first packet is viral packet, determine that the first packet is the variant virus packet of known viruse packet, wherein, known viruse packet is arranged in the set at the first packet place, and is recorded in Virus Sample storehouse.
Wherein, cluster analysis is the analytic process set of physics or abstract object be grouped into as the multiple classes be made up of similar object.Specific in the above embodiment of the present invention, can be according to packet to be identified between similarity, packet to be identified is assigned in different bunch (i.e. the set of above-described embodiment).
Pass through above-described embodiment, before searching packet, get the feature of multiple packet to be identified, and according to this feature, multiple packet to be identified is divided into different set, then in units of set, the first packet is searched, thus when determining that the first packet is viral packet, the set that this first packet is positioned at can be determined, thus can determine that the first packet is the variant virus packet of known viruse packet in this set.In this embodiment, not only can determine whether the first packet is viral packet, can also determine that the first packet is the packet of the variant virus of which (or which class) viral packet, adopt the embodiment of the present invention, not only can quick and precisely screen out viral packet, classification mark and process can also be carried out to all viral packets, thus can for virus research or analysis provide more accurate and complete viral packet data further.
According to the abovementioned embodiments of the present invention, Cluster Analysis module can comprise: judge module, and the feature for usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value; First divides collection modules, for when similarity is greater than predetermined threshold value, two packets to be identified is divided into same set; Second divides collection modules, for when similarity is not more than predetermined threshold value, two packets to be identified is divided into different set.
The similarity of two packets to be identified can be used in an embodiment of the present invention to represent, and whether two packets to be identified are similar, then hierarchical clustering disposal route is adopted, the sample (packet to be identified) similarity being greater than predetermined threshold value θ converges to a set (heap namely in clustering processing or bunch), and sample similarity being not more than predetermined threshold value is divided into different set.
Particularly, the similarity of any two packets in multiple packet to be identified can be calculated in this embodiment, then complete according to the size of similarity and predetermined threshold value that multiple set is obtained to the cluster analysis of multiple packet to be identified.
In a preferred embodiment of the invention, judge module can comprise similarity calculation module and threshold value judgment module: similarity calculation module is for calculating the similarity between any two packets to be identified in multiple packet to be identified, and threshold value judgment module is for judging whether similarity is greater than predetermined threshold value.
Particularly, similarity calculation module can make to realize with the following method: after the feature (this feature can proper vector represent) getting two packets (being denoted by the first packet to be identified and the second packet to be identified) to be identified, with this, the similarity calculating the proper vector of the first packet to be identified and the proper vector of the second packet to be identified, to obtain the similarity of the first packet to be identified and the second packet to be identified, represents whether two samples (namely two to be identified packet) are similar.More specifically, Jaccard formula (Jie Kade formula) can be adopted to calculate the similarity of two packets to be identified:
total category feature (i.e. subcharacter), represent all category features of sample A and B. represent the quantity of the category feature that sample A and sample B have, represent the quantity of all subcharacters of sample A and B.
As can be seen from above-mentioned formula, the similarity span of two samples is [0,1].When all category features of two samples are just the same, similarity is 1; When two samples do not exist same item feature, their similarity is 0.
According to a further advantageous embodiment of the invention, judge module can comprise: processing data packets module, for two packets to be identified are denoted as the first packet to be identified and the second packet to be identified, and obtain the second packet feature of the first packet characteristic sum second packet to be identified of the first packet to be identified, wherein, first packet feature comprises the first subcharacter of the first data volume, and the second packet feature comprises the second subcharacter of the second quantity; First judges submodule, for judging whether the ratio of the first quantity and the second quantity meets default ratio; First determines submodule, for when the ratio of the first quantity and the second quantity does not meet default ratio, determines that similarity is not more than predetermined threshold value; Second determines submodule, for when the ratio of the first quantity and the second quantity meets default ratio, calculates the similarity of the first packet to be identified and the second packet to be identified, and judges whether similarity is greater than predetermined threshold value.
Particularly, when sample (packet namely to be identified) cluster, predetermined threshold value can be adopted to do cluster, the sample that similarity is greater than predetermined threshold value converges in a set, according to above-mentioned Jie Kade whether the ratio of quantity meets default ratio is the ratio judging the first quantity and the second quantity and whether is not more than predetermined threshold value, when the ratio of the first quantity and the second quantity is not more than predetermined threshold value, namely determine that the ratio of the first quantity and the second quantity does not meet default ratio; When the ratio of the first quantity and the second quantity is greater than predetermined threshold value, namely determine that the ratio of the first quantity and the second quantity meets default ratio.
By above-described embodiment, when calculating the similarity of multiple packet to be identified, the similarity of sample A and sample B can not be calculated, to avoid invalid computation.
Need to further illustrate, second determines that submodule can comprise: logging modle, for recording the 3rd quantity of first subcharacter different from the second subcharacter; Second judges submodule, for judging whether the 3rd quantity exceedes predetermined number; 3rd determines submodule, for when the 3rd quantity exceedes predetermined number, determines that similarity is not more than predetermined threshold value; First computing module, for when the 3rd quantity is no more than predetermined number, calculates the similarity of the first packet to be identified and the second packet to be identified; 4th determines submodule, for when similarity is greater than predetermined threshold value, determines that similarity is greater than predetermined threshold value; 5th determines submodule, for when similarity is not more than predetermined threshold value, determines that similarity is not more than predetermined threshold value.
Particularly, predetermined number in above-described embodiment can be determined by predetermined threshold value, if predetermined number can be (1-θ) (| A|+|B|), when calculating the similarity of sample A and B, if when finding the category feature relatively crossed same population do not exceed (1-θ) (| A|+|B|), the similarity of sample A and B can not exceed predetermined threshold value, and remaining category feature can no longer compare, and directly terminates comparison procedure.
Similarity Measure can be terminated as early as possible by this embodiment, thus can computational resource be saved.
What needs further illustrated is, the feature that Hive cluster (Hive is a Tool for Data Warehouse) can also be adopted to realize usage data bag judges whether the similarity of two packets to be identified is greater than predetermined threshold value, during the method adopting Hive cluster to realize in this embodiment, multiple computing terminal can be used, but its implementation is identical with the implementation method in embodiment one, does not repeat them here.
When computationally stating the similarity between two packets to be identified, namely also above-mentioned similarity calculation module and/or the first computing module also obtain feature Hash for carrying out Hash after sorting to the category feature (i.e. subcharacter) of each sample (packet namely to be identified), and with feature Hash, multiple packet to be identified is divided into groups, the packet that then extraction one is to be identified from each group and the packet to be identified that other groups extract carry out the calculating of similarity.In the above-described embodiments, enough good words are obtained at hash method, the sample often organized has identical feature Hash, the similarity of the sample of each group the inside is 1, thus, after being divided into group, from every group, all only take out one calculate Sample Similarity, this sample extracted and the similarity of the sample that other groups extract be any one sample in this group with corresponding group in the similarity of any one sample, so just greatly reduce the calculated amount of similarity, decrease the treatment capacity of processor, improve the performance of system.
According to the abovementioned embodiments of the present invention, recognition device can also comprise: preserve module, for preserving into packet feature similarity matrix by the magnitude relationship of similarity and similarity and predetermined threshold value; Matrix disposal module, for when getting packet newly to be identified, calculates the first similar matrix of packet newly to be identified; Calculate the second similar matrix of packet newly to be identified and multiple packet to be identified; Use the first similar matrix and second-phase like matrix update packet feature similarity matrix, wherein, first similar matrix comprises the magnitude relationship of the first similarity between any two new packets to be identified and the first similarity and predetermined threshold value, and the second similar matrix comprises the magnitude relationship of the second similarity between each packet newly to be identified and each packet to be identified and the second similarity and predetermined threshold value.
Feature due to each packet is all fixing, and the sample of calculated similarity is the similarity without the need to again calculating between them, can reduce double counting by the method for above-mentioned incremental update.
In the above embodiment of the present invention, feature acquisition module can comprise: storehouse acquisition module, for obtaining system call library, wherein, and the characteristic of in store corresponding data packet function in system call library; Extraction module, for extracting the characteristic of each packet function of packet corresponding to be identified from system call library; Feature processing block, for obtaining subcharacter to carrying out Hash after the sequence of characteristic duplicate removal, all subcharacters assemble the feature of packet to be identified; Sample characteristics storehouse generation module, for the characteristic sum subcharacter proper vector of packet to be identified being represented, and preserves proper vector into sample characteristics storehouse.
Need to further illustrate, recognition device also comprises: sample acquisition module, for obtaining malice sample in sample characteristics storehouse and safe sample; Statistical module, for adding up the malice sample size carrying subcharacter and the safe sample size carrying subcharacter; Second computing module, for using the viral probability of following formulae discovery subcharacter, and viral probability is preserved into viral probability storehouse:
w ( feature ) = n ( feature , malware _ set ) n ( feature , malware _ set ) + n ( feature , safe _ set ) ,
Wherein, the viral probability that w (feature) is subcharacter, n (feature, malware_set) is for carrying the malice sample size of subcharacter, n (feature, safe_set) is for carrying the safe sample size of subcharacter;
Data acquisition module comprises: the first read module, for reading the fisrt feature corresponding with the first packet from sample database; And second read module, for reading the viral probability corresponding with the subcharacter in fisrt feature from viral probability storehouse.
In the above embodiment of the present invention, checking and killing virus software (as Tengxun mobile phone house keeper) collected and analyzed a large amount of samples, these Sample preservations are entered Sample Storehouse, the sample preserved in Sample Storehouse is divided into: malice sample and safe sample, wherein, after to the sample analysis collected, confirm that this sample really can produce to the property, privacy etc. of terminal (as the terminal such as mobile phone, panel computer) user the sample endangered and be considered to malice sample; Other do not find that the sample of dangerous act is safe sample.
According to the abovementioned embodiments of the present invention, the first determination module can comprise: order module, for sorting to the subcharacter in fisrt feature from big to small according to viral probability; Locating module, for sequentially locating the code of corresponding each subcharacter in the first packet; 6th determines submodule, for having at least a code to be viral code in code, determines that the first packet is viral packet; 7th determines submodule, for when code is not all viral code, determines that the first packet is not viral packet.
Particularly, the position at quick position suspect code place can be come with suspicious characteristic, code is further analyzed, particularly, confirm that whether this suspect code is virus, mainly the logic of the code of sample is followed the tracks of, confirm whether can trigger in code and steal privacy, the dangerous acts such as note occur stealthily to deduct fees, some words confirm that the code of this sample is viral code, to confirm that this sample (i.e. the first packet) is for viral packet.
Implementation method in above-mentioned modules difference corresponding method embodiment.The example that above-mentioned modules realizes with corresponding step is identical with application scenarios, but the content be not limited to disclosed in above-described embodiment, and above-mentioned module may operate in terminal or mobile terminal, can also run on the server, and above-mentioned module can pass through software or hardware implementing.
Embodiment 3
According to the embodiment of the present invention, additionally provide a kind of recognition system of viral packet.For purposes of illustration, the architecture of painting is only an example of proper environment, not proposes any limitation to the usable range of the application or function.This computing system should be interpreted as, to the arbitrary assembly in this system or its combination, there is any dependence or demand yet.
The recognition system of viral packet provided by the invention can comprise the embodiment of the recognition device of any one the viral packet in embodiment two.
Adopt the present invention, after finding the first packet be not recorded in Virus Sample storehouse, determine whether this first packet is viral packet by finding by the fisrt feature of the first packet and the viral probability of correspondence thereof, also namely screen without the need to carrying out virus to all packets to be identified, and in examination process, employ the packet feature of the first packet and viral probability corresponding to feature, doubtful virus is screened and no longer relies on artificial experience, it is also more accurate to screen the virus of packet, thus solve in prior art the inefficient problem of screening virus from a large amount of packets, achieve and quick and precisely from mass data bag, screen out viral packet, improve the efficiency of screening virus from mass data bag.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
In the above embodiment of the present invention, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed terminal, client, the mode by other realizes.Wherein, device embodiment described above is only schematic, the division of such as unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or set part can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of unit or module or communication connection can be electrical or other form.
The unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, server or the network equipment etc.) perform each embodiment method of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-OnlyMemory), random access memory (RAM, RandomAccessMemory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
Below be only the preferred embodiment of the present invention; it should be pointed out that for those skilled in the art, under the premise without departing from the principles of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims (19)

1. a recognition methods for viral packet, is characterized in that, comprising:
From multiple packet to be identified, search the first packet, wherein, described first packet is different from Virus Sample storehouse the viral packet recorded;
Obtain the fisrt feature of described first packet and the viral probability corresponding with described fisrt feature, wherein, described viral probability is for representing that the packet carrying described fisrt feature is the probability of virus;
Whether described first packet is described viral packet to use described viral probability and described fisrt feature to determine.
2. recognition methods according to claim 1, is characterized in that,
From multiple packet to be identified, search the first packet comprise: the feature obtaining described multiple packet to be identified, and according to the feature of described packet, cluster analysis is carried out to described multiple packet to be identified, so that described multiple packet to be identified is divided into multiple set; Described first packet is searched in units of described set;
After the described viral probability of use and described fisrt feature determine whether described first packet is described viral packet, described recognition methods comprises: when determining that described first packet is described viral packet, determine that described first packet is the variant virus packet of known viruse packet
Wherein, described known viruse packet is arranged in the set at described first packet place, and is recorded in described Virus Sample storehouse.
3. recognition methods according to claim 2, is characterized in that, carries out cluster analysis according to the feature of described packet to described multiple packet to be identified, to be divided into multiple set to comprise described multiple packet to be identified:
Whether the similarity of two packets to be identified is greater than predetermined threshold value to use the feature of described packet to judge;
When described similarity is greater than described predetermined threshold value, described two packets to be identified are divided into same set;
When described similarity is not more than described predetermined threshold value, described two packets to be identified are divided into different set.
4. recognition methods according to claim 3, is characterized in that, whether the similarity of two packets to be identified is greater than predetermined threshold value and comprises to use the feature of described packet to judge:
Described two packets to be identified are denoted as the first packet to be identified and the second packet to be identified, second packet feature of the second packet to be identified described in the first packet characteristic sum obtaining described first packet to be identified, wherein, described first packet feature comprises the first subcharacter of the first data volume, and described second packet feature comprises the second subcharacter of the second quantity;
Judge whether the ratio of described first quantity and described second quantity meets default ratio;
When the described ratio of described first quantity and described second quantity does not meet described default ratio, determine that described similarity is not more than described predetermined threshold value;
When the described ratio of described first quantity and described second quantity meets described default ratio, calculate the described similarity of described first packet to be identified and described second packet to be identified, and judge whether described similarity is greater than described predetermined threshold value.
5. recognition methods according to claim 4, is characterized in that, calculates the described similarity of described first packet to be identified and described second packet to be identified, and judges whether described similarity is greater than described predetermined threshold value and comprises:
Record the 3rd quantity of described first subcharacter different from described second subcharacter;
Judge whether described 3rd quantity exceedes predetermined number;
When described 3rd quantity exceedes described predetermined number, determine that described similarity is not more than described predetermined threshold value;
When described 3rd quantity is no more than described predetermined number, calculate the described similarity of described first packet to be identified and described second packet to be identified;
When described similarity is greater than described predetermined threshold value, determine that described similarity is greater than described predetermined threshold value;
When described similarity is not more than described predetermined threshold value, determine that described similarity is not more than described predetermined threshold value.
6. according to the recognition methods in claim 3 to 5 described in any one, it is characterized in that, using after the feature of described packet judges whether the similarity of two packets to be identified is greater than predetermined threshold value, described recognition methods also comprises:
The magnitude relationship of described similarity and described similarity and described predetermined threshold value is preserved into packet feature similarity matrix;
When getting packet newly to be identified, calculate the first similar matrix of described packet newly to be identified; Calculate the second similar matrix of described packet newly to be identified and described multiple packet to be identified; Use described first similar matrix and described second-phase like packet feature similarity matrix described in matrix update,
Wherein, described first similar matrix comprises the magnitude relationship of the first similarity between any two described packets newly to be identified and described first similarity and described predetermined threshold value, and described second similar matrix comprises the magnitude relationship of the second similarity between each described packet newly to be identified and each described packet to be identified and described second similarity and described predetermined threshold value.
7. recognition methods according to claim 2, is characterized in that, the feature obtaining described multiple packet to be identified comprises:
Obtain system call library, wherein, the characteristic of in store corresponding data packet function in described system call library;
The characteristic of each packet function of corresponding described packet to be identified is extracted from described system call library;
Obtain subcharacter to carrying out Hash after the characteristic duplicate removal sequence of described packet, all described subcharacters assemble the feature of packet to be identified;
Subcharacter proper vector described in the characteristic sum of described packet to be identified is represented, and the proper vector of described packet is preserved into sample characteristics storehouse.
8. recognition methods according to claim 7, is characterized in that,
Before the fisrt feature obtaining described first packet and the viral probability corresponding with described fisrt feature, described recognition methods also comprises: obtain the malice sample in sample database and safe sample; Statistics carries the malice sample size of subcharacter and carries the safe sample size of described subcharacter; Use the viral probability of subcharacter described in following formulae discovery, and described viral probability preserved into viral probability storehouse:
Formula is:
w ( feature ) = n ( feature , malware _ set ) n ( feature , malware _ set ) + n ( feature , safe _ set ) ,
Wherein, the viral probability that w (feature) is described subcharacter, n (feature, malware_set) is for carrying the malice sample size of described subcharacter, n (feature, safe_set) is for carrying the safe sample size of described subcharacter;
The fisrt feature and the viral probability corresponding with fisrt feature that obtain the first packet comprise: from described sample database, read the described fisrt feature corresponding with described first packet; And the described viral probability corresponding with the described subcharacter in described fisrt feature is read from described viral probability storehouse.
9. recognition methods according to claim 1, is characterized in that, whether described first packet is that described viral packet comprises to use described viral probability and described fisrt feature to determine:
From big to small the subcharacter in described fisrt feature is sorted according to described viral probability, in described first packet, sequentially locate the code of corresponding each described subcharacter;
When having at least a described code to be viral code in described code, determine that described first packet is described viral packet;
When described code is not all described viral code, determine that described first packet is not described viral packet.
10. a recognition device for viral packet, is characterized in that, comprising:
Search module, for searching the first packet from multiple packet to be identified, wherein, described first packet is different from Virus Sample storehouse the viral packet recorded;
Data acquisition module, for obtaining the fisrt feature of described first packet and the viral probability corresponding with described fisrt feature, wherein, described viral probability is for representing that the packet carrying described fisrt feature is the probability of virus;
First determination module, determines for using described viral probability and described fisrt feature whether described first packet is described viral packet.
11. recognition devices according to claim 10, is characterized in that,
Described module of searching comprises: feature acquisition module, Cluster Analysis module and search submodule, and wherein, the feature acquisition module of described packet is for obtaining the feature of described multiple packet to be identified; Described Cluster Analysis module is used for carrying out cluster analysis according to the feature of described packet to described multiple packet to be identified, so that described multiple packet to be identified is divided into multiple set; Described submodule of searching for searching described first packet in units of described set;
Described recognition device comprises: the second determination module, for when determining that described first packet is described viral packet, determines that described first packet is the variant virus packet of known viruse packet,
Wherein, described known viruse packet is arranged in the set at described first packet place, and is recorded in described Virus Sample storehouse.
12. recognition devices according to claim 11, is characterized in that, described Cluster Analysis module comprises:
Judge module, judges for using the feature of described packet whether the similarity of two packets to be identified is greater than predetermined threshold value;
First divides collection modules, for when described similarity is greater than described predetermined threshold value, described two packets to be identified is divided into same set;
Second divides collection modules, for when described similarity is not more than described predetermined threshold value, described two packets to be identified is divided into different set.
13. recognition devices according to claim 12, is characterized in that, described judge module comprises:
Processing data packets module, for described two packets to be identified are denoted as the first packet to be identified and the second packet to be identified, second packet feature of the second packet to be identified described in the first packet characteristic sum obtaining described first packet to be identified, wherein, described first packet feature comprises the first subcharacter of the first data volume, and described second packet feature comprises the second subcharacter of the second quantity;
First judges submodule, for judging whether the ratio of described first quantity and described second quantity meets default ratio;
First determines submodule, for when the described ratio of described first quantity and described second quantity does not meet described default ratio, determines that described similarity is not more than described predetermined threshold value;
Second determines submodule, for when the described ratio of described first quantity and described second quantity meets described default ratio, calculate the described similarity of described first packet to be identified and described second packet to be identified, and judge whether described similarity is greater than described predetermined threshold value.
14. recognition devices according to claim 13, is characterized in that, described second determines that submodule comprises:
Logging modle, for recording the 3rd quantity of described first subcharacter different from described second subcharacter;
Second judges submodule, for judging whether the 3rd quantity exceedes predetermined number;
3rd determines submodule, for when described 3rd quantity exceedes described predetermined number, determines that described similarity is not more than described predetermined threshold value;
First computing module, for when described 3rd quantity is no more than described predetermined number, calculates the described similarity of described first packet to be identified and described second packet to be identified;
4th determines submodule, for when described similarity is greater than described predetermined threshold value, determines that described similarity is greater than described predetermined threshold value;
5th determines submodule, for when described similarity is not more than described predetermined threshold value, determines that described similarity is not more than described predetermined threshold value.
15. according to claim 12 to the recognition device described in any one in 14, and it is characterized in that, described recognition device also comprises:
Preserve module, for preserving into packet feature similarity matrix by the magnitude relationship of described similarity and described similarity and described predetermined threshold value;
Matrix disposal module, for when getting packet newly to be identified, calculates the first similar matrix of described packet newly to be identified; Calculate the second similar matrix of described packet newly to be identified and described multiple packet to be identified; Use described first similar matrix and described second-phase like packet feature similarity matrix described in matrix update,
Wherein, described first similar matrix comprises the magnitude relationship of the first similarity between any two described packets newly to be identified and described first similarity and described predetermined threshold value, and described second similar matrix comprises the magnitude relationship of the second similarity between each described packet newly to be identified and each described packet to be identified and described second similarity and described predetermined threshold value.
16. recognition devices according to claim 11, is characterized in that, the feature acquisition module of described packet comprises:
Storehouse acquisition module, for obtaining system call library, wherein, the characteristic of in store corresponding data packet function in described system call library;
Extraction module, for extracting the characteristic of each packet function of corresponding described packet to be identified from described system call library;
Feature processing block, for obtaining subcharacter to carrying out Hash after the characteristic duplicate removal sequence of described packet, all described subcharacters assemble the feature of packet to be identified;
Sample characteristics storehouse generation module, for subcharacter proper vector described in the characteristic sum of described packet to be identified being represented, and preserves the proper vector of described packet into sample characteristics storehouse.
17. recognition devices according to claim 16, is characterized in that,
Described recognition device also comprises: sample acquisition module, for obtaining malice sample in sample database and safe sample; Statistical module, for adding up the malice sample size carrying subcharacter and the safe sample size carrying described subcharacter; Second computing module, for using the viral probability of subcharacter described in following formulae discovery, and described viral probability is preserved into viral probability storehouse:
Formula is:
w ( feature ) = n ( feature , malware _ set ) n ( feature , malware _ set ) + n ( feature , safe _ set ) ,
Wherein, the viral probability that w (feature) is described subcharacter, n (feature, malware_set) is for carrying the malice sample size of described subcharacter, n (feature, safe_set) is for carrying the safe sample size of described subcharacter;
Described data acquisition module comprises: the first read module, for reading the described fisrt feature corresponding with described first packet from described sample database; And second read module, for reading the described viral probability corresponding with the described subcharacter in described fisrt feature from described viral probability storehouse.
18. recognition devices according to claim 10, is characterized in that, described first determination module comprises:
Order module, for sorting to the subcharacter in described fisrt feature from big to small according to described viral probability;
Locating module, for sequentially locating the code of corresponding each described subcharacter in described first packet;
6th determines submodule, for having at least a described code to be viral code in described code, determines that described first packet is described viral packet;
7th determines submodule, for when described code is not all described viral code, determines that described first packet is not described viral packet.
The recognition system of 19. 1 kinds of viral packets, is characterized in that, comprising: the recognition device of the viral packet in claim 10 to 18 described in any one.
CN201410190765.6A 2014-05-07 2014-05-07 The recognition methods of viral data packet, apparatus and system Active CN105095752B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410190765.6A CN105095752B (en) 2014-05-07 2014-05-07 The recognition methods of viral data packet, apparatus and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410190765.6A CN105095752B (en) 2014-05-07 2014-05-07 The recognition methods of viral data packet, apparatus and system

Publications (2)

Publication Number Publication Date
CN105095752A true CN105095752A (en) 2015-11-25
CN105095752B CN105095752B (en) 2019-01-08

Family

ID=54576160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410190765.6A Active CN105095752B (en) 2014-05-07 2014-05-07 The recognition methods of viral data packet, apparatus and system

Country Status (1)

Country Link
CN (1) CN105095752B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598128A (en) * 2018-12-11 2019-04-09 郑州云海信息技术有限公司 A kind of method and device of scanography
CN112464235A (en) * 2020-11-26 2021-03-09 西京学院 Computer network safety control system and control method

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1752888A (en) * 2005-11-08 2006-03-29 朱林 Virus characteristics extraction and detection system and method for mobile/intelligent terminal
CN103136477A (en) * 2013-03-06 2013-06-05 北京奇虎科技有限公司 Scanning method and scanning system for file samples
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN103679012A (en) * 2012-09-03 2014-03-26 腾讯科技(深圳)有限公司 Clustering method and device of portable execute (PE) files
CN103679019A (en) * 2012-09-10 2014-03-26 腾讯科技(深圳)有限公司 Malicious file identifying method and device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1752888A (en) * 2005-11-08 2006-03-29 朱林 Virus characteristics extraction and detection system and method for mobile/intelligent terminal
CN103679012A (en) * 2012-09-03 2014-03-26 腾讯科技(深圳)有限公司 Clustering method and device of portable execute (PE) files
CN103679019A (en) * 2012-09-10 2014-03-26 腾讯科技(深圳)有限公司 Malicious file identifying method and device
CN103136477A (en) * 2013-03-06 2013-06-05 北京奇虎科技有限公司 Scanning method and scanning system for file samples
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109598128A (en) * 2018-12-11 2019-04-09 郑州云海信息技术有限公司 A kind of method and device of scanography
CN112464235A (en) * 2020-11-26 2021-03-09 西京学院 Computer network safety control system and control method

Also Published As

Publication number Publication date
CN105095752B (en) 2019-01-08

Similar Documents

Publication Publication Date Title
CN111565205B (en) Network attack identification method and device, computer equipment and storage medium
US20210021616A1 (en) Method and system for classifying data objects based on their network footprint
US20150172303A1 (en) Malware Detection and Identification
CN106599686A (en) Malware clustering method based on TLSH character representation
US10187412B2 (en) Robust representation of network traffic for detecting malware variations
CN107368856B (en) Malicious software clustering method and device, computer device and readable storage medium
JP2009523270A (en) Method and apparatus used for automatic comparison of data strings
CN105224600B (en) A kind of detection method and device of Sample Similarity
CN108009425A (en) File detects and threat level decision method, apparatus and system
CN110362996B (en) Method and system for offline detection of PowerShell malicious software
CN110392013A (en) A kind of Malware recognition methods, system and electronic equipment based on net flow assorted
CN106254321A (en) A kind of whole network abnormal data stream sorting technique
CN107247902A (en) Malware categorizing system and method
CN117081858B (en) Intrusion behavior detection method, system, equipment and medium based on multi-decision tree
CN112565301B (en) Method for detecting abnormal data of server operation network flow based on small sample learning
CN110519264A (en) Tracking source tracing method, device and the equipment of attack
CN110362995A (en) It is a kind of based on inversely with the malware detection of machine learning and analysis system
CN107209834A (en) Malicious communication pattern extraction apparatus, malicious communication schema extraction system, malicious communication schema extraction method and malicious communication schema extraction program
CN112257076B (en) Vulnerability detection method based on random detection algorithm and information aggregation
CN105095752A (en) Identification method, apparatus and system of virus packet
Hubballi et al. Detecting packed executable file: Supervised or anomaly detection method?
CN110188537B (en) Data separation storage method and device, storage medium and electronic device
CN107832611A (en) The bot program detection and sorting technique that a kind of dynamic static nature combines
Yu et al. A unified malicious documents detection model based on two layers of abstraction
CN111368294B (en) Virus file identification method and device, storage medium and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230712

Address after: 518000 Tencent Building, No. 1 High-tech Zone, Nanshan District, Shenzhen City, Guangdong Province, 35 Floors

Patentee after: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

Patentee after: TENCENT CLOUD COMPUTING (BEIJING) Co.,Ltd.

Address before: 2, 518000, East 403 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: TENCENT TECHNOLOGY (SHENZHEN) Co.,Ltd.

TR01 Transfer of patent right