CN105956469A - Method and device for identifying file security - Google Patents

Method and device for identifying file security Download PDF

Info

Publication number
CN105956469A
CN105956469A CN201610270523.7A CN201610270523A CN105956469A CN 105956469 A CN105956469 A CN 105956469A CN 201610270523 A CN201610270523 A CN 201610270523A CN 105956469 A CN105956469 A CN 105956469A
Authority
CN
China
Prior art keywords
file
apk file
characteristic
characteristic vector
length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610270523.7A
Other languages
Chinese (zh)
Other versions
CN105956469B (en
Inventor
陈治宇
周吉文
徐超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201610270523.7A priority Critical patent/CN105956469B/en
Publication of CN105956469A publication Critical patent/CN105956469A/en
Application granted granted Critical
Publication of CN105956469B publication Critical patent/CN105956469B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/562Static detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2221/00Indexing scheme relating to security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F2221/03Indexing scheme relating to G06F21/50, monitoring users, programs or devices to maintain the integrity of platforms
    • G06F2221/033Test or assess software

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Hardware Design (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Virology (AREA)
  • Storage Device Security (AREA)

Abstract

The invention discloses a method and a device for identifying file security. A concrete embodiment of the method comprises the following steps of: extracting at least one piece of feature information, of a to-be-identified file, for identifying the file security to obtain feature vectors which respectively correspond to various pieces of feature information, wherein types of the feature vectors corresponding to the feature information are preset; the types of the feature vectors include fixed length feature vectors with a fixed length and variable length feature vectors with variable lengths; respectively inputting the obtained feature vectors as input vectors to machine learning models corresponding to the types of the feature vectors, wherein the fixed length feature vectors correspond to a fixed length input learning model, the variable length feature vectors correspond to a variable length input learning model; and determining the to-be-identified file as a virus file or a secure file through output vectors of the machine learning models. Through the embodiment, the application range for identification of the file security is broadened.

Description

File security recognition methods and device
Technical field
The application relates to field of computer technology, is specifically related to field of information security technology, especially Relate to file security recognition methods and device.
Background technology
Computer virus (Computer Virus) be organizer insert in computer program broken Bad computer function or the code of data, can affect computer and use, the one of energy self replication Group computer instruction or program code.In prior art generally by artificial add file Join rule and come whether recognition application is virus document.
But, it is the rule manually added in advance that places one's entire reliance upon owing to traditional virus identifies It is identified, for manually failing the virus document identified, owing to client lacks patrolling of identification Collect rule, then can not effectively identify, therefore need badly and improve the range of application that virus identifies.
Summary of the invention
The purpose of the application is to propose the file security recognition methods of a kind of improvement and device, Solve the technical problem that background section above is mentioned.
First aspect, this application provides a kind of file security recognition methods, described method bag Include: extract at least one of file to be identified for the characteristic information of file security identification, To vectorial, wherein, corresponding to every kind of characteristic information with various characteristic informations characteristic of correspondence respectively The type of characteristic vector be set in advance, the type of characteristic vector includes constant the determining of length Long characteristic vector and adjustable length elongated characteristic vector;Each obtained characteristic vector is made It is separately input into the machine learning model corresponding with the type of described characteristic vector for input vector, Wherein, fixed length characteristic vector is corresponding with fixed length input learning model, and elongated characteristic vector is with elongated Input learning model is corresponding;By the output vector of each machine learning model, determine described in treat Identify that file is virus document or secure file.
Second aspect, this application provides a kind of file security identification device, described device bag Include: extraction unit, for extracting at least one of file to be identified for file security identification Characteristic information, obtain and various characteristic informations respectively characteristic of correspondence vector, wherein, every kind The type of the characteristic vector corresponding to characteristic information is set in advance, the type bag of characteristic vector Include the constant fixed length characteristic vector of length and adjustable length elongated characteristic vector;Input block, For using each obtained characteristic vector as input vector be separately input into described feature to The machine learning model corresponding to type of amount, wherein, fixed length characteristic vector and fixed length input study Model is corresponding, and elongated characteristic vector is corresponding with elongated input learning model;Determine unit, be used for By the output vector of each machine learning model, determine that described file to be identified is virus document Or secure file.
The file security recognition methods of the application offer and device, can be by for extracting spy Reference ceases the different types of characteristic vector formed, all can be by corresponding machine learning mould Type processes, thus the safety to file is identified, and can improve file security identification Range of application.
Accompanying drawing explanation
By reading retouching in detail with reference to made non-limiting example is made of the following drawings Stating, other features, purpose and advantage will become more apparent upon:
Fig. 1 is that the application can apply to exemplary system architecture figure therein;
Fig. 2 is the flow process of an embodiment of the file security recognition methods according to the application Figure;
Fig. 3 is an application according to the file security recognition methods described by the application Fig. 2 The schematic diagram of scene;
Fig. 4 is the flow process of another embodiment of the file security recognition methods according to the application Figure;
Fig. 5 is an application according to the file security recognition methods described by the application Fig. 4 The schematic diagram of scene;
Fig. 6 is that the structure of an embodiment of the file security identification device according to the application is shown It is intended to;
Fig. 7 is adapted for the computer for the terminal unit or server realizing the embodiment of the present application The structural representation of system.
Detailed description of the invention
With embodiment, the application is described in further detail below in conjunction with the accompanying drawings.It is appreciated that , specific embodiment described herein is used only for explaining related invention, rather than to this Bright restriction.It also should be noted that, for the ease of describe, accompanying drawing illustrate only with About the part that invention is relevant.
It should be noted that in the case of not conflicting, the embodiment in the application and embodiment In feature can be mutually combined.Describe this below with reference to the accompanying drawings and in conjunction with the embodiments in detail Application.
Fig. 1 shows file security recognition methods or the file security that can apply the application Identify the exemplary system architecture 100 of the embodiment of device.
As it is shown in figure 1, system architecture 100 can include terminal unit 101,102,103, Network 104 and server 105.Network 104 is in order at terminal unit 101,102,103 and The medium of communication link is provided between server 105.Network 104 can include various connection class Type, the most wired, wireless communication link or fiber optic cables etc..
User can use terminal unit 101,102,103 by network 104 and server 105 Alternately, to receive or to send message etc..Can be provided with on terminal unit 101,102,103 Various telecommunication customer ends are applied, such as security classes application etc..
Terminal unit 101,102,103 can be various electronic equipment, includes but not limited to intelligence Can mobile phone, panel computer, E-book reader, MP3 player (Moving Picture Experts Group Audio Layer III, dynamic image expert's compression standard audio frequency aspect 3), (Moving Picture Experts Group Audio Layer IV, dynamic image expert compresses MP4 Standard audio aspect 4) player, pocket computer on knee and desk computer etc..
Server 105 can be to provide the server of various service, such as to terminal unit 101, 102, the data of display provide the background server supported on 103.Such as, background server can Process to be analyzed the data received waiting, and by result (such as recognition result) Feed back to terminal unit.
It should be noted that the file security recognition methods one that Fig. 2 correspondence embodiment is provided As by terminal unit 101,102,103 perform, some steps can also be held by server 105 OK;Correspondingly, the file security identification device in Fig. 4 correspondence embodiment is generally positioned at end In end equipment 101,102,103, some unit can also be arranged on server 105.
It should be understood that the number of terminal unit, network and the server in Fig. 1 is only signal Property.According to realizing needs, can have any number of terminal unit, network and server.
With continued reference to Fig. 2, it is shown that according to one of the file security recognition methods of the application The flow process 200 of embodiment.Described file security recognition methods, comprises the following steps:
Step 201, extracts at least one of file to be identified for the spy of file security identification Reference ceases, and obtains and various characteristic information characteristics of correspondence vector.
In the present embodiment, file security recognition methods runs on electronic equipment (example thereon Terminal unit as shown in Figure 1) can pacify for file at least one of file to be identified The characteristic information of full property identification carries out information retrieval, gets for characteristic feature information corresponding Characteristic vector.Wherein, these characteristic informations can include but not limit each file in application program Size information, temporal information, file name information.Characteristic vector is for characterizing these features Vector, these features are formed by quantification treatment.Wherein, the class of characteristic vector Type includes the fixing fixing long characteristic vector of length and the unfixed elongated characteristic vector of length.Example As, some feature can use length predetermined characteristic vector to characterize, the most accordingly Characteristic vector be fixed length characteristic vector;Some feature can use the length cannot be predetermined Characteristic vector characterizes, then corresponding characteristic vector is elongated characteristic vector.Every kind of feature letter The type of the characteristic vector corresponding to breath can be that the feature according to characteristic information is set in advance. Optionally, the characteristic vector generated is by 0,1 characteristic vector formed.
It should be noted that the characteristic vector obtained for each characteristic information may be all fixed length Characteristic vector, it is also possible to be all elongated characteristic vector, it is also possible to both.
In some optional implementations of the present embodiment, above-mentioned file to be identified is that Android is installed Bag APK file.
In some optional implementations of the present embodiment, when file to be identified is Android installation kit APK file, this feature vector includes that following any feature information is extracted by least one The characteristic vector obtained: the structure feature information of APK file;The authority information of APK file; The information that service is provided of APK file;The information of the monitored event of APK file;APK The class name of each class, the function name of each function or the information of cited character string in file; The characteristic information of the distribution characteristics of the file type of each file in APK file.Said structure is special Levy and include but not limited to: bag, class, member function, member variable, input parameter, menu, Animation, the feature of picture.
In some optional implementations of the present embodiment, structure feature information includes following one Or multiple: in APK file, the length of title is less than the number of the bag of threshold value;In APK file The maximum of the length of the title of class, minima, total value, meansigma methods, variance yields, class The ratio of the number of all classes in number and APK file;Member variable in APK file The maximum of length of title, minima, meansigma methods, variance yields, the number of member variable Ratio with the number of all member variables in APK file;Member's letter in APK file The maximum of length of title of number, minima, total value, meansigma methods, variance yields, member Ratio shared in the number of the number of function all member variables in APK file;APK In file, the type of the return value of the type of member variable, member function is all with APK file The ratio of the type of data;The distribution of the number of the input parameter of member function in APK file, The length of the title of input parameter is less than the number of the parameter of threshold value;Whether APK file exists Preset characters string, url, telephone number, numeral;The number of the forms in APK file, forms The maximum of length of title, minima, total value, meansigma methods, variance yields, forms Size and determine window;The number of the menu in APK file, the character string of title of menu The maximum of length, minima, total value, meansigma methods, variance yields;Animation in APK file Number, the maximum of length of title of animation, minima, total value, meansigma methods, side The pixel characteristic of the image in difference and animation;The name of the number of picture, picture in APK file Figure in the maximum of length, minima, total value, meansigma methods, variance yields and the picture that claim The pixel characteristic of picture.
Optionally, the fixed length characteristic component that above-mentioned authority information is corresponding can be in the following manner Extract: the authority selected in advance for each, can extract for representing APK file Whether there is the component of the information of this authority.The default power that APK file to be identified is had Limit, the numerical value of corresponding component is 1.The authority not having for APK file to be identified is right The numerical value of the component answered is 0.
Optionally, the fixed length characteristic component that the information of the monitored event of APK file is corresponding is permissible Extract in the following manner: the monitoring event selected in advance for each, Ke Yiti Take in representing whether APK file has the component of the information of this monitoring event.For APK The event of middle monitoring, the numerical value of the component that the event of this monitoring is corresponding is 1;APK is not had The event monitored, the numerical value of the component that the event of this monitoring is corresponding is 0.
Optionally, the characteristic information of the distribution characteristics of the file type of each file in APK file Corresponding fixed length characteristic component can extract in the following manner: pre-for pre-set If the files classes such as the type of number, such as APK, dex, jar, so, xml, icon, png Type, it is judged that whether APK file to be identified exists the file of the type, according to judged result, carries Take corresponding fixed length characteristic vector.
Additionally, the class name of each class, the function name of each function or cited in APK file The information of character string can also be by being processed into the fixed length characteristic variable for representing this information.
In some optional implementations of the present embodiment, characteristic vector includes that at least one is right Following any feature information carries out extracting the elongated characteristic vector obtained: the function of APK file The information of call relation;In APK file each window control type and control title Information;The information of the distribution characteristics updating the time of each file in APK file;APK file Included in the information of credential categories.
Optionally, the elongated characteristic vector that the information of function calling relationship is corresponding can be by following Method is extracted: can obtain the call relation of all functions in APK file to be identified, generates Function forest.In this function forest, the corresponding node of each function.Function forest is wrapped Containing multiple trees.After generating function forest, can be ranked up according to the length of tree, determine length The tree of the forward predetermined number of ranking of degree, the tree that such as ranking is first 100.Forward in length In the tree of predetermined number, the joint on depth-priority-searching method traversal predetermined depth can be used further Point, obtains multiple functions of the tree correspondence determined.In the present embodiment, similarity can be used Hash algorithm, calculates the cryptographic Hash that the tree determined is corresponding for example with simhash algorithm.Such as, After the function name of above-mentioned multiple functions can being combined or the instruction of multiple functions is combined As the input of similarity hash algorithm, such that it is able to calculate the cryptographic Hash that the tree determined is corresponding. Then, using this cryptographic Hash as elongated characteristic vector corresponding to the information of function calling relationship.
For in APK file to be identified each window control type and control title Information, character string corresponding with title for the type of control can be converted to elongated characteristic vector. The information of the distribution characteristics of time is updated for each file in APK file, it is also possible to extract Go out the elongated characteristic vector for characterizing this information;
For the digital certificate characteristic information of APK file, the elongated feature of correspondence can be extracted Vector, each digital certificate that wherein each component of this elongated characteristic vector and this APK have Feature corresponding.Each subcomponent of each component is respectively used to characterize in digital certificate CN (Common Name, name and surname), OU (Organization Unit, organization unit name Claim), O (Organization, organization name), L (Locality, city or zone name), The content information of ST (State, state or province part title), C (Country, country's title) part.
Step 202, using each obtained characteristic vector as input vector be separately input into The machine learning model that the type of characteristic vector is corresponding.
In the present embodiment, each characteristic vector got based on step 201, file security Property recognition methods run on electronic equipment thereon and each obtained characteristic vector can be made It is separately input into the machine learning model corresponding with the type of characteristic vector for input vector, wherein, Fixed length characteristic vector is corresponding with fixed length input learning model, and elongated characteristic vector is learned with elongated input Habit model is corresponding.Wherein, above-mentioned fixed length input learning model and elongated input learning model are permissible It is to be respectively created previously according to the length characteristic of characteristic vector corresponding to each characteristic information.
Step 203, by the output vector of each machine learning model, determines file to be identified For virus document or secure file.
In the present embodiment, the input to each machine learning model is inputted based on step 202 After vector, the output vector machine learning model of correspondence can be obtained from each machine learning model. Electronic equipment can be judged to wait to know by various algorithms according to the output vector of each equipment Other file is virus document or secure file.For example, it is possible to determined respectively by each output vector Individual machine learning model judgment value based on individual features to file security to be identified is the most right Judgment value carries out adding up to according to certain rule and obtains last result.Can be by throwing during total Ticket method, weighted calculation method etc..In practice, true by the output vector of each machine learning model Fixed file to be identified is that the method for virus document or secure file is not limited to calculation enumerated above Method.
In some optional implementations of the present embodiment, above-mentioned fixed length input learning model is god Through network (NN, Neural Network) model, elongated input learning model is circulation nerve Network (RNN, Recurrent neural Network) model.Fixed length characteristic vector is inputted During to NN model, the usual input with corresponding NN model of the length of this fixed length characteristic vector The length of vector matches, every time by corresponding fixed length characteristic vector according to preset length input i.e. Can.When by the input of elongated characteristic vector to RNN model, can carry out in units of timeslice Input, the component of each timeslice input predetermined number, circulation input is until fully inputting to this In RNN model.Optionally, above-mentioned Recognition with Recurrent Neural Network model can be shot and long term memory network (LSTM, Long-Short Term Memory) model.
In some optional implementations of the present embodiment, by step 202 by obtained Each characteristic vector is separately input into the machine corresponding with the type of characteristic vector as input vector Before learning model, said method also includes: if characteristic vector and the input of machine learning model When the length of vector is not mated, electronic equipment can carry out truncation or cover to characteristic vector Process.Wherein, truncation is typically when the length of characteristic vector is long, only intercepts this spy Levying the numerical digit mated in vector with the length of machine learning model, the characteristic vector after intercepting is used as Subsequent treatment.The cover of characteristic vector is processed, is long in characteristic vector and mode input parameter When degree does not mates, characteristic vector can be supplemented some numerical digits (such as mending 0) so that Liang Zhechang Degree coupling.Such as, for fixed length characteristic variable, when the length of fixed length characteristic variable is less than fixed length During the length of input vector of input learning model, can be processed by cover so that two Person is mated;For elongated characteristic vector, when the length of elongated characteristic variable is not that elongated input is learned When the single of the input vector practising model inputs the integral multiple of length, it can be carried out at cover Reason so that the length of elongated characteristic variable is the integral multiple of mode input parameter length.
In some optional implementations of the present embodiment, above-mentioned machine learning model can be logical Cross following steps to be trained and generate: first, for each file in training sample, The various characteristic informations of extraction document respectively, to obtain each characteristic vector of correspondence, wherein, Training sample includes at least one file with security class label, and security class label is used It is virus document or secure file in characterizing file, the characteristic vector corresponding to various characteristic informations Type be set in advance, the type of characteristic vector includes fixed length characteristic vector and elongated feature Vector;Afterwards, each obtained characteristic vector as input vector and is combined each file Security class label, training generates the machine learning model corresponding with various characteristic informations, its In, the machine learning type that fixed length characteristic vector is trained is that fixed length inputs learning model, elongated The machine learning type that characteristic vector is trained is elongated input learning model.
With continued reference to Fig. 3, Fig. 3 be file security recognition methods according to the present embodiment should With scene schematic diagram.In the application scenarios of Fig. 3, electronic equipment can be to be identified APK file carries out each characteristic information and carries out extracting to extract representing each characteristic information Characteristic vector, this feature vector can include the structure feature information of APK file, authority information Deng the fixed length characteristic vector 1 that characteristic information is the most corresponding ... n, it is also possible to include APK file Function calling relationship information, the characteristic information correspondence respectively such as the information of credential categories that comprised Elongated characteristic vector 1 ... m.Afterwards, electronic equipment is by the most defeated for the characteristic vector extracted Enter to corresponding machine learning model, wherein fixed length characteristic vector 1 ... the machine that n is the most corresponding Learning model is NN model 1 ... n, elongated characteristic vector 1 ... the machine that m is the most corresponding Learning model is RNN model 1 ... m;Afterwards, electronic equipment can collect each engineering Practising the output vector of model and calculated by output layer, obtaining output valve, this output valve is the most available It is virus document or secure file in indicating file to be identified.
The method that above-described embodiment of the application provides goes out length for feature extraction and fixes or not solid Fixed characteristic variable, all for two kinds of different characteristic vectors can carry out respective handling and obtain Recognition result, improves the range of application of file security identification.
With further reference to Fig. 4, it illustrates another embodiment of file security recognition methods Flow process 400.The flow process 400 of this document safety recognition methods, comprises the following steps:
Step 401, extracts at least one of file to be identified for the spy of file security identification Reference ceases, and obtains and various characteristic information characteristics of correspondence vector.
In the present embodiment, the concrete process of step 401 is referred to Fig. 2 correspondence embodiment Step 201.
Step 402, using each obtained characteristic vector as input vector be separately input into The machine learning model that the type of characteristic vector is corresponding.
In the present embodiment, the concrete process of step 402 is referred to Fig. 2 correspondence embodiment Step 202.
Step 403, inputs the output vector of each machine learning model to presetting fixed length input Machine learning model.
In the present embodiment, based in step 402 each machine learning model generate output to Amount, electronic equipment can be using this output vector as input vector input to presetting fixed length input machine In device learning model.
Step 404, is determined by the output vector of this default fixed length input machine learning model and treats Identify that file is virus document or secure file.
In the present embodiment, based on step 403 is preset the defeated of fixed length input machine learning model Outgoing vector, electronic equipment can based on this output vector determine file to be identified be virus document or Secure file.Generally, this default fixed length input machine learning model can be to be generated by training 's.
With continued reference to Fig. 5, Fig. 5 be file security recognition methods according to the present embodiment should With scene schematic diagram.Unlike the application scenarios described by Fig. 3, this applied field Scape is by NN model 1 ... n and RNN model 1 ... each machine learning model such as m Output vector, as in input vector input to NN model n+1, can use NN mould afterwards The output vector of type n+1 judges that APK file to be identified is virus document or secure file.
The embodiment corresponding compared to Fig. 2, the method that this embodiment of application provides is to each machine The output vector that device learning model is generated inputs to another fixed length engineering as input vector In habit model, such that it is able to improve the accurate of identification further by the combination of machine learning model Property.
With further reference to Fig. 6, as to the realization of method shown in above-mentioned each figure, the application provides A kind of embodiment of file security identification device, this device embodiment with shown in Fig. 2 Embodiment of the method corresponding, this device specifically can apply in various server.
As shown in Figure 6, the file security identification device 600 described in the present embodiment includes: carry Take unit 601, input block 602, determine unit 603.Wherein, extraction unit 601 is used for Extract at least one of file to be identified for the characteristic information of file security identification, obtain with Various characteristic informations characteristic of correspondence vector respectively, wherein, the spy corresponding to every kind of characteristic information The type levying vector is set in advance, and the type of characteristic vector includes the fixed length spy that length is constant Levy the adjustable length elongated characteristic vector of vector sum;Input block 602 is for by obtained each Individual characteristic vector is separately input into the engineering corresponding with the type of characteristic vector as input vector Practising model, wherein, it is corresponding that fixed length characteristic vector and fixed length input learning model, elongated feature to Measure corresponding with elongated input learning model;Determine that unit 603 is for by each machine learning mould The output vector of type, determines that file to be identified is virus document or secure file.
In the present embodiment, the extraction unit 601 of file security identification device 600, input Unit 602, determine that concrete process of unit 603 is referred to embodiment corresponding to Fig. 2.
In some optional implementations of the present embodiment, above-mentioned file to be identified is that Android is installed Bag APK file.Characteristic vector includes that following any feature information is extracted by least one The fixed length characteristic vector obtained: the structure feature information of APK file;The authority of APK file Information;The information of the provided service of APK file;The information of the monitored event of APK file; The class name of each class, the function name of each function or the letter of cited character string in APK file Breath;The characteristic information of the distribution characteristics of the file type of each file in APK file.
In some optional implementations of the present embodiment, said structure characteristic information includes following One or more: in APK file, the length of title is less than the number of the bag of threshold value;APK file In the maximum of length of title of class, minima, total value, meansigma methods, variance yields, The ratio of the number of all classes in the number of class and APK file;Member in APK file The maximum of the length of the title of variable, minima, meansigma methods, variance yields, member variable The ratio of the number of all member variables in number and APK file;One-tenth in APK file Member the maximum of length of title of function, minima, total value, meansigma methods, variance yields, Ratio shared in the number of the number of member function all member functions in APK file; The type of the return value of the type of member variable, member function and APK file in APK file In the ratio of type of all data;The number of the input parameter of member function in APK file Distribution, the length of the title inputting parameter are less than the number of the parameter of threshold value;APK file is No there is preset characters string, url, telephone number, numeral;The number of the forms in APK file, The maximum of the length of the title of forms, minima, total value, meansigma methods, variance yields, window The size of body and determine window;The number of the menu in APK file, the word of title of menu The symbol maximum of string length, minima, total value, meansigma methods, variance yields;In APK file The number of animation, the maximum of length of title of animation, minima, total value, meansigma methods, The pixel characteristic of the image in variance yields and animation;The number of picture in APK file, picture In the maximum of the length of title, minima, total value, meansigma methods, variance yields and picture The pixel characteristic of image.
In some optional implementations of the present embodiment, characteristic vector includes that at least one is right Following any feature information carries out extracting the elongated characteristic vector obtained: the function of APK file The information of call relation;In APK file each window control type and control title Information;The information of the distribution characteristics updating the time of each file in APK file;APK file Included in the information of credential categories.
In some optional implementations of the present embodiment, above-mentioned fixed length input learning model is god Through network model, above-mentioned elongated input learning model is Recognition with Recurrent Neural Network model.
In some optional implementations of the present embodiment, determine that unit 603 is further used for: Using output vector as input vector input to presetting fixed length input machine learning model;By in advance The output vector setting long input machine learning model determines that file to be identified is as virus document or peace Whole file.
In some optional implementations of the present embodiment, electronic equipment also includes: training unit (not shown), is used for training each machine learning model, specifically for performing following steps: first First, for each file in training sample, the various characteristic informations of extraction document respectively, To obtain each characteristic vector of correspondence.Wherein, training sample includes that at least one is with peace The file of universal class distinguishing label, it is virus document or safety literary composition that security class label is used for characterizing file Part, the type of the characteristic vector corresponding to various characteristic informations is set in advance, characteristic vector Type include fixed length characteristic vector and elongated characteristic vector.The method of feature extraction is referred to The feature extraction of file to be identified in Fig. 2.Afterwards, using each obtained characteristic vector as Input vector also combines the security class label of each file, and training generates and various characteristic informations Corresponding machine learning model, wherein, the machine learning type that fixed length characteristic vector is trained is Fixed length input learning model, the machine learning type that elongated characteristic vector is trained is elongated input Learning model.
Below with reference to Fig. 7, it illustrates the terminal unit be suitable to for realizing the embodiment of the present application Or the structural representation of the computer system 700 of server.
As it is shown in fig. 7, computer system 700 includes CPU (CPU) 701, its Can be according to the program being stored in read only memory (ROM) 702 or from storage part 708 It is loaded into the program in random access storage device (RAM) 703 and performs various suitable action And process.In RAM 703, also storage has system 700 to operate required various program sums According to.CPU 701, ROM 702 and RAM 703 are connected with each other by bus 704.Input / output (I/O) interface 705 is also connected to bus 704.
It is connected to I/O interface 705: include the importation 706 of keyboard, mouse etc. with lower component; Including such as cathode ray tube (CRT), liquid crystal display (LCD) etc. and speaker etc. Output part 707;Storage part 708 including hard disk etc.;And include such as LAN card, The communications portion 709 of the NIC of modem etc..Communications portion 709 is via such as The network of the Internet performs communication process.Driver 710 is connected to I/O interface also according to needs 705.Detachable media 711, such as disk, CD, magneto-optic disk, semiconductor memory etc., Be arranged on as required in driver 710, in order to the computer program read from it according to Needs are mounted into storage part 708.
Especially, according to embodiment of the disclosure, the process described above with reference to flow chart is permissible It is implemented as computer software programs.Such as, embodiment of the disclosure and include a kind of computer journey Sequence product, it includes the computer program being tangibly embodied on machine readable media, described meter Calculation machine program comprises the program code for performing the method shown in flow chart.In such enforcement In example, this computer program can be downloaded and installed from network by communications portion 709, And/or be mounted from detachable media 711.
Flow chart in accompanying drawing and block diagram, it is illustrated that according to the various embodiment of the application system, Architectural framework in the cards, function and the operation of method and computer program product.This point On, each square frame in flow chart or block diagram can represent a module, program segment or code A part, a part for described module, program segment or code comprise one or more for Realize the executable instruction of the logic function of regulation.It should also be noted that at some as replacement In realization, the function marked in square frame can also be sent out to be different from the order marked in accompanying drawing Raw.Such as, two square frames succeedingly represented can essentially perform substantially in parallel, they Sometimes can also perform in the opposite order, this is depending on involved function.It is also noted that It is, the square frame in each square frame in block diagram and/or flow chart and block diagram and/or flow chart Combination, can realize by the special hardware based system of the function or operation that perform regulation, Or can realize with the combination of specialized hardware with computer instruction.
Being described in the embodiment of the present application involved unit can be real by the way of software Existing, it is also possible to realize by the way of hardware.Described unit can also be arranged on process In device, for example, it is possible to be described as: a kind of processor includes extraction unit, input block, really Recognize unit.Wherein, the title of these unit is not intended that under certain conditions to this unit itself Restriction, such as, extraction unit is also described as " extracting at least the one of file to be identified Plant the characteristic information for file security identification the most corresponding with various characteristic informations to obtain The unit of characteristic vector ".
As on the other hand, present invention also provides a kind of nonvolatile computer storage media, This nonvolatile computer storage media can be described in above-described embodiment included in device Nonvolatile computer storage media;Can also be individualism, be unkitted allocate in terminal non- Volatile computer storage medium.Above-mentioned nonvolatile computer storage media storage have one or The multiple program of person, when one or more program is performed by an equipment so that described Equipment: extract at least one of file to be identified for the characteristic information of file security identification, Obtaining and various characteristic informations characteristic of correspondence vector respectively, wherein, every kind of characteristic information institute is right The type of the characteristic vector answered is set in advance, and the type of characteristic vector includes that length is constant Fixed length characteristic vector and adjustable length elongated characteristic vector;By each obtained characteristic vector It is separately input into the machine learning mould corresponding with the type of described characteristic vector as input vector Type, wherein, fixed length characteristic vector and fixed length input learning model is corresponding, elongated characteristic vector and Elongated input learning model is corresponding;By the output vector of each machine learning model, determine institute Stating file to be identified is virus document or secure file.
Above description is only the preferred embodiment of the application and saying institute's application technology principle Bright.It will be appreciated by those skilled in the art that invention scope involved in the application, do not limit In the technical scheme of the particular combination of above-mentioned technical characteristic, also should contain simultaneously without departing from In the case of described inventive concept, above-mentioned technical characteristic or its equivalent feature carry out combination in any And other technical scheme formed.Such as features described above and (but not limited to) disclosed herein The technical characteristic with similar functions is replaced mutually and the technical scheme that formed.

Claims (15)

1. a file security recognition methods, it is characterised in that described method includes:
Extract at least one of file to be identified for the characteristic information of file security identification, To vectorial, wherein, corresponding to every kind of characteristic information with various characteristic informations characteristic of correspondence respectively The type of characteristic vector be set in advance, the type of characteristic vector includes constant the determining of length Long characteristic vector and adjustable length elongated characteristic vector;
Using each obtained characteristic vector as input vector be separately input into described feature to The machine learning model corresponding to type of amount, wherein, fixed length characteristic vector and fixed length input study Model is corresponding, and elongated characteristic vector is corresponding with elongated input learning model;
By the output vector of each machine learning model, determine that described file to be identified is for virus File or secure file.
Method the most according to claim 1, it is characterised in that described file to be identified is Android installation kit APK file.
Method the most according to claim 2, it is characterised in that wrap in described characteristic vector Include at least one and following any feature information extracted obtained fixed length characteristic vector:
The structure feature information of APK file;The authority information of APK file;APK file institute The information of service is provided;The information of the monitored event of APK file;Each class in APK file Class name, the function name of each function or the information of cited character string;In APK file each The characteristic information of the distribution characteristics of the file type of file.
Method the most according to claim 3, it is characterised in that described structure feature information Including following one or more:
In APK file, the length of title is less than the number of the bag of threshold value;
The maximum of the length of the title of the class in APK file, minima, total value, average The ratio of the number of all classes in value, variance yields, the number of described class and APK file;
The maximum of the length of the title of the member variable in APK file, minima, meansigma methods, The number of all member variables in variance yields, the number of described member variable and APK file Ratio;
The maximum of the length of the title of the member function in APK file, minima, total value, Meansigma methods, variance yields, the number all member functions in APK file of described member function Number in shared ratio;
The type of the return value of the type of member variable, member function and APK in APK file The ratio of the type of all data in file;
The distribution of the number of the input parameter of member function in APK file, the title of input parameter Length less than the number of parameter of threshold value;
Whether APK file exists preset characters string, url, telephone number, numeral;
The number of the forms in APK file, the maximum of length of title of forms, minima, Total value, meansigma methods, variance yields, the size of forms and determine window;
The number of the menu in APK file, menu title string length maximum, Minima, total value, meansigma methods, variance yields;
The maximum of the length of the title of the number of animation, animation in APK file, minima, The pixel characteristic of the image in total value, meansigma methods, variance yields and animation;
The maximum of the length of the title of the number of picture, picture in APK file, minima, The pixel characteristic of the image in total value, meansigma methods, variance yields and picture.
Method the most according to claim 2, it is characterised in that wrap in described characteristic vector Include at least one and following any feature information extracted the elongated characteristic vector obtained:
The information of the function calling relationship of APK file;In APK file each window control Type and the information of control title;In APK file, the distribution updating the time of each file is special The information levied;The information of the credential categories included in APK file.
Method the most according to claim 1, it is characterised in that described fixed length input study Model is neural network model, and described elongated input learning model is Recognition with Recurrent Neural Network model.
Method the most according to claim 1, it is characterised in that described by each machine The output vector of learning model, determines that file to be identified is virus document or secure file, including:
Using described output vector as input vector input to presetting fixed length input machine learning mould Type;
Determine described to be identified by the output vector of described default fixed length input machine learning model File is virus document or secure file.
Method the most according to claim 1, it is characterised in that each machine learning model Through the following steps that be trained generation in advance:
For each file in training sample, the various characteristic informations of extraction document respectively, To obtain each characteristic vector of correspondence, wherein, described training sample includes that at least one carries Having the file of security class label, described security class label to be used for characterizing file is virus document Or secure file, the type of the characteristic vector corresponding to various characteristic informations is set in advance, The type of characteristic vector includes fixed length characteristic vector and elongated characteristic vector;
Using each obtained characteristic vector as input vector the security classes that combines each file Distinguishing label, training generates the machine learning model corresponding with various characteristic informations, wherein, fixed length The machine learning type that characteristic vector is trained is that fixed length inputs learning model, elongated characteristic vector The machine learning type trained is elongated input learning model.
9. a file security identification device, it is characterised in that described device includes:
Extraction unit, for extracting at least one of file to be identified for file security identification Characteristic information, obtain and various characteristic informations respectively characteristic of correspondence vector, wherein, every kind The type of the characteristic vector corresponding to characteristic information is set in advance, the type bag of characteristic vector Include the constant fixed length characteristic vector of length and adjustable length elongated characteristic vector;
Input block, for inputting each obtained characteristic vector respectively as input vector To the machine learning model corresponding with the type of described characteristic vector, wherein, fixed length characteristic vector Corresponding with fixed length input learning model, elongated characteristic vector is corresponding with elongated input learning model;
Determine unit, for by the output vector of each machine learning model, determine described in treat Identify that file is virus document or secure file.
Device the most according to claim 9, it is characterised in that described file to be identified For Android installation kit APK file.
11. devices according to claim 10, it is characterised in that in described characteristic vector Including at least one following any feature information extracted obtained fixed length characteristic vector:
The structure feature information of APK file;The authority information of APK file;APK file institute The information of service is provided;The information of the monitored event of APK file;Each class in APK file Class name, the function name of each function or the information of cited character string;In APK file each The characteristic information of the distribution characteristics of the file type of file.
12. devices according to claim 11, it is characterised in that described architectural feature is believed Breath includes following one or more:
In APK file, the length of title is less than the number of the bag of threshold value;
The maximum of the length of the title of the class in APK file, minima, total value, average The ratio of the number of all classes in value, variance yields, the number of described class and APK file;
The maximum of the length of the title of the member variable in APK file, minima, meansigma methods, The number of all member variables in variance yields, the number of described member variable and APK file Ratio;
The maximum of the length of the title of the member function in APK file, minima, total value, Meansigma methods, variance yields, the number all member functions in APK file of described member function Number in shared ratio;
The type of the return value of the type of member variable, member function and APK in APK file The ratio of the type of all data in file;
The distribution of the number of the input parameter of member function in APK file, the title of input parameter Length less than the number of parameter of threshold value;
Whether APK file exists preset characters string, url, telephone number, numeral;
The number of the forms in APK file, the maximum of length of title of forms, minima, Total value, meansigma methods, variance yields, the size of forms and determine window;
The number of the menu in APK file, menu title string length maximum, Minima, total value, meansigma methods, variance yields;
The maximum of the length of the title of the number of animation, animation in APK file, minima, The pixel characteristic of the image in total value, meansigma methods, variance yields and animation;
The maximum of the length of the title of the number of picture, picture in APK file, minima, The pixel characteristic of the image in total value, meansigma methods, variance yields and picture.
13. devices according to claim 10, it is characterised in that in described characteristic vector Including at least one following any feature information is extracted the elongated characteristic vector obtained:
The information of the function calling relationship of APK file;In APK file each window control Type and the information of control title;In APK file, the distribution updating the time of each file is special The information levied;The information of the credential categories included in APK file.
14. devices according to claim 9, it is characterised in that the input of described fixed length is learned Habit model is neural network model, and described elongated input learning model is Recognition with Recurrent Neural Network model.
15. devices according to claim 9, it is characterised in that described determine that unit enters One step is used for:
Using described output vector as input vector input to presetting fixed length input machine learning mould Type;
Determine described to be identified by the output vector of described default fixed length input machine learning model File is virus document or secure file.
CN201610270523.7A 2016-04-27 2016-04-27 File security recognition methods and device Active CN105956469B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610270523.7A CN105956469B (en) 2016-04-27 2016-04-27 File security recognition methods and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610270523.7A CN105956469B (en) 2016-04-27 2016-04-27 File security recognition methods and device

Publications (2)

Publication Number Publication Date
CN105956469A true CN105956469A (en) 2016-09-21
CN105956469B CN105956469B (en) 2019-04-26

Family

ID=56916916

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610270523.7A Active CN105956469B (en) 2016-04-27 2016-04-27 File security recognition methods and device

Country Status (1)

Country Link
CN (1) CN105956469B (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577943A (en) * 2017-09-08 2018-01-12 北京奇虎科技有限公司 Sample predictions method, apparatus and server based on machine learning
CN107665307A (en) * 2017-09-13 2018-02-06 北京金山安全软件有限公司 Application identification method and device, electronic equipment and storage medium
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN109582854A (en) * 2018-12-28 2019-04-05 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110147788A (en) * 2019-05-27 2019-08-20 东北大学 A kind of metal plate and belt Product labelling character recognition method based on feature enhancing CRNN
CN110210217A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 A kind of recognition methods of file, equipment and computer readable storage medium
CN110532772A (en) * 2018-05-23 2019-12-03 深信服科技股份有限公司 File test method, model, equipment and computer readable storage medium
WO2019242443A1 (en) * 2018-06-20 2019-12-26 深信服科技股份有限公司 Character string-based malware recognition method and system, and related devices
CN112884570A (en) * 2021-02-24 2021-06-01 中国工商银行股份有限公司 Method, device and equipment for determining model security
EP4311167A1 (en) * 2022-07-21 2024-01-24 Rockwell Automation Technologies, Inc. Systems and methods for artificial intelligence-based security policy development
CN117951704A (en) * 2024-03-27 2024-04-30 宁波和利时信息安全研究院有限公司 Hash calculation method and device of executable file, electronic equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN103839006A (en) * 2010-11-29 2014-06-04 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN104392174A (en) * 2014-10-23 2015-03-04 腾讯科技(深圳)有限公司 Generation method and device for characteristic vectors of dynamic behaviors of application program
CN104700828A (en) * 2015-03-19 2015-06-10 清华大学 Deep long-term and short-term memory recurrent neural network acoustic model establishing method based on selective attention principles
CN104966031A (en) * 2015-07-01 2015-10-07 复旦大学 Method for identifying permission-irrelevant private data in Android application program

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102479298A (en) * 2010-11-29 2012-05-30 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN103839006A (en) * 2010-11-29 2014-06-04 北京奇虎科技有限公司 Program identification method and device based on machine learning
CN102346829A (en) * 2011-09-22 2012-02-08 重庆大学 Virus detection method based on ensemble classification
CN103473506A (en) * 2013-08-30 2013-12-25 北京奇虎科技有限公司 Method and device of recognizing malicious APK files
CN104392174A (en) * 2014-10-23 2015-03-04 腾讯科技(深圳)有限公司 Generation method and device for characteristic vectors of dynamic behaviors of application program
CN104700828A (en) * 2015-03-19 2015-06-10 清华大学 Deep long-term and short-term memory recurrent neural network acoustic model establishing method based on selective attention principles
CN104966031A (en) * 2015-07-01 2015-10-07 复旦大学 Method for identifying permission-irrelevant private data in Android application program

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107577943B (en) * 2017-09-08 2021-07-13 北京奇虎科技有限公司 Sample prediction method and device based on machine learning and server
CN107577943A (en) * 2017-09-08 2018-01-12 北京奇虎科技有限公司 Sample predictions method, apparatus and server based on machine learning
CN107665307A (en) * 2017-09-13 2018-02-06 北京金山安全软件有限公司 Application identification method and device, electronic equipment and storage medium
CN110210217A (en) * 2018-04-26 2019-09-06 腾讯科技(深圳)有限公司 A kind of recognition methods of file, equipment and computer readable storage medium
CN110532772B (en) * 2018-05-23 2024-01-02 深信服科技股份有限公司 File detection method, model, device and computer readable storage medium
CN110532772A (en) * 2018-05-23 2019-12-03 深信服科技股份有限公司 File test method, model, equipment and computer readable storage medium
WO2019242443A1 (en) * 2018-06-20 2019-12-26 深信服科技股份有限公司 Character string-based malware recognition method and system, and related devices
CN110619212A (en) * 2018-06-20 2019-12-27 深信服科技股份有限公司 Character string-based malicious software identification method, system and related device
CN110619212B (en) * 2018-06-20 2022-01-18 深信服科技股份有限公司 Character string-based malicious software identification method, system and related device
CN109067708A (en) * 2018-06-29 2018-12-21 北京奇虎科技有限公司 A kind of detection method, device, equipment and the storage medium at webpage back door
CN109582854B (en) * 2018-12-28 2022-05-03 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN109582854A (en) * 2018-12-28 2019-04-05 百度在线网络技术(北京)有限公司 Method and apparatus for generating information
CN110147788A (en) * 2019-05-27 2019-08-20 东北大学 A kind of metal plate and belt Product labelling character recognition method based on feature enhancing CRNN
CN112884570A (en) * 2021-02-24 2021-06-01 中国工商银行股份有限公司 Method, device and equipment for determining model security
EP4311167A1 (en) * 2022-07-21 2024-01-24 Rockwell Automation Technologies, Inc. Systems and methods for artificial intelligence-based security policy development
CN117951704A (en) * 2024-03-27 2024-04-30 宁波和利时信息安全研究院有限公司 Hash calculation method and device of executable file, electronic equipment and medium
CN117951704B (en) * 2024-03-27 2024-06-07 宁波和利时信息安全研究院有限公司 Hash calculation method and device of executable file, electronic equipment and medium

Also Published As

Publication number Publication date
CN105956469B (en) 2019-04-26

Similar Documents

Publication Publication Date Title
CN105956469A (en) Method and device for identifying file security
CN111401558B (en) Data processing model training method, data processing device and electronic equipment
CN112417439B (en) Account detection method, device, server and storage medium
CN109922032B (en) Method, device, equipment and storage medium for determining risk of logging in account
CN111681091B (en) Financial risk prediction method and device based on time domain information and storage medium
CN109214914A (en) A kind of loan information checking method and device based on communication open platform
CN107809371B (en) Shared resource display method and device
CN110929806B (en) Picture processing method and device based on artificial intelligence and electronic equipment
CN111415336B (en) Image tampering identification method, device, server and storage medium
CN105894028A (en) User identification method and device
CN112668453B (en) Video identification method and related equipment
CN114817346A (en) Service processing method and device, electronic equipment and computer readable medium
CN111325578B (en) Sample determination method and device of prediction model, medium and equipment
CN115983907A (en) Data recommendation method and device, electronic equipment and computer readable medium
CN116958846A (en) Video detection method, device, equipment, medium and product
CN112085469B (en) Data approval method, device, equipment and storage medium based on vector machine model
CN112784990A (en) Training method of member inference model
CN112328779A (en) Training sample construction method and device, terminal equipment and storage medium
KR20200087333A (en) Inspection system and method for right identification of images in website
CN113779635B (en) Medical data verification method, device, equipment and storage medium
CN112150139B (en) Data analysis method and device
CN113572913B (en) Image encryption method, device, medium and electronic equipment
CN117034219B (en) Data processing method, device, equipment and readable storage medium
CN117271819B (en) Image data processing method and device, storage medium and electronic device
CN118277536A (en) Information processing method, information processing device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant