CN108733664A - A kind of file classifying method and device - Google Patents

A kind of file classifying method and device Download PDF

Info

Publication number
CN108733664A
CN108733664A CN201710240448.4A CN201710240448A CN108733664A CN 108733664 A CN108733664 A CN 108733664A CN 201710240448 A CN201710240448 A CN 201710240448A CN 108733664 A CN108733664 A CN 108733664A
Authority
CN
China
Prior art keywords
file
characteristic information
sample
storage location
integer
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710240448.4A
Other languages
Chinese (zh)
Other versions
CN108733664B (en
Inventor
张洁烽
崔精兵
屈亚鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710240448.4A priority Critical patent/CN108733664B/en
Publication of CN108733664A publication Critical patent/CN108733664A/en
Application granted granted Critical
Publication of CN108733664B publication Critical patent/CN108733664B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention discloses file classifying method and devices, are applied to technical field of information processing.File categorization arrangement converts the characteristic information of file to be sorted out to the first integer, then function is calculated according to the first integer and preset position and determines corresponding first storage location, if the corresponding numerical value of the first storage location is the first indicated value in memory space, determine that file to be sorted out is the file of a certain type, wherein, the first indicated value is used to indicate the characteristic information for the sample file that memory space stores represented by the first storage location.In this way, the characteristic information of a sample file of a certain type can be indicated by each storage location in memory space, and indicate whether to store the characteristic information of corresponding sample file by the corresponding indicated value of each storage location in memory space, considerably reduce the characteristic information of the memory space of the sample file of a certain type so that determine that the efficiency of the type of file to be sorted out is improved.

Description

A kind of file classifying method and device
Technical field
The present invention relates to technical field of information processing, more particularly to a kind of file classifying method and device.
Background technology
In the prior art, when storing virus base, some features for often extracting Virus Sample are stored, specifically The Message Digest Algorithm 5 (Message Digest Algorithm5, MD5) of each Virus Sample is stored into this by ground In the hard disk of ground equipment.When needing to carry out virus investigation to the file in local device according to virus base, needing will be viral in hard disk The information of sample is loaded into memory, then judge to wait for again killing file whether the information matches with Virus Sample.
Under normal circumstances, local device, can be directly by the MD5 of virus base when the information of Virus Sample is loaded memory It is loaded into memory.But the MD5 features of virus base are generally bigger so that Virus Sample can not disposably be loaded into memory, need The hard disk that frequently read local device, causes virus investigation speed low.For example, the MD5 features one of a common Virus Sample As need 32 bytes to store, if necessary to storage 100,000,000, the space for just taking around 2.98GB stores the virus base.Separately In the case of one kind, local device can first pass through after MD5 character strings are built into dictionary tree and be loaded into memory, but initialize dictionary tree It needs to expend longer time, if MD5's in hard disk is larger, the problem of easily causing low memory.
Invention content
A kind of file classifying method of offer of the embodiment of the present invention and device are realized according to a certain type sample file of storage Characteristic information memory space in, the numerical value of corresponding first storage location of characteristic information of file to be sorted out, which determines, to be waited sorting out File whether be a certain type file.
The embodiment of the present invention provides a kind of file classifying method, including:
The characteristic information of file to be sorted out is obtained, and converts the characteristic information of the file to be sorted out to the first integer;
Function, which is calculated, according to first integer and preset position determines corresponding first storage location;
If in the memory space for storing the characteristic information of a certain type sample file, the corresponding numerical value of the first storage location For the first indicated value, determine described in file to be sorted out be the file of a certain type, first indicated value is used to indicate institute State the characteristic information for the sample file that memory space stores represented by first storage location.
The embodiment of the present invention provides a kind of file categorization arrangement, including:
Integer unit, the characteristic information for obtaining file to be sorted out, and the characteristic information of the file to be sorted out is turned Turn to the first integer;
Position determination unit determines that corresponding first deposits for calculating function according to first integer and preset position Storage space is set;
First kind determination unit, if in the memory space of characteristic information for storing a certain type sample file, The corresponding numerical value of first storage location be the first indicated value, determine described in file to be sorted out be a certain type file, institute State the feature letter that the first indicated value is used to indicate the sample file that the memory space stores represented by first storage location Breath.
As it can be seen that in the method for the present embodiment, file categorization arrangement obtains the characteristic information of file to be sorted out, and by the spy Reference breath is converted into the first integer, and then calculating function according to the first integer and preset position determines corresponding first storage position It sets, if in storing the memory space of the characteristic information of a certain type sample file, the corresponding numerical value of the first storage location is the One indicated value determines that file to be sorted out is the file of a certain type, wherein the first indicated value is used to indicate memory space and stores the The characteristic information of sample file represented by one storage location.In this way, each storage location table in memory space can be passed through Show the characteristic information of a sample file of a certain type, and memory space is indicated by the corresponding indicated value of each storage location In whether store the characteristic information of corresponding sample file, considerably reduce the spy of the memory space of a certain type sample file Reference ceases, to also save during determining the type of file to be sorted out the characteristic information of sample file is loaded into it is interior The time deposited and space so that determine that the efficiency of the type of file to be sorted out is improved.
Description of the drawings
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technology description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention without having to pay creative labor, may be used also for those of ordinary skill in the art With obtain other attached drawings according to these attached drawings.
Fig. 1 is a kind of flow chart of file classifying method provided in an embodiment of the present invention;
Fig. 2 is the method flow diagram that the numerical value of each storage location in memory space is arranged in the embodiment of the present invention;
Fig. 3 is the method flow diagram for the MD5 features that Virus Sample file is stored in Application Example of the present invention;
Fig. 4 be in Application Example of the present invention determine wait for killing file whether be virus document method flow diagram;
Fig. 5 be in Application Example of the present invention determine wait for killing file whether be virus document schematic diagram;
Fig. 6 is a kind of structural schematic diagram of file categorization arrangement provided in an embodiment of the present invention;
Fig. 7 is the structural schematic diagram of another file categorization arrangement provided in an embodiment of the present invention;
Fig. 8 is a kind of structural schematic diagram of terminal device provided in an embodiment of the present invention.
Specific implementation mode
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation describes, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other without creative efforts Embodiment shall fall within the protection scope of the present invention.
Term " first ", " second ", " third " " in description and claims of this specification and above-mentioned attached drawing The (if present)s such as four " are for distinguishing similar object, without being used to describe specific sequence or precedence.It should manage The data that solution uses in this way can be interchanged in the appropriate case, so that the embodiment of the present invention described herein for example can be to remove Sequence other than those of illustrating or describe herein is implemented.In addition, term " comprising " and " having " and theirs is any Deformation, it is intended that cover not exclusively include, for example, containing the process of series of steps or unit, method, system, production Product or equipment those of are not necessarily limited to clearly to list step or unit, but may include not listing clearly or for this The intrinsic other steps of processes, method, product or equipment or unit a bit.
The embodiment of the present invention provides a kind of file classifying method, is mainly used in and (waits for the file of a certain UNKNOWN TYPE Sort out file) scene sorted out, for example determine whether a certain file is the scenes such as virus document.Specifically, file is sorted out Device obtains the characteristic information of file to be sorted out, and converts this feature information to the first integer, then according to the first integer and Preset position calculates function and determines corresponding first storage location, if store the characteristic information of a certain type sample file In memory space, the corresponding numerical value of the first storage location is the first indicated value, determines that file to be sorted out is the file of a certain type, Wherein, the first indicated value is used to indicate the characteristic information for the sample file that memory space stores represented by the first storage location.This Sample, can indicate the characteristic information of a sample file of a certain type by each storage location in memory space, and lead to The characteristic information that corresponding sample file whether is stored in the corresponding numerical value memory space of each storage location is crossed, greatly The memory space for reducing the characteristic information of a certain type sample file, to also save in the type for determining file to be sorted out During the characteristic information of sample file is loaded into time and the space of memory so that determine the type of file to be sorted out Efficiency is improved.
One embodiment of the present of invention provides a kind of file classifying method, the mainly side performed by file categorization arrangement Method, flow chart is as shown in Figure 1, include:
Step 101, the characteristic information of file to be sorted out is obtained, and the characteristic information of the file to be sorted out is converted into first Integer.
Specifically, file categorization arrangement is obtaining when sorting out the characteristic information of file, can calculate file to be sorted out MD5 values;When converting characteristic information to the first integer, will mainly the cryptographic Hash that Hash calculation obtains be carried out to characteristic information As the first integer, Hash calculation can be arbitrary hash algorithm here, for example, Secure Hash Algorithm (Secure Hash Algorithm, SHA) etc..
Wherein, the MD5 values of file to be sorted out may insure that information transmission is complete consistent, has compressibility, is easy calculating, is anti- The characteristics such as modification property, strong impact resistant.The effect of MD5 is to allow large capacity information before signing private key with digital signature software By " compression " at a kind of format of secrecy, the byte serial of a random length is specifically transformed into the hexadecimal number of a fixed length Word string.
Hash algorithm is the binary value that the binary value of random length is mapped as to shorter regular length, this is small Binary value is known as cryptographic Hash.Cryptographic Hash is the unique and extremely compact numerical value representation of one piece of data, if one section of hash In plain text and even only changing a letter of the paragraph, different values will all be generated by Hash calculation.It is adopted in the present embodiment With Hash calculation, it ensure that the characteristic information of different files to be sorted out corresponds to different integers.
Step 102, function is calculated according to the first integer and preset position and determines corresponding first storage location.
Here it refers to obtaining any function of a certain storage location by a certain integer calculations that preset position, which calculates function, In the present embodiment, function is calculated by the first integer and position and can be obtained the first storage location, it is pre- which, which calculates function, First it is arranged in file categorization arrangement.And the first storage location determined in the present embodiment is used to indicate and is storing a certain type A certain position in the memory space of the characteristic information of sample file.
The concrete form that the first above-mentioned storage location and preset position calculate function depends primarily on above-mentioned storage sky Between stored samples file characteristic information form, the first storage location can be first position elements of a fix etc., and position count Calculate function can be using a certain integer to the quotient of n and remainder values as the integer memory space the position elements of a fix etc., Wherein, n is the integer more than 1.Specifically, in the present embodiment, file categorization arrangement can be by the first integer to the quotient of n and remaining Numerical value as first integer memory space the first position elements of a fix.
Such as:Above-mentioned memory space is by the bit array of the positions 4*16 as shown in table 1 below come the feature of stored samples file Information.Then calculating function in position is:Using the quotient of a certain integer pair 16 (n 16) and remainder values as the integer in memory space The position elements of a fix, for example, the 29 position elements of a fix be (1,13), for indicating a [1] and this position bit13.
bit0 bit2 bit3 bit4 bit12 bit13 bit14 bit15
a[0] 0 0 0 0 0 0 0 0 0
a[1] 0 0 0 0 0 0 1 0 0
a[2] 0 1 0 0 0 0 0 0 0
a[3] 0 0 0 0 0 0 0 1 0
Table 1
Wherein, if only there are two types of values for certain object, you can indicate the object by bit, bit array is just It is the array for storing this object, internal is integer array, and each of integer all indicates an object.Above-mentioned table 1 Shown in bit array can indicate that 4*16 object, each object indicate the characteristic information of a sample file of a certain type, Two kinds of values 0 of the object and 1 indicate not storing in memory space in the characteristic information and memory space of the sample file respectively The characteristic information of the sample file is stored, for example, value (for the 1) expression of object represented by a [2] and bit2 is deposited in above-mentioned table 1 The characteristic information that the sample file is stored in storage space calculates function according to corresponding position and obtains the feature of the sample file The corresponding integer of information is 34.
Step 103, in the memory space for judging to store the characteristic information of a certain type sample file, the first storage location pair Whether the numerical value answered is that the first indicated value thens follow the steps 104 if it is the first indicated value;If not the first indicated value, and It is the second indicated value, thens follow the steps 105.
Wherein, which is indicated for storing the sample represented by first storage location in memory space The characteristic information of this document, is specifically as follows 1;Second indicated value is indicated in memory space not storing first storage The characteristic information of sample file represented by position, is specifically as follows 0.
In other specific embodiments, the first indicated value can be 1, and the second indicated value is 0.
Step 104, determine that file to be sorted out is the file of a certain type.
Step 105, determine that file to be sorted out is not the file of a certain type.
As it can be seen that in the method for the present embodiment, a certain type can be indicated by each storage location in memory space A sample file characteristic information, and by whether storing phase in the corresponding numerical value memory space of each storage location The characteristic information for the sample file answered considerably reduces the memory space of the characteristic information of a certain type sample file, to Also save during determining the type of file to be sorted out by the characteristic information of sample file be loaded into time of memory with Space so that determine that the efficiency of the type of file to be sorted out is improved.
In a specific embodiment, file categorization arrangement can be with as follows 201 to 203 by a certain type Sample file is stored into memory space, and flow chart is as shown in Fig. 2, include:
Step 201, the characteristic information for obtaining multiple sample files of a certain type respectively, by the feature of multiple sample files Information is separately converted to multiple sample integers.
Specifically, file categorization arrangement can calculate in acquisition when sample characteristics information and calculate separately known type The MD5 values of multiple sample files;When converting sample characteristics information to sample integer, mainly will to sample characteristics information into The cryptographic Hash that row Hash calculation obtains can be arbitrary hash algorithm as sample integer, here Hash calculation.
Step 202, function is calculated according to above-mentioned position and determines the corresponding sample storage location of multiple sample integers respectively.
It can regard a certain integer as the position elements of a fix to the quotient of n and remainder values that the position, which calculates function, In embodiment, sample integer can position the quotient and remainder values of n as corresponding sample position and sit by file categorization arrangement Mark, sample storage location includes the sample position elements of a fix.
Step 203, the corresponding numerical value of sample storage location in memory space is set to the first indicated value, and other positions Corresponding numerical value is set as the second indicated value.
The spy of a sample file of a certain type can be thus indicated by each storage location in memory space Reference ceases, and the feature by whether storing corresponding sample file in the corresponding numerical value memory space of each storage location Information.
Further, file categorization arrangement can also be continuously updated the sample text of a certain type stored in memory space The characteristic information of part in this case, needs first according to upper rheme for example, newly increasing the characteristic information of the type sample file It sets calculating function and determines the corresponding newest storage location of the characteristic information for newly increasing sample file of the type, it then will storage sky Between in the numerical value of newest storage location be set as the first indicated value.Wherein, when determining newest storage location with above-mentioned determination the The method of one storage location is similar, unlike, it is the spy for being directed to the sample file newly increased when determining newest storage location Reference ceases, and when determining first storage location is the characteristic information for file to be sorted out.
Illustrate the file classifying method of the present invention with a specific embodiment below, in the present embodiment, file is returned Class device is virus checking and killing apparatus, and file to be sorted out is to wait for killing file, in the present embodiment:
(1) characteristic information of the server in virus checking and killing apparatus or high in the clouds storage Virus Sample file, specially MD5 are special Sign, flow chart is as shown in figure 3, include:
Step 301, virus checking and killing apparatus or server are directed to multiple Virus Sample files, first obtain multiple virus-likes respectively The MD5 features of this document carry out Hash calculation to the MD5 features of multiple Virus Sample files respectively and obtain corresponding cryptographic Hash. Each the corresponding cryptographic Hash of Virus Sample file is an integer.
Step 302, a units group is arranged in virus checking and killing apparatus or server, by the numerical value of each position in the bit array It is initialized as 0.
Specifically, it is assumed that the integer number obtained in above-mentioned steps 301 is SUM, in order to prevent multiple Virus Sample files The corresponding integer of MD5 features collide, realize a Virus Sample file the corresponding integer of MD5 features correspond to bit array In a position, bit array can be set to m times of integer number SUM, in the present embodiment m can be 8.
In a specific embodiment, bit array can be set to t*n bit arrays, each can indicate one The MD5 features of a Virus Sample file, the numerical value of each can be 1 or 0, wherein n 32, t=(SUM*8)/32+1.
For example, the bit array of a position 10*32 as shown in table 2 below can be arranged:
bit0 bit2 bit3 bit4 bit28 bit29 bit30 bit31
a[0] 0 0 0 0 0 0 0 0 0
a[1] 0 0 0 0 0 0 0 0 0
a[2] 0 0 0 0 0 0 0 0 0
a[3] 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0
a[9] 0 0 0 0 0 0 0 0 0
Table 2
Step 303, virus checking and killing apparatus or server calculate function according to preset position, determine that above-mentioned steps 301 obtain The position elements of a fix of the corresponding integer of MD5 features (i.e. cryptographic Hash) of each virus document arrived.
Specifically, in the present embodiment, position, which calculates function, to be:By integer (NUM) to the quotient of 32 (n 32) and The position elements of a fix of the remainder values as corresponding integer, are specifically as follows following formula 1:
[NUM/32]|(1<<(NUM%32)) (1)
In the present embodiment, the position elements of a fix of virus checking and killing apparatus or server determination are a [i] and bitj, wherein i Integer between to 0 to 9, j are the integer between 0 to 31.
Step 304, virus checking and killing apparatus or server position each position determined with above-mentioned steps 303 in bit array The numerical value of coordinate corresponding position is set as 1, for indicating that the MD5 for storing the corresponding Virus Sample file in the position in bit array is special Sign.
For example, the MD5 of a certain Virus Sample file is characterized as " 25e41a91a6a83f9b400e2ff1fc28a1f9 ", it should The corresponding cryptographic Hash of MD5 features is 45, through the above steps 303, determine the 45 corresponding position elements of a fix be a [1] and The numerical value of the bit13 of a [1] is then set as 1 by bit13.
In another example virus checking and killing apparatus or server are directed to the corresponding cryptographic Hash of MD5 features of other Virus Sample files 32,95,126,31 and 288, determine respectively the corresponding position elements of a fix be a [1] and bit0, a [2] and bit31, a [3] and Bit30, a [0] and bit31 and a [9] and bit30, and the numerical value of corresponding position is set as 1, it specifically can be such as the following table 3 institute Show:
bit0 bit12 bit13 bit14 bit29 bit30 bit31
a[0] 0 0 0 0 0 0 0 0 1
a[1] 1 0 0 1 0 0 0 0 0
a[2] 0 0 0 0 0 0 0 0 1
a[3] 0 0 0 0 0 0 0 1 0
0 0 0 0 0 0 0 0 0
a[9] 1 0 0 0 0 0 0 0 0
Table 3
(2) determine whether local file is virus document, and flow chart is as shown in figure 4, include:
Step 401, as shown in figure 5, the bit array that virus checking and killing apparatus will first can store in local hard drive, or from service The bit array that device is downloaded is loaded into the memory of virus checking and killing apparatus.
It is appreciated that virus checking and killing apparatus can regularly be directed to local file, initiate this implementation according to the preset period The flow of example;Or the operation according to user to virus checking and killing apparatus initiates the flow of the present embodiment for specific file.
Step 402, virus checking and killing apparatus first obtains the MD5 features for waiting for killing file, treats killing for killing file is waited for The MD5 features of file carry out Hash calculation and obtain corresponding cryptographic Hash, which is an integer.
Step 403, virus checking and killing apparatus calculates function according to preset position, determines that above-mentioned steps 402 obtain to be checked Kill the position elements of a fix of the corresponding integer of MD5 features (i.e. cryptographic Hash) of file.
Step 404, virus checking and killing apparatus judges in the bit array loaded in memory, and the position that above-mentioned steps 403 obtain is fixed Whether the numerical value of position coordinate corresponding position is 1, if it is 1, it is determined that waits for that killing file is virus document, if it is 0, it is determined that It is virus document to wait for killing file not.
Following effect may be implemented in method through this embodiment:
(1) memory space of the characteristic information of Virus Sample file is reduced.A disease is directly stored in compared with prior art The MD5 features (character string of 32 bytes) of malicious file a, as long as bit stores MD5 spies in the present embodiment Sign is equivalent to memory space saving 256 (32*8) times.
(2) it is directed to the bigger Virus Sample file of data volume, it can be disposably by the characteristic information of Virus Sample file It is loaded into memory.If 100,000,000 MD5 features of storage need 3,200,000,000 bytes, corresponding memory to take around 2.98GB, many diseases Malicious killing equipment can not disposably distribute accordingly memory, and using the method for the present embodiment, distribution is greatly reduced The characteristic information of Virus Sample file disposably can be loaded into memory by memory.
(3) matching efficiency is promoted.The MD5 features for waiting for killing file in the prior art and the MD5 of Virus Sample file are carried out When matching, time complexity is O (n), and the method for utilizing the present embodiment, time complexity are O (1).
The embodiment of the present invention also provides a kind of file categorization arrangement, and structural schematic diagram is as shown in fig. 6, can specifically wrap It includes:
Integer unit 10, the characteristic information for obtaining file to be sorted out, and by the characteristic information of the file to be sorted out It is converted into the first integer;
Position determination unit 11, the first integer and preset position for being obtained according to the integer unit 10 calculate letter Number determines corresponding first storage location;
First kind determination unit 12, if the memory space of the characteristic information for storing a certain type sample file In, the corresponding numerical value of the first storage location that the position determination unit 11 determines is the first indicated value, waits sorting out described in determination File is the file of a certain type, and first indicated value is used to indicate memory space storage first storage position Set the characteristic information of represented sample file.
In the present embodiment, above-mentioned integer unit 10, specifically for will file be sorted out characteristic information carry out Hash meter Obtained cryptographic Hash is as first integer.And position determination unit 11, it is specifically used for first integer to the quotient of n As first integer in the first position elements of a fix of memory space, first storage location is described for value and remainder values The first position elements of a fix, the n are the integer more than 1.
Wherein, the sample file of a certain type is Virus Sample file, and first indicated value is 1, described second Indicated value is 0.
As it can be seen that in the file categorization arrangement of the present embodiment, integer unit 10 obtains the characteristic information of file to be sorted out, and will This feature information is converted into the first integer, and then position determination unit 11 is true according to the first integer and preset position calculating function Fixed corresponding first storage location, if in storing the memory space of the characteristic information of a certain type sample file, the first storage The corresponding numerical value in position is the first indicated value, and first kind determination unit 12 determines that file to be sorted out is the file of a certain type, Wherein, the first indicated value is used to indicate the characteristic information for the sample file that memory space stores represented by the first storage location.This Sample, can indicate the characteristic information of a sample file of a certain type by each storage location in memory space, and lead to The characteristic information that corresponding sample file whether is stored in the corresponding numerical value memory space of each storage location is crossed, greatly The memory space for reducing the characteristic information of a certain type sample file, to also save in the type for determining file to be sorted out During the characteristic information of sample file is loaded into time and the space of memory so that determine the type of file to be sorted out Efficiency is improved.
Refering to what is shown in Fig. 7, in a specific embodiment, file categorization arrangement is in addition to may include as shown in FIG. 6 Outside structure, can also include setting unit 13 and Second Type determination unit 14, wherein:
Second Type determination unit 14, if being additionally operable to store the memory space of the characteristic information of a certain type sample file In, the corresponding numerical value of the first storage location that the position determination unit 11 determines is the second indicated value, waits sorting out described in determination File is not the file of a certain type, and second indicated value, which is used to indicate the memory space and does not store described first, deposits Storage space sets the characteristic information of represented sample file.
Above-mentioned integer unit 10 is additionally operable to obtain the characteristic information of multiple sample files of a certain type respectively, will The characteristic information of the multiple sample file is separately converted to multiple sample integers;Position determination unit 11 is additionally operable to according to institute Rheme, which is set, calculates the corresponding sample storage location of multiple sample integers that function determines the determination of the integer unit 10 respectively;In this way Setting unit 13, the corresponding numerical value of sample storage location for determining position determination unit 11 described in the memory space It is set as first indicated value, sets the corresponding numerical value of other positions to the second indicated value.
Further, above-mentioned position determination unit 11 is additionally operable to calculate function according to the position, determines certain described one kind The corresponding newest storage location of the characteristic information for newly increasing sample file of type;In this case, setting unit 13 are additionally operable to Set the numerical value for the newest storage location that position determination unit 11 described in the memory space determines to first instruction Value.
The embodiment of the present invention also provides a kind of terminal device, and structural schematic diagram is as shown in figure 8, the terminal device can be because matching It sets or performance is different and generate bigger difference, may include one or more central processing units (central Processing units, CPU) 20 (for example, one or more processors) and memory 21, one or more are deposited Store up the storage medium 22 (such as one or more mass memory units) of application program 221 or data 222.Wherein, it stores Device 21 and storage medium 22 can be of short duration storage or persistent storage.The program for being stored in storage medium 22 may include one or More than one module (diagram does not mark), each module may include to the series of instructions operation in terminal device.More into one Step ground, central processing unit 20 could be provided as communicating with storage medium 22, execute one in storage medium 22 on the terminal device Series of instructions operates.
Specifically, the application program that the application program 221 stored in storage medium 22 is sorted out including file, and the program May include the integer unit 10 in above-mentioned file categorization arrangement, position determination unit 11, first kind determination unit 12, setting Unit 13 and Second Type determination unit 14, herein without repeating.Further, central processing unit 20 could be provided as with Storage medium 22 communicates, and executes the corresponding system of application program that the file stored in storage medium 22 is sorted out on the terminal device Row operation.
Terminal device can also include one or more power supplys 23, one or more wired or wireless networks connect Mouth 24, one or more input/output interfaces 25, and/or, one or more operating systems 223, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM etc..
The end shown in Fig. 8 can be based on by the step performed by file categorization arrangement described in above method embodiment The structure of end equipment.
One of ordinary skill in the art will appreciate that all or part of step in the various methods of above-described embodiment is can It is completed with instructing relevant hardware by program, which can be stored in a computer readable storage medium, storage Medium may include:Read-only memory (ROM), random access memory ram), disk or CD etc..
It is provided for the embodiments of the invention file classifying method above and device is described in detail, it is used herein Principle and implementation of the present invention are described for specific case, and the explanation of above example is only intended to help to understand The method and its core concept of the present invention;Meanwhile for those of ordinary skill in the art, according to the thought of the present invention, having There will be changes in body embodiment and application range, in conclusion the content of the present specification should not be construed as to the present invention Limitation.

Claims (14)

1. a kind of file classifying method, which is characterized in that including:
The characteristic information of file to be sorted out is obtained, and converts the characteristic information of the file to be sorted out to the first integer;
Function, which is calculated, according to first integer and preset position determines corresponding first storage location;
If in the memory space for storing the characteristic information of a certain type sample file, the corresponding numerical value of the first storage location is the One indicated value, determine described in file to be sorted out be a certain type file, first indicated value be used to indicate described in deposit Storage space stores the characteristic information of the sample file represented by first storage location.
2. the method as described in claim 1, which is characterized in that the method further includes:
The characteristic information for obtaining multiple sample files of a certain type respectively, by the characteristic information of the multiple sample file It is separately converted to multiple sample integers;
Function, which is calculated, according to the position determines the corresponding sample storage location of the multiple sample integer respectively;
The corresponding numerical value of sample storage location described in the memory space is set to first indicated value, by other positions Corresponding numerical value is set as the second indicated value.
3. method as claimed in claim 2, which is characterized in that the method further includes:
Function is calculated according to the position, determines that the sample characteristics information for newly increasing sample file of a certain type is corresponding Newest storage location;
Set the numerical value of newest storage location described in the memory space to first indicated value.
4. method as described in any one of claims 1 to 3, which is characterized in that described that the characteristic information is converted into first Integer specifically includes:
The cryptographic Hash that the characteristic information of file to be sorted out progress Hash calculation is obtained is as first integer.
5. method as described in any one of claims 1 to 3, which is characterized in that described according to first integer and preset Position calculates function and determines corresponding first storage location, specifically includes:
First integer positions the quotient and remainder values of n as first integer in the first position of memory space and sits Mark, first storage location are the first position elements of a fix, and the n is the integer more than 1.
6. method as described in any one of claims 1 to 3, which is characterized in that the method further includes:
If in the memory space for storing the characteristic information of a certain type sample file, the corresponding numerical value of the first storage location is the Two indicated values, determine described in file to be sorted out be not a certain type file, described in second indicated value is used to indicate Memory space does not store the characteristic information of the sample file represented by first storage location.
7. method as claimed in claim 6, which is characterized in that the sample file of a certain type is Virus Sample file, First indicated value is 1, and second indicated value is 0.
8. a kind of file categorization arrangement, which is characterized in that including:
Integer unit, the characteristic information for obtaining file to be sorted out, and convert the characteristic information of the file to be sorted out to First integer;
Position determination unit determines corresponding first storage position for calculating function according to first integer and preset position It sets;
First kind determination unit, if in the memory space of characteristic information for storing a certain type sample file, first The corresponding numerical value of storage location is the first indicated value, determine described in file to be sorted out for a certain type file, described the One indicated value is used to indicate the characteristic information for the sample file that the memory space stores represented by first storage location.
9. device as claimed in claim 8, which is characterized in that further include setting unit, wherein:
The integer unit is additionally operable to obtain the characteristic information of multiple sample files of a certain type respectively, will be described more The characteristic information of a sample file is separately converted to multiple sample integers;
The position determination unit is additionally operable to determine that the multiple sample integer is corresponding respectively according to position calculating function Sample storage location;
The setting unit, for the corresponding numerical value of sample storage location described in the memory space to be set as described first Indicated value sets the corresponding numerical value of other positions to the second indicated value.
10. device as claimed in claim 9, which is characterized in that
The position determination unit is additionally operable to calculate function according to the position, and determine a certain type newly increases sample The corresponding newest storage location of characteristic information of file;
The setting unit is additionally operable to set the numerical value of newest storage location described in the memory space to first finger Indicating value.
11. such as claim 8 to 10 any one of them device, which is characterized in that
The integer unit, specifically for the characteristic information of file to be sorted out is carried out cryptographic Hash that Hash calculation obtains as institute State the first integer.
12. such as claim 8 to 10 any one of them device, which is characterized in that
The position determination unit is specifically used for first integer to the quotient and remainder values of n as first integer In the first position elements of a fix of memory space, first storage location is the first position elements of a fix, and the n is big In 1 integer.
13. such as claim 8 to 10 any one of them device, which is characterized in that further include:
Second Type determination unit, if be additionally operable in the memory space for storing the characteristic information of a certain type sample file, the The corresponding numerical value of one storage location be the second indicated value, determine described in file to be sorted out be not a certain type file, institute State the feature that the second indicated value is used to indicate the sample file that the memory space does not store represented by first storage location Information.
14. device as claimed in claim 13, which is characterized in that the sample file of a certain type is virus-like this paper Part, first indicated value are 1, and second indicated value is 0.
CN201710240448.4A 2017-04-13 2017-04-13 File classification method and device Active CN108733664B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710240448.4A CN108733664B (en) 2017-04-13 2017-04-13 File classification method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710240448.4A CN108733664B (en) 2017-04-13 2017-04-13 File classification method and device

Publications (2)

Publication Number Publication Date
CN108733664A true CN108733664A (en) 2018-11-02
CN108733664B CN108733664B (en) 2022-05-03

Family

ID=63923800

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710240448.4A Active CN108733664B (en) 2017-04-13 2017-04-13 File classification method and device

Country Status (1)

Country Link
CN (1) CN108733664B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090012984A1 (en) * 2007-07-02 2009-01-08 Equivio Ltd. Method for Organizing Large Numbers of Documents
CN101777056A (en) * 2009-12-31 2010-07-14 成都市华为赛门铁克科技有限公司 Data storage method and device
US7809752B1 (en) * 2005-04-14 2010-10-05 AudienceScience Inc. Representing user behavior information
CN103037344A (en) * 2012-12-06 2013-04-10 亚信联创科技(中国)有限公司 Call bill repetition removing method and call bill repetition removing device
CN103067364A (en) * 2012-12-21 2013-04-24 华为技术有限公司 Virus detection method and equipment
CN103164651A (en) * 2011-12-15 2013-06-19 西门子公司 Device and method for extracting virus file feature code and virus detection system
CN103164408A (en) * 2011-12-09 2013-06-19 阿里巴巴集团控股有限公司 Information storage and query method based on vertical search engine and device thereof
WO2014081727A1 (en) * 2012-11-20 2014-05-30 Denninghoff Karl L Search and navigation to specific document content
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104090895A (en) * 2013-12-18 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device, server and system for obtaining cardinal number
CN104657451A (en) * 2015-02-05 2015-05-27 百度在线网络技术(北京)有限公司 Processing method and processing device for page
CN104751055A (en) * 2013-12-31 2015-07-01 北京启明星辰信息安全技术有限公司 Method, device and system for detecting distributed malicious codes on basis of textures
CN105069020A (en) * 2015-07-14 2015-11-18 国家信息中心 3D visualization method and system of natural resource data
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN105306063A (en) * 2015-10-12 2016-02-03 浙江大学 Optimization and recovery methods for record type data storage space
CN106487833A (en) * 2015-08-26 2017-03-08 北京国双科技有限公司 The statistical method of isolated user number and device in network monitor

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7809752B1 (en) * 2005-04-14 2010-10-05 AudienceScience Inc. Representing user behavior information
US20090012984A1 (en) * 2007-07-02 2009-01-08 Equivio Ltd. Method for Organizing Large Numbers of Documents
CN101777056A (en) * 2009-12-31 2010-07-14 成都市华为赛门铁克科技有限公司 Data storage method and device
CN103164408A (en) * 2011-12-09 2013-06-19 阿里巴巴集团控股有限公司 Information storage and query method based on vertical search engine and device thereof
CN103164651A (en) * 2011-12-15 2013-06-19 西门子公司 Device and method for extracting virus file feature code and virus detection system
WO2014081727A1 (en) * 2012-11-20 2014-05-30 Denninghoff Karl L Search and navigation to specific document content
CN103037344A (en) * 2012-12-06 2013-04-10 亚信联创科技(中国)有限公司 Call bill repetition removing method and call bill repetition removing device
CN103067364A (en) * 2012-12-21 2013-04-24 华为技术有限公司 Virus detection method and equipment
CN104090895A (en) * 2013-12-18 2014-10-08 深圳市腾讯计算机系统有限公司 Method, device, server and system for obtaining cardinal number
CN104751055A (en) * 2013-12-31 2015-07-01 北京启明星辰信息安全技术有限公司 Method, device and system for detecting distributed malicious codes on basis of textures
CN103970722A (en) * 2014-05-07 2014-08-06 江苏金智教育信息技术有限公司 Text content duplicate removal method
CN104657451A (en) * 2015-02-05 2015-05-27 百度在线网络技术(北京)有限公司 Processing method and processing device for page
CN105069020A (en) * 2015-07-14 2015-11-18 国家信息中心 3D visualization method and system of natural resource data
CN106487833A (en) * 2015-08-26 2017-03-08 北京国双科技有限公司 The statistical method of isolated user number and device in network monitor
CN105183855A (en) * 2015-09-08 2015-12-23 浪潮(北京)电子信息产业有限公司 Information classification method and system
CN105306063A (en) * 2015-10-12 2016-02-03 浙江大学 Optimization and recovery methods for record type data storage space

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
O. ERDOGAN 等: "Hash-AV: fast virus signature scanning by cache-resident filters", 《 GLOBECOM "05. IEEE GLOBAL TELECOMMUNICATIONS CONFERENCE, 2005.》 *
刘鹏程: "一种新型网页篡改检测技术", 《绍兴文理学院学报(自然科学)》 *
樊震 等: "基于PE文件结构异常的未知病毒检测", 《计算机技术与发展》 *

Also Published As

Publication number Publication date
CN108733664B (en) 2022-05-03

Similar Documents

Publication Publication Date Title
US9805099B2 (en) Apparatus and method for efficient identification of code similarity
Chikhi et al. On the representation of de Bruijn graphs
EP2608096B1 (en) Compression of genomic data file
US7834781B2 (en) Method of constructing an approximated dynamic Huffman table for use in data compression
JP4455661B2 (en) Hash function construction from expander graph
US20150302197A1 (en) Apparatus and Method for Identifying Similarity Via Dynamic Decimation of Token Sequence N-Grams
EP3072076B1 (en) A method of generating a reference index data structure and method for finding a position of a data pattern in a reference data structure
Gayoso Martínez et al. State of the art in similarity preserving hashing functions
CN105553937B (en) The system and method for data compression
CN106682506B (en) Virus program detection method and terminal
CN111370064B (en) Rapid classification method and system for gene sequences of SIMD (Single instruction multiple data) -based hash function
Yu et al. Traversing the k-mer landscape of NGS read datasets for quality score sparsification
CN112052413A (en) URL fuzzy matching method, device and system
Italiano et al. Compressed weighted de Bruijn graphs
CN112926647B (en) Model training method, domain name detection method and domain name detection device
WO2018055160A1 (en) System level testing of entropy encoding
CN1691581A (en) Multi-pattern matching algorithm based on characteristic value and hardware implementation
CN107251015B (en) Efficiently detecting user credentials
CN108733664A (en) A kind of file classifying method and device
Ghosh et al. Pakman: Scalable assembly of large genomes on distributed memory machines
CN113987486A (en) Malicious program detection method and device and electronic equipment
CN110287147B (en) Character string sorting method and device
Ho et al. PERG-Rx: a hardware pattern-matching engine supporting limited regular expressions
Hreinsson et al. Storing a compressed function with constant time access
JP2015046683A (en) Traffic scanning device and method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant