CN109816038A

CN109816038A - A kind of Internet of Things firmware program classification method and its device

Info

Publication number: CN109816038A
Application number: CN201910098931.2A
Authority: CN
Inventors: 吴晓鸰; 于龙海
Original assignee: Guangdong University of Technology
Current assignee: Guangdong University of Technology
Priority date: 2019-01-31
Filing date: 2019-01-31
Publication date: 2019-05-28
Anticipated expiration: 2039-01-31
Also published as: CN109816038B

Abstract

The invention discloses a kind of Internet of Things firmware program classification method and its devices, including extracting the readable character string in firmware；Driver tree according to readable character string building firmware；The root node of driver tree is firmware number, and the second node layer is Program Type, and third node layer is the information type of readable character string, and the 4th node layer is the content of corresponding readable character string；The difference degree numerical value of corresponding node between every two driver trees is successively calculated, and records calculated result；Calculated result includes the mark, the mark of corresponding node calculated and its difference degree numerical value of two driver trees calculated；It screens to obtain the maximum top n driver tree of ambient density according to calculated result and be clustered as cluster centre, obtain several firmware classifications, for subsequent foundation firmware classification progress firmware reparation.The present invention considers the similarity degree between whole readable character strings in classification, and the accuracy of classification is high, reduces workload when subsequent firmware is repaired.

Description

A kind of Internet of Things firmware program classification method and its device

Technical field

The present invention relates to firmware recovery technique fields, more particularly to a kind of Internet of Things firmware program classification method and its dress It sets.

Background technique

Firmware (Firmware) refers to the equipment " driver " saved inside equipment, and by firmware, operating system could be pressed The device drives of sighting target standard realize that the run action, such as CD-ROM drive, CD writer etc. of specific machine have internal firmware.Firmware is load Appoint the software of the most basic bottom work of a system.And in hardware device, due to some hardware devices in addition to firmware with It is formed outside without other softwares, therefore firmware also just decides the function and performance of hardware device.

Testing process in the production link of hardware product can find out hardware there are the problem of.But this test wrapper Section can only find out existing problem or loophole.After a test, product comes into operation, because hardware device is networked, attacker is logical Cross network attack hardware system, Internet of Things firmware may be made to generate new loophole, this new loophole be testing process be can not It finds out.There are problems that loophole, conventional method are replacement new model equipment hardware device at present, but due to hardware Equipment usage amount is huge, is limited to the cost problem of hardware device, and the processing method of most of companies still reformats equipment, Then proceed to using.Hardware device originally usually all uses in the Intranet of company, that is, is not coupled on Internet, into And form isolation physically.But with the development of technology of Internet of things, each hardware device will connected network communication.In this feelings Under condition, if there is the loophole on hardware, and is excavated by the criminal on network, serious threat will be brought to production safety. So picking out problematic firmware and repairing, this work is very crucial.

When carrying out firmware reparation, due to this problem program have been used a period of time, may with this program production at Product hardware is ten hundreds of, each to detect and is confirmed whether there are loophole and patching bugs, this workload is too It is huge.The difference of the hardware platform as locating for the firmware, the compiler option is not when compiler used is different and compiler With selection, even identical firmware program, as these reasons finally generate different assembly code and machine code. It therefore, is that firmware is classified to reduce a kind of existing method of workload, specific method is to be divided into firmware program Then smaller file section is respectively compared the similarity that each file section is in two firmwares, as long as there is a file section similarity The two firmwares are then just classified as one kind by height.But under this mode, since firmware is there are many file sections, be easy so that There can be many different multiplexing codes in every a kind of firmware after subsequent classification, that is, belong to the phase between a kind of firmware code Low like spending, classification accuracy is low.

Therefore, how to provide a kind of Internet of Things firmware program classification method that classification accuracy is high and its device is this field The current problem to be solved of technical staff.

Summary of the invention

The object of the present invention is to provide a kind of Internet of Things firmware program classification method and its device, by product tree come Whole readable code sections of firmware are organized, so that in view of the similarity degree between whole readable character strings when classification, from And the accuracy of classification is improved, and then reduce workload when subsequent firmware is repaired.

In order to solve the above technical problems, the present invention provides a kind of Internet of Things firmware program classification methods, comprising:

Extract the readable character string in each firmware to be sorted；

The driver tree of the firmware is constructed according to the readable character string；The root node of the driver tree is institute The number of firmware is stated, the second node layer of the driver tree is program part type belonging to readable character string, the drive The third node layer of dynamic program tree is the information type of readable character string；4th node layer is the interior of corresponding readable character string Hold；

The difference degree numerical value of corresponding node between every two driver trees in whole driver trees is successively calculated, and Record calculated result；The calculated result includes mark, the corresponding node calculated of two driver trees calculated Mark and its difference degree numerical value；

Screen to obtain the maximum top n driver tree of ambient density as cluster centre progress according to the calculated result Cluster, obtains several firmware classifications, carries out firmware analysis reparation according to the firmware classification for subsequent；N is positive integer.

Preferably, after the readable character string extracted in each firmware to be sorted, according to the readable character string Before the driver tree for constructing the firmware, further includes:

Judge whether the readable character string is readable character string relevant to platform or readable word relevant with chained library Symbol string, if so, the part readable character string is deleted, if it is not, continuing to judge next extracted readable character string, until mentioning The whole readable character strings judgement taken finishes；

Correspondingly, the readable character string after subsequent foundation judgement constructs the driver tree of the firmware；

Wherein, judge the readable character string whether be readable character string relevant to platform process are as follows:

Judge whether the obtained information quantity of the readable character string is greater than preset platform dependent thresholds, if so, described Readable character string is readable character string relevant to platform, and otherwise, the readable character string is not readable word relevant to platform Symbol string；The obtained information quantity of the readable character string specifically:

Wherein, IG (s) is obtained information quantity；C_iFor i-th of target platform；P(C_i) it is target platform C_iIn binary system text The ratio of part Zhan total binary file；P (s) is the total binary file of the binary file Zhan containing readable character string s Ratio；P(s,C_i) it is target platform C_iIt and include the ratio of the total binary file of binary file Zhan of readable character string s.

Preferably, the process for calculating the difference degree numerical value of corresponding node between two driver trees specifically:

According to nodal distance relational expression, successively calculate every in the first layer, the second layer and third layer of two driver trees Difference degree numerical value between a corresponding node；

The nodal distance relational expression are as follows:

Wherein,Driver tree is formed by for i-th of firmware；Driver tree is formed by for j-th of firmware；ForWithIn in same position corresponding node v difference degree numerical value；ForInterior joint v's The set of all child nodes；Wherein,

Preferably, described to screen to obtain the maximum top n driver tree conduct of ambient density according to the calculated result The process of cluster centre includes:

Determine whole differences in the calculated result between every driver tree and other whole driver trees Degree numerical value；

Count the difference for being less than pre-determined distance threshold value in the corresponding whole difference degree numerical value of every driver tree The number of degree numerical value, the ambient density number as this driver tree；

All driver trees are ranked up according to the sequence of ambient density number from big to small, top n is selected to drive Dynamic program tree is as cluster centre.

Preferably, each corresponding section in the first layer, the second layer and third layer for successively calculating two driver trees Difference degree numerical value between point, and after recording calculated result, further includes:

According to layer distance relation formula, calculates the layer distance of respective layer between every two driver trees and saved；

Wherein, the layer distance relation formula are as follows:

Wherein,ForWithL layers of layer distance；ForL layers of all nodes collection It closes；

Wherein, β_vFor the corresponding weight of node v,ForThe set of all child nodes of interior joint v, w are v's Father node.

Preferably, described according to layer distance relation formula, calculate the layer distance of respective layer between every two driver trees Later, further includes:

According to tree distance relation formula, the tree distance between every two driver trees is calculated；The nodal distance and described Tree distance is the difference degree numerical value；Wherein, the tree distance relation formula are as follows:

Wherein,ForWithBetween tree distance, γ is common ratio；H (φ) is the height of driver tree, The value of H (φ) is { 1,2,3 }；ω_lIt is l layers of layers apart from weight coefficient；Wherein:

Preferably, it is described select top n driver tree as cluster centre after, further includes:

Judge whether the tree distance between any two cluster centre is greater than default tree distance threshold, if it is not, by current The N+1 driver tree as cluster centre, and ambient density number lesser one in two cluster centres currently judged A cluster centre is placed in last position of sorting, and repeats the above process later；Until the tree between any two cluster centre Distance is all larger than the default tree distance threshold.

In order to solve the above technical problems, the present invention also provides a kind of Internet of Things firmware program sorters, comprising:

Extraction module, for extracting the readable character string in each firmware to be sorted；

Structure tree constructs module, for constructing the driver tree of the firmware according to the readable character string；The drive The root node of dynamic program tree is the number of the firmware, and the second node layer of the driver tree is belonging to readable character string Program part type, the third node layer of the driver tree are the information type of readable character string；4th node layer is pair The content for the readable character string answered；

Distance calculation module, for corresponding node between every two driver trees in the whole driver trees of successively calculating Difference degree numerical value, and record calculated result；The calculated result includes the mark of two driver trees calculated, institute The mark and its difference degree numerical value of the corresponding node of calculating；

Cluster module obtains the maximum top n driver tree work of ambient density for screening according to the calculated result It is clustered for cluster centre, obtains several firmware classifications, carry out firmware analysis reparation according to the firmware classification for subsequent； N is positive integer.

The present invention provides a kind of Internet of Things firmware program classification method and its devices, in the readable character string for extracting firmware Afterwards, the driver tree of each firmware is constructed according to readable character string, the second node layer of driver tree is readable character string Affiliated program part type, the third node layer of driver tree are the information type of readable character string；Every two are calculated later The difference degree numerical value of corresponding node between driver tree, difference degree numerical value is for showing that the two corresponding nodes are wrapped Containing the difference degree between content, and difference degree is smaller, and the content for showing that the two corresponding nodes are included is more similar；And it is interior Rong Yue is similar, and it is closer also to be understood as the distance between the two corresponding nodes；Therefore, subsequent according to each driver tree With the calculated result between other driver trees, it is used as in cluster to screen the maximum driver tree of top n ambient density The heart is clustered, and firmware classification is completed, for subsequent progress firmware analysis reparation.As it can be seen that in the present invention, utilizing product tree construction Carry out the readable character string of tissue whole, so that not just it is divided into one kind when only one section of similarity is higher when subsequent cluster, and It is that the corresponding calculated result of whole node accounts in comprehensive product tree construction, that is, considers whole readable characters of firmware Similarity degree between string, so that belonging to multiplexing as having the same as possible between a kind of firmware code after final classification Code, that is, the similarity belonged between the code between a kind of firmware is as high as possible, to improve the accuracy of classification, in turn Reduce workload when subsequent firmware is repaired.

Detailed description of the invention

It to describe the technical solutions in the embodiments of the present invention more clearly, below will be to institute in the prior art and embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without creative efforts, can also obtain according to these attached drawings Obtain other attached drawings.

Fig. 1 is a kind of structural schematic diagram of driver tree provided by the invention；

Fig. 2 is a kind of flow chart of the process of Internet of Things firmware program classification method provided by the invention；

Fig. 3 is the flow chart of the process of another Internet of Things firmware program classification method provided by the invention；

Fig. 4 is a kind of structural schematic diagram of Internet of Things firmware program sorter provided by the invention.

Specific embodiment

Core of the invention is to provide a kind of Internet of Things firmware program classification method and its device, by product tree come Whole readable code sections of firmware are organized, so that in view of the similarity degree between whole readable character strings when classification, from And the accuracy of classification is improved, and then reduce workload when subsequent firmware is repaired.

In order to make the object, technical scheme and advantages of the embodiment of the invention clearer, below in conjunction with the embodiment of the present invention In attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is A part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, those of ordinary skill in the art Every other embodiment obtained without making creative work, shall fall within the protection scope of the present invention.

Shown in Figure 2 the present invention provides a kind of Internet of Things firmware program classification method, Fig. 2 is provided by the invention A kind of flow chart of the process of Internet of Things firmware program classification method；This method comprises:

Step s1: the readable character string in each firmware to be sorted is extracted；

In different hardware platforms, using different compilers, the different compiling option of selection, always have some program codes Great variety will not occur, i.e., the readable character string in binary code, these readable character strings are in different translation and compiling environments Under still keep similitude.Therefore, the present invention needs the character according to these with general character when classifying to firmware program String, to classify to firmware.

Step s2: the driver tree according to readable character string building firmware；The root node of driver tree is firmware Number, the second node layer of driver tree are program part type belonging to readable character string, the third layer of driver tree Node is the information type of readable character string；4th node layer is the content of corresponding readable character string；I.e. by Internet of Things firmware Program is successively pacified according to the content of the driven by program tree second layer and third node layer according to different readable character strings (code segment) It is placed on the 4th layer (leaf node) of tree.

Product tree (Product Structure Tree, PST): being the material composition for describing a certain product and each portion The tree-shaped figure of hierarchical structure of single cent part composition.It is by the product information in product data management, in conjunction between each components Hierarchical relationship, form a kind of effective attribute management structure.Product tree is each by product according to the hierarchical relationship of the product Kind components are organized according to certain hierarchical relationship, can clearly describe the relationship between product all parts, part, Node on behalf component, part or component on tree, each node can belong to figure number, material, specification, the model of the component etc. Property information and relevant documentation are related.In PST, root nodes stand product, branch node on behalf component or subassembly, leaf segment Point represents part.The distinguishing hierarchy of product tree must reflect the function division and composition of product, and product must be taken into consideration in it Production and business needs.After the completion of the General layout Plan of product, to realize that the function of product is drawn by product tree Point, by product material object.Product structure tree hierachy will be determined according to product complexity.Simultaneously also because of management mode of enterprise not With and difference, as soon as some enterprises indicate a serial product with one tree, one product of the enterprise also having uses one Tree representation.

The characteristics of product tree is utilized in the present invention, constructs driver tree, or it can be appreciated that product is Product tree when firmware driver.So that according to the restricting relation and semanteme between level between the program of firmware Relationship is successively organized, in view of whole readable character strings that firmware includes when subsequent cluster, to guarantee to be presented to as far as possible Each category code file of professional all guarantees to be generated by few Multiplexing module source code as far as possible on line.It is understood that Will include a large amount of Multiplexing module source code in such in same class when the firmware code similarity for including is low, i.e., such In existing lap is different between firmware two-by-two, such as firmware A and firmware B includes Multiplexing module source code in such 1, firmware A and firmware C then include Multiplexing module source code 2, etc.；Cause classification accuracy low in this way.

Shown in Figure 1, Fig. 1 is a kind of structural schematic diagram of driver tree provided by the invention；Wherein root node is The number of Internet of Things firmware program file distinguishes different Internet of Things firmware files with this；Second node layer is LINUX embedded The part formed in system driver and WINDOWS CE embedded system driver (constitutes the first subseries herein, incites somebody to action The different program part of same firmware program separates)；Each node indicates readable character string under various circumstances in third layer Information category；The node that 4th node layer is made of the contents of program in firmware, these programs are according to the second layer and third layer The content of node has respectively constituted different leaf nodes.According to the truth of Internet of Things firmware program, the 4th node layer is True Internet of Things firmware program is constituted；For the node in the second layer and third layer, if its all child nodes is all sky, that This node is also deleted.In this way, what is be finally constituted is exactly the driver tree of corresponding firmware program file.

Step s3: the difference degree of corresponding node between every two driver trees in whole driver trees is successively calculated Numerical value, and record calculated result；Calculated result includes mark, the corresponding node calculated of two driver trees calculated Mark and its difference degree numerical value；

Here corresponding node refers to that the node is identical in the upper location of this two driver trees.For example, this When the node that calculates be GC group connector service routine in first driver tree second layer node, then its corresponding node For the node of GC group connector service routine in second driver tree second layer.Difference degree numerical value (can also become node Distance) for showing the difference degree between the included content of the two corresponding nodes, and difference degree is smaller, shows the two The content that corresponding node is included is more similar；

Step s4: screen to obtain the maximum top n driver tree of ambient density as cluster centre according to calculated result It is clustered, obtains several firmware classifications, carry out firmware analysis reparation according to firmware classification for subsequent；N is positive integer.

It is understood that ambient density, refers to being less than certain threshold value with the difference degree numerical value of the driver tree Driver tree number, number is more, shows that the density around the driver tree is higher, i.e., the driver tree more connects Nearly cluster centre.

The present invention provides a kind of Internet of Things firmware program classification methods, after the readable character string for extracting firmware, foundation Readable character string constructs the driver tree of each firmware, and the second node layer of driver tree is journey belonging to readable character string Prelude classifying type, the third node layer of driver tree are the information type of readable character string；Every two drivings journey is calculated later The difference degree numerical value of corresponding node between sequence tree, difference degree numerical value for show the included content of the two corresponding nodes it Between difference degree, and difference degree is smaller, and the content for showing that the two corresponding nodes are included is more similar；And content gets over phase Seemingly, it is closer that the distance between the two corresponding nodes also are understood as；Therefore, subsequent according to each driver tree and other Calculated result between driver tree carries out to screen the maximum driver tree of top n ambient density as cluster centre Cluster completes firmware classification, for subsequent progress firmware analysis reparation.As it can be seen that in the present invention, using product tree construction come tissue Whole readable character strings, so that not just it is divided into one kind when only one section of similarity is higher when subsequent cluster, but it is comprehensive The corresponding calculated result of whole node accounts in product tree construction, that is, considers between whole readable character strings of firmware Similarity degree so that belong to multiplexing code as having the same as possible between a kind of firmware code after final classification, The similarity belonged between the code between a kind of firmware is as high as possible, to improve the accuracy of classification, and then reduces Workload when subsequent firmware is repaired.

Wherein, the process of step s1 specifically:

From the data segment and code segment of firmware to be sorted, readable character string is extracted；Readable character string include variable name, Output information, error message, Debugging message, version information and sign character.

It is understood that these readable character strings include variable name banners, output information output message, Error message error message, Debugging message debugging message, version information version strings, symbol Character symbol table strings (such as ACSLL table)；O:output message in Fig. 1；B:banners；E:error message；D:debugging message；V:version strings；S:symbol table strings.Readable character String is all made of a~z, A~Z, still keeps similitude under different translation and compiling environments.These readable character strings, major part is all It is stored in the data segment of binary code, sub-fraction is stored in the code segment of binary code.And in the coding of Internet of Things firmware It is largely ASCLL coding form in form, and another part is the coding form of UNICODE.So extracting rank in code Section, main object have in 4, corresponding method are as follows:

The ACSLL code of data segment；(extraction readable character can be extracted with string packet under Linux system by extracting it Length is more than 6 character string).ACSLL code, information coding are exactly to be converted into certain symbolism for indicating information convenient for calculating Machine or another symbolism of people identification and processing；Or in same system, it is changed by a kind of forms of information representations another The process of kind forms of information representations.For example, people by gesture, expression, expression in the eyes, the simple actions such as speak and express certain emotion； Ancient times fight to beat a drum to indicate to march, and expression of calling off a battle is withdrawn troops；Traffic lights are yellow, green, red to be respectively indicated slowly traveling, leads to Row, no through traffic, etc., is all a kind of simple information coding.Information is with binary representation, this table on computers It is just highly difficult to show that method allows people to understand.Therefore input and output device is equipped on computer, the main purpose of these equipment is exactly, Information is shown human-readable understanding by the form that can be read with a kind of mankind on devices.To guarantee the mankind and setting It is standby, it can be carried out correct information exchange between equipment and computer, the unified information exchange code of people's establishment, here it is ASCII character table.The information symbol of input is translated by certain rule and is compiled by the binary system that " 0 " and " 1 " form by computer Code, is handled to binary coding, and processing result is finally reduced into the symbol that we can identify, exports corresponding letter Breath.Currently, the information coding that computer-internal generally uses is ASCII character.Standard ASCII character is made of 7 bits, is used To indicate 26 English upper and lower case letters and some additional characters.

The ACSLL code of code segment；Here general code effect is all definition and storage local variable, or is function It calls, the reorientation of address.In this stage, the function of code mainly completes function with the storage form of stack, as long as then knowing Not Chu each stack, and by the contents extraction of stack come out.Identification stack can be by identifying that a series of pull instruction, stacking refer to It enables to constitute a stack.Readable character string is divided into different roles, and a kind of method can be used: firstly, identifying continuous The instruction of push class (entering stack instruction) pushing-type.Then, the operand that pushing-type instructs is extracted from these continuous pushing-type instructions, then The stack structure identified for each go out the groups of operands by extracting at data flow.Finally, these data flows can be with structure At readable character string.

The Unicode code of data segment；Show the information of this code position according to different hardware to extract corresponding code. In identical hardware, the position of this code is relatively more fixed.Only one character set of Unicode, Chinese, Japanese, Korean three Kind text occupies the part of 0x3000 to 0x9FFF in Unicode simultaneously.What Unicode was generallyd use at present is UCS-2 mark Standard, it encodes a character with two bytes, for example the coding of Chinese character " warp " is 0x7ECF.Because character code generally uses 16 System indicates that in order to distinguish with the decimal system, hexadecimal is started with 0x, it is 32463, UCS- that 0x7ECF, which is converted into the decimal system, 2 with two bytes come code character, two bytes are exactly 16 binary systems, and 2 16 powers are equal to 65536, so UCS-2 is most 65536 characters can be encoded.The character from 0 to 127 is encoded as the character that ASCII is encoded, such as alphabetical " a " Unicode coding is 0x0061, and the corresponding decimal system is 97, and the ASCII of " a " coding is 0x61, and the corresponding decimal system is also 97.Since Chinese character quantity is excessive, and UCS-2 can only at most indicate 65536 characters in Unicode, therefore Unicode can only lead to The method for excluding some almost unused Chinese characters is crossed so that remaining Chinese characters in common use can be expressed.In order to indicate all Chinese characters, Unicode also has UCS-4 specification, most of to come from country variant under this specification exactly with 4 bytes come code character It can be expressed with the readable character in area.

The Unicode code of code segment；This code is very rare, so the present invention is to ignore this code.

Process of the program from source code to executable program is as follows:

One, precompile: the precompile instruction with " # " beginning in main processing source code file.Processing rule is seen below:

1. deleting all #define, all macrodefinitions are unfolded.

2. handling all condition precompile instructions, such as " #if ", " #endif ", " #ifdef ", " #elif " and " # else”。

3. handling " #include " precompile instruction, file content is substituted into its position, this process be recurrence into Capable, it include alternative document in file.

4. deleting all annotations, " // " and "/* */".

5. retaining all #pragma compiler instructions, compiler needs to use them, such as: #pragma once be for The file has been prevented to be repeated reference.

6. adding line number and file identification, the row number information of debugging is generated convenient for compiler when compiling, and produce when compiling Raw compile error or warning are can to show line numbers.

Two, it compiles: xxx.i the or xxx.ii file generated after precompile, carrying out a series of morphological analyses, grammer point After analysis, semantic analysis and optimization, corresponding assembly code file is generated.Mainly there is following process:

1. morphological analysis: using the algorithm for being similar to " finite state machine ", source code program is input in scanning machine, it will Character string therein is divided into a series of mark.

2. syntactic analysis: syntax analyzer carries out syntactic analysis to the mark generated by scanner, generates syntax tree.By The syntax tree of syntax analyzer output is a kind of using expression formula as the tree of node.

3. semantic analysis: syntax analyzer is the analysis completed to expression syntax level, and semantic analyzer is then right Whether expression formula significant to be judged, the semanteme of analysis is static semantic --- compiling duration can semanteme by stages, relatively The dynamic semantics answered are the semantemes that just can determine that in the runtime.Wherein, static semantic generally includes: the matching of statement and type, The conversion of type, then when semantic analysis will be to check in terms of these, such as an int type is assigned to int* type, Semantic analyzer can find that this type mismatches, and compiler will report an error.

4. optimization: the other optimization process * of * source code level, during entire syntax tree can be converted by source code optimizer Between code --- the sequence of syntax tree indicates, very close to object code.There are many kinds of types for intermediate code, most commonly " three-address code " and " P- code ", the wherein citation form of three-address code are as follows: x=y op z indicates variable y and z carrying out op After operation, it is assigned to x, op operation can be addition subtraction multiplication and division etc..

5. Object Code Generator: intermediate code being converted into target machine code by code generator, is generated a series of Code sequence --- assembler language indicates.

6. object code optimizes: object code optimizer optimizes above-mentioned target machine code: it is suitable to find Addressing system is substituted multiplying using displacement, deletes extra instruction etc..

Three, it collects: assembly code is transformed into the instruction (machine code file) that machine can execute.

The assembly process of assembler is simpler for compiler, not complicated grammer, also without semanteme, less Need to do optimization, it is only translated come assembly process has compilation one by one according to the table of comparisons of assembly instruction and machine instruction Device as is completed.

Four, it links: the file in the same engineering is combined into a complete binary program.

Five, it loads: by binary program and combination of hardware, so as to run on a hardware platform.

Preferably, after step s1, before step s2, further includes:

Judge whether readable character string is readable character string relevant to platform or readable character string relevant with chained library, If so, the part readable character string is deleted, if it is not, continue to judge next extracted readable character string, until extract Whole readable character string judgements finish；

Correspondingly, the driver tree of the readable character string building firmware after subsequent foundation judgement；

Wherein, judge readable character string whether be readable character string relevant to platform process are as follows:

Judge whether to be greater than preset platform dependent thresholds by the obtained information quantity of readable character string that (threshold size here can By adjusting in actual work, the present invention does not limit its occurrence), if so, readable character string is readable word relevant to platform Symbol string, otherwise, readable character string is not readable character string relevant to platform；The obtained information quantity of readable character string specifically:

It is understood that due in readable character string in addition to it includes have execute contents of program itself other than, also wrap Containing some Partial Features as caused by hardware platform, encoder self-characteristic, this Partial Feature is not helpful for classifying, The complexity that will increase classification instead reduces the accuracy of classification, therefore preferably deletes this partial data and filter, to mention The accuracy of high-class reduces the calculation amount of classification.In addition, can also be by all label hardware platforms, compiler version and compiling The instruction of device option all filters out.The above is only a kind of preferred embodiments, which content are specifically needed to filter, and how to carry out Filtering can be set according to actual needs.

Specifically, buildroot tool may be used herein, then all files of cross compile are different targets Platform creates a blacklist.Readable character string related with target platform and kernel level library, system is added inside blacklist The sign character (these libraries general position under LINUX system is /lib ,/usr/lib) in the library of grade.Filter process is mainly By in the code of extraction, the readable character string in blacklist is removed.Buildroot is a building insertion in Linux platform The frame of formula linux system.Entire Buildroot is made of Makefile script and Kconfig configuration file.You can be with It as compiling linux kernel, is configured by buildroot, menuconfig modification, compiling out one completely can be direct Run in programming to machine linux system software (comprising in boot, kernel, rootfs and rootfs various libraries and Application program).Certainly, the filtering that other tools carry out readable character string can also be used, this is not limited by the present invention.

In a specific embodiment, in step s3, the difference degree of corresponding node between two driver trees is calculated The process of numerical value specifically:

Step s31: according to nodal distance relational expression, the first layers of two driver trees, the second layer and the are successively calculated Difference degree numerical value in three layers between each corresponding node, and record calculated result；Nodal distance relational expression are as follows:

It is understood that corresponding node refers to driver tree first layer, the second layer and third node layer here.4th The leaf node of layer is program that practical Internet of Things firmware extracts, third node layer be the denominator of its child nodes (i.e. It is the denominator of the 4th node layer), the second node layer is the denominator of its child nodes again.It is like classification standard one Sample, the second node layer and third node layer save different function in Internet of Things firmware, the code dehind of different location to the 4th layer Point (leaf node).Therefore, not only can be because contents of program have differences between the 4th layer of corresponding node, the second layer, third layer Node is also discrepant.Because of the difference of the 4th layer of specific procedure, cause third layer node can because its corresponding The type of the information of four node layers is not present, and the node of corresponding third layer can also be not present.This has resulted in third layer section The difference of point, thus third also has nodal distance at node；The reason of second layer, is same as above.And what nodal distance relational expression was related to It is the information in its child node.So the distance of the first-level nodes is calculated according to the information of the second node layer, the second layer Nodal distance information be by third layer node calculate come, and the distance of third node layer by the 4th node layer calculate from, Therefore need to calculate the distance of first three node layer.This nodal distance relational expression has used jacard similarity algorithm.Jacard Similarity is higher, apart from smaller.By calculate two driver trees between first layer, the second layer and each node of third layer it Between difference degree numerical value, enable the calculated result finally obtained between this two driver trees to contain two as far as possible Similarity degree information between person between whole readable character strings, so that subsequent according between every two driver trees When calculated result is clustered, cluster result can improve the similitude of every class firmware program as far as possible, to reduce work people Member carries out workload when firmware reparation.In addition, for every driver tree, often with an other driver tree After being calculated, i.e., multiple groups calculated result can be obtained, every group of calculated result includes that the node identification of one group of corresponding node (is used to table It is bright that currently calculate is node at which position of driver tree), the mark and difference of driver tree locating for it Off course degree value.Therefore, after all calculating, every driver tree has the calculating knot that multiple groups include its own mark Fruit.

In an advantageous embodiment, calculated result here can be recorded using label.I.e. each pair of one group of corresponding section , i.e., can be tagged for driver tree locating for the group node after point calculates, label construction are as follows: < driver tree i, Driver tree j, corresponding node, difference degree numerical value >, in addition, since calculated result is the meter between two driver trees It calculates as a result, the label after therefore calculating can be configured on the two driver trees currently calculated respectively, at two The label being arranged on driver tree is only that the mark sequence of driver tree is different, remaining is identical.Due to of the invention special The core ideas of benefit is: finding the Internet of Things firmware program with equal modules multiplexing code, submits to after cluster professional on line Personnel do leak analysis and reparation.This thought is applied in driver tree, that is, from root node to the 4th layer of leaf The branch of node, whether having the distance of very little in different driver trees, (i.e. similarity is very for difference degree numerical value in other words It is high).So the structure of label are as follows:<driver tree 1, driver tree 2, corresponding node, difference degree numerical value>, so set Meter, can just distinguish the different branches from root node to leaf node.The effect of label:

Difference degree numerical value is greater than specific threshold value and (can be adjusted by specific working condition, this patent in this label In with no restrictions) if, then it is corresponding solid that two driver trees in the label can be found according to the node location in the label Similar program code between part.Since the structure of entire driver tree can be the journey of different location, different function in firmware Sequence code is distributed in different leaf nodes.So when the difference degree numerical value of corresponding node in two different driving program trees When very little (similarity of node is greater than threshold value), so that it may determine in this node, there is program module to be re-used, thus convenient It is subsequent to be clustered.

Preferably, it in step s4, screens to obtain the maximum top n driver tree of ambient density according to calculated result Process as cluster centre includes:

Determine whole difference degrees in the calculated result between every driver tree and other whole driver trees Numerical value；

Counting (can be by specific less than pre-determined distance threshold value in the corresponding whole difference degree numerical value of every driver tree Working condition adjusts, in the present invention with no restrictions) difference degree numerical value number, around this driver tree Density number；

Whole driver trees are ranked up according to the sequence of ambient density number from big to small, top n is selected to drive journey Sequence tree is as cluster centre.

It is understood that every driver tree can include multiple groups calculated result multiple labels in other words, by every group of meter The difference degree numerical value calculated in result is compared with pre-determined distance threshold value respectively, and it is poor to record the whole that the driver tree includes Less than the number of the difference degree numerical value of pre-determined distance threshold value, i.e. the driver tree interior joint and other drives in off course degree value The difference degree numerical value of dynamic program tree interior joint is less than the number of pre-determined distance threshold value.The quantity is higher, shows the driver Tree is more similar to other driver trees, apart from closer.Therefore, after being ranked up from big to small according to the quantity, sequence It is more forward, then show that the density of the driver tree around the driver tree is higher, i.e., with the driver tree it is similar its His number of driver tree is more, therefore, preferentially using the driver tree as cluster centre.And it is close around cluster centre Degree is big, shows that there are programming reusability phenomenons between the corresponding firmware of cluster centre and most of firmware chosen, namely show The corresponding firmware of the cluster centre belongs to the similarity degree height between a kind of firmware, and classification accuracy is high.And it is previously mentioned Difference degree numerical value be less than pre-determined distance threshold value, then show the corresponding code segment of corresponding two nodes of the difference degree numerical value it Between exist multiplexing phenomenon.This mode can comprehensively consider the similarity of the whole nodes and other firmwares in firmware, so that having The different Internet of Things firmwares for being multiplexed the same module can be gathered in a cluster, and classification results are more accurate, and relatively existing Method, that is, the method for using minhash and LSH, the cluster that the method that is mentioned can cluster in the present invention is more, in each cluster Internet of Things firmware file is less, consequently facilitating staff carries out subsequent firmware analysis reparation, reduces the work of staff Amount.

In addition, in addition to considering in different trees, in first layer, the second layer and third layer, the distance between corresponding node is (i.e. Similarity degree), it is also contemplated that in different trees, all total distances for being in same node layer, i.e. layer distance.The viewpoint is Possible different driving program tree has many places to be all multiplexed the same program module；The position 1 of one driver tree and driving journey There is the program module of multiplexing in sequence tree 1, and position 2 has the program module of multiplexing with driver tree 2.And program module is No multiplexing can be judged that layer distance is the similar journey in different driving program tree between respective layer according to layer distance Degree.

Preferably, each corresponding section in the first layer, the second layer and third layer of two driver trees is successively calculated Difference degree numerical value between point, and after recording calculated result, further includes:

Step s32: it according to layer distance relation formula, calculates the layer distance of respective layer between every two driver trees and carries out It saves；

Wherein, layer distance relation formula are as follows:

Wherein,ForWithL layers of layer distance；ForL layers of all nodes collection It closes；β_vFor the corresponding weight of node v.

It is understood that the layer distance of so-called driver tree, exactly calculates the difference degree of all nodes of each layer The summation of numerical value (i.e. nodal distance).For the node of first layer, the second layer and third layer, nodal distance is bigger, represents it The jacard similarity of child nodes is lower, then the value that it corresponds to layer distance offer is bigger；On the contrary, the section of a node For point apart from very little, the jacard similarity for representing its child nodes is higher, this nodal distance to the offer value of layer distance just very Small (having ignored influence of this nodal distance to layer distance in the present invention).Specific way is by the different nodal point separations in weighted sum From weight coefficient β_vTo influence.In addition, since first layer is root node, the also as number of firmware, therefore itself not generation The difference of two firmware contents of table.But from the relational expression of nodal distance it is found that each upper corresponding node of two driver trees The distance between be to be calculated according to the set of its child node.So information of the nodal distance of first layer by the second node layer It obtains, and the distance of the corresponding node on the second layer is obtained by third node layer, corresponding node distance is by the 4th layer in third layer Nodal information obtains.The 4th layer of difference for reflecting driver, but its quantized value shows upper layer.

Relational expression according to above-mentioned weight coefficient, so that it may calculate different driving program tree layer apart from when, journey will be present Influence of the node of sequence multiplexing to layer distance is ignored.Here threshold value still will be arranged according to the case where real work, node Distance is less than threshold value, shows the case where there are programming reusabilities in the 4th layer of two driver trees of program, computation layer apart from when 0 just is set by the weight coefficient of this nodal distance, therefore there are the layer of two driven by program trees of programming reusability distance meeting very littles. Thus it will be connected the case where the programming reusability of driver module with layer distance.If without programming reusability situation, different journeys The layer distance of sequence driving tree will be very big.The programming reusability situation of i.e. two driven by program trees is more, and layer is apart from smaller.As excellent Selection of land, after step s32, further includes:

Step s33: according to tree distance relation formula, the tree distance between every two driver trees is calculated；Wherein, distance is set Relational expression are as follows:

In addition, in relational expression of the above-mentioned layer apart from weight coefficient, the adjustable ω of γ_l, therefore, can be according to actual work The size of γ is selected as situation, bring influence has:

γ=0, then only root node can provide tree distance, ignore other layer distances to the offer amount of tree distance；

0 < γ < 1, then the layer of layer where the father node of a node will be than the layer of layer where this node apart from offer amount It is big apart from offer amount；

γ=1, then layer provided by all layers (1,2,3) will be identical apart from offer amount；

γ > 1, then the layer of layer where the child node of a node will be greater than the layer of this node place layer apart from offer amount Apart from offer amount.

Wherein, γ is greater than 1 in principle here, so that the layer distance of low layer is bigger to the contribution amount of tree distance.Cause For from the structural analysis of tree, every driver tree can all have root node (driver file serial number), so first layer is to tree The contribution of distance is minimum.The each node of the second layer, when all child nodes corresponding to only the second node layer are all sky, this Layer structure can just have any different, and otherwise, this layer still depends on influence (the third layer section of its child nodes to the contribution of tree distance Point).The nodal distance (nodal distance of third layer is practical reflect be the 4th layer of nodal information) of third layer plays tree distance Conclusive influence is arrived, the nodal distance of this layer is all different, if there is different drives between different driver trees Nodal distance of the dynamic program tree on this layer is close, illustrates there is programming reusability phenomenon in readable character string.Certainly, the above is only excellent Scheme is selected, the present invention does not limit the specific value of γ.

It is understood that calculating different driving program by the layer distance of the different node layers using driver tree The distance (tree distance) of tree.This method is not only allowed in the different nodes (program in different classes of Internet of Things firmware The readable character string extracted) similarity, it is also contemplated that the semantic similarity of driver tree construction entirety.Accordingly, there exist The tree distance of two driven by program trees of programming reusability can very little.Thus by the case where the programming reusability of driver module with Number distance connects.If the tree distance of distinct program driving tree will be very big without programming reusability situation.That is two driven by program The programming reusability situation of tree is more, sets apart from smaller.And tree distance is compared for layer distance, can more reflect that driver tree is whole Between similarity degree.Therefore, it is subsequent can be analyzed according to the tree distance being calculated it is whole between each driver tree Body similarity, so adjust cluster as a result, keeping cluster result more accurate (so that there is the different Internet of Things for being multiplexed the same module Net firmware can be gathered in a cluster, convenient for the analysis of personnel on line, greatly reduce the workload of personnel on line).Tool Body method may refer to following embodiment:

Preferably, in step s4, select top n driver tree as cluster centre after, further include it is following in Hold, step s4 adjusted includes:

Step s41: screen to obtain the maximum top n driver tree of ambient density as in cluster according to calculated result The heart；

Step s42: judging whether the tree distance between any two cluster centre is greater than default tree distance threshold, if not It is, using the N+1 current driver tree as cluster centre, and ambient density number in two cluster centres currently judged A lesser cluster centre is placed in last position of sorting, and repeats the above process later；Until any two cluster centre Between tree distance be all larger than the default tree distance threshold；

Step s42: being clustered according to obtained N number of cluster centre, obtain several firmware classifications, solid for subsequent foundation Part classification carries out firmware analysis reparation.

It is understood that although aforementioned cluster apart from this quantization according to layer.It is contemplated that program The content of the second node layer in driving tree, due to the difference of unused Internet of Things firmware, the second layer between possible difference firmwares In some node or certain nodes also can be deleted during generating the driver tree of corresponding firmware file.In order to fill Divide integrally-built semantic (influence of the structure of different levels node to entirely setting) using driver tree.The present invention is being set Layer has been counted on this quantitative relationship, and has devised the tree distance of different driver trees to quantify different Internet of Things The difference of firmware file.The structure for making full use of driver tree at all levels is quasi- to improve the classification to different Internet of Things firmwares True property.

Later after primarily determining cluster centre according to nodal distance, due in order to avoid overlapped between each class Situation needs to guarantee that each cluster centre should set distance farther out between each other, therefore when counting all Internet of Things firmwares, if In top n driver tree, if being less than default the case where setting distance threshold there are the tree distance between two cluster centres, table The two bright cluster centre hypotelorisms, therefore, it is necessary to be adjusted.Since to preferably select ambient density number larger for cluster centre Driver tree, therefore, in adjustment, the lesser cluster centre of ambient density number is replaced by selection.In addition, every It after replacing a cluster centre, is required to repeat aforesaid operations to replaced N number of cluster centre again, until N number of poly- Until the tree distance of class center between any two is all larger than default tree distance threshold.By aforesaid operations, enable to finally obtain N number of cluster centre ambient density number it is big as far as possible, while between N number of cluster centre tree distance as far as possible, thus It ensure that the accuracy that cluster centre is chosen.

Wherein, presetting tree distance threshold is the tree distance for two more different driver trees, is according to reality Border works to determine, therefore can preferably take the expectation of the tree distance of all driver trees；And it is examined for real work Consider, appropriate can reduce default tree distance threshold, to accelerate the selection process of cluster centre.Certainly, the present invention does not limit pre- If setting the set-up mode and setting numerical value of distance threshold.

It is noted, of course, that shown in Figure 3, Fig. 3 is only a kind of specific implementation.Since primary Calculation is N number of Cluster centre is calculated according to nodal distance, thus step 41 only need after step s31 carry out, step s41 and The sequencing relationship present invention between step s32~s33 is not construed as limiting, and the two can also carry out parallel.That is, can be with After N number of cluster centre is calculated first, then the operation of step s32~s33 is carried out, executes step s42 and step again later s43；Or the operation of step s31~s33 can also be first carried out, carry out the operation of step s41~s43 again later；Alternatively, step S41 and step s32~s33 can be carried out side by side, after being both finished, then execute step s42 and step s43.Specifically adopt Which kind of it is not construed as limiting with the implementation present invention.

The last stage, after cluster centre determined above, with tree distance as the criterion distance of cluster, The tree distance for calculating a driver tree Yu K cluster centre, it is minimum with the tree distance of which cluster centre, just by this object Which kind of networking firmware file is classified as.By the file after classification, professional is given to analyze, the work of professional is reduced with this It measures.

It certainly, then can be according to the interbed distance or node of each driving tree in not calculating the embodiment by tree distance The size of distance, to judge which kind of Internet of Things firmware file should be classified as, to complete cluster operation.

The present invention also provides a kind of Internet of Things firmware program sorters, and shown in Figure 4, Fig. 4 provides for the present invention A kind of Internet of Things firmware program sorter structural schematic diagram.The device includes:

Extraction module 1, for extracting the readable character string in each firmware to be sorted；

Structure tree constructs module 2, for the driver tree according to readable character string building firmware；The root of driver tree Node is the number of firmware, and the second node layer of driver tree is program part type belonging to readable character string, drives journey The third node layer of sequence tree is the information type of readable character string；4th node layer is the content of corresponding readable character string；

Distance calculation module 3, for successively calculating corresponding between every two driver trees in whole driver trees save The difference degree numerical value of point, and record calculated result；Calculated result includes the mark of two driver trees calculated, is counted The mark and its difference degree numerical value of the corresponding node of calculation；

Cluster module 4, for screening to obtain the maximum top n driver tree of ambient density as poly- according to calculated result Class center is clustered, several firmware classifications are obtained, and carries out firmware analysis reparation according to firmware classification for subsequent；N is positive whole Number.

It is apparent to those skilled in the art that for convenience and simplicity of description, the device of foregoing description Specific work process, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.

Above several specific embodiments are only the preferred embodiment of the present invention, and above several specific embodiments can be with Any combination, the embodiment obtained after combination is also within protection scope of the present invention.It should be pointed out that for the art For those of ordinary skill, relevant speciality technical staff deduced out in the case where not departing from spirit of that invention and concept thereof other change Into and variation, should all be included in the protection scope of the present invention.

It should also be noted that, in the present specification, relational terms such as first and second and the like be used merely to by One entity or operation are distinguished with another entity or operation, without necessarily requiring or implying these entities or operation Between there are any actual relationship or orders.Moreover, the terms "include", "comprise" or its any other variant meaning Covering non-exclusive inclusion, so that the process, method, article or equipment for including a series of elements not only includes that A little elements, but also including other elements that are not explicitly listed, or further include for this process, method, article or The intrinsic element of equipment.In the absence of more restrictions, the element limited by sentence "including a ...", is not arranged Except there is also other identical elements in the process, method, article or apparatus that includes the element.

Claims

1. a kind of Internet of Things firmware program classification method characterized by comprising

Extract the readable character string in each firmware to be sorted；

The driver tree of the firmware is constructed according to the readable character string；The root node of the driver tree is described solid The number of part, the second node layer of the driver tree are program part type belonging to readable character string, the driving journey The third node layer of sequence tree is the information type of readable character string；4th node layer is the content of corresponding readable character string；

The difference degree numerical value of corresponding node between every two driver trees in whole driver trees is successively calculated, and is recorded Calculated result；The calculated result includes the mark of two driver trees calculated, the mark of corresponding node calculated And its difference degree numerical value；

It screens to obtain the maximum top n driver tree of ambient density as cluster centre according to the calculated result and be gathered Class obtains several firmware classifications, carries out firmware analysis reparation according to the firmware classification for subsequent；N is positive integer.

2. the method according to claim 1, wherein the readable character extracted in each firmware to be sorted After string, before the driver tree that the firmware is constructed according to the readable character string, further includes:

Judge whether the readable character string is readable character string relevant to platform or readable character string relevant with chained library, If so, the part readable character string is deleted, if it is not, continue to judge next extracted readable character string, until extract Whole readable character string judgements finish；

Judge whether the obtained information quantity of the readable character string is greater than preset platform dependent thresholds, if so, described readable Character string is readable character string relevant to platform, and otherwise, the readable character string is not readable character string relevant to platform； The obtained information quantity of the readable character string specifically:

Wherein, IG (s) is obtained information quantity；C_iFor i-th of target platform；P(C_i) it is target platform C_iIn binary file account for The ratio of total binary file；P (s) is the ratio of the total binary file of the binary file Zhan containing readable character string s； P(s,C_i) it is target platform C_iIt and include the ratio of the total binary file of binary file Zhan of readable character string s.

3. the method according to claim 1, wherein described calculate corresponding node between two driver trees The process of difference degree numerical value specifically:

According to nodal distance relational expression, it is each right in the first layer, the second layer and third layer of two driver trees successively to calculate Answer the difference degree numerical value between node；

The nodal distance relational expression are as follows:

4. according to the method described in claim 3, it is characterized in that, described screen to obtain ambient density according to the calculated result Maximum top n driver tree includes: as the process of cluster centre

Count the difference degree for being less than pre-determined distance threshold value in the corresponding whole difference degree numerical value of every driver tree The number of numerical value, the ambient density number as this driver tree；

All driver trees are ranked up according to the sequence of ambient density number from big to small, top n is selected to drive journey Sequence tree is as cluster centre.

5. according to the method described in claim 4, it is characterized in that, it is described successively calculate two driver trees first layer, Difference degree numerical value in the second layer and third layer between each corresponding node, and after recording calculated result, further includes:

Wherein, the layer distance relation formula are as follows:

Wherein,ForWithL layers of layer distance；ForL layers of all nodes set；

Wherein, β_vFor the corresponding weight of node v,ForThe set of all child nodes of interior joint v, the father that w is v save Point.

6. according to the method described in claim 5, it is characterized in that, described according to layer distance relation formula, every two drivings of calculating Between program tree after the layer distance of respective layer, further includes:

According to tree distance relation formula, the tree distance between every two driver trees is calculated；The nodal distance and it is described tree away from From for the difference degree numerical value；Wherein, the tree distance relation formula are as follows:

Wherein,ForWithBetween tree distance, γ is common ratio；H (φ) is the height of driver tree, H (φ) Value be { 1,2,3 }；ω_lIt is l layers of layers apart from weight coefficient；Wherein:

7. according to the method described in claim 6, it is characterized in that, described select top n driver tree as cluster centre Later, further includes:

Judge whether the tree distance between any two cluster centre is greater than default tree distance threshold, if it is not, by current N + 1 driver tree is as cluster centre, and the lesser cluster of ambient density number in two cluster centres currently judged Center is placed in last position of sorting, and repeats the above process later；Until the tree distance between any two cluster centre is Greater than the default tree distance threshold.

8. a kind of Internet of Things firmware program sorter characterized by comprising

Structure tree constructs module, for constructing the driver tree of the firmware according to the readable character string；The driving journey The root node of sequence tree is the number of the firmware, and the second node layer of the driver tree is program belonging to readable character string Some types, the third node layer of the driver tree are the information type of readable character string；4th node layer is corresponding The content of readable character string；

Distance calculation module, for successively calculating the difference of corresponding node between every two driver trees in whole driver trees Off course degree value, and record calculated result；The calculated result includes the mark of two driver trees calculated, is calculated Corresponding node mark and its difference degree numerical value；

Cluster module, for screening to obtain the maximum top n driver tree of ambient density as poly- according to the calculated result Class center is clustered, several firmware classifications are obtained, and carries out firmware analysis reparation according to the firmware classification for subsequent；N is Positive integer.