CN104657665B - A kind of document handling method - Google Patents

A kind of document handling method Download PDF

Info

Publication number
CN104657665B
CN104657665B CN201510108614.6A CN201510108614A CN104657665B CN 104657665 B CN104657665 B CN 104657665B CN 201510108614 A CN201510108614 A CN 201510108614A CN 104657665 B CN104657665 B CN 104657665B
Authority
CN
China
Prior art keywords
file
mrow
similarity
image
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201510108614.6A
Other languages
Chinese (zh)
Other versions
CN104657665A (en
Inventor
罗阳
陈虹宇
王峻岭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Shenhu Technology Co ltd
Original Assignee
SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SICHUAN SHENHU TECHNOLOGY Co Ltd filed Critical SICHUAN SHENHU TECHNOLOGY Co Ltd
Priority to CN201510108614.6A priority Critical patent/CN104657665B/en
Publication of CN104657665A publication Critical patent/CN104657665A/en
Application granted granted Critical
Publication of CN104657665B publication Critical patent/CN104657665B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a kind of document handling method, this method includes:The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface description file, audio file and image file, utilize the feature of characteristic extraction step extraction features described above file, based on the feature, whether pretended by the size identification installation file for comparing similarity and predetermined threshold value.The present invention proposes a kind of file process, it is identified by extracting using installation file content characteristic, and the camouflage of file and catalogue can be effective against and interference that malicious modification is brought, utilize characteristic extraction procedure to reduce file content Feature-scale, improve operation efficiency.

Description

A kind of document handling method
Technical field
It is more particularly to a kind of to apply installation file security processing the present invention relates to file process.
Background technology
In Mobile solution field, application program is submitted to application market by developer, and user is downloaded by application market Using.But the presence of Malware can not be still avoided in official market;Security Assurance Mechanism is not perfect enough, causes malice soft The ratio of part remains high.Wherein, embedded known codes and camouflage applications installation file are chief threats.Existing technical scheme It is applied behavior sequence using decompiling instrument or dynamic behaviour analysis tool, behavior sequence is pre-processed and gone For sequence signature, the distance by comparing behavior sequence feature is applied the quantized data whether pretended.This method can be with The change of application code is identified, but the extraction of behavior sequence feature is easily influenceed by Code Obfuscation Security Technology, thus in pin With certain limitation when analyzing practical problem.
Therefore, for the above mentioned problem in the presence of correlation technique, effective solution is not yet proposed at present.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of document handling method, including:
The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface File, audio file and image file are described, the feature of features described above file is extracted using characteristic extraction step, based on the spy Whether sign, pretended by the size identification installation file for comparing similarity and predetermined threshold value.
Preferably, set app={ exe will be described as using installation file;lib;profile;image;audio; Etc }, wherein exe represents the executable byte codes in installation file, the primary code storehouse in lib representation programs, profile tables Show the XML document that description is stored and be laid out for routine data, the image file in image representation programs, in etc representation programs Alternative document.
Preferably, in characteristic extraction procedure, when extracting the feature of image file,
The picture size reduced first in installation file, and coloured image is converted into gray level image, calculate average gray Level, image content features are extracted according to similarity hash algorithm, according to the brightness of image and be patterned into every image generation one Fingerprint of the character string as image, the similarity between image is determined according to the degree of approximation between fingerprint;
Wherein downscaled images size is by image down to K × K pixels, for eliminating image resolution ratio to similarity-rough set Interference, remove the difference of picture size and image scaled, only retain essential information, the K values such as structure, brightness and be set to 128;And Wherein, picture material similarity-rough set includes calculating the Hamming distance of fingerprint.
In characteristic extraction procedure, when feature of the extraction with the interface description file of XML file format storage,
XML file similarity-rough set includes structural similarity and content similarity, and XML file is converted into tree construction, leads to Cross and compare the difference of tree and obtain XML structure difference, the node difference by comparing tree obtains XML content difference, in the application boundary Face, which describes file, to be stored according to pre-defined rule, is illustrated according to interface description file, is obtained structure list of file names;Then according to structure List of file names extracts architectural feature, and filtering interface describes architectural feature and symbolic information in file, obtains content information;It is final right Structure and content information calculate cryptographic Hash, obtain Structural Eigenvalue and content characteristic, interface description file obtains after treatment One Hash array, so as to which the content similarity of interface description file to be converted into the similarity for comparing Hash array.
In characteristic extraction procedure, when extracting the feature of audio file,
Using file cryptographic Hash as audio file feature, audio file stream S, and predetermined constant character string M are inputted, is calculated Audio file stream S MD5 cryptographic Hash H1 are inputted, are then added input audio file stream S with predetermined constant character string M, and count The MD5 cryptographic Hash H2 of addition result are calculated, H1 and H2 are summed, obtain final hash value, the content as audio file is special Sign.
The present invention compared with prior art, has advantages below:
The present invention proposes a kind of file process, is identified by extracting using installation file content characteristic, and can be with It is effective against the camouflage of file and catalogue and interference that malicious modification is brought, utilizes characteristic extraction procedure to reduce file content feature Scale, improve operation efficiency.
Brief description of the drawings
Fig. 1 is the flow chart of document handling method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the accompanying drawing for illustrating the principle of the invention State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details Some or all details can also realize the present invention according to claims.
Fig. 1 is document handling method flow chart according to embodiments of the present invention.Propose a kind of application program installation file Camouflage recognition methods.Installation file attribute is applied by analysis, select file type, extracts content characteristic, and according to file Type uses different Content Feature Extraction algorithms, and weights are assigned to its similarity, so as to improve application program camouflage identification Accuracy and operation efficiency.
Exist using installation file in the form of compressed file, inside tissue in the form of catalogue deposits executable byte codes File, certificate file and resource file, wherein executable byte codes are stored in class file;Certificate file is the signature text of application Part;Resource file includes database file, function library file, XML file, image file etc..
In one embodiment, set app={ exe will be described as using installation file;lib;profile;image; audio;Etc }, wherein exe represents the executable byte codes in installation file, the primary code storehouse in lib representation programs, Profile represents to store for routine data and the XML document of layout description, the image file in image representation programs, etc Alternative document in representation program.It can be seen from set app description:It is an object of the present invention to according to exe, lib, profile, The content characteristic of the associated documents such as image, perform the camouflage identification of application program installation file.
In order to accurately and efficiently analyze whether installation file is pretended by file content, and meet actual identification need Ask, method proposed by the present invention puts forth effort to reach following three targets:1) big data computing is adapted to, the quantity in application market is big, Increase soon, the system framework that can quickly handle mass data is the basis for adapting to big data computing;2) suitable feature text is selected Part, there are thousands of kinds of file types in installation file, the content for extracting which file directly affects the efficiency of camouflage identification and accurate Property;3) efficient feature extraction and accurate characteristics algorithm, the speed of extraction document content characteristic determine system effectiveness, simultaneously Accurate characteristics algorithm is the basic guarantee that guarantee system can correctly provide result of determination.
The present invention does not lose while ensureing during extraction document content characteristic, calculation document similarity and improving efficiency The accuracy of operation result.
Require that the target that algorithm is directed to can not be excessively complicated first, if excessively complicated for target, then need to this Individual target is reduced, and is selected wherein crucial key element and is contrasted;Secondly efficiency of algorithm is high;Finally, in developing algorithm process When, the running environment of algorithm optimized as far as possible, reduce the intermediate steps of algorithm, cutting down in algorithm to cause Plenty of time and the content of space consuming.
Firstly the need of selecting suitable tag file, the file in an application installation file from hundreds of to thousands of not Deng, feature extraction such as is carried out to the content of all files, easily causes the result that target is excessively complicated, analysis efficiency is low, and Easily disturbed by insertion discarded record.Therefore the present invention is according to generality, representativeness and metrizability principle, selected section Suitable file type is as tag file, in the case where at utmost ensureing that tag file effectively represents to apply installation file Feature-scale is reduced, so as to reduce operand.
Next, extracting the feature to have selected files from installation file, the file interface of installation file is obtained, according to pressure Contracting document location offset orientation tag file, the step of being decompressed to other unrelated files is saved to improve operation efficiency.It is first First the tag file in application is counted, contrasting different algorithms according to statistical law realizes, algorithm is carried out most suitable Optimization, efficiency highest algorithm is used on the premise of accuracy is ensured, and multithreading scheme is applied in extraction process, weight The partial function for not supporting multithreading is write, ensures the Thread safety of all computings, further improves operation efficiency.
Finally, camouflage identification is carried out based on file content feature, when measuring similarity algorithm designs, according to the system of application Feature is counted, is counted, is exchanged for space consuming time-optimized using Hash table.
By file content feature calculation file similarity, first have to select suitable feature from the file type of complexity File.Suitable tag file needs have following three features.The file of the type is included in most of installation files, if Some file type only exists in a small number of application memories, then can not carry out similarity-rough set by such file content feature;File Content characteristic has " signature " characteristic, can represent the application, the file content feature extracted in different application has difference Property;File content has a range performance, and the file content distance in similar documents is near, on the contrary the file content in different files away from From remote.In one embodiment, selection interface describes file, image file, audio file as tag file, can be described as Appfile={ image;audio;Profile }, main thought is calculation document content characteristic similarity, is analyzed with this similar Degree, can be represented with below equation:
Com (app1, app2)=com (appfile1, appfile2).
The present invention represents the feature of installation file with the content characteristic of this three class file.Per class file content characteristic set bag The feature of such All Files is contained, has been represented with equation below:
N represents the quantity of documents that every kind of file type includes, and it is similar to calculate image, audio, the content of interface description file Degree, every kind of feature of two applications is contrasted, can derive that file characteristic calculating formula of similarity is as follows, represent installation File similarity is equivalent to the similarity of all the type in two application installation files in file:
M represents the quantity of documents that every kind of file type includes.
It is similar using installation file that image, audio or the representative of interface description file content characteristic similarity is used alone Degree, it is as a result not ideal enough, it can cause to fail to report if threshold value setting is higher;It can cause to report by mistake if threshold value setting is too low.Therefore, The present invention assigns weights to image, audio and interface description file content similarity, passes through the weighting of three kinds of file content features Installation file similarity is applied in similarity expression, and Weighted Similarity formula represents as follows:
Com (app1, app2)=com (appfile1, appfile2)=
com_image×α+com_audio×β+com_profile×γ。
Above formula is represented to be equivalent to the similarity of app1 and app2 internal files using app1 and app2 similarity, is equivalent to Image, sound, the weighted value of interface description file similarity in two installation files.α herein, the value of beta, gamma is according to com_ Image, com_audio, com_profile difference and dynamic change.
The quantity of image, audio and interface description file in installation file differs, and certain applications do not include audio File, so fixed α, beta, gamma can not effectively calculation document similarity.Embodiments of the invention utilize the side of Dynamic Weights Method:Weights are assigned according to com_image, the size of tri- values of com_audio, com_profile, by learning to determine three Most suitable weights, respectively 0.6,0.3,0.1, com_image, com_audio, the maximum weights of com_profile intermediate values For 0.6, secondly weights are 0.3, and minimum weights are 0.1.
The similarity of two files can be obtained by above procedure, can be with by the size for comparing similarity and threshold value T Judge whether two files belong to similar application, i.e., whether be simulated papers.
The present invention is represented using file content feature and applies feature, proposes that specific feature carries for the characteristics of different files Take method and similarity algorithm.
At present, existing image similarity matching algorithm needs larger room and time expense, can not apply and advise greatly In mould computing environment.And camouflage applications installation file generally use two ways influences image:1) it is enterprising in original image basis Row modification;2) original image resolution is changed.Based on such consideration, it is necessary to select one in image content features extraction process Kind algorithm, can reduce modification image and eliminate resolution ratio reduces the interference brought.Therefore, the present invention reduces installation file first In picture size, and coloured image is converted into gray level image, calculates average gray level, extracted according to similarity hash algorithm Image content features, according to the brightness of image and it is patterned into " fingerprint " of every image one character string of generation as image, figure The fingerprint of picture is more similar then to represent that 2 images are more similar.Computational complexity is reduced while improving accuracy.
Wherein downscaled images size is by image down to K × K pixels, and the process is mainly used in eliminating image resolution ratio pair The interference of similarity-rough set, the difference for removing picture size and image scaled, only retain the essential informations such as structure, brightness, here K values be typically set to 128.There is ratio highest in Mobile solution in the image of 40 × 40 resolution ratio.Picture material similarity ratio Relatively need to calculate the Hamming distance of fingerprint, i.e. two different character numbers of fingerprint character string correspondence position, K=40, then word Symbol string length is K × K/8=200.The present invention simplifies to the step, and Hamming distance is replaced using whether character string is equal, Whether cost is that similarity result is only able to display two finger images consistent, whether can not identify finger image by Hamming distance It is similar.
Interface description file in installation file is stored with XML file format, and therefore, interface description file content feature carries Take and be equal to XML file Content Feature Extraction.XML file similarity-rough set includes structural similarity and the aspect of content similarity 2, XML file is converted into tree construction, the difference by comparing tree obtains XML structure difference, is obtained by the node difference for comparing tree To XML content difference.
Interface description file is stored according to pre-defined rule in the application, and in the case of known regimes, the present invention adopts With a kind of simple structure and Content Feature Extraction method:First, illustrated according to interface description file, obtain structure and rank Table;Then, architectural feature is extracted according to structure list of file names, filtering interface describes architectural feature and symbolic information in file, obtained To content information;Cryptographic Hash finally is calculated to structure and content information, obtains Structural Eigenvalue and content characteristic.Interface description text Part obtains a Hash array after treatment, compares Hash array so as to which the content similarity of interface description file be converted into Similarity.
Found by carrying out analysis to the audio file in installation file, camouflage applications installation file bag is not to audio file Big modification is carried out, therefore the present invention uses file cryptographic Hash as audio file feature.Calculate audio file cryptographic Hash.For Its hash space is smaller for extensive computation, and Hash result easily collides.Therefore, the present invention proposes following hash method, Hash collision is greatly reduced in the case where ensureing arithmetic speed.Audio file stream S, and predetermined constant character string M are inputted, is calculated Audio file stream S MD5 cryptographic Hash H1 are inputted, are then added input audio file stream S with predetermined constant character string M, and count The MD5 cryptographic Hash H2 of addition result are calculated, H1 and H2 are summed, obtain final hash value.Audio is obtained by algorithm above Content characteristic of the secondary cryptographic Hash of file as audio file.
Included using installation file content characteristic in image content features, interface description file content feature and audio file Hold feature.Image content features are gathered for image " fingerprint ";One interface description file content is characterized as a Hash set, should It is combined into all interface description file content features in installation file by multiple Hash collection;Audio file content is characterized as breathing out Uncommon set.Three kinds of file content characteristic sets can be considered string assemble.Present invention selection content similarity as standard, its Computational methods are:Ratio of set A and B the common factor element in A and B shared by less set.This method can be effectively Weigh the similarity between the set of different length.Content similarity represents as follows with L (A, B):
L (A, B)=| A ∩ B |/min (| A |, | B |).
Thus, tag file set calculating formula of similarity is pushed away by file characteristic similarity formula and content similarity formula Lead, represent that file set similarity is equivalent to the content similarity of file set;File similarity calculates public by Weighted Similarity Formula derives, and represents that file similarity is equivalent to file set Similarity-Weighted value, i.e. three kinds of tag file content similarities plus Weights.
File similarity is obtained by calculating tag file content similarity, is not disturbed by document directory structure change; And the similarity calculating method selected using less set length in two set as standard, therefore can be effective against inserting Enter the interference of garbage files.
According to another aspect of the present invention, it is also proposed that a kind of anti-dazzle system of Mobile solution, first using message digest Algorithm carries out initializing fingerprint sampling to each file of server, is stored in telesecurity database and local security file.Wound Build false proof arrangement, the access request that processing client is submitted.Access request is analyzed, extracts access path, installation text will be applied Part fingerprint provides response scheme after being compared with fingerprint in storehouse;Web page files directly are traced back to, suitable for dynamic and static page Website.Recover the pagefile pretended by the way of local page snapshot and file verification contrast is called.
Further, system is mainly used in coordinating access request, camouflage identification, site file renewal and event alarm 4 Relation between action.When system receives web access request, camouflage identification module is called to divide each HTTP request Analysis, follow the trail of called file and access path;Calculated using false proof arrangement of the carry in security component and apply installation file Digital finger-print, itself and original fingerprint in safety zone are contrasted, judge whether pretended using installation file;If not by Camouflage, Web server are accessed with normal HTTP request response user.Otherwise, emergency recovery module is enabled immediately, is called local Page snapshot responds user, enables recovery module afterwards and calls local backup to replace simulated papers, completes to repair.Enabling snapshot In the case of technology, even if pagefile is pretended or reset by hacker, server will not also misinformate the page after camouflage to clear The person of looking at, avoid causing bad consequence.System record camouflage daily record, custodian is notified with SMS or E-mail mode Member.Each server file will be locked after anti-camouflage is enabled in server, will be unable to update without permission;Through authentication solution It can be updated after lock using FTP or SSL modes.It is in good time that local fingerprint base, backup file and snapshot are carried out with long-range library file It is synchronous, to ensure that data are consistent.
System is pretended by providing the integrality for protecting site file, monitoring and processing HTTP access requests, fast quick-recovery File, alarm and it is credible issue five functional come realize it is false proof installation system.Thus, by server of the system, client and publishing side 3 Individual part.
(1) service end.The communication between multiple client is completed using database as hinge.File is provided for each client Backup, snapshot storage, site file initialization the handling of digital finger-print, all kinds of daily records and the preservation for pretending warning information.Service End only opens the port with client and database communication in the course of the work, to provide the security of system to greatest extent;This Structure will lay the first stone for system transplantation.
(2) client.It is being installed in the case of not changing legacy network topological structure in shielded server, with clothes Business end and publishing side establish trusted communications.Client includes initialization, interviewed file monitor and tracking, site listing lock, camouflage Identification, camouflage recovers and local resource backup six functionses, is the core of whole anti-dazzle system.
When enabling first, the protected site file of server will be initialized, and gather the digital finger-print of each file, deposit It is stored in the safety database of service end, and is backed up and be stored in local file fingerprint;To ensure the safety of local file, Local digital fingerprint and backup file and snapshot are encrypted using symmetric key cryptography AES.When receiving publishing side More newer command after, unlock its protect catalogue, the digital finger-print for being updated file is updated.False proof die-filling piece in processing visitor During the accessing page request of family, according to interviewed pagefile name and access path calculate its fingerprint and with the fingerprint pair in safety zone Than being responded if consistent;Otherwise camouflage is enabled to recover and event alarm module, execution post-processing process, and record access Source IP, source port and destination slogan, camouflage process ID, the modification content of request, structure warning message notify administrative staff.For Quick response client request, recovery module, which first reads page snapshot, alleviates user's access;After taking local backup file decryption again Simulated papers are replaced, when local backup file is destroyed, backup file will be issued from service end and is recovered, to handle disaster Sexual behavior part.
(3) publishing side.The renewal of the main issue and original server file for completing new demand servicing device.Publishing side passes through client After holding certification, client creates new website or the requested website of unblock according to request, completes issue instruction;At the end of client Relock website.
After system initialization process includes each client submission server feature information, in service end server designated area The catalogue that middle foundation is named with characteristic information.In order to ensure its uniqueness, characteristic information using the IP address of client computer, CPUID, Hard disk ID forms the cryptographic Hash of character string to represent.Service end establishes unified database, stores each site file fingerprint, day Will and warning information.Client is completed, with configuring local operating conditions after the checking of service end link information, to specify current server The file type of middle need website to be protected and different websites, using the ciphertext of site name and creation time as file name creation sheet Ground security catalog, for storing backup and snapshot document, the XML document of storage finger print data, daily record and alarm data.
File pre-processing assembly calls crypto engine first, and a pair of secret keys is generated using public key encryption algorithm RSA;Public key is adopted Local security catalogue is saved in together with being stored in after AES encryption in backup server, then by the public key copy after encryption and private key, And public key is exchanged with backup server in time, recover to provide working environment for data syn-chronization and website.Then read what need to be protected Traversal engine is called after site listing and file type, the server file of traversal regulation suffix, it is calculated only using MD5 algorithms One, irreversible digital finger-print, by fingerprint results by certain data structure deposit database, then generate by site name In XML document deposit local security file, contrast and use for false proof arrangement.The station data traveled through is finally used into client The public key encryption at end, it is resident locally wait in security catalog and is synchronized with backup server.Entirely
Pretend identification process and develop false proof arrangement using security component, HTTP request is analyzed, pass through HTTP visitors The data that family end is submitted, access path and filename are extracted, its integrality is monitored in real time, file changes legal Property is verified.Using the false proof arrangement of kernel inside technological development, and establish and reflect in the mapping table by security components interfaces Relation is penetrated, being serviced device is loaded into the process space, completes the calculating to each interviewed Fingerprint of Web Page and original fingerprint contrast work Make.
When server receives HTTP request, request is tracked using installation file first, then calculated using installation text The cryptographic Hash of part, finally call fingerprint contrast component;Read in local security region and decrypted using the original fingerprint of installation file Contrasted afterwards with the current fingerprint that calculates, HTTP request is replied if matching, otherwise entered and recover processing and emergency response stream Journey.For emergency response component after contrast failure command is received, system generates HTML after calling local snapshot document decryption of the same name Format text responds HTTP request, ensures the efficiency and quality of server HTTP request response;With most fast speed after having responded Call original document in local security region that after AES is decrypted simulated papers are replaced with recovery, be text to the full extent Part provides the guarantee of safety;Current file is disabled if failure is recovered, redirects requests to specified page.In file access pattern While, system record camouflage daily record, sent a warning message by way of SMS or Email for administrative staff, be Data analysis and management in the future provides foundation.Snapshot is called with occurring the time required to accessing redirection process in several milliseconds, Requestor can not be received by the response contents of the camouflage page.Client calls file bottom filtration drive by certain cycle duration Module is calculated the digital finger-print of the shielded website of current server and the file of stated type, contrasted, identified, with true Protect the similarity of digital finger-print everywhere.
In summary, the present invention proposes a kind of file process, is known by extracting using installation file content characteristic Not, and the camouflage of file and catalogue can be effective against and interference that malicious modification is brought, utilizes characteristic extraction procedure to reduce text Part content characteristic scale, improve operation efficiency.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed Network on, alternatively, they can be realized with the program code that computing system can perform, it is thus possible to they are stored Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing Change example.

Claims (3)

  1. A kind of 1. document handling method, for identifying the application program installation file of camouflage, it is characterised in that including:
    The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface description File, audio file and image file, the feature of features described above file is extracted using characteristic extraction step, based on the feature, Whether pretended by the size identification installation file for comparing similarity and predetermined threshold value;
    In characteristic extraction procedure, when feature of the extraction with the interface description file of XML file format storage, XML file phase Compare like degree including structural similarity and content similarity, XML file is converted into tree construction, obtained by the difference for comparing tree XML structure difference, the node difference by comparing tree obtain XML content difference, and interface description file is according to pre- in the application Set pattern then stores, and is illustrated according to interface description file, obtains structure list of file names;Then it is special structure to be extracted according to structure list of file names Sign, filtering interface describe architectural feature and symbolic information in file, obtain content information;Finally to structure and content information meter Cryptographic Hash to be calculated, obtains Structural Eigenvalue and content characteristic, interface description file obtains a Hash array after treatment, so as to The content similarity of interface description file is converted into the similarity of Hash array;
    Affiliated application installation file content characteristic is included in image content features, interface description file content feature and audio file Hold feature;Image content features are gathered for image " fingerprint ";One interface description file content is characterized as a Hash set, should It is combined into all interface description file content features in installation file by multiple Hash collection;Audio file content is characterized as breathing out Uncommon set;Three kinds of file content characteristic sets are considered as string assemble;
    Select content similarity is as standard, computational methods:Set A and B common factor element less set institute in A and B The ratio accounted for, content similarity represent as follows with L (A, B):
    L (A, B)=| A ∩ B |/min (| A |, | B |);
    That is tag file set calculating formula of similarity by file characteristic similarity formula and the content characteristic similarity derivation of equation, Represent that file set similarity is equivalent to the content similarity of file set;File similarity is calculated and pushed away by Weighted Similarity formula Lead, represent that file similarity is equivalent to file set Similarity-Weighted value, i.e., the weighted value of three kinds tag file content similarities;
    The file content feature of the tag file of the predefined type has signature characteristic, the file extracted in different application Content characteristic has otherness, and file content has range performance;And the characteristic extraction procedure further comprises, peace is obtained The file interface of file is filled, according to compressed file position offset orientation tag file, the tag file in application is counted, Different algorithms is contrasted according to statistical result, algorithm is optimized, and multithreading scheme is applied in extraction process, is rewritten The partial function of multithreading is not supported, after feature extraction, is identified based on file content feature, according to the statistics of application Feature, counted using Hash table and carry out measuring similarity;
    Certain types of tag file includes interface description file, image file, audio file, and by application program installation file It is described as appfile={ image;audio;Profile }, contain such All Files per class file content characteristic set Feature, it is expressed as:
    <mrow> <msub> <mi>image</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>image</mi> <mi>f</mi> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
    <mrow> <msub> <mi>audio</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>audio</mi> <mi>f</mi> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
    <mrow> <msub> <mi>profile</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>profile</mi> <mi>f</mi> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>;</mo> </mrow>
    Wherein imagef、audiofAnd profilefImage file, audio file, the feature of interface description file, n are represented respectively Represent the quantity of documents that every kind of file type includes, and calculation document content characteristic similarity function com () procedural representation For:
    Com (app1, app2)=com (appfile1, appfile2)
    Every kind of feature of two applications is contrasted, acquisition file characteristic Similarity Measure is as follows, represents text in installation file Part similarity is equivalent to the similarity of all the type in two application installation files:
    <mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>i</mi> <mi>m</mi> <mi>a</mi> <mi>g</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&amp;Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>image</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>,</mo> <msub> <mi>image</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>j</mi> <mo>&amp;rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
    <mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>a</mi> <mi>u</mi> <mi>d</mi> <mi>i</mi> <mi>o</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>audio</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>,</mo> <msub> <mi>audio</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>j</mi> <mo>&amp;rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
    <mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>f</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&amp;Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>profile</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>i</mi> <mo>&amp;rsqb;</mo> <mo>,</mo> <msub> <mi>profile</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&amp;lsqb;</mo> <mi>j</mi> <mo>&amp;rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>
    Wherein m represents the quantity of documents that every kind of file type includes;
    Weights are assigned to three kinds of file content similarities, application installation is represented by the Weighted Similarity of three kinds of file content features File similarity, Weighted Similarity formula represent as follows:
    Com (app1, app2)=com (appfile1, appfile2)=
    com_image×α+com_audio×β+com_profile×γ;
    Wherein α, the value of beta, gamma is according to com_image, com_audio, the differences of com_profile contents and dynamic change, i.e., According to com_image, the size of tri- values of com_audio, com_profile assigns weights, optimal by learning determination three Weights.
  2. 2. according to the method for claim 1, it is characterised in that in characteristic extraction procedure, as the spy of extraction image file During sign,
    The picture size reduced first in installation file, and coloured image is converted into gray level image, calculate average gray level, root Image content features are extracted according to similarity hash algorithm, according to the brightness of image and are patterned into every image one character string of generation As the fingerprint of image, the similarity between image is determined according to the degree of approximation between fingerprint;
    Wherein downscaled images size is by image down to K × K pixels, and similarity-rough set is done for eliminating image resolution ratio The difference of picture size and image scaled is disturbed, removed, only retains structure, monochrome information, K values are set to 128;And wherein, image Content similarity compares the Hamming distance including calculating fingerprint.
  3. 3. according to the method for claim 2, it is characterised in that in characteristic extraction procedure, as the spy of extraction audio file During sign,
    Using file cryptographic Hash as audio file feature, audio file stream S, and predetermined constant character string M are inputted, calculates input Audio file stream S MD5 cryptographic Hash H1, then input audio file stream S is added with predetermined constant character string M, and calculates phase Add the MD5 cryptographic Hash H2 of result, H1 and H2 are summed, obtain final hash value, the content characteristic as audio file.
CN201510108614.6A 2015-03-12 2015-03-12 A kind of document handling method Active CN104657665B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510108614.6A CN104657665B (en) 2015-03-12 2015-03-12 A kind of document handling method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510108614.6A CN104657665B (en) 2015-03-12 2015-03-12 A kind of document handling method

Publications (2)

Publication Number Publication Date
CN104657665A CN104657665A (en) 2015-05-27
CN104657665B true CN104657665B (en) 2017-12-08

Family

ID=53248776

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510108614.6A Active CN104657665B (en) 2015-03-12 2015-03-12 A kind of document handling method

Country Status (1)

Country Link
CN (1) CN104657665B (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224367A (en) * 2015-09-30 2016-01-06 浪潮电子信息产业股份有限公司 A kind of installation method of software and device
CN106815522A (en) * 2015-11-27 2017-06-09 中兴通讯股份有限公司 Mobile terminal software vacation interface identification method and device
CN105653984B (en) * 2015-12-25 2019-04-19 北京奇虎科技有限公司 File fingerprint method of calibration and device
CN109564613B (en) * 2016-07-27 2023-05-30 日本电气株式会社 Signature creation device, signature creation method, recording medium on which signature creation program is recorded, and software determination system
CN107323114B (en) * 2017-06-22 2019-08-16 珠海汇金科技股份有限公司 Intrusion detection method, system and the print control instrument of print control instrument
CN108123934B (en) * 2017-12-06 2021-02-19 深圳先进技术研究院 Mobile-end-oriented data integrity verification method
CN107992599A (en) * 2017-12-13 2018-05-04 厦门市美亚柏科信息股份有限公司 File comparison method and system
CN108491458A (en) * 2018-03-02 2018-09-04 深圳市联软科技股份有限公司 A kind of sensitive document detection method, medium and equipment
CN111160123B (en) * 2019-12-11 2023-06-09 桂林长海发展有限责任公司 Aircraft target identification method, device and storage medium
CN113590144A (en) * 2021-08-16 2021-11-02 北京字节跳动网络技术有限公司 Dependency processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101158967A (en) * 2007-11-16 2008-04-09 北京交通大学 Quick-speed audio advertisement recognition method based on layered matching
CN101369268A (en) * 2007-08-15 2009-02-18 北京书生国际信息技术有限公司 Storage method for document data in document warehouse system
CN102968439A (en) * 2012-10-11 2013-03-13 微梦创科网络科技(中国)有限公司 Method and device for sending microblogs
CN103400076A (en) * 2013-07-30 2013-11-20 腾讯科技(深圳)有限公司 Method, device and system for detecting malicious software on mobile terminal
CN104091152A (en) * 2014-06-30 2014-10-08 南京理工大学 Method for detecting pedestrians in big data environment

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101369268A (en) * 2007-08-15 2009-02-18 北京书生国际信息技术有限公司 Storage method for document data in document warehouse system
CN101158967A (en) * 2007-11-16 2008-04-09 北京交通大学 Quick-speed audio advertisement recognition method based on layered matching
CN102968439A (en) * 2012-10-11 2013-03-13 微梦创科网络科技(中国)有限公司 Method and device for sending microblogs
CN103400076A (en) * 2013-07-30 2013-11-20 腾讯科技(深圳)有限公司 Method, device and system for detecting malicious software on mobile terminal
CN104091152A (en) * 2014-06-30 2014-10-08 南京理工大学 Method for detecting pedestrians in big data environment

Also Published As

Publication number Publication date
CN104657665A (en) 2015-05-27

Similar Documents

Publication Publication Date Title
CN104657665B (en) A kind of document handling method
Khan et al. Cloud log forensics: Foundations, state of the art, and future directions
Volety et al. Cracking Bitcoin wallets: I want what you have in the wallets
US9411958B2 (en) Polymorphic treatment of data entered at clients
US9584543B2 (en) Method and system for web integrity validator
Zhao et al. Sear: Secure and efficient aggregation for byzantine-robust federated learning
ES2679286T3 (en) Distinguish valid users of robots, OCR and third-party solvers when CAPTCHA is presented
US20160365978A1 (en) Making cryptographic claims about stored data using an anchoring system
US11916920B2 (en) Account access security using a distributed ledger and/or a distributed file system
CN112217835B (en) Message data processing method and device, server and terminal equipment
CN107370719B (en) Abnormal login identification method, device and system
Hasan et al. Cloud data provenance using IPFS and blockchain technology
BRPI0815605B1 (en) METHOD FOR COMMUNICATING DATA USING A COMPUTER DEVICE; METHOD FOR GENERATING A SECOND VERSION OF A DATA COMMUNICATION COMPONENT USING A COMPUTER DEVICE; METHOD FOR COMMUNICATING DATA USING A COMPUTER DEVICE; METHOD FOR CREATING A CERTIFICATE USING A COMPUTER DEVICE; AND METHOD FOR USING A CERTIFICATE USING A COMPUTER DEVICE
Suarez-Tangil et al. Stegomalware: Playing hide and seek with malicious components in smartphone apps
JP7077425B2 (en) Checking the display of third-party content on client devices
Fowler SQL server forenisc analysis
Zhang et al. LedgerGuard: Improving blockchain ledger dependability
US10679183B2 (en) Method and system for distributing and tracking information
CN104657504A (en) Fast file identification method
Khan et al. Digital forensics and cyber forensics investigation: security challenges, limitations, open issues, and future direction
Holmes et al. A framework for live host-based Bitcoin wallet forensics and triage
WO2011117268A1 (en) Secure data scanning method and system
WO2023146737A1 (en) Multi-variate anomalous access detection
WO2022116761A1 (en) Self auditing blockchain
AU2021100847A4 (en) Blockchain based multiple authorities signature model for industrial iot applications

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20230613

Address after: F13, Building 11, Zone D, New Economic Industrial Park, No. 99, West Section of Hupan Road, Xinglong Street, Tianfu New District, Chengdu, Sichuan, 610000

Patentee after: Sichuan Shenhu Technology Co.,Ltd.

Address before: 610041 No. 5, floor 1, unit 1, building 19, No. 177, middle section of Tianfu Avenue, high tech Zone, Chengdu, Sichuan Province

Patentee before: SICHUAN CINGHOO TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right