CN104657665B - A kind of document handling method - Google Patents
A kind of document handling method Download PDFInfo
- Publication number
- CN104657665B CN104657665B CN201510108614.6A CN201510108614A CN104657665B CN 104657665 B CN104657665 B CN 104657665B CN 201510108614 A CN201510108614 A CN 201510108614A CN 104657665 B CN104657665 B CN 104657665B
- Authority
- CN
- China
- Prior art keywords
- file
- mrow
- similarity
- image
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Abstract
The invention provides a kind of document handling method, this method includes:The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface description file, audio file and image file, utilize the feature of characteristic extraction step extraction features described above file, based on the feature, whether pretended by the size identification installation file for comparing similarity and predetermined threshold value.The present invention proposes a kind of file process, it is identified by extracting using installation file content characteristic, and the camouflage of file and catalogue can be effective against and interference that malicious modification is brought, utilize characteristic extraction procedure to reduce file content Feature-scale, improve operation efficiency.
Description
Technical field
It is more particularly to a kind of to apply installation file security processing the present invention relates to file process.
Background technology
In Mobile solution field, application program is submitted to application market by developer, and user is downloaded by application market
Using.But the presence of Malware can not be still avoided in official market;Security Assurance Mechanism is not perfect enough, causes malice soft
The ratio of part remains high.Wherein, embedded known codes and camouflage applications installation file are chief threats.Existing technical scheme
It is applied behavior sequence using decompiling instrument or dynamic behaviour analysis tool, behavior sequence is pre-processed and gone
For sequence signature, the distance by comparing behavior sequence feature is applied the quantized data whether pretended.This method can be with
The change of application code is identified, but the extraction of behavior sequence feature is easily influenceed by Code Obfuscation Security Technology, thus in pin
With certain limitation when analyzing practical problem.
Therefore, for the above mentioned problem in the presence of correlation technique, effective solution is not yet proposed at present.
The content of the invention
To solve the problems of above-mentioned prior art, the present invention proposes a kind of document handling method, including:
The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface
File, audio file and image file are described, the feature of features described above file is extracted using characteristic extraction step, based on the spy
Whether sign, pretended by the size identification installation file for comparing similarity and predetermined threshold value.
Preferably, set app={ exe will be described as using installation file;lib;profile;image;audio;
Etc }, wherein exe represents the executable byte codes in installation file, the primary code storehouse in lib representation programs, profile tables
Show the XML document that description is stored and be laid out for routine data, the image file in image representation programs, in etc representation programs
Alternative document.
Preferably, in characteristic extraction procedure, when extracting the feature of image file,
The picture size reduced first in installation file, and coloured image is converted into gray level image, calculate average gray
Level, image content features are extracted according to similarity hash algorithm, according to the brightness of image and be patterned into every image generation one
Fingerprint of the character string as image, the similarity between image is determined according to the degree of approximation between fingerprint;
Wherein downscaled images size is by image down to K × K pixels, for eliminating image resolution ratio to similarity-rough set
Interference, remove the difference of picture size and image scaled, only retain essential information, the K values such as structure, brightness and be set to 128;And
Wherein, picture material similarity-rough set includes calculating the Hamming distance of fingerprint.
In characteristic extraction procedure, when feature of the extraction with the interface description file of XML file format storage,
XML file similarity-rough set includes structural similarity and content similarity, and XML file is converted into tree construction, leads to
Cross and compare the difference of tree and obtain XML structure difference, the node difference by comparing tree obtains XML content difference, in the application boundary
Face, which describes file, to be stored according to pre-defined rule, is illustrated according to interface description file, is obtained structure list of file names;Then according to structure
List of file names extracts architectural feature, and filtering interface describes architectural feature and symbolic information in file, obtains content information;It is final right
Structure and content information calculate cryptographic Hash, obtain Structural Eigenvalue and content characteristic, interface description file obtains after treatment
One Hash array, so as to which the content similarity of interface description file to be converted into the similarity for comparing Hash array.
In characteristic extraction procedure, when extracting the feature of audio file,
Using file cryptographic Hash as audio file feature, audio file stream S, and predetermined constant character string M are inputted, is calculated
Audio file stream S MD5 cryptographic Hash H1 are inputted, are then added input audio file stream S with predetermined constant character string M, and count
The MD5 cryptographic Hash H2 of addition result are calculated, H1 and H2 are summed, obtain final hash value, the content as audio file is special
Sign.
The present invention compared with prior art, has advantages below:
The present invention proposes a kind of file process, is identified by extracting using installation file content characteristic, and can be with
It is effective against the camouflage of file and catalogue and interference that malicious modification is brought, utilizes characteristic extraction procedure to reduce file content feature
Scale, improve operation efficiency.
Brief description of the drawings
Fig. 1 is the flow chart of document handling method according to embodiments of the present invention.
Embodiment
Retouching in detail to one or more embodiment of the invention is hereafter provided together with the accompanying drawing for illustrating the principle of the invention
State.The present invention is described with reference to such embodiment, but the invention is not restricted to any embodiment.The scope of the present invention is only by right
Claim limits, and the present invention covers many replacements, modification and equivalent.Illustrate in the following description many details with
Thorough understanding of the present invention is just provided.These details are provided for exemplary purposes, and without in these details
Some or all details can also realize the present invention according to claims.
Fig. 1 is document handling method flow chart according to embodiments of the present invention.Propose a kind of application program installation file
Camouflage recognition methods.Installation file attribute is applied by analysis, select file type, extracts content characteristic, and according to file
Type uses different Content Feature Extraction algorithms, and weights are assigned to its similarity, so as to improve application program camouflage identification
Accuracy and operation efficiency.
Exist using installation file in the form of compressed file, inside tissue in the form of catalogue deposits executable byte codes
File, certificate file and resource file, wherein executable byte codes are stored in class file;Certificate file is the signature text of application
Part;Resource file includes database file, function library file, XML file, image file etc..
In one embodiment, set app={ exe will be described as using installation file;lib;profile;image;
audio;Etc }, wherein exe represents the executable byte codes in installation file, the primary code storehouse in lib representation programs,
Profile represents to store for routine data and the XML document of layout description, the image file in image representation programs, etc
Alternative document in representation program.It can be seen from set app description:It is an object of the present invention to according to exe, lib, profile,
The content characteristic of the associated documents such as image, perform the camouflage identification of application program installation file.
In order to accurately and efficiently analyze whether installation file is pretended by file content, and meet actual identification need
Ask, method proposed by the present invention puts forth effort to reach following three targets:1) big data computing is adapted to, the quantity in application market is big,
Increase soon, the system framework that can quickly handle mass data is the basis for adapting to big data computing;2) suitable feature text is selected
Part, there are thousands of kinds of file types in installation file, the content for extracting which file directly affects the efficiency of camouflage identification and accurate
Property;3) efficient feature extraction and accurate characteristics algorithm, the speed of extraction document content characteristic determine system effectiveness, simultaneously
Accurate characteristics algorithm is the basic guarantee that guarantee system can correctly provide result of determination.
The present invention does not lose while ensureing during extraction document content characteristic, calculation document similarity and improving efficiency
The accuracy of operation result.
Require that the target that algorithm is directed to can not be excessively complicated first, if excessively complicated for target, then need to this
Individual target is reduced, and is selected wherein crucial key element and is contrasted;Secondly efficiency of algorithm is high;Finally, in developing algorithm process
When, the running environment of algorithm optimized as far as possible, reduce the intermediate steps of algorithm, cutting down in algorithm to cause
Plenty of time and the content of space consuming.
Firstly the need of selecting suitable tag file, the file in an application installation file from hundreds of to thousands of not
Deng, feature extraction such as is carried out to the content of all files, easily causes the result that target is excessively complicated, analysis efficiency is low, and
Easily disturbed by insertion discarded record.Therefore the present invention is according to generality, representativeness and metrizability principle, selected section
Suitable file type is as tag file, in the case where at utmost ensureing that tag file effectively represents to apply installation file
Feature-scale is reduced, so as to reduce operand.
Next, extracting the feature to have selected files from installation file, the file interface of installation file is obtained, according to pressure
Contracting document location offset orientation tag file, the step of being decompressed to other unrelated files is saved to improve operation efficiency.It is first
First the tag file in application is counted, contrasting different algorithms according to statistical law realizes, algorithm is carried out most suitable
Optimization, efficiency highest algorithm is used on the premise of accuracy is ensured, and multithreading scheme is applied in extraction process, weight
The partial function for not supporting multithreading is write, ensures the Thread safety of all computings, further improves operation efficiency.
Finally, camouflage identification is carried out based on file content feature, when measuring similarity algorithm designs, according to the system of application
Feature is counted, is counted, is exchanged for space consuming time-optimized using Hash table.
By file content feature calculation file similarity, first have to select suitable feature from the file type of complexity
File.Suitable tag file needs have following three features.The file of the type is included in most of installation files, if
Some file type only exists in a small number of application memories, then can not carry out similarity-rough set by such file content feature;File
Content characteristic has " signature " characteristic, can represent the application, the file content feature extracted in different application has difference
Property;File content has a range performance, and the file content distance in similar documents is near, on the contrary the file content in different files away from
From remote.In one embodiment, selection interface describes file, image file, audio file as tag file, can be described as
Appfile={ image;audio;Profile }, main thought is calculation document content characteristic similarity, is analyzed with this similar
Degree, can be represented with below equation:
Com (app1, app2)=com (appfile1, appfile2).
The present invention represents the feature of installation file with the content characteristic of this three class file.Per class file content characteristic set bag
The feature of such All Files is contained, has been represented with equation below:
N represents the quantity of documents that every kind of file type includes, and it is similar to calculate image, audio, the content of interface description file
Degree, every kind of feature of two applications is contrasted, can derive that file characteristic calculating formula of similarity is as follows, represent installation
File similarity is equivalent to the similarity of all the type in two application installation files in file:
M represents the quantity of documents that every kind of file type includes.
It is similar using installation file that image, audio or the representative of interface description file content characteristic similarity is used alone
Degree, it is as a result not ideal enough, it can cause to fail to report if threshold value setting is higher;It can cause to report by mistake if threshold value setting is too low.Therefore,
The present invention assigns weights to image, audio and interface description file content similarity, passes through the weighting of three kinds of file content features
Installation file similarity is applied in similarity expression, and Weighted Similarity formula represents as follows:
Com (app1, app2)=com (appfile1, appfile2)=
com_image×α+com_audio×β+com_profile×γ。
Above formula is represented to be equivalent to the similarity of app1 and app2 internal files using app1 and app2 similarity, is equivalent to
Image, sound, the weighted value of interface description file similarity in two installation files.α herein, the value of beta, gamma is according to com_
Image, com_audio, com_profile difference and dynamic change.
The quantity of image, audio and interface description file in installation file differs, and certain applications do not include audio
File, so fixed α, beta, gamma can not effectively calculation document similarity.Embodiments of the invention utilize the side of Dynamic Weights
Method:Weights are assigned according to com_image, the size of tri- values of com_audio, com_profile, by learning to determine three
Most suitable weights, respectively 0.6,0.3,0.1, com_image, com_audio, the maximum weights of com_profile intermediate values
For 0.6, secondly weights are 0.3, and minimum weights are 0.1.
The similarity of two files can be obtained by above procedure, can be with by the size for comparing similarity and threshold value T
Judge whether two files belong to similar application, i.e., whether be simulated papers.
The present invention is represented using file content feature and applies feature, proposes that specific feature carries for the characteristics of different files
Take method and similarity algorithm.
At present, existing image similarity matching algorithm needs larger room and time expense, can not apply and advise greatly
In mould computing environment.And camouflage applications installation file generally use two ways influences image:1) it is enterprising in original image basis
Row modification;2) original image resolution is changed.Based on such consideration, it is necessary to select one in image content features extraction process
Kind algorithm, can reduce modification image and eliminate resolution ratio reduces the interference brought.Therefore, the present invention reduces installation file first
In picture size, and coloured image is converted into gray level image, calculates average gray level, extracted according to similarity hash algorithm
Image content features, according to the brightness of image and it is patterned into " fingerprint " of every image one character string of generation as image, figure
The fingerprint of picture is more similar then to represent that 2 images are more similar.Computational complexity is reduced while improving accuracy.
Wherein downscaled images size is by image down to K × K pixels, and the process is mainly used in eliminating image resolution ratio pair
The interference of similarity-rough set, the difference for removing picture size and image scaled, only retain the essential informations such as structure, brightness, here
K values be typically set to 128.There is ratio highest in Mobile solution in the image of 40 × 40 resolution ratio.Picture material similarity ratio
Relatively need to calculate the Hamming distance of fingerprint, i.e. two different character numbers of fingerprint character string correspondence position, K=40, then word
Symbol string length is K × K/8=200.The present invention simplifies to the step, and Hamming distance is replaced using whether character string is equal,
Whether cost is that similarity result is only able to display two finger images consistent, whether can not identify finger image by Hamming distance
It is similar.
Interface description file in installation file is stored with XML file format, and therefore, interface description file content feature carries
Take and be equal to XML file Content Feature Extraction.XML file similarity-rough set includes structural similarity and the aspect of content similarity 2,
XML file is converted into tree construction, the difference by comparing tree obtains XML structure difference, is obtained by the node difference for comparing tree
To XML content difference.
Interface description file is stored according to pre-defined rule in the application, and in the case of known regimes, the present invention adopts
With a kind of simple structure and Content Feature Extraction method:First, illustrated according to interface description file, obtain structure and rank
Table;Then, architectural feature is extracted according to structure list of file names, filtering interface describes architectural feature and symbolic information in file, obtained
To content information;Cryptographic Hash finally is calculated to structure and content information, obtains Structural Eigenvalue and content characteristic.Interface description text
Part obtains a Hash array after treatment, compares Hash array so as to which the content similarity of interface description file be converted into
Similarity.
Found by carrying out analysis to the audio file in installation file, camouflage applications installation file bag is not to audio file
Big modification is carried out, therefore the present invention uses file cryptographic Hash as audio file feature.Calculate audio file cryptographic Hash.For
Its hash space is smaller for extensive computation, and Hash result easily collides.Therefore, the present invention proposes following hash method,
Hash collision is greatly reduced in the case where ensureing arithmetic speed.Audio file stream S, and predetermined constant character string M are inputted, is calculated
Audio file stream S MD5 cryptographic Hash H1 are inputted, are then added input audio file stream S with predetermined constant character string M, and count
The MD5 cryptographic Hash H2 of addition result are calculated, H1 and H2 are summed, obtain final hash value.Audio is obtained by algorithm above
Content characteristic of the secondary cryptographic Hash of file as audio file.
Included using installation file content characteristic in image content features, interface description file content feature and audio file
Hold feature.Image content features are gathered for image " fingerprint ";One interface description file content is characterized as a Hash set, should
It is combined into all interface description file content features in installation file by multiple Hash collection;Audio file content is characterized as breathing out
Uncommon set.Three kinds of file content characteristic sets can be considered string assemble.Present invention selection content similarity as standard, its
Computational methods are:Ratio of set A and B the common factor element in A and B shared by less set.This method can be effectively
Weigh the similarity between the set of different length.Content similarity represents as follows with L (A, B):
L (A, B)=| A ∩ B |/min (| A |, | B |).
Thus, tag file set calculating formula of similarity is pushed away by file characteristic similarity formula and content similarity formula
Lead, represent that file set similarity is equivalent to the content similarity of file set;File similarity calculates public by Weighted Similarity
Formula derives, and represents that file similarity is equivalent to file set Similarity-Weighted value, i.e. three kinds of tag file content similarities plus
Weights.
File similarity is obtained by calculating tag file content similarity, is not disturbed by document directory structure change;
And the similarity calculating method selected using less set length in two set as standard, therefore can be effective against inserting
Enter the interference of garbage files.
According to another aspect of the present invention, it is also proposed that a kind of anti-dazzle system of Mobile solution, first using message digest
Algorithm carries out initializing fingerprint sampling to each file of server, is stored in telesecurity database and local security file.Wound
Build false proof arrangement, the access request that processing client is submitted.Access request is analyzed, extracts access path, installation text will be applied
Part fingerprint provides response scheme after being compared with fingerprint in storehouse;Web page files directly are traced back to, suitable for dynamic and static page
Website.Recover the pagefile pretended by the way of local page snapshot and file verification contrast is called.
Further, system is mainly used in coordinating access request, camouflage identification, site file renewal and event alarm 4
Relation between action.When system receives web access request, camouflage identification module is called to divide each HTTP request
Analysis, follow the trail of called file and access path;Calculated using false proof arrangement of the carry in security component and apply installation file
Digital finger-print, itself and original fingerprint in safety zone are contrasted, judge whether pretended using installation file;If not by
Camouflage, Web server are accessed with normal HTTP request response user.Otherwise, emergency recovery module is enabled immediately, is called local
Page snapshot responds user, enables recovery module afterwards and calls local backup to replace simulated papers, completes to repair.Enabling snapshot
In the case of technology, even if pagefile is pretended or reset by hacker, server will not also misinformate the page after camouflage to clear
The person of looking at, avoid causing bad consequence.System record camouflage daily record, custodian is notified with SMS or E-mail mode
Member.Each server file will be locked after anti-camouflage is enabled in server, will be unable to update without permission;Through authentication solution
It can be updated after lock using FTP or SSL modes.It is in good time that local fingerprint base, backup file and snapshot are carried out with long-range library file
It is synchronous, to ensure that data are consistent.
System is pretended by providing the integrality for protecting site file, monitoring and processing HTTP access requests, fast quick-recovery
File, alarm and it is credible issue five functional come realize it is false proof installation system.Thus, by server of the system, client and publishing side 3
Individual part.
(1) service end.The communication between multiple client is completed using database as hinge.File is provided for each client
Backup, snapshot storage, site file initialization the handling of digital finger-print, all kinds of daily records and the preservation for pretending warning information.Service
End only opens the port with client and database communication in the course of the work, to provide the security of system to greatest extent;This
Structure will lay the first stone for system transplantation.
(2) client.It is being installed in the case of not changing legacy network topological structure in shielded server, with clothes
Business end and publishing side establish trusted communications.Client includes initialization, interviewed file monitor and tracking, site listing lock, camouflage
Identification, camouflage recovers and local resource backup six functionses, is the core of whole anti-dazzle system.
When enabling first, the protected site file of server will be initialized, and gather the digital finger-print of each file, deposit
It is stored in the safety database of service end, and is backed up and be stored in local file fingerprint;To ensure the safety of local file,
Local digital fingerprint and backup file and snapshot are encrypted using symmetric key cryptography AES.When receiving publishing side
More newer command after, unlock its protect catalogue, the digital finger-print for being updated file is updated.False proof die-filling piece in processing visitor
During the accessing page request of family, according to interviewed pagefile name and access path calculate its fingerprint and with the fingerprint pair in safety zone
Than being responded if consistent;Otherwise camouflage is enabled to recover and event alarm module, execution post-processing process, and record access
Source IP, source port and destination slogan, camouflage process ID, the modification content of request, structure warning message notify administrative staff.For
Quick response client request, recovery module, which first reads page snapshot, alleviates user's access;After taking local backup file decryption again
Simulated papers are replaced, when local backup file is destroyed, backup file will be issued from service end and is recovered, to handle disaster
Sexual behavior part.
(3) publishing side.The renewal of the main issue and original server file for completing new demand servicing device.Publishing side passes through client
After holding certification, client creates new website or the requested website of unblock according to request, completes issue instruction;At the end of client
Relock website.
After system initialization process includes each client submission server feature information, in service end server designated area
The catalogue that middle foundation is named with characteristic information.In order to ensure its uniqueness, characteristic information using the IP address of client computer, CPUID,
Hard disk ID forms the cryptographic Hash of character string to represent.Service end establishes unified database, stores each site file fingerprint, day
Will and warning information.Client is completed, with configuring local operating conditions after the checking of service end link information, to specify current server
The file type of middle need website to be protected and different websites, using the ciphertext of site name and creation time as file name creation sheet
Ground security catalog, for storing backup and snapshot document, the XML document of storage finger print data, daily record and alarm data.
File pre-processing assembly calls crypto engine first, and a pair of secret keys is generated using public key encryption algorithm RSA;Public key is adopted
Local security catalogue is saved in together with being stored in after AES encryption in backup server, then by the public key copy after encryption and private key,
And public key is exchanged with backup server in time, recover to provide working environment for data syn-chronization and website.Then read what need to be protected
Traversal engine is called after site listing and file type, the server file of traversal regulation suffix, it is calculated only using MD5 algorithms
One, irreversible digital finger-print, by fingerprint results by certain data structure deposit database, then generate by site name
In XML document deposit local security file, contrast and use for false proof arrangement.The station data traveled through is finally used into client
The public key encryption at end, it is resident locally wait in security catalog and is synchronized with backup server.Entirely
Pretend identification process and develop false proof arrangement using security component, HTTP request is analyzed, pass through HTTP visitors
The data that family end is submitted, access path and filename are extracted, its integrality is monitored in real time, file changes legal
Property is verified.Using the false proof arrangement of kernel inside technological development, and establish and reflect in the mapping table by security components interfaces
Relation is penetrated, being serviced device is loaded into the process space, completes the calculating to each interviewed Fingerprint of Web Page and original fingerprint contrast work
Make.
When server receives HTTP request, request is tracked using installation file first, then calculated using installation text
The cryptographic Hash of part, finally call fingerprint contrast component;Read in local security region and decrypted using the original fingerprint of installation file
Contrasted afterwards with the current fingerprint that calculates, HTTP request is replied if matching, otherwise entered and recover processing and emergency response stream
Journey.For emergency response component after contrast failure command is received, system generates HTML after calling local snapshot document decryption of the same name
Format text responds HTTP request, ensures the efficiency and quality of server HTTP request response;With most fast speed after having responded
Call original document in local security region that after AES is decrypted simulated papers are replaced with recovery, be text to the full extent
Part provides the guarantee of safety;Current file is disabled if failure is recovered, redirects requests to specified page.In file access pattern
While, system record camouflage daily record, sent a warning message by way of SMS or Email for administrative staff, be
Data analysis and management in the future provides foundation.Snapshot is called with occurring the time required to accessing redirection process in several milliseconds,
Requestor can not be received by the response contents of the camouflage page.Client calls file bottom filtration drive by certain cycle duration
Module is calculated the digital finger-print of the shielded website of current server and the file of stated type, contrasted, identified, with true
Protect the similarity of digital finger-print everywhere.
In summary, the present invention proposes a kind of file process, is known by extracting using installation file content characteristic
Not, and the camouflage of file and catalogue can be effective against and interference that malicious modification is brought, utilizes characteristic extraction procedure to reduce text
Part content characteristic scale, improve operation efficiency.
Obviously, can be with general it should be appreciated by those skilled in the art, above-mentioned each module of the invention or each step
Computing system realize that they can be concentrated in single computing system, or be distributed in multiple computing systems and formed
Network on, alternatively, they can be realized with the program code that computing system can perform, it is thus possible to they are stored
Performed within the storage system by computing system.So, the present invention is not restricted to any specific hardware and software combination.
It should be appreciated that the above-mentioned embodiment of the present invention is used only for exemplary illustration or explains the present invention's
Principle, without being construed as limiting the invention.Therefore, that is done without departing from the spirit and scope of the present invention is any
Modification, equivalent substitution, improvement etc., should be included in the scope of the protection.In addition, appended claims purport of the present invention
Covering the whole changes fallen into scope and border or this scope and the equivalents on border and repairing
Change example.
Claims (3)
- A kind of 1. document handling method, for identifying the application program installation file of camouflage, it is characterised in that including:The tag file of predefined type is selected from installation file, the tag file of the predefined type includes interface description File, audio file and image file, the feature of features described above file is extracted using characteristic extraction step, based on the feature, Whether pretended by the size identification installation file for comparing similarity and predetermined threshold value;In characteristic extraction procedure, when feature of the extraction with the interface description file of XML file format storage, XML file phase Compare like degree including structural similarity and content similarity, XML file is converted into tree construction, obtained by the difference for comparing tree XML structure difference, the node difference by comparing tree obtain XML content difference, and interface description file is according to pre- in the application Set pattern then stores, and is illustrated according to interface description file, obtains structure list of file names;Then it is special structure to be extracted according to structure list of file names Sign, filtering interface describe architectural feature and symbolic information in file, obtain content information;Finally to structure and content information meter Cryptographic Hash to be calculated, obtains Structural Eigenvalue and content characteristic, interface description file obtains a Hash array after treatment, so as to The content similarity of interface description file is converted into the similarity of Hash array;Affiliated application installation file content characteristic is included in image content features, interface description file content feature and audio file Hold feature;Image content features are gathered for image " fingerprint ";One interface description file content is characterized as a Hash set, should It is combined into all interface description file content features in installation file by multiple Hash collection;Audio file content is characterized as breathing out Uncommon set;Three kinds of file content characteristic sets are considered as string assemble;Select content similarity is as standard, computational methods:Set A and B common factor element less set institute in A and B The ratio accounted for, content similarity represent as follows with L (A, B):L (A, B)=| A ∩ B |/min (| A |, | B |);That is tag file set calculating formula of similarity by file characteristic similarity formula and the content characteristic similarity derivation of equation, Represent that file set similarity is equivalent to the content similarity of file set;File similarity is calculated and pushed away by Weighted Similarity formula Lead, represent that file similarity is equivalent to file set Similarity-Weighted value, i.e., the weighted value of three kinds tag file content similarities;The file content feature of the tag file of the predefined type has signature characteristic, the file extracted in different application Content characteristic has otherness, and file content has range performance;And the characteristic extraction procedure further comprises, peace is obtained The file interface of file is filled, according to compressed file position offset orientation tag file, the tag file in application is counted, Different algorithms is contrasted according to statistical result, algorithm is optimized, and multithreading scheme is applied in extraction process, is rewritten The partial function of multithreading is not supported, after feature extraction, is identified based on file content feature, according to the statistics of application Feature, counted using Hash table and carry out measuring similarity;Certain types of tag file includes interface description file, image file, audio file, and by application program installation file It is described as appfile={ image;audio;Profile }, contain such All Files per class file content characteristic set Feature, it is expressed as:<mrow> <msub> <mi>image</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>image</mi> <mi>f</mi> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>;</mo> </mrow><mrow> <msub> <mi>audio</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>audio</mi> <mi>f</mi> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>;</mo> </mrow><mrow> <msub> <mi>profile</mi> <mi>f</mi> </msub> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <msub> <mi>profile</mi> <mi>f</mi> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>;</mo> </mrow>Wherein imagef、audiofAnd profilefImage file, audio file, the feature of interface description file, n are represented respectively Represent the quantity of documents that every kind of file type includes, and calculation document content characteristic similarity function com () procedural representation For:Com (app1, app2)=com (appfile1, appfile2)Every kind of feature of two applications is contrasted, acquisition file characteristic Similarity Measure is as follows, represents text in installation file Part similarity is equivalent to the similarity of all the type in two application installation files:<mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>i</mi> <mi>m</mi> <mi>a</mi> <mi>g</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>image</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>,</mo> <msub> <mi>image</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>j</mi> <mo>&rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow><mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>a</mi> <mi>u</mi> <mi>d</mi> <mi>i</mi> <mi>o</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>audio</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>,</mo> <msub> <mi>audio</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>j</mi> <mo>&rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow><mrow> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mo>_</mo> <mi>p</mi> <mi>r</mi> <mi>o</mi> <mi>f</mi> <mi>i</mi> <mi>l</mi> <mi>e</mi> <mrow> <mo>(</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>1</mn> <mo>,</mo> <mi>a</mi> <mi>p</mi> <mi>p</mi> <mn>2</mn> <mo>)</mo> </mrow> <mo>=</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>m</mi> </munderover> <mi>c</mi> <mi>o</mi> <mi>m</mi> <mrow> <mo>(</mo> <msub> <mi>profile</mi> <mrow> <mi>f</mi> <mn>1</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>i</mi> <mo>&rsqb;</mo> <mo>,</mo> <msub> <mi>profile</mi> <mrow> <mi>f</mi> <mn>2</mn> </mrow> </msub> <mo>&lsqb;</mo> <mi>j</mi> <mo>&rsqb;</mo> <mo>)</mo> </mrow> <mo>;</mo> </mrow>Wherein m represents the quantity of documents that every kind of file type includes;Weights are assigned to three kinds of file content similarities, application installation is represented by the Weighted Similarity of three kinds of file content features File similarity, Weighted Similarity formula represent as follows:Com (app1, app2)=com (appfile1, appfile2)=com_image×α+com_audio×β+com_profile×γ;Wherein α, the value of beta, gamma is according to com_image, com_audio, the differences of com_profile contents and dynamic change, i.e., According to com_image, the size of tri- values of com_audio, com_profile assigns weights, optimal by learning determination three Weights.
- 2. according to the method for claim 1, it is characterised in that in characteristic extraction procedure, as the spy of extraction image file During sign,The picture size reduced first in installation file, and coloured image is converted into gray level image, calculate average gray level, root Image content features are extracted according to similarity hash algorithm, according to the brightness of image and are patterned into every image one character string of generation As the fingerprint of image, the similarity between image is determined according to the degree of approximation between fingerprint;Wherein downscaled images size is by image down to K × K pixels, and similarity-rough set is done for eliminating image resolution ratio The difference of picture size and image scaled is disturbed, removed, only retains structure, monochrome information, K values are set to 128;And wherein, image Content similarity compares the Hamming distance including calculating fingerprint.
- 3. according to the method for claim 2, it is characterised in that in characteristic extraction procedure, as the spy of extraction audio file During sign,Using file cryptographic Hash as audio file feature, audio file stream S, and predetermined constant character string M are inputted, calculates input Audio file stream S MD5 cryptographic Hash H1, then input audio file stream S is added with predetermined constant character string M, and calculates phase Add the MD5 cryptographic Hash H2 of result, H1 and H2 are summed, obtain final hash value, the content characteristic as audio file.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510108614.6A CN104657665B (en) | 2015-03-12 | 2015-03-12 | A kind of document handling method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510108614.6A CN104657665B (en) | 2015-03-12 | 2015-03-12 | A kind of document handling method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104657665A CN104657665A (en) | 2015-05-27 |
CN104657665B true CN104657665B (en) | 2017-12-08 |
Family
ID=53248776
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510108614.6A Active CN104657665B (en) | 2015-03-12 | 2015-03-12 | A kind of document handling method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104657665B (en) |
Families Citing this family (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105224367A (en) * | 2015-09-30 | 2016-01-06 | 浪潮电子信息产业股份有限公司 | A kind of installation method of software and device |
CN106815522A (en) * | 2015-11-27 | 2017-06-09 | 中兴通讯股份有限公司 | Mobile terminal software vacation interface identification method and device |
CN105653984B (en) * | 2015-12-25 | 2019-04-19 | 北京奇虎科技有限公司 | File fingerprint method of calibration and device |
CN109564613B (en) * | 2016-07-27 | 2023-05-30 | 日本电气株式会社 | Signature creation device, signature creation method, recording medium on which signature creation program is recorded, and software determination system |
CN107323114B (en) * | 2017-06-22 | 2019-08-16 | 珠海汇金科技股份有限公司 | Intrusion detection method, system and the print control instrument of print control instrument |
CN108123934B (en) * | 2017-12-06 | 2021-02-19 | 深圳先进技术研究院 | Mobile-end-oriented data integrity verification method |
CN107992599A (en) * | 2017-12-13 | 2018-05-04 | 厦门市美亚柏科信息股份有限公司 | File comparison method and system |
CN108491458A (en) * | 2018-03-02 | 2018-09-04 | 深圳市联软科技股份有限公司 | A kind of sensitive document detection method, medium and equipment |
CN111160123B (en) * | 2019-12-11 | 2023-06-09 | 桂林长海发展有限责任公司 | Aircraft target identification method, device and storage medium |
CN113590144A (en) * | 2021-08-16 | 2021-11-02 | 北京字节跳动网络技术有限公司 | Dependency processing method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101158967A (en) * | 2007-11-16 | 2008-04-09 | 北京交通大学 | Quick-speed audio advertisement recognition method based on layered matching |
CN101369268A (en) * | 2007-08-15 | 2009-02-18 | 北京书生国际信息技术有限公司 | Storage method for document data in document warehouse system |
CN102968439A (en) * | 2012-10-11 | 2013-03-13 | 微梦创科网络科技(中国)有限公司 | Method and device for sending microblogs |
CN103400076A (en) * | 2013-07-30 | 2013-11-20 | 腾讯科技(深圳)有限公司 | Method, device and system for detecting malicious software on mobile terminal |
CN104091152A (en) * | 2014-06-30 | 2014-10-08 | 南京理工大学 | Method for detecting pedestrians in big data environment |
-
2015
- 2015-03-12 CN CN201510108614.6A patent/CN104657665B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101369268A (en) * | 2007-08-15 | 2009-02-18 | 北京书生国际信息技术有限公司 | Storage method for document data in document warehouse system |
CN101158967A (en) * | 2007-11-16 | 2008-04-09 | 北京交通大学 | Quick-speed audio advertisement recognition method based on layered matching |
CN102968439A (en) * | 2012-10-11 | 2013-03-13 | 微梦创科网络科技(中国)有限公司 | Method and device for sending microblogs |
CN103400076A (en) * | 2013-07-30 | 2013-11-20 | 腾讯科技(深圳)有限公司 | Method, device and system for detecting malicious software on mobile terminal |
CN104091152A (en) * | 2014-06-30 | 2014-10-08 | 南京理工大学 | Method for detecting pedestrians in big data environment |
Also Published As
Publication number | Publication date |
---|---|
CN104657665A (en) | 2015-05-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104657665B (en) | A kind of document handling method | |
Khan et al. | Cloud log forensics: Foundations, state of the art, and future directions | |
Volety et al. | Cracking Bitcoin wallets: I want what you have in the wallets | |
US9411958B2 (en) | Polymorphic treatment of data entered at clients | |
US9584543B2 (en) | Method and system for web integrity validator | |
Zhao et al. | Sear: Secure and efficient aggregation for byzantine-robust federated learning | |
ES2679286T3 (en) | Distinguish valid users of robots, OCR and third-party solvers when CAPTCHA is presented | |
US20160365978A1 (en) | Making cryptographic claims about stored data using an anchoring system | |
US11916920B2 (en) | Account access security using a distributed ledger and/or a distributed file system | |
CN112217835B (en) | Message data processing method and device, server and terminal equipment | |
CN107370719B (en) | Abnormal login identification method, device and system | |
Hasan et al. | Cloud data provenance using IPFS and blockchain technology | |
BRPI0815605B1 (en) | METHOD FOR COMMUNICATING DATA USING A COMPUTER DEVICE; METHOD FOR GENERATING A SECOND VERSION OF A DATA COMMUNICATION COMPONENT USING A COMPUTER DEVICE; METHOD FOR COMMUNICATING DATA USING A COMPUTER DEVICE; METHOD FOR CREATING A CERTIFICATE USING A COMPUTER DEVICE; AND METHOD FOR USING A CERTIFICATE USING A COMPUTER DEVICE | |
Suarez-Tangil et al. | Stegomalware: Playing hide and seek with malicious components in smartphone apps | |
JP7077425B2 (en) | Checking the display of third-party content on client devices | |
Fowler | SQL server forenisc analysis | |
Zhang et al. | LedgerGuard: Improving blockchain ledger dependability | |
US10679183B2 (en) | Method and system for distributing and tracking information | |
CN104657504A (en) | Fast file identification method | |
Khan et al. | Digital forensics and cyber forensics investigation: security challenges, limitations, open issues, and future direction | |
Holmes et al. | A framework for live host-based Bitcoin wallet forensics and triage | |
WO2011117268A1 (en) | Secure data scanning method and system | |
WO2023146737A1 (en) | Multi-variate anomalous access detection | |
WO2022116761A1 (en) | Self auditing blockchain | |
AU2021100847A4 (en) | Blockchain based multiple authorities signature model for industrial iot applications |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20230613 Address after: F13, Building 11, Zone D, New Economic Industrial Park, No. 99, West Section of Hupan Road, Xinglong Street, Tianfu New District, Chengdu, Sichuan, 610000 Patentee after: Sichuan Shenhu Technology Co.,Ltd. Address before: 610041 No. 5, floor 1, unit 1, building 19, No. 177, middle section of Tianfu Avenue, high tech Zone, Chengdu, Sichuan Province Patentee before: SICHUAN CINGHOO TECHNOLOGY Co.,Ltd. |
|
TR01 | Transfer of patent right |