AU640335B1 - - Google Patents
Download PDFInfo
- Publication number
- AU640335B1 AU640335B1 AU2840292A AU2840292A AU640335B1 AU 640335 B1 AU640335 B1 AU 640335B1 AU 2840292 A AU2840292 A AU 2840292A AU 2840292 A AU2840292 A AU 2840292A AU 640335 B1 AU640335 B1 AU 640335B1
- Authority
- AU
- Australia
- Prior art keywords
- data
- template
- segments
- segment
- value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Lifetime
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F21/00—Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
- G06F21/50—Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
- G06F21/55—Detecting local intrusion or implementing counter-measures
- G06F21/56—Computer malware detection or handling, e.g. anti-virus arrangements
- G06F21/562—Static detection
- G06F21/564—Static detection by virus signature recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Computer Security & Cryptography (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- Health & Medical Sciences (AREA)
- Virology (AREA)
- Databases & Information Systems (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Executing Machine-Instructions (AREA)
Description
6 4 j j j Regulation 3.2
AUSTRALIA
Patents Act 1952 PETTY PATENT SPECIFICATION FOR A PETTY PATENT
(ORIGINAL)
C.
Name of Applicant: CYBEC PTY LTD
C
Actual Inventor(s): ROGER HAMLINE RIORDAN Address for Service: Invention Title: DAVIES COLLISON CAVE, Patent Attorneys, 1 Little Collins Street, Melbourne, 3000.
DATA SEARCHING METHOD Details of Associated Provisional Application No: PK9530/91 The following statement is a full description of this invention, including the best method of performing it known to us: -2- DATA SEARCHING METHOD This invention relates to a data searching method.
In the field of electronic data processing machines, for example digital computers, it is a common requirement to be able to detect the presence or absence of one or more predetermined data strings in a body or file of stored, incoming or outgoing data. This may be accomplished by searching the body or 15 file of data for the presence of one or more of the predetermined data strings, for example by sequentially examining units of the body or file of data to determine whether a portion of the data file matches one or more of the predetermined data strings. If the data file is large and/or there are a large number of data strings to be searched for, the searching procedure may take a significant amount of time 20 which is generally undesirable.
The present invention provides a data searching method which in most •instances may reduce the time required to determine the presence or absence of one or more of a number of data strings in a body of data.
In particular, computer programs known as computer viruses often act to insinuate themselves into a data storage portion of a computer without the knowledge of the computer user. Once installed in the computer memory or storage area, the computer virus may produce effects which are generally undesirable to the computer user. In order to remove a computer virus from an infected computer or computer disk, it is first necessary to detect the presence and location of the virus in the computer or disk data.
921116,dbwspe.036,cybe42 -3- Many computer viruses may be characterised by a particular data string, sometimes referred to as a signature or template of the virus. Thus, in order to detect the presence of a particular virus in a body of data, it is possible to search the body of data for the signature or template which is characteristic of that computer virus. If a portion of the body of data is found which matches the signature or template of a known computer virus there is a high likelihood of that portion of the data being a constituent of a virus. Additionally, in order for a template to be characteristic of a particular virus to a substantial degree it must be at least several units of data in length, with each individual template characteristic of a different virus. At present, there are upwards of 1000 known computer viruses, thus yielding a corresponding number of detectable virus templates. Consequently, if a storage device containing several million units of information is to be searched for the presence of one or more of the known computer o• viruses, a significant amount of time may be required to carry out the requisite comparisons.
oo In accordance with the present invention there is provided a method for locating a data pattern in a body of digital data, the data pattern consisting of a sequence of S.segments of at least one pattern template, comprising the steps of: S A. accessing a segment of the body of data and utilising the accessed segment 20 to address one of a plurality of translation entries, each entry comprising a sequence of data sections which each indicate whether a segment of said body of data occurs as a segment of at least one pattern template in a position corresponding to the sequence position of said data section; B, logically combining the content of the addressed translation entry with an initially predetermined combination value, shifting the result thereof and storing the shifted result as the combination value; C. repeating steps A and B for sequential segments of the body of data, wherein a predetermined logic value resulting from the logical combination and shifting of step B is indicative of the existence of the data pattern within the body of data.
930608,p:\oper m,cybccO1.com,3 -4- Preferably said logical combination is effective to set a flag in the combination value each time the accessed segment matches a segment in a first position of any said at least one pattern template and wherein a set flag will remain set following succeeding shifting and logical combination steps if the succeeding accessed segments match segments in succeeding positions of any said at least one pattern template, and wherein a set flag which is shifted out of said combination value is indicative of the preceding sequence of accessed segments matching a sequence of segments from any of said at least one pattern templates constituting the data pattern.
Preferably a plurality of fixed length pattern templates are utilised to form a table of addressable entries comprising said addressable translation entries, and wherein the logical combination and shifting step is carried out in such a way that as each new segment of the body of the data is entered a new flag is entered into the combination value, indicating the possible occurrence of one of the pattern templates, and each flag 15 will remain set as it is shifted through the combination value if successive segments each occur in successive positions of one or more of the pattern templates, and wherein steps A and B are repeated until a flag is still set when it is shifted out of the combination value, indicating that the preceding portion of data could contain one of the pattern templates, or until the end of the data is reached, and including the step of comparing any 20 sequence of segments of the body of data identified in step C with the individual pattern templates to determine if it matches any given template.
*.Soo.
930608,p:\opet' cm,ybcO a,4 Furthermore, the method may also employ negative logic, where a zero bit indicates a true condition, whereby a logical OR operation would be substituted for a logical AND operation.
Preferably the method includes a step of comparing the portion of the body of data identified as matching the data pattern with said at least one template data string sequence to determine whether or not the portion matches a said template data string sequence.
Preferred embodiments of the invention are hereinafter described in detail, by way of example only, with reference to the accompanying drawings, wherein: Figure 1 is a flow chart showing the steps of a first embodiment of the method of the present invention; Figure 2 is a flow chart of the steps for creating a translation table for the 15 first embodiment; S Figures 3A and 3B illustrate a flow chart in accordance with a second S.embodiment of the present invention; Figures 4A and 4B show an example of a template table and translation :table, respectively, in accordance with the first or second embodiments; S: 20 Figure 5 illustrates a flow chart for the creation of a translation table in accordance with a third embodiment of the present invention; oand Figure 6 illustrates a flow chart of the method of the third embodiment; .oeo.i and Figure 7 illustrates an example of a translation table in accordance with 25 the third embodiment.
The preferred methods disclosed herein are described in relation to the determination of whether or not a particular computer file is infected with one or more of a number of computer viruses. Throughout the specification the terms data file and body of data are utilised to refer to digital information which is to be searched by the described methods, whether this digital information be a computer program, stored digital data for a computer program, or otherwise. In 921116,dbvspef36,cybe,5 this case a typical implementation of one of the preferred methods may be as a computer program for use on a personal computer.
As mentioned above, many computer viruses may be characterised by a data string of several bytes long, the characterising data strings being referred to herein as templates. Furthermore, standard conventions are used throughout the specification with respect to binary digital logic notation, for example where the left most bit of a binary number is designated the most significant bit (msb) and the right most the least significant bit (lsb). Also, both positive and negative digital logic conventions are referred to and it is understood that the methods of the invention may utilise either or both positive and negative logic. The method described in relation to Figure 1, for example operates with positive logic, wherein a logical "false" and logical "true". In addition, where reference is made to the logical combination of two o, more digital valves this is understood 15 to mean that the digital valves are operated upon by a logic operator, such as an AND or OR operator, and the operator used may be dependent upon whether positive or negative logic is used.
Figure 1 shows a flow diagram 2 of a first scanning method beginning at 20 step 4. The flow chart 2 consists essentially of initialisation steps 6 to 14, and an iteration loop including steps 16 to 26. Generally speaking the method illustrated in flow chart 2 acts to search a data file for the presence of one or more of a S plurality of templates.
,Qoo 25 Initially, the templates are listed and stored in a template table (step 6).
The templates are 16 bytes, or 8 words, long and are segmented into sequences of one unit segments at step 8, where a unit segment is one word in length. Such an arrangement of each segment unit comprising a 16 bit word, and each template containing 8 words is particularly advantageous if such a method is implemented on a computer, such as a personal computer, having a 16 bit address space and 8 bit addressable units.
921116,dbwspe.036,cybec6 -7- At step 10, an addressable translation table is allocated in digital memory containing one addressable byte for each possible value of the template unit segments. In this case, a unit segment word has 65,536 possible values, so the translation table will contain 65,536 bytes. This is particularly convenient for operation on the microprocessors of personal type computers such as those utilising the Intel 80X86 series of microprocessors, as this is the largest array which can be addressed directly.
Once an addressable translation table array has been allocated having a length corresponding to the address space of a template unit segment, data is placed in the translation table entries in accordance with entries in the template table unit segments created at step 8. Briefly, each bit in the translation table array is set in accordance with the occurrence of the array entry address value in the corresponding sequence position of any one of the segmented templates. This 15 is accomplished by utilising each template segment as an index into the translation table array so as to set the bit in the indexed array entry corresponding to the sequence position of the template segment. The procedure for creating the completed translation table of step 12 in flow chart 2 is described in greater detail hereinafter in relation to flow chart 50 illustrated in Figure 2.
The method utilises a combination unit, which may conveniently comprise an internal register of the computer microprocessor. At step 14 the combination unit is initialised by storing it with a predetermined value. In the present example the combination unit comprises an 8 bit storage register which is pre-stored at step 14 with the binary value 00000001.
The search iteration loop begins at step 16 where a unit segment of the data file is retrieved from the memory buffer containing the data file and temporarily stored in an internal register of the microprocessor. The unit segment of the data file which is retrieved at step 16 is the same length as the unit segments of the segmented template sequences, in this instance being one word or 16 bits. The 16 bit data file segment is then utilised at step 18 as an 921116,dbvspe.036,cybec,7 -8address or index into the translation table array created at steps 10 and 12. The byte of translation table data addressed by the data file segment at step 18 is then retrieved at step 20, the retrieved byte being referred to as a mask.
The mask retrieved at step 20 is then logically combined with the current value of the combination unit by way of a logical AND operation, with the result thereof being stored in the combination unit (step 22). The combination unit register is thereafter shifted by one bit towards the left (most significant bit), and a logical one is shifted into the least significant bit at the right of the register. By shifting the combination unit by one bit to the left, the value contained in the most significant bit of the combination unit is shifted out of the combination unit register, typically into the carry flag associated with that register.
At step 26 the value of the bit shifted out of the combination unit at step 24 is examined to determine whether it corresponds to a logical one (true) value or logical zero (false) value. If a true value is returned at step 26, the procedure is directed to step 28. A true value shifted out of the combination unit at step 24 indicates that the immediately preceding 8 words of file data matches words in corresponding segment numbers of one or more of the segmented templates.
This is an indication that the preceding 8 words of file data may match one of the 8 word templates, thus indicating the presence of one of the computer viruses characterised thereby. Consequently, step 28 acts to directly compare each of the templates listed in the template table with the relevant 8 words of file data to determine whether the file data does in fact match one of the virus templates.
Any convenient method of comparing the relevant file data and templates may be used at step 28. For example, the templates in the template table may conveniently be sorted in ascending order, and a binary comparison used. This may be used to quickly determine if the suspect string of file data matches anyone of the templates, without actually comparing the file data string with each template. For a template table containing 1024 (210) entries, for example, this method of comparison will involve a maximum of 10 comparisons to determine 921116,dbwspe.036,cybec8 -9whether or not a match exists. Many other suitable comparison methods exist and are known to those skilled in the art, and consequently will not be described in detail herein.
If a logical zero or false value is shifted out of the combination unit register at step 24 and detected at step 26, the procedure continues to step which determines whether the end of the data file has been reached. This step may conveniently be performed by utilising a register initially containing a value indicative of the total length of the data file, and decrementing the register each time data is read from the file. Thus, if the file length register holds a zero value when examined at step 30 this is indicative of the end of the file, and the procedure is terminated at step 32. If the end of the file has not been reached, as determined at step 30, the procedure is directed back to step 16 to read the next file data word for a further iteration of the search loop containing steps 16 to 26.
Figure 2 illustrates a flow chart 50 of a method for creating a translation table array, corresponding to steps 6 to 12 of the flow chart 2. The procedure begins at step 52, and at steps 54 and 56 the templates are listed and segmentec 20 into the template table, for example in random access memory. The segmentation of the templates may be a notation segmentation rather than a I physical procedure, since the templates themselves may be merely stored in •successive and contiguous memory locations or blocks. At step 58 an amount of Smemory corresponding to the address space of one unit segment is allocated for the translation table array. In the simplest case, the translation table array memorywould comprise contiguous memorylocations beginning at address 0000h, however the translation table array memory locations may alternatively be allocated elsewhere with appropriate virtual address mapping as is common in computer memory access.
Step 60 is the beginning of a loop in which successive templates are read from the template table. In the first pass through the loop beginning at step 921116,dbspe.036,cybec.9 the first template from the table is retrieved, and thereafter successive templates from the table are retrieved each time step 60 is executed. Similarly, step 62 marks the beginning of a loop concerned with individual segments of the template retrieved at step 60. Segments are retrieved from the template of step 60, in order of their occurrence in the template, each time step 62 is executed. The 16 bit segment retrieved at step 62 is utilised in step 64 as an index or address into the translation table array, each different valued template segment corresponding to a unique entry in the translation table. Having identified the relevant translation table entry at step 64, the bit in the array entry which corresponds to the position of the segment in the template segment sequence retrieved at step is set to a true value, the translation table entries having all been initialised to a false value during allocation of the array memory. For example, if the first :segment of a particular template contains the value 5555h, then the least significant bit of the entry in the translation table array having the index or 15 address 5555h would be set in accordance with the procedure just described.
Similarly, if the segment was the last in a particular template, it would be the .i most significant bit of the relevant translation table entry which would be set.
Steps 60 and 62 of flow chart 50 may alternatively be reversed such that 20 the first segment of each template are entered in the translation table, followed by the second segment of each template, and so on until the final segment of each template has been processed at steps 64 and 66. This alternative would also require that the procedures of steps 68 and 70 be performed in reverse order.
Either of the above mentioned systems for setting the entries in the translation 25 table may be conveniently implemented depending upon the manner in which the templates are actually stored in memory.
Step 68 of the flow chart 50 causes the procedure to repeat steps 62 to 66 for each segment in the template retrieved at step 60, and step 70 returns the procedure to step 60 for the next template in the template table, until the data corresponding to the last template has been entered in the translation table. In this event, the translation table is complete and the flow chart procedure 921116,dbwspe.036,cbe1O -11 terminates at step 72.
Figures 4A and 4B illustrate an example of a template table and translation table respectively. In Figure 4A, the translation table 150 shows rows 154 storing templates, with the templates being separated into one word columns 156 corresponding to unit segments. The template table 150 is shown containing two templates (template numbers and each divided into four single word segments. For example, segment number 1 of template number is indicated at 160, and contains the hexadecimal value 0001. As indicated by the extended broken lines in Figure 4A, the template table 150 is not constrained to a table of 11 templ s having four segments as illustrated, and may more realistically contain of the order of 1000 templates divided into eight single word segments in an application such as the detection of computer viruses as previously discussed.
Furthermore, although the template table is illustrated in Figure 4A as being a table containing rows and columns as is conventionally known, this is a notional %construction only, and when the method is implemented on a digital computer the templates may, for example, in fact occupy contiguous blocks of random access memory.
Figure 4B illustrates tr anslation table 152 containing entries derived from the templates shown in template table 150. The translation table array 152 is i illustrated in left and right columns with addresses indicated at 158 corresponding I to the entry in the left hand column, whereas the entry in the right hand column corresponds to the addresses 158 incremented by one. For example, the bottom entry in the left hand column of the array 152 corresponds to the hexadecimal address Offfe whilst the bottom entry in the right hand column corresponds to the address Offff. The entry corresponding to address 0001h is indicated by reference numeral 162, and contains the binary value 00000101. Beginning with the least significant bit of this entry, this corresponds to segment number 1 or column 1 in the template table 150, and contains a true value in accordance with the entry 160. Similarly, bit number 3 of the entry 162 in translation table 152 also contains a true value corresponding to the entry of 0001h in segment 3 of template number 9211 1dbspe.036,qyber,11 -12- 2. In this way, the entries in the translation table array 152 are constructed from the entries in the template table 150.
The following is an example of the procedure described in relation to Figure 1, including sample calculations utilising the data of the template table 150 and translation table 152 illustrated in Figures 4A and 4B.
Sample Calculation For.the purposes of the following example it is assumed that the file data operated upon by the procedure of flow chart 2 the data read in at step 16) contains the following sequence of hexadecimal data: 0001 000 cd21 fffe For the purposes of simplicity the calculations are shown in relation to a 4 bit combination unit register rather than an 8 bit register as might ordinarily be employed.
initialise combination unit (CU) (step 14) 0001b <1> retrieve 1st data word (step 16) 0001h <2> retrieve entry from array 152 corresponding to <2> (steps 18 &20) 0101b <3> e- AND with CU (step 22) 0001b <4> shift CU left 1 bit (step 24) 0011b 25 examine carry flag (step 26) c 0 <6> recrieve 2nd data word (step 16) 0000h <7> retrieve entry from array 152 corresponding to <7> (steps 18 20) 0010b <8> AND with CU (step 22) 0010b <9> shift CU left 1 bit (step 24) OlO1b examine carry flag (step 26) c 0 <11> 921116,dbwspe.036,cyber, 2 13retrieve 3rd data word (step 16) retrieve entry from an ay 152 corresponding to <12> (steps 18 20) AND <13> with CU <10> (step 22) shift CU <14> left 1 bit (step 24) examine carry flag (step 26) retrieve 4th data word (step 16) retrieve entry from array 152 corresponding to 17> (steps 18 20) AND <18> with CU <15> (step 22) shift CU 19 left 1 bit (step 24) examine carry flag (step 26) Ocd2lh <12> OlO1b O!Olb 101 1b c 0 <13> <14> <16> Offfeh 17 cc o r 1000b 1000b c001b c=l <18> <19> <21 At this poiut it is evident that the shifting operation performed on the combination unit value CU <19> has shifted a "true" (logic 1) value out of the most significant bit of the four bit register into the carry flag Thus, when the carry flag is examined at calculation <21 a "true" value is returned indicating a match of the previous 4 words of file data with corresponding segments of the templates in the template cable 150. This would direct the procedure 2 to step 28 to compare each template with the file data to determine if one of the templates does in fact match the file data. In this instance, the first three words of file data match with the first three words of template whilst the fourth file data word matches with the fourth word of template Thus, neither of the templates precisely matches the file data and the procedure 2 would ordinarily return to operation at step The unit of data storage and manipulation utilised in many digital computer is an 8 bit byte. However, the procedure described thus far retrieves and operates upon two byte words. This creates a difficulty if, for example the words of file data retrieved each begin with an eve.:, numbered byte but contains a data string beginning on an od-d numbered byte which matches one of the 921116,dbwspe.036,cybe,13 -14templates. The procedure described in relation to Figure 1 would not detect the matching data string in this situation. The procedure of flow chart 100 illustrated in Figures 3A and 3B overcomes this difficulty by providing two searching loops, one operating on file data words beginning with even numbered bytes, and the other operating on file data words beginning with odd numbered bytes.
The procedure of flow chart 100 begins at step 102, and requires the use of two 8 bit combination units CU and CU which are initialised at step 104 such that only the least significant bit of each combination unit register holds a true value. The procedure of flow chart 100 is described as utilising negative logic, wherein a logic zero value indicates true and a logic one value "indicates false. The negative logic scheme can be advantageous if the method is implemented on a personal computer based upon an Intel 80X86 microprocessor, '"since peculiarities of the 80X86 microprocessor instruction set enables a program embodying the mehod to be slightly shorter if negative logic is used rather than positive logic. This also requires chat zeros are entered in the translation table array, instead of ones as previously described, and that a logical OR operator rather than a logical AND operator is used to combine results in the combination unit. Thus, at step 104 the combination unit registers CU 1 and CU <2 are initialised to hold the value Ofeh. Also, the procedure 100 utilises a 16 bit index address register for indexing into the translation table array, the index address register comprising high and low index portions containing respectively a high index address byte and a low index address byte. At step 106 the first byte of the data file is retrieved and placed in the low index address. The file data address is incremented by one byte at step 108, and the file length counter is decremented by one byte at step 110.
Step 112 operates to shift the low index address byte to the high index address byte of the 16 bit index address register. At this stage the next byte of the file data is also retrieved and placed in the low index address byte. At step 114 the file data address is again incremented to take into account the data byte accessed in step 112. Step 116 corresponds to steps 18 to 22 of flow chart 2, 9211 16,dbwvspe.036,cybec,14 wherein the index address is utilised to retrieve an entry from the translation table array, and the retrieved entry is combined with the current value of the combination unit by way of a logical OR operator. In this case, combination unit CU is operated upon at step 116, and the results of the combination of the translation table entry and the value contained in CU is stored in the combination unit register CU Step 118 of flow chart 100 corresponds to step 24 of flow chart 2, where the combination unit CU 2 is shifted by one bit left (towards the most significant bit), and a true (logical zero) value is shifted into the right most or least significant bit. In performing this shifting operation, the most significant bit resulting from the operation performed at step 116 is :shifted out of the register at step 118, typically into a carry flag bit associated with S the combination unit CU register. The carry flag is examined at step 120 to determine the value of the bit shifted out at step 118, and if a true value is contained in the carry flag the procedure is directed to step 122. Step 122 designates an alert condition which indicates that the previous eight words of the data file correspond to some combination of word segments contained in the template table. In response to this alert condition a comparison procedure may be carried out in accordance with the comparison described in relation to step 28 of flow chart 2, to determine if the suspect data string of the data file matches any S 20 particular template in the template table. If no match is found the procedure may continue.
If a false value is returned at step 120, or a true value is returned but no match of the data string with any particular template is found at step 122, the flow chart continues to step 124 where the next byte of file data is retrieved and placed in the low index address, the previous low index address being shifted to the high index address portion of the index address register. Consequently, the file data address is incremented by one byte at step 126. Thereafter, step 128 operates in a manner similar to step 116, except utilising the 8 bit combination unit register CU 1 in place of the register CU 2 In this way, combination unit CU may be utilised for the comparison of data words beginning on even numbered bytes, whilst combination unit CU 1 is utilised for data words 921116,dbwspe.036,cybec,15 16beginning on odd numbered bytes. This is achieved by the action of steps 112 and 124 which shifts the file data to be examined by one byte between the operation of the combination steps 116 and 128. Steps 130, 132 and 134 operate in a manner similar to steps 118 to 122, except in relation to combination unit CU 1 rather than CU 2 The file length counter is decremented by two bytes at step 136, and is examined (step 138) to determine whether the end of the data file has been reached. If the end of the data file is found at step 138 the procedure terminates (step 140), otherwise the flow chart returns to step 112 to read the next byte of file data.
The end of file checking operaions of steps 136 and 138 operate on the assumption that none of the predetermined templates terminate on the last byte of a data file, since the procedure of flow chart 100 will disregard the final byte .of a data file which is not an integral number of words in length. This is a valid assumption if the virus templates are chosen with this characteristic in mind. If, however, a string matching one or more of the templates is likely to terminate at the final byte of a file, it is possible to decrement the file length counter by one byte at step 136 and implement a further file length counter decrement and end of file check operation between steps 120 and 124.
A suitable assembly language routine for an Intel 80X86 microprocessor is shown below, corresponding to the procedure shown in, and described in relation to, Figures 3A and 3B. This assumes that the file has been read into a buffer memory, the translation table has been generated in accordance with the procedure described in relation to flow chart 50, and the microprocessor registers are set up as follows: i. DS:SI is points to the buffer containing the data file to be examined ii. ES is set to the start of the computer memory segment containing the translation table iii. CX contains the length of the data file, in words iv. AL is used as the combination unit for strings starting on even bytes 921116,dbwpe.O36,cybecP16 17- AH is used as the combination unit for strings starting on odd bytes BX is used as an index into the translation table, and is comprised of a low index portion BL and a high index portion BH mov ax,0fefeh Set bit zero to 'true' in AL AH begin: mov inc dec jmp bl,[si] si ex first Check word on odd byte boundary.
repeat: mov bh,bl Move low byte to high mov bl,[si] then load the next byte inc si and advance SI to the next byte or ah,es:[bx] Get entry combine with previous shl ah,1 Shift result one bit left (and 'true' in) jnc alertl Alert if C is not set bit 7 still zero) result e s o okl: ex end Check word on even byte boundary. Procedure is exactly the same as above.
first: mov mov inc or shl jnc bh,bl bl,[si] si al;es:[bx] al,1 alert2 ok2: loop repeat 9211 16,dbwspe.036,cybec,17 -18end: The procedure will end here if the file is clean.
alertl: alert2: The procedure will reach the alertl or alert2 markers if the preceding eight words occur at the correct positions in any of the template strings.
If this occurs the suspect file data string must be checked against the templates. If there is a match an alarm procedure will be invoked.
Otherwise the procedure must return to either marker OK1 or OK2 as appropriate.
Figures 5 and 6 illustrate flow charts 200 and 250 of a procedure corresponding to a modification of that described in relation to flow chart 100 S. shown in Figures 3A and 3B. The flow chart 200 shown in Figure 5 shows the necessary steps for creating a modified translation table for use by the procedure of flow chart 250 in Figure 6.
If 17 byte data strings are known for each template then the template may be segmented at both odd and even numbered byte boundaries to yield 8 data word segments beginning on odd numbered bytes and 8 data word segments beginning on even numbered bytes for each template. It is then also possible to disregard the least significant bit of every 16 bit template segment, such that the address space of the resulting 15 bit segments correspond to a translation table of only 32,768 bytes. However, this also requires that there is not a unique translation table entry for each different template segment value, but rather that consecutive odd and even segment values share a single translation table entry.
For example, the segment values 0000h and 0001h, when used as an index address into the translation table array, would both result in the address 0000h being utilised, and would both correspond to the first entry in the translation table 921116,dbwspe.G36,ybec,18 -19 array. By reducing the resolution of the translation table by a factor of 2 in this way, however, translation tables for both the odd and even numbered segments of the templates may be combined into a single 32,768 word table. In doing this a slightly shorter, and thus slightly faster, operating procedure may be implemented. The resulting template table contains 32,768 16 bit words, wherein the even numbered bits of each word are set in accordance with the even numbered template segments, whilst the odd numbered bits of each translation table entry corresponds to the odd numbered template segments.
Procedure 200 begins at step 202, and in steps 204 and 206 the predetermined template data strings are listed in a template table, and notionally segmented into words beginning on both odd and even numbered bytes. At step 208 memory space is allocated for a 16 bit addressable translation table, and all bits in the allocated memory set to a false value (either logical 1 or logical 0 depending upon whether negative or positive logic is utilised). Each template is successively indexed at step 210, beginning with the first 17 byte template on the .first execution of this step. Each of the word segments of the template indexed at step 210 is then retrieved and used to set bits in the translation table.
Template segments beginning on even numbered bytes are retrieved one at a time on each execution of step 212, and the least significant bit of the even segment is discarded by setting the bit to a zero value (step 214). This may conveniently be achieved by performing a logical AND operation on the 16 bit segment with the value Offfeh. Having discarded the least significant bit of the segment and thus effectively set the segment value to a even numbered address, the modified segment value is used as an index address into the 16 bit translation table. Since only even numbered addresses are used to index the translation table array each entry in the translation table comprises 16 digits. Once the relevant entry in the translation table has been identified by the index address, the even numbered bit in the 16 bit entry which corresponds to the even numbered segment, which was received at step 212, is set to a true value (step 218). Steps 212 to 218 are then repeated at steps 220 to 226, using odd numbered template segments and setting odd numbered bits in the translation table entries. Continuing then to step 228, 921116,dbwspe.036,cybec.19 the procedure is directed back to step 212 if there remains segments in the present template which have not yet been entered in the translation table, or is directed to step 230 if the previous template segment was the last in the current template. Similarly, if the template previously indexed at step 210 was the last in the template table then step 230 directs the procedure to its termination at step 232, else the procedure repeats beginning at step 210.
Figure 7 shows an example of a 16 bit translation table 300 constructed in accordance with the flow chart procedure 200, utilising the template data string 00 ff fe ff ff 00.
In this instance the following results are yielded: Word Segment Original Segment Modified Index Corresponding Value Address Reference Numeral in Figure 7 0 OOffh 00feh 302 1 fffeh fffeh 304 2 feffh fefeh 306 3 ffffh fffeh 308 4 ff00h ff00h 310 Flow chart procedure 250 shown in Figure 6 operates in a manner similar to flow chart procedure 2, only using a 16 bit combination unit to examine for template segment matches beginning on odd and even bytes concurrently.
Beginning at step 252, the 16 bit combination unit register is first initialised by clearing the register and setting the two least significant bits to a true value (step 254). A word of data from the data file is retrieved at step 256, and consequently the data file address is incremented by one word (step 258). The data word retrieved at step 256 is then converted into an even numbered index address by setting the least significant bit to a zero value (step 260), and the index address 921116,dbwspe.036,cybe,20 -21 is used to index and retrieve a 16 bit translation table entry from the translation table array created in accordance with flow chart 200 (step 262). The translation table entry is then combined with the current value of the 16 bit combination unit, in the manner described hereinbefore, by using a logical AND or a logical OR operator depending upon whether negative or positive logic is utilised (step 262) the result of the combination of the translation table entry and the combination unit value is stored in the combination unit register. The combination unit register is thereafter shifted by one bit to the left (step 264), and if a true value is shifted out from the most significant bit of the combination unit (step 266) the procedure is directed to step 268, indicating an alert requiring a comparison of the file data with the templates contained in the template table. If no alert is generated the combination unit is shifted by one more bit towards the most significant bit, and if a true value is shifted out (step 272) an alert requiring comparison is again generated at step 274. Assuming no alerts are generated, or that the comparisons perfor Tned by virtue of step 268 or step 274 are negative, the file length counter is decremented by one word at step 276, and if the end of the file is detected (step 278) the procedure terminates at step 280. If unexamined file data still exists the procedure returns to step 256 where the next word of file data is retrieved, steps 256 to 278 being repeated for each word of data in the 20 data file.
Shown below is an assembly language routine for an Intel 80X86 microprocessor which implements the procedure shown in the flow chart 250: begin: mov ax,0fffch Set bits zero one to start repeat: mov bx,[si] Load the next word add si,2 Advance SI to the next word and bx,0fffeh Convert to even word address or ax,es:[bx] Combine in the current word shl ax,1 Shift even result one bit left jnc alertl Alert if C not set bit 7 was still true) 921116,dbwspe.036,cybec.21 -22 3okl: shl ax,1 Shift odd result one bit left jnc alert2 ok2: loop repeat end: end of procedure alertl: alert condition requiring comparison alert2: of file data and templates In some cases 16 byte templates may not be long enough to give a positive identification. In this case one of the above described methods may act as an initial filter, and the comparison procedure can check as many bytes as desired, or can invoke additional checking procedures for positive identification of, for example, a computer virus. Alternatively it may be possible to use shorter templates in the initial comparisons to reduce the length of the translation table.
4 921116,dbv.spc.036,cybec,22
Claims (3)
1. A method for locating a data pattern in a body of digital data, the data pattern consisting of a sequence of segments of at least nae pattern template, comprising the steps of: A. accessing a segment of the body of data and utilising the accessed segment to address one of a plurality of translation entries, each entry comprising a sequence of data sections which each indicate whether a segment of said body of data occurs as a segment of at least one pattern template in a position corresponding to the sequence position of said data section; B. logically combining the content of the addressed translation entry with an :initially predetermined combination value, shifting the result thereof and storing the shifted reslllt as the combination value; C. repeating steps A and B for sequential segments of the body of data, wherein a predetermined logic value resulting from the logical combination and shifting of step B is indicative of the existence of the data pattern within the body of data.
2. The method of claim 1, wherein said logical combination is effective to set a flag in the combination value each time the accessed segment matches a segment in a first position of any said at least one pattern template and wherein a set flag will remain set following succeeding shifting and logical combination steps if the succeeding accessed segments match segments in succeeding positions of any said at least one pattern template, and wherein a set flag which is shifted out of said combination value is indicative of the preceding sequence of accessed segments matching a sequence of segments from any of said at least one pa'tern templates constituting the data pattern. 930609,p :\percnicyb cJ .cam,23 4 24
3. The method of claim 1 in which a plurality of fixed length pattern templates are utilised to form a table of addressable entries comprising said addressable translation entries, and wherein the logical combination and shifting step is carried out in such a way that as each new segment of the body of the data is entered a new flag is entered into the combination value, indicating the possible occurrence of one of the pattern templates, and each flag will remain set as it is shifted through the combination value if successive segments each occur in successive positions of one or more of the pattern templates, and wherein steps A and B are repeated until a flag is still set when it is shifted out of the combination value, indicating that the preceding portion of data could contain one of the pattern templates, or until the end of the data is reached, and including the step of comparing any sequence of segments of the body of data identified in step C with the individual pattern templates to determine if it matches any given template. *e SDATED this 9th day of June 1993 CYBEC PTY. LTD. By its Patent Attorneys DAVIES COLLISON CAVE *o. o* 930609,p\opr~cm,ybcO1.com,24 ABSTRACT A data searching method in which a plurality of fixed length template strings can be searched for simultaneously. The template strings at e segmented and the segments used as addresses to form a translation table in which sections of the table entries are set in accordance with the positions of corresponding address segments in the template strings. Having formed the table, segments of the data to be searched are then sequentially accessed and used to address the table and retrieve the corresponding entry. For each segment of the data the 'retrieved entry is combined with a combination value, and the combined value shifted and stored as the new combination value. A predetermined value being shifted out of the combination value indicates a sequence of segments in the data 15 which matches a sequence of segments from the template strings. When this occurs the matching sequence of data is compared with the template strings to determine whether the sequence m.,hes a particular template string. *o 921116,dbwspe.036,cybec,24
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AU28402/92A AU640335B3 (en) | 1991-11-15 | 1992-11-16 | Data searching method |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AUPK9530 | 1991-11-15 | ||
AUPK953091 | 1991-11-15 | ||
AU28402/92A AU640335B3 (en) | 1991-11-15 | 1992-11-16 | Data searching method |
Publications (2)
Publication Number | Publication Date |
---|---|
AU640335B1 true AU640335B1 (en) | 1993-08-19 |
AU640335B3 AU640335B3 (en) | 1993-08-19 |
Family
ID=
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0896285A1 (en) * | 1997-07-10 | 1999-02-10 | International Business Machines Corporation | Efficient detection of computer viruses and other data trails |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0896285A1 (en) * | 1997-07-10 | 1999-02-10 | International Business Machines Corporation | Efficient detection of computer viruses and other data trails |
US6016546A (en) * | 1997-07-10 | 2000-01-18 | International Business Machines Corporation | Efficient detection of computer viruses and other data traits |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6000008A (en) | Method and apparatus for matching data items of variable length in a content addressable memory | |
US5664184A (en) | Method and apparatus for implementing Q-trees | |
US5613145A (en) | Stored string data with element data units and pointer data units in distinct subranges of values | |
CA1287183C (en) | Method and apparatus for data hashing | |
US4677550A (en) | Method of compacting and searching a data index | |
Morrison | PATRICIA—practical algorithm to retrieve information coded in alphanumeric | |
US5692177A (en) | Method and system for data set storage by iteratively searching for perfect hashing functions | |
US4991087A (en) | Method of using signature subsets for indexing a textual database | |
US6353873B1 (en) | Apparatus and method to determine a longest prefix match in a content addressable memory | |
US4145738A (en) | Plural virtual address space processing system | |
US7062499B2 (en) | Enhanced multiway radix tree and related methods | |
US5421007A (en) | Key space analysis method for improved record sorting and file merging | |
EP3292481B1 (en) | Method, system and computer program product for performing numeric searches | |
US5241638A (en) | Dual cache memory | |
JPH08212136A (en) | Method and apparatus for efficient sharing of virtual memoryconversion processing | |
WO2004036589A1 (en) | Virtual content addressable memory with high speed key insertion and deletion and pipelined key search | |
JP3644494B2 (en) | Information retrieval device | |
US7003653B2 (en) | Method for rapid interpretation of results returned by a parallel compare instruction | |
US5519860A (en) | Central processor index sort followed by direct record sort and write by an intelligent control unit | |
AU640335B1 (en) | ||
US5261090A (en) | Search arrangement adapted for data range detection | |
US5898898A (en) | Collating bits from a byte source | |
EP1131713A2 (en) | Method of performing a sliding window search | |
EP0381245A2 (en) | Address translation system | |
WO2003003250A1 (en) | Range content-addressable memory |