CA2183024A1

CA2183024A1 - Method and apparatus for classifying raw data entries according to data patterns

Info

Publication number: CA2183024A1
Application number: CA 2183024
Authority: CA
Inventors: Wesley C. Sampson
Original assignee: Individual
Current assignee: Individual
Priority date: 1995-08-11
Filing date: 1996-08-09
Publication date: 1997-02-12

Abstract

A method and apparatus for indexing data entry records groups in a plurality of raw data entry records, each data entry records group having at least one of the data entry records with at least one key field containing an item, includes a master list of items in the raw data to be indexed. A data entry records group is read and a tally of a key number representing a total number of the key fields in the data entry records group is generated. In some cases, the key fields are sorted in a predetermined sequence, duplicate key fields are removed and the key number is adjusted for the key fields removed. An index record having a predetermined number of key fields equal to the key number and a first pointer related to instructions for processing any other data entry records group having the same combination of key fields is generated. Each item in the key fields of the data entry records group is mapped to generate a corresponding item indicator in the key fields of the index record. If the index record matches an existing index record, the instructions related to the first pointer are executed. If the index record is new, a second pointer is added related to a location of the data entry records group in the plurality of data entry records.

Description

~3~
-METHOD AND APPARATUS FOR CLASSIFYING
RAW DATA ENl~IES ACCORDlNG TO DATA PATTERNS
BA(~K(~R()UND OF T~IF INVFl~I~ON
The present invention relates geneMlly to a system for classifying large volumes of raw data and, in particular, to a method and appaMtus for creating anindex to optimally classify a pluMlity of data entries for efficient review, inspection, stoMge, sorting, e and reporting.
Many enterprises geneMte large volumes of data entries, each such entry having associated with it one or more items Ic~l ~, important i,.r(",.,-,;...~
Typically, larger enterprises are ~: 1 ' and the data entries are inputted into a computer and then transferred to a mass storage device such as a magnetic tapefor later use in various data analysis and report geneMting progMms. For example, 15 a receiving department or a stock room accepts a delivery of parts to be placed into inventory. At least the part number and quantity are items which must be inputted into the computer for later use in inventory"~-, r~..l.,.;~g and accounting programs.
The prior art is v, ' Indexes are described as primary vs.
20 secondary; as dense vs. sparse; as single-level vs. multi-level; and, as static vs.
dynamic. Regardless of the method of index, the target of the index reference isone specific item. That one specific item may be a specific record, the first occurrence of a specific value of a given field, the disk sector on which the data is found, or something else of a singular definition. The target is usually defined 25 on (directed to the index from) a single field of a file. "F~ of Database Systems" by Elasri and Navathe devotes chapter 5 to index structures for files.
The U.S. Patent No. 5,212,639 shows a method and electronic apparatus for the ~ ri. .1,~". of . ' ' data for the and/or tabulation thereof. An apparatus and a method for creating a database wherein tne data entry, 30 such as a journal entry in a l~, is the canonical record can be imr'~ --using a ~ o, . A plurality of data entries to be classified are separate records each comprised of one or more items having associated quantities and an ~ 8~4 .
entry number serving as a pointer to the record. Each item contains ;"r.."..";...~
including at least an item number or label and a quantity. A mapping function isapplied to each data entry to assign item indicators for the item numbers based upon the associated quantities. The item indicators for the data entry are sorted 5 into ascending numerical sequence and an n-dimension sparse matrix is selected where "n" is the number of items in the data entry. If the present ' of item indicators is new, a design record is created for the database based upon the sparse matrix and including the item indicators, the associated quantity sums, the total number of data entries ' in the design record and a pointer (a chain o of entry numbers) to the records of the data entry detail. The quantities for the presen~ data entry are added to the quantity sums and the entry number is storedin the pointer chain. After all the data entries have been processed, a search routine can be utilized to review the various design records as desired.
The U.S. Patent No. 5,390,113 shows a method and electronic apparatus for 15 pclfu~ , bol-kk~~ A method and apparatus for el~ ly p, r ~ ~
kk..~ f upon a plurality of accounting joumal entries having at least one account number and at least one data component associated with each account number. A chart of accounts having account numbers and opening balances associated with the plurality of joumal entries is read electronically. A set ofaccount-section numbers is then created for each account number. The joumal entries are el~L1, lly read and one of the account-section numbers is assigned to each account number. The assigned account-section numbers and associated data., are then sorted in a l)1cd~; ' order. A design for the order is identified and compared with stored design records to see if such a design already exists. If not, the new design is stored. If so, the associated data ~ are added to the - ' ~ total for each account-section number. A tally ICLn~ illg the number of account-section numbers is increased by one and an entry number is added to a list for the particular design record. The process is then repeated for each joumal entry.
The use of a database to summarize joumal entries in a way that can be used for audit planning and ?~ data for matrix-defined financial statements is 2~8~
described in an article entitled "An Innovation in Auditing: Matrix Summaries ofJournal Entries", ABACIJS, Vol. 2g, No. 2, 1992.
SUMMARY OF THE INVENTION
An index classifies material in such a way as to make it easier to find. The present invention relates to methods of creating an index for the storage and retrieval of electronic data. The present invention differs from the prior art indexing systems in that the source and the target is a group of items rather than o an individual item, and that the data pattern of the group defines an individual record of the index. Under these conditions the index records are of variable length (multiple fields) rather than a single field.
The present invention concerns an apparatus and a method for creating a database wherein the data entry is the canonical record. Raw data, typically 15 provided from a mass storage device, is formed of a plurality of data entry records.
Each record can include one or more key fields each containing and item of interest to be indexed. Typically, two or more of these data entry records are related asa group and these groups are to be classified or ' A master list of all of the items in the key fields to be found in the raw data is either included with the 20 raw data or created by the method and apparatus according to the present invention.
The present invention can be used to index new records and to . ,' existing files. Use to index new records includes (I) the ~l~ccifi~fion of data for schedules or quality reports and (2) ", ~ of a "class" of items in object-oriented l~lUL ~ ' ' usage would include use as primary or 25 secondary indexes to (1) a log file and (2) a relational database table. New record usage involves dynamic operation whereby the index becomes part of the process of selecting where to place newly created data.
Another feature of the index according to the present invention is of program branching. A "Pl" pointer located in each index record 30 allows the ~ , to specify program action for given patterns of data using additional i~ liùAs. This feature is expected to simplify some programs.

~183~
BRTF.F DFA~GRTPTION OF TT-TF DRAWINGS
The above, as well as other advantages of the present invention, will become readily apparent to those skilled in the art from the following detailed description s of a preferred ~,I..I)o~' ' when considered in the light of the - . Y;l 6 drawings in which:
Fig. 1 is a schematic perspective view of a three-~ Pnc;~-n~l matrix according to the present invention;
Fig. 2 is a schematic ~ lio-l of the data pattern structure schema according to the present invention;
Fig. 3 is a block diagram of the apparatus according to the present invention; and Fig. 4 is a flow diagram of the method according to the present invention.
DFA~(~RTPTION OF TTTF pRFFFRRFn FMPODTMF,~T
In general, raw data in the form of a plurality of data entry records is only useful if the records can be sorted based upon items of interest in the records. A
significant savings in data processing time can be achieved if the data entry records can be indexed or classified by "key" items and such index stored for use in subsequent searches, summaries and reports. A "group" can be defined as one or more data entry records having a common 1~ 'a" ' ,i. A group will have one or more key fields in which items of interest (key items) are stored. For example, a simple sales transaction can be recorded utilizing a first data entry record for the 2s customer ;,~ . r " 1~ a second data entry record for the product sold, and a third data entry record for the sales price. These three data entry records are related as a group. If the customer ' ~ is the only item of interest, this group has one key feld (in the first record). If the product sold also is of interest, this group has two key fields (in the first and second records).
A group having n-key fields can be indexed according to the present invention by utilizing an n-dimension matrix with up to Mn values stored where "M" is the number of items in a master list of key items for all data entry records - 21$3~
in the raw data. However, not all key item ~u ' - may occur in any given set of raw data. This allows the use of a series of sparse matrices, which require space to be allocated for only those key items which actually exist. For the purposes of i~ ctr~fi--n it is assumed that a plurality of groups of raw data entry s records, each group having three key items of interest, are to be classified.
Potentially then, a three ," ' matrix, such as the one shown in the Fig. 1, could be used to index such raw data.
There is shûwn in the Fig. 2 the data pattern structure schema of index records according to the present invention for each of one-dimension, two-o dimension, three-dimension and four-dimension ~ of key items to be indexed. Each size (number of key fields) group of data entry records requires aseparate matrix. The index records include a variable length record pûrtiûn, having a key field for each key item to be indexed in the associated group, and two pointer fields which are discussed below.
An example of a three-dimension matrix 21 which uniquely classifies grûups of data entry records each having a, ' of three key items from a master list of ten items is shown in the Fig. 1. The matrix 21 can hold up to one thousand values, or different ~ of the three items in the groups, which is the total number of cells in the matrix. The items used in the example of the Fig. 1 are 20 designated by item indicators as the first ten letters of the English alphabet. The first item of the group is classified on a "Y" axis 22, the second item is classified on an "X" axis 23, and the third item is classified on a "Z" axis 24 of the matrix 21. Thus, a group of data entry records having key items "ADE" results in an ;l~f~ at a cell 25 and a group of data entry records having key items "JJA"
25 results in an : at a cell 26.
A dictionary is a type of index wherein only those; b ~dliùl~S of letters of the alphabet which have meaning, words, are of interest. The dictionary indexes those letter ~nC in a p.cd~l ' sequence. Thus, a dictionary is not an efficient index if all words containing ~ el .' letter ~ "~ - such 30 as the letters "a" and "b" for example, are to be located. Clearly, an index can be ~o..~llu~t~ according to the present invention to efficiently r ' ' such search requests.
s ~1~3~
Assuming that a master list of key items exists, the matrix design process for each n-dimension ~ù ' involves the steps of: (a) reading a data entry records group from the source of raw data, (b) ~ the data pattern structure schema of the data entry records group in terms of the number of key s fields to be classified, (c) creating an index record for the key item combination, (d) mapping the key items to generate item indicators in key fields in the indexrecord, and (e) recording a pointer in the index record to enable the data entryrecords group to be located in the raw data. The result is a ~ ' length index record for each unique data pattern in the groups of data entry records. After the raw data has been processed in this manner, each data entry records group can be located. Thus, the cells in the matrix shown in the Fig. 1 represent the of item indicators for the index record key fields ~;UII~_r '- ~ to any group of data entry records containing three key items from the master list of "A' through "J".
When the sequential ' ' 1~ of items within the master list is fixed, the item indicators substitute for the subscripts of matrix notation. The sparse matrix IJIU3, ,, technique requires only those records which exist to be stored in computer memory. Vacant positions or cells in the matrix shown in the Fig. 1 require no computer memory.
The index constructed according to the present invention includes index records which identify the data patterns of the groups of key items drawn from amaster list. All that the process requires is that a master list be in computer memory at the time the index is created. The form of master list file, whether sequential, array, B+ tree, or some other Ul~ .~..;~l~iUII, is not important so long as 25 each master list item is uniquely identified via the mapping function in the index records. rulL~ llul~ the master list can be constructed dynamically. One needs primarily to avoid the problems associated with deleted master records in some file structures. However, the choice of mapping function and file, " ~n of the master list does affect the speed of operation of the process, the comparability of 30 index records over time, and whether the process can be traced in one or both directions. The index always identifies a group data pattern.

3~24 The group can be the sequential items of i as are found in a log of invGices; or, the items reported in a summary, such as treatments recorded inhospital cases. The method according to the present invention requires only thatall items of the group be available at the time of data pattern creation.
s Data pattern is the term applied to the key items and/or the associated item identifiers of all the data entry records of a group. For indexing purposes in business or social science ~I~J~ the order in and/or the number of times which the key items appear in the group of data entry records usually is not important. Thus, in the above described sales transaction example, if we assign the letters "A" and "B" to the key items of interest, it would not matter for the purposes of indexing whether the order in the group was "AB" or "BA" and it would not matter whether the group was "AAB" or ABB" since the same two key items a~e present in each of these . ' Thus, the key items can be sorted and duplicates eliminated before mapping so that synonyms are not separate, i.e., "AB" and "BA" will map to the same index record and be considered the same data pattern. An individual item indicator (the mappcd value of the key field) will be used only once in the c--~hinq~-n which becomes the index record even though therelated item appears more than once in the group of data entry records. In otherwords, duplicate O~,-,UII~ ,~ are combined. The ." of duplicate 0~ UI1~ reduces the size of the index and increases the speed with which the raw data can be processed. If an item being mapped does not match any item of the master list, an error message is generated before the data pattern is included in the index In business or social science, a data pattern is the verified, non-duplicate list of item indicators of the co ' being mapped, typically in ascending sequence. Other disciplines may require the index record to reflect the sequenceof the key fields as they appear in the group. Both types of index records may be required for the same raw data. For example, the ~ of a bolt, washer and nut can be key items of group of raw data entry records. For the purposes ofpurchasing and inventory, the order in which these items appear in the group is not important. For the purposes of assembly, the order "bolt", "washer" and "nut"
would be important.

~83~
In the Fig. 3, there is shown an apparatus for "~.r, " the method of creating the index records. A source of il~rullllatiu,. or raw data entry records 31, such as another computer program or data i from another computer or peripheral data source, has an output connected to an input of a computer 32. Tbe 5 source 31 may include a master list of items in key fields of the i " r,.. ., ~ . . being processed. The computer 32 reads the master list and stores the master list - r " in a master list memory devioe 33 connected to the computer. If the master list is not present in the source 31, the computer must create the master list.
Once the master list has been stored, the computer 32 does not have to read the master list each time additional raw data containing items from the master list is to be processed.
The computer 32 reads the items in the key fields of a group of data entry records from the data source 31. If the order in which the items appear in the key fields of the group is not significamt, these key items are sorted into ascending sequence and duplicates are removed. The computer 32 can identify the unique setof key items associated with a specific co".l. ~ If the set of key items is new,an index record containing the d~lU~)I' ' number of key fields is created. If the set already exists, the computer performs the U~,LiU.~ provided for by a first pointer in the l,UllUi~LJOl~ll;l.g index record. For example, the instruction could be 20 to store the current group in a file thereby - " all groups having the same data pattern in a common file. In either case, the index records are stored in an index record memory 34 connected to the computer 32. Each index record also includes a second pointer which is used to locate the associated groups for subsequent processing of the data entry records. An output device 35, such as a 25 printer or a CRT, also is connected to the computer 32 for generating output l~rulll~ iu.. related to the index.
Data entry records already placed at locations deflned in each index record may be inspected via a search routine. The se~rch may be according to all of thekey items of the l;UI '' " or any portion of the key items. Less than the 30 complete ~u ~ l requires additional queries to find the specific index record.
Once the proper index record is found, the operator may inspect data entry records at the data en~ record locati~ ~ec _rec~.

211 ~2~
The hardware associated with this method of indexing is modest. It may consist of portions of an existing computer program, as a separate computer system, or as a small computer installed in the circuitry of a larger computer.
There is shown in the Fig. 4 the f~ow of the method of .;o.~ u~L..~, an index s record. As each key field of the group of data entry records to be indexed according to the data pattern is read, a mapping function is applied to assign the item identifiers. An item identifier defines an item of the master list. If the present data pattern is new, a new record for that pattern is created in the index.
After each data pattern has been processed to the index, the process can be o interrupted. Thus, usage of the index may be either static or dynamic.
The process of creating and using the index records is:
A. Read the key fields of all data entry records groups of a given ~iu...l,:..aLon, i.e. a transaction (as in business) or a case (as in hospital records), and tally the total number of items in the ~ ' B. Determine whether key fields are to be sorted and duplicates removed. If not, proceed to step E.
C. Sort the key fields read in step A. into ascending sequence.
D. Eliminate amy key fields of duplicate items and reduce the tally by the number of duplicate values eliminated.
20 E. Create a variable lengtb portion of an index record where the length is the number of index record key fields equal to the tally and apply the mapping function to each item in the group key fields to store ~-~rri~crnn~' 1,, item indicators in the index record key fields.
F. Determine whether the item indicators generated in step E. require a new 25 index record. If not, go to step G. If so, complete the index record by hi.~ two pointers (P1) from the new index record to the computer location of additional illol~ liOI~o to be processed when that particular c~ iS
d and (P2) from the new index record to the computer location of the object of the index (location of new data storage, or location of a ~ ccifir~jon of 30 data).
G. Perform user-defined procedures to be used with the specific . ~illa~iO
as specified by the pointer Pl.

3~24 H. Wait for a signal to (1) apply the index to a new group, (2) begin a search using the index, or (3) cease data entry record indexing and go to another program function.
In the flow diagram of the method shown in the Fig. 4, the computer 32 of 5 the Fig. 3 is l.-u~ J to perform a sequence of i~DllU.,liu..~ beginning at a circle 61. The computer program then enters a decision point 62 to determine whether a master list of items to be indexed exists in the source 31. If the master list exists, the program branches at "YES" to an instruction set 63 wherein the master list is read and stored in the memory 33. If the master list does not exist, the program branches at "NO" to an instruction set 64 wherein the master list iscreated and stored in the memory 33. The master list can created by reading and storing the items in the key fields of all of the data entry records groups before continuing with the program. In the alternative, the instruction set 64 can direct the program to create the master list as the data entry records groups are being15 processed as described below.
The program then enters a wait loop 65 from both of the instruction sets 63 and 64. The programs idles in the wait loop 65 until receiving a signal that theraw data is ready to be processed. When an instruction set 66 receives the signal that a group of data entry records from the source of raw data 31 is ready, the 20 computer program enters an instruction set 67 which causes the key fields of the group to be read into computer memory and a tally of the number of key fields tobe created. If duplicate key fields are to be eliminated, a sort switch is set. If the master list is being created, the items in the key fields are stored in the master list memory 33. The program then enters a decision point 68. If the sort switch is set, 25 the program branches at "YES" to an instruction set 69 wherein the key fields are sorted into ascending sequence, duplicate key fields are eliminated, and the tally of key fields eliminated is subtracted from the tally created in instruction set 67.
The program then enters an instruction set 70 to create the variable length portion of the index record (see key fields in the Fig. 2) where the length is the 30 tally of the number of key fields in the specific COIl~ la.~iull. A mapping function is applied to each key field of the group to generate the item indicators in the key fields of the index record. The instruction set 70 also is entered from the "NO"

3~24 branch of the decision point 68 when the sort switch is not set. If an error occurs in the mapping function, a "YES" branch is made from a decision point 71 and an indicator is set in an instruction set 72 for possible ~, The program then returns to the wait loop 65. If no error occurs in the mapping function, the s prograrn branches at "NO" from the decision point 71 to an instruction set 73 wherein the computer 32 searches the existing index records in the memory 34 to determine whether the exact I ' of key items being processed already exists. If the variable length portion of the index record being CU..Di1U~,lCd is new, the prcgram branches from a decision point 74 at "YES" and a new index record is created in an instruction set 75 by (a) extending pointer chains from each master list item of the new record to include the new record, (b) ~b'- ' of a pointer "P2" from the new index record to the group or file which is the object of that record, and (c) the inclusion of a pointer "Pl" to any special i.,,~u.,liu." to be associated with that index record. The program enters an instruction set 76 to 15 execute ill,llu~ilio.., referred to in the "Pl" field of the index rccord and then continues to the wait loop 65. If the pattial index record under ~:oll~LlUU~iUII iS not new, the program branches from the decision point 74 at "NO" to the instruction set 76 and continues to the wait loop 65.
At any time the program is in the wait loop 65 a branch can be made to an 20 instruction set 77 to search the existing index records. After the search is completed, the program returns to the wait loop 65. The program also has an instruction set 78 which can respond to an indication that no more groups of data entry records are to be indexed. The program then exits through a "NO" branch of a decision point 79 to completion at a circle 80 or through a "YES" branch to25 an instruction set 81 if another computer program is to be executed.
The master list controls how the groups of data entry records are indexed.
More than one master list can be utilized with the same raw data entry records to permit specialized searching and processing. The index records created from the groups of data entry records reduce the time required for searching and organizing 30 the raw data.
In ac~u.~' with the provisions of the patent statutes, the present invention has been described in what is considered to represent its preferred ~illlbu~' 2~
However, it should be noted that the invention can be practiced otherwise than as specifically illustrated and described without departing from its spirit or scope.

Claims

1. A method of classifying a plurality of data entry records comprising:
a. reading a data entry records group from a plurality of data entry records, said data entry records group having at least one of said data entry records with at least one key field containing an item;
b. tallying a key number representing a total number of said key fields in said data entry records group;
c. creating an index record having a predetermined number of key fields equal to said key number and a first pointer related to instructions for processing any other data entry records group having a combination of key fields corresponding to said key fields of said data entry records group;
d. mapping each item in said key fields of said data entry records group to generate a corresponding item indicator in said key fields of said index record; and e. adding a second pointer to said index record, said second pointer being related to a location of said data entry records group in the plurality of data entry records.

2. The method according to claim 1 including creating a plurality of different index records by performing the steps a. through e. for each combination of key fields in data entry records groups in the plurality of data entry records.

3. The method according to claim 2 including storing said index records in an index records memory.

4. The method according to claim 1 including comparing said index record with a plurality of stored index records prior to performing the step e.,performing the step e. if said index record does not match one of said stored index records, and, if said index record matches one of said stored index records, eliminating the step e. and executing said instructions related to said first pointer in said one stored index record.

5. The method according to claim 1 including sorting said key fields in a predetermined sequence, removing duplicate ones of said key fields and adjusting said key number for said duplicate key fields removed before performing the stepc.

6. The method according to claim 1 including reading a master list of items to be indexed and performing the step b. by identifying said key fields asthose fields in said data entry records group containing items in said master list.

7. The method according to claim 1 including creating a master list of items to be indexed and performing the step b. by identifying said key fields asthose fields in said data entry records group containing items in said master list.

8. A method of classifying a plurality of data entry records comprising:
a. reading a data entry records group from a plurality of data entry records, said data entry records group having at least one of said data entry records with at least one key field containing an item;
b. tallying a key number representing a total number of said key fields in said data entry records group;
c. sorting said key fields in a predetermined sequence, removing duplicate ones of said key fields and adjusting said key number for said duplicate key fields removed;
d. creating an index record having a predetermined number of key fields equal to said key number and a first pointer related to instructions for processing any other data entry records group having a combination of key fields corresponding to said key fields of said data entry records group;
e. mapping each item in said key fields of said data entry records group to generate a corresponding item indicator in said key fields of said index record; and f. adding a second pointer to said index record, said second pointer being related to a location of said data entry records group in the plurality of data entry records.

9. The method according to claim 8 including creating a plurality of different index records by performing the steps a. through f. for each combination of key fields in data entry records groups in the plurality of data entry records.

10. The method according to claim 9 including storing said index records in an index records memory.

11. The method according to claim 8 including comparing said index record with a plurality of stored index records prior to performing the step f.,performing the step f. if said index record does not match one of said stored index records, and, if said index record matches one of said stored index records, eliminating the step f. and executing said instructions related to said first pointer in said one stored index record.

12. The method according to claim 8 including reading a master list of items to be indexed and performing the step b. by identifying said key fields asthose fields in said data entry records group containing items in said master list.

13. The method according to claim 8 including creating a master list of items to be indexed and performing the step b. by identifying said key fields asthose fields in said data entry records group containing items in said master list.

14. An apparatus for classifying data entry records groups in a plurality of data entry records, comprising:
a computer means having a data input adapted to be connected to a source of data entry records groups in a plurality of raw data entry records, each data entry records group including at least one of the data entry records having at least one key field containing an item;

a master list memory means connected to said computer means for storing a master list of the items to be indexed; and an index records memory connected to said computer means for storing a plurality of index records, each said stored index record representing a unique combination of the items in said master list whereby said computer means reads a data entry records group from the source, generates an index record for said data entry records group, compares said index record with said stored index records, adds a pointer to and stores said index record in said index records memory if no match is found, and executes instructions related to a corresponding one of said stored index records if a match is found.

15. The apparatus according to claim 14 including an output device connected to said computer means for displaying information based upon said dataentry records groups.

16. The apparatus according to claim 14 wherein said computer means generates said master list based upon the raw data entry records groups read from the source.

17. The apparatus according to claim 14 wherein said computer means tallies a key number representing a total number of said key fields in said dataentry records group, sorts said key fields in said data entry records group in apredetermined sequence, removes duplicate ones of said key fields, adjusts said key number for said duplicate key fields removed and creates said index record with a predetermined number of key fields equal to said key number.

18. The apparatus according to claim 14 wherein said computer means generates said index record with a key field for each of the items in the key fields of said data entry records group and maps the items to generate item indicators in corresponding ones of said key fields of said index record.