WO2007148033A1

WO2007148033A1 - Flat file searching

Info

Publication number: WO2007148033A1
Application number: PCT/GB2006/002343
Authority: WO
Inventors: Duncan Gunther Pauly
Original assignee: Coppereye Limited
Priority date: 2006-06-23
Filing date: 2006-06-23
Publication date: 2007-12-27

Abstract

A method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising: indexing the records to generate a flat file index; receiving a query; referencing the flat file index to identify one or more of the records which satisfies the query; and retrieving the identified record(s) from the flat file database.

Description

FLAT FILE SEARCHING

FIELD OF THE INVENTION

The present invention relates to a method and apparatus for searching a flat file database.

BACKGROUND OF THE INVENTION

Data is often recorded in a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value. Examples include telecommunication network usage event files, web server logs and e-commerce transaction logs. A flat file database can be contrasted with a database such as a relational database in which a structural relationship exists between the records.

The data stored in the flat file database is typically transaction data created by automated systems and self-service environments, which typically generate data in large daily volumes. Such data is typically immutable and does not require the extensive management framework implemented by a relational database. However, conventionally the data is migrated into a relational database to gain SQL query access to it.

Migrating the data involves converting and moving it from the flat file database to the relational database and indexing it.

Until the migration completes, the data is effectively unavailable for query.

Relational databases use conventional indexing such as B-trees to index the migrated data. Such indexing requires extensive key sorting and/or disk activity and this lengthens the delay until the data is available for query.

Some commercial databases offer SQL query access to unstructured storage, but the data remains un-indexed, forcing every SQL query to scan the entire unstructured data set. This makes such SQL query access to large volumes of unstructured data infeasible for responsive selective access. SUMMARY OF THE INVENTION

A first aspect of the invention provides a method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising:

indexing the records to generate a flat file index;

receiving a query;

referencing the flat file index to identify one or more of the records which satisfies the query; and

retrieving the identified record(s) from the flat file database.

The invention provides direct access to the data in the flat file database whilst avoiding the delays associated with migration.

The query may be in any desired format including (but not limited to):

• SQL (relational)

• OQL (object)

• XQL/XQuery (XML)

• SPARQL (REF - semantic web)

Typically the method further comprises autonomously discovering the flat file(s) in the database.

Typically the query is delegated from a relational database query service. This enables the relational database query service to provide access to the flat file database which is transparent to a user, and present results to the query in the required format.

The method of the first aspect of the invention is implemented on hardware loaded with appropriate computer software. A second aspect of the invention provides apparatus comprising:

a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value;

an indexing service configured to index the records to generate an index; and

a flat file query service configured to receive a query directed to the flat file database; reference the index to identify one or more of the records which satisfies the query; and retrieve the identified record(s) from the flat file database.

BRIEF DESCRIPTION OF THE DRAWINGS

Embodiments of the invention will now be described with reference to the accompanying drawings, in which:

Figure 1 is a schematic view of the architecture of a system for searching a flat file database;

Figure 2 shows a method of discovering, parsing and indexing the flat file database; and

Figure 3 shows a method of servicing queries, retrieving results and presenting the results.

DETAILED DESCRIPTION OF EMBODIMENT(S)

Figure 1 shows a flat file database 1 comprising one or more flat files. Figure 2 gives an example of two flat files in the database 1, namely:

• /subs/logs/381.1og

• /subs/logs/382.log

Each flat file comprising a sequence of records with no structural relationship between the records. For example, three records in the flat file /subs/logs/381. log are shown in Figure 2. Each record comprises one or more adjacent fields. For example, the first record shown in /subs/logs/381.log comprises five fields: • a video field (/video/films/9765.mpeg)

• a subscriber field (016791801)

• a datetime field (210105:221007)

• a duration field (012705)

• an ip-address field (165.58.192.11)

Each field of each record contains a scalar value: that is a single quantitative or identification value.

An indexing service 2 discovers, parses and indexes new flat files as they are added to the database 1, by a process illustrated in Figure 2. The process is managed and controlled by discovery, parse, and index configurations shown in Figure 2.

The discovery configuration defines the path and file naming conventions for the files to be discovered in the database - in this example subs/logs/* .log.

The parse configuration defines the expected record and field formats for the files to be parsed - in this example video, subscriber, datetime[ddmmyy:hhnnss], duration, ip-address.

The index configuration defines the fields to be indexed to support the queries expected - in this example the video field and the subscriber field.

The indexing service 2 autonomously discovers the flat file(s) in the database according to the discovery configuration. Each discovered file is then scanned and parsed according to the parse configuration. The fields are then indexed according to the indexing engine to generate index files which are stored in a flat file index 3 shown in Figure 1. Figure 2 illustrates two index files: a video index file and a subscriber index file. Each index file comprises a set of index records, each index record comprising a key value (for instance /video/films/0671.mpeg); a file pointer identifying one of the files in the database 1 (for instance 382); and a record offset pointer identifying the location of the record within the file (for instance 67). The indexing methods described in WO0244940 (US2004015478) and/or in WO02069185 (US2004073559) may be used to offer fast and immediate access with minimal latency between data creation and query availability. The disclosures of WO0244940 (US2004015478) and WO02069185 (US2004073559) are incorporated herein by reference in their entirety.

A relational database query service 4 services queries issued against a relational database 5 using a B-tree index 6. That is, the query service 4 receives relational database queries; references the B-tree index 6 to identify one or more records which satisfies the relational database query; and retrieves the identified record(s) from the relational database 5.

Instead of migrating the data in the flat file database 1 into the relational database 5, the system of Figure 1 maintains the data in the flat file database 1, and any queries issued against the flat file database 1 are delegated by the query service 4 to a flat file query service 7. The query service 7 services each delegated query by the method shown in Figure 3.

The query process is managed and controlled by the parse and index configurations previously described with reference to Figure 2, and by a view configuration shown in Figure 3. The view configuration defines the relational views to be offered to queries, and the mapping from fields and records in the database 1 to the columns and tables in the relational views.

Thus in the example of Figure 3, a query is phrased against the view configuration. In this case the query is as follows: select subscriber,datetime from videolog where video= "/video/films/0671.mpeg" The query service 7 then identifies the index file to be used in the "where" clause of the query (in this case, the "where" clause identifies the video index). The query service 7 then references the identified index file to identify one or more of the records in the database 1 which satisfies the query. In this case the index file identifies the record associated with file pointer 382 and record offset pointer 67. The query service 7 then retrieves the identified record(s) from the database 1 , maps the unstructured file formats to a structured relational view in accordance with the view configuration, and presents the query result as rows from a relational table, in this example:

Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims.

Claims

1. A method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising:

indexing the records to generate a flat file index;

receiving a query;

retrieving the identified record(s) from the flat file database.

2. The method of claim 1 wherein the query is an SQL query.

3. The method of any preceding claim further comprising autonomously discovering the flat file(s) in the database.

4. The method of any preceding claim wherein the query is delegated from a relational database query service.

5. The method of any preceding claim further comprising:

indexing a relational database to generate a relational database index;

receiving a relational database query;

referencing the relational database index to identify one or more records which satisfies the relational database query; and

retrieving the identified record(s) from the relational database.

6. Apparatus configured to perform the method of any preceding claim.

7. Computer software which, when loaded on suitable hardware, causes the hardware to perform the method of any of claims 1 to 5.

8. Apparatus comprising:

an indexing service configured to index the records to generate an index; and

9. The apparatus of claim 8 further comprising:

a relational database; and

a relational database query service configured to service queries directed to the relational database and delegate queries directed to the flat file database to the flat file query service.