WO2007148033A1 - Flat file searching - Google Patents

Flat file searching Download PDF

Info

Publication number
WO2007148033A1
WO2007148033A1 PCT/GB2006/002343 GB2006002343W WO2007148033A1 WO 2007148033 A1 WO2007148033 A1 WO 2007148033A1 GB 2006002343 W GB2006002343 W GB 2006002343W WO 2007148033 A1 WO2007148033 A1 WO 2007148033A1
Authority
WO
WIPO (PCT)
Prior art keywords
flat file
query
database
records
index
Prior art date
Application number
PCT/GB2006/002343
Other languages
French (fr)
Inventor
Duncan Gunther Pauly
Original Assignee
Coppereye Limited
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Coppereye Limited filed Critical Coppereye Limited
Priority to PCT/GB2006/002343 priority Critical patent/WO2007148033A1/en
Publication of WO2007148033A1 publication Critical patent/WO2007148033A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising: indexing the records to generate a flat file index; receiving a query; referencing the flat file index to identify one or more of the records which satisfies the query; and retrieving the identified record(s) from the flat file database.

Description

FLAT FILE SEARCHING
FIELD OF THE INVENTION
The present invention relates to a method and apparatus for searching a flat file database.
BACKGROUND OF THE INVENTION
Data is often recorded in a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value. Examples include telecommunication network usage event files, web server logs and e-commerce transaction logs. A flat file database can be contrasted with a database such as a relational database in which a structural relationship exists between the records.
The data stored in the flat file database is typically transaction data created by automated systems and self-service environments, which typically generate data in large daily volumes. Such data is typically immutable and does not require the extensive management framework implemented by a relational database. However, conventionally the data is migrated into a relational database to gain SQL query access to it.
Migrating the data involves converting and moving it from the flat file database to the relational database and indexing it.
Until the migration completes, the data is effectively unavailable for query.
Relational databases use conventional indexing such as B-trees to index the migrated data. Such indexing requires extensive key sorting and/or disk activity and this lengthens the delay until the data is available for query.
Some commercial databases offer SQL query access to unstructured storage, but the data remains un-indexed, forcing every SQL query to scan the entire unstructured data set. This makes such SQL query access to large volumes of unstructured data infeasible for responsive selective access. SUMMARY OF THE INVENTION
A first aspect of the invention provides a method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising:
indexing the records to generate a flat file index;
receiving a query;
referencing the flat file index to identify one or more of the records which satisfies the query; and
retrieving the identified record(s) from the flat file database.
The invention provides direct access to the data in the flat file database whilst avoiding the delays associated with migration.
The query may be in any desired format including (but not limited to):
• SQL (relational)
• OQL (object)
• XQL/XQuery (XML)
• SPARQL (REF - semantic web)
Typically the method further comprises autonomously discovering the flat file(s) in the database.
Typically the query is delegated from a relational database query service. This enables the relational database query service to provide access to the flat file database which is transparent to a user, and present results to the query in the required format.
The method of the first aspect of the invention is implemented on hardware loaded with appropriate computer software. A second aspect of the invention provides apparatus comprising:
a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value;
an indexing service configured to index the records to generate an index; and
a flat file query service configured to receive a query directed to the flat file database; reference the index to identify one or more of the records which satisfies the query; and retrieve the identified record(s) from the flat file database.
BRIEF DESCRIPTION OF THE DRAWINGS
Embodiments of the invention will now be described with reference to the accompanying drawings, in which:
Figure 1 is a schematic view of the architecture of a system for searching a flat file database;
Figure 2 shows a method of discovering, parsing and indexing the flat file database; and
Figure 3 shows a method of servicing queries, retrieving results and presenting the results.
DETAILED DESCRIPTION OF EMBODIMENT(S)
Figure 1 shows a flat file database 1 comprising one or more flat files. Figure 2 gives an example of two flat files in the database 1, namely:
• /subs/logs/381.1og
• /subs/logs/382.log
Each flat file comprising a sequence of records with no structural relationship between the records. For example, three records in the flat file /subs/logs/381. log are shown in Figure 2. Each record comprises one or more adjacent fields. For example, the first record shown in /subs/logs/381.log comprises five fields: • a video field (/video/films/9765.mpeg)
• a subscriber field (016791801)
• a datetime field (210105:221007)
• a duration field (012705)
• an ip-address field (165.58.192.11)
Each field of each record contains a scalar value: that is a single quantitative or identification value.
An indexing service 2 discovers, parses and indexes new flat files as they are added to the database 1, by a process illustrated in Figure 2. The process is managed and controlled by discovery, parse, and index configurations shown in Figure 2.
The discovery configuration defines the path and file naming conventions for the files to be discovered in the database - in this example subs/logs/* .log.
The parse configuration defines the expected record and field formats for the files to be parsed - in this example video, subscriber, datetime[ddmmyy:hhnnss], duration, ip-address.
The index configuration defines the fields to be indexed to support the queries expected - in this example the video field and the subscriber field.
The indexing service 2 autonomously discovers the flat file(s) in the database according to the discovery configuration. Each discovered file is then scanned and parsed according to the parse configuration. The fields are then indexed according to the indexing engine to generate index files which are stored in a flat file index 3 shown in Figure 1. Figure 2 illustrates two index files: a video index file and a subscriber index file. Each index file comprises a set of index records, each index record comprising a key value (for instance /video/films/0671.mpeg); a file pointer identifying one of the files in the database 1 (for instance 382); and a record offset pointer identifying the location of the record within the file (for instance 67). The indexing methods described in WO0244940 (US2004015478) and/or in WO02069185 (US2004073559) may be used to offer fast and immediate access with minimal latency between data creation and query availability. The disclosures of WO0244940 (US2004015478) and WO02069185 (US2004073559) are incorporated herein by reference in their entirety.
A relational database query service 4 services queries issued against a relational database 5 using a B-tree index 6. That is, the query service 4 receives relational database queries; references the B-tree index 6 to identify one or more records which satisfies the relational database query; and retrieves the identified record(s) from the relational database 5.
Instead of migrating the data in the flat file database 1 into the relational database 5, the system of Figure 1 maintains the data in the flat file database 1, and any queries issued against the flat file database 1 are delegated by the query service 4 to a flat file query service 7. The query service 7 services each delegated query by the method shown in Figure 3.
The query process is managed and controlled by the parse and index configurations previously described with reference to Figure 2, and by a view configuration shown in Figure 3. The view configuration defines the relational views to be offered to queries, and the mapping from fields and records in the database 1 to the columns and tables in the relational views.
Thus in the example of Figure 3, a query is phrased against the view configuration. In this case the query is as follows: select subscriber,datetime from videolog where video= "/video/films/0671.mpeg" The query service 7 then identifies the index file to be used in the "where" clause of the query (in this case, the "where" clause identifies the video index). The query service 7 then references the identified index file to identify one or more of the records in the database 1 which satisfies the query. In this case the index file identifies the record associated with file pointer 382 and record offset pointer 67. The query service 7 then retrieves the identified record(s) from the database 1 , maps the unstructured file formats to a structured relational view in accordance with the view configuration, and presents the query result as rows from a relational table, in this example:
Figure imgf000007_0001
Although the invention has been described above with reference to one or more preferred embodiments, it will be appreciated that various changes or modifications may be made without departing from the scope of the invention as defined in the appended claims.

Claims

1. A method of searching a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value, the method comprising:
indexing the records to generate a flat file index;
receiving a query;
referencing the flat file index to identify one or more of the records which satisfies the query; and
retrieving the identified record(s) from the flat file database.
2. The method of claim 1 wherein the query is an SQL query.
3. The method of any preceding claim further comprising autonomously discovering the flat file(s) in the database.
4. The method of any preceding claim wherein the query is delegated from a relational database query service.
5. The method of any preceding claim further comprising:
indexing a relational database to generate a relational database index;
receiving a relational database query;
referencing the relational database index to identify one or more records which satisfies the relational database query; and
retrieving the identified record(s) from the relational database.
6. Apparatus configured to perform the method of any preceding claim.
7. Computer software which, when loaded on suitable hardware, causes the hardware to perform the method of any of claims 1 to 5.
8. Apparatus comprising:
a flat file database, the flat file database comprising one or more flat files, each flat file comprising a sequence of records with no structural relationship between the records, each record comprising one or more adjacent fields, each field containing a scalar value;
an indexing service configured to index the records to generate an index; and
a flat file query service configured to receive a query directed to the flat file database; reference the index to identify one or more of the records which satisfies the query; and retrieve the identified record(s) from the flat file database.
9. The apparatus of claim 8 further comprising:
a relational database; and
a relational database query service configured to service queries directed to the relational database and delegate queries directed to the flat file database to the flat file query service.
PCT/GB2006/002343 2006-06-23 2006-06-23 Flat file searching WO2007148033A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/GB2006/002343 WO2007148033A1 (en) 2006-06-23 2006-06-23 Flat file searching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/GB2006/002343 WO2007148033A1 (en) 2006-06-23 2006-06-23 Flat file searching

Publications (1)

Publication Number Publication Date
WO2007148033A1 true WO2007148033A1 (en) 2007-12-27

Family

ID=36822387

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/GB2006/002343 WO2007148033A1 (en) 2006-06-23 2006-06-23 Flat file searching

Country Status (1)

Country Link
WO (1) WO2007148033A1 (en)

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
BUTLER M: "CopperEye Reducing the Risks and Improving the Return from Database Technology", BUTLER GROUP, September 2002 (2002-09-01), pages 1 - 10, XP002396161, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/analyst_reports/reducing_the_risks_a4.pdf> [retrieved on 20060824] *
COPPEREYE LTD: "CopperEye Greenwich Architecture", COPPEREYE PRODUCT WHITE PAPER, 15 August 2005 (2005-08-15), pages 1 - 6, XP002396158, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/greenwich_product_whitepaper_08_15_05_a4.pdf> [retrieved on 20060823] *
COPPEREYE: "Impact of Key Locality on Performance", COPPEREYE TECHNICAL WHITE PAPER, January 2004 (2004-01-01), pages 1 - 7, XP002396163, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/white_papers/key_locality_whitepaper_a4.pdf> [retrieved on 20060823] *
COPPEREYE: "Profile of CopperEye Indexing Technology", COPPEREYE TECHNICAL WHITE PAPER, September 2004 (2004-09-01), pages 1 - 10, XP002396162, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/white_papers/coppereye_indexing_whitepaper_a4_june_03.pdf> [retrieved on 20060824] *
HOWARD P: "CopperEye Greenwich & Search", BLOOR RESEARCH, May 2006 (2006-05-01), pages 1 - 12, XP002396160, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/analyst_reports/Bloor_Report_May_2006_A4.pdf> [retrieved on 20060823] *
THOMPSON M: "CopperEye Greenwich Technology Audit", BUTLER GROUP SUBSCRIPTION SERVICES, January 2006 (2006-01-01), pages 1 - 7, XP002396159, Retrieved from the Internet <URL:http://www.coppereye.com/pdfs/analyst_reports/Greenwich%20TA_rc_A4.pdf> [retrieved on 20060823] *

Similar Documents

Publication Publication Date Title
US7930277B2 (en) Cost-based optimizer for an XML data repository within a database
US6950815B2 (en) Content management system and methodology featuring query conversion capability for efficient searching
US7464084B2 (en) Method for performing an inexact query transformation in a heterogeneous environment
EP2605158B1 (en) Mixed join of row and column database tables in native orientation
CA2603901C (en) System and methods for facilitating a linear grid database with data organization by dimension
US7953727B2 (en) Handling requests for data stored in database tables
US6965891B1 (en) Method and mechanism for partition pruning
US7472140B2 (en) Label-aware index for efficient queries in a versioning system
US7761455B2 (en) Loading data from a vertical database table into a horizontal database table
US9747349B2 (en) System and method for distributing queries to a group of databases and expediting data access
US8239374B2 (en) Collection of performance information for search queries executed in a tiered architecture
US20080201296A1 (en) Partitioning of nested tables
CN106815353B (en) Data query method and equipment
US20070214104A1 (en) Method and system for locking execution plan during database migration
US20090077625A1 (en) Associating information related to components in structured documents stored in their native format in a database
US8275888B2 (en) Indexing heterogeneous resources
US7707144B2 (en) Optimization for aggregate navigation for distinct count metrics
US20060230020A1 (en) Improving Efficiency in processing queries directed to static data sets
CN107291938A (en) Order Query System and method
US20030225722A1 (en) Method and apparatus for providing multiple views of virtual documents
US6397214B1 (en) Method and apparatus for instantiating records with missing data
CA2701173A1 (en) System and method for distributing queries to a group of databases and expediting data access
WO2007148033A1 (en) Flat file searching
US8290935B1 (en) Method and system for optimizing database system queries
US20010023420A1 (en) Externalizing very large objects in a relational database client/server invironment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06755629

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 06755629

Country of ref document: EP

Kind code of ref document: A1