GB2573512A

GB2573512A - Database and associated method

Info

Publication number: GB2573512A
Application number: GB1807232.2A
Authority: GB
Inventors: Clifford Paul; Robinson Mark
Original assignee: Imosphere Ltd
Current assignee: Imosphere Ltd
Priority date: 2018-05-02
Filing date: 2018-05-02
Publication date: 2019-11-13
Also published as: GB201807232D0; GB2573512A8

Abstract

Disclosed is a method of operating a database system, to provide access to a plurality of data sources, ie databases, each data source having a structure based on a respective data model and storing a plurality of data entities having attributes and occurrences within the structure. The method starts by receiving selected attributes associated with the data sources, and then generating a staging database based on the in selected attributes. Next, an output database corresponding to the contents of staging database is generated. The output database is a multi-character expression database populated with unique, multi-character expressions, each multi-character expression having a predetermined hierarchical structure which defines a relationship between each entity, attribute and entity occurrence with every other entity, attribute and entity occurrence in the multi-character expression database. The data structure of the staging database may be independent of the data model relationship between the entities, attributes and entity occurrences in the staging database.

Description

DATABASE AND ASSOCIATED METHOD

TECHNICAL FIELD

The disclosure relates to database systems, and in particular to a system and associated method for transposing data between databases.

BACKGROUND

A database model is a theory or specification describing how a database is structured and used. A data model is not just a way of structuring data: it also defines a set of operations that can be performed on the data such as queries. Several such models have been suggested such as Hierarchical model, Network model, Relational model (the most popular model), Entity-relationship model, Object-relational model, Multivalue model, Object Model and Document model.

As is well known, there is a lack of standardisation of database models and systems such that different organizations use different database models or even different departments within an organization (e.g., the Health Service) use different database models. Each organization or department generally chooses the database model considered most suitable for them or simply accepts the database model recommended to them by their IT department or database manager/administrator. Furthermore it is generally not possible to manage databases having different database models using one database management system.

Well known database models and systems include those provided by Oracle™, Microsoft™, Sybase™, IBM™ and the like. Each differ in specifics and typically require an expertise in maintaining and interrogating data within their defined data structures.

These databases are known and usefully employed to aggregate data in a manner that has become known in the art as big data analytics. This is where large data sets are examined or interrogated to uncover hidden patterns, correlations, or other trends of information categories. A problem that exists with big data is that the actual examination of the data typically requires expertise in computing analysis, tools that are specific to the database type and a knowledge of the data structures within the databases. There are fundamental problems in the world of big data of collating data from multiple source databases into a structure that makes for flexible and straightforward analytical interrogation. Traditional methods depend upon complex mapping processes that are both time-consuming and force upfront decisions as to the relationships between similar data elements.

Another problem that exists with traditional approaches to collating data from multiple source databases pertains to the fact that while the approaches require complex mapping from the source databases, it is often desirable to avoid mapping decisions in the first instance because decisions regarding whether to treat two data elements from different sources as identical may only be sensibly made by undertaking analyses of various types within a collated database. Some traditional methods force mapping judgments in advance of the data being in a single place that supports such analyses.

SUMMARY

According to a first aspect there is provided a method of operating a database system, the method comprising:

providing access to a plurality of data sources, each data source having a structure based on a respective data model and storing a plurality of data entities having attributes and occurrences within the structure;

receiving selected data associated with the plurality of data sources;

generating a staging database based on the selected data; and generating an output database corresponding to the contents of staging database, wherein the output database is a multicharacter expression database populated with a plurality of unique, multi-character expressions, each multi-character expression having a predetermined hierarchical structure which defines a relationship between each entity, attribute and entity occurrence with every other entity, attribute and entity occurrence in the multicharacter expression database.

The selected data may be one or more selected attributes. The selected data may be one or more selected occurrences. The selected data may be one or more selected entities.

The structure of data in the staging database may be independent of a data model relationship between the entities, attributes and entity occurrences in the staging database.

The entity occurrences in the staging database may have the same format as the corresponding entity occurrences in the data sources.

The staging database may be a relational database. The staging database may not be a multicharacter expression database. The staging database may have a different data structure to the multicharacter expression database.

The method may comprise generating a form defining the selected data. The method may comprise mapping, onto the staging database, selected attributes of the entity occurrences stored in the plurality of data sources. The form may define composite attributes based on one or more attributes in the source data.

The form may define mapping rules to convert variants of an attribute in the source data to a transformed attribute in the staging database. The form may define a mapping from one variant of an attribute to another variant of the attribute. The form may retain, in the staging database, divergent variants of attributes from the plurality of data sources. The form may enable a value of an attribute from a first source database, and a divergent value of the attribute from the second source database to both be mapped to valid values in the staging database. The valid values in the staging database may be predefined and/or of a limited number. Divergent variants of an attribute may have a similar semantic meaning. Divergent variants may have an identical semantic meaning to one another in one context and a different meaning to one another in a different context.

The method may comprise updating the form to change the selected data.

The method may comprise generating an updated staging database following a change to the plurality of data sources or a change in the selected data. The method may comprise generating a difference file detailing changes between the staging database and the updated staging database.

The plurality of data sources may comprise a plurality of source databases. The plurality of data sources may provide heterogenous data structures. The plurality of data sources may comprise one or more other data sources, such as spreadsheets. At least two of the plurality of source databases may differ in their database management systems.

The multicharacter expression database may allow a user to access data stored in the plurality of data sources without using the database management system or systems that are specific to the plurality of databases. Said multicharacter expressions may be stored in an expression settable linking each element of each expression with a natural language phrase relating the expression to a hierarchical level and a position in a data model.

The staging database may be associated with the data entities of the source data. The multicharacter expression database may be associated with the data entities of the staging database.

According to a further aspect there is provided a system configured to perform a method described herein. The system may comprise at least one processor and memory sorting computer program, in which the computer program is configured to, when executed by the at least one processor, cause the system to perform a method described herein.

A computer program configured to, when executed by the at least one processor, cause the system to perform a method described herein.

BRIEF DESCRIPTION OF FIGURES

One or more embodiments will now be described by way of example only with reference to the accompanying drawings in which:

Figure 1a illustrates a schematic block diagram of an exemplary data model for a multi-character expression database;

Figure 1 b illustrates an overview of the use of an expression set together with implementing tables;

Figure 2 illustrates a flow chart of a method for operating a database system;

Figure 3 illustrates a schematic block diagram of a series of data structures for use in the method of Figure 2;

Figure 4 illustrates a data structure of a data source;

Figure 5 illustrates an example form that matches the data structure described previously with reference to Figure 4;

Figure 6 illustrates the relationship between a plurality of forms, a staging database and automatically generated scripts;

Figure 7 illustrates a staging table of selected attributes of a plurality of forms; and Figure 8 illustrates a form definition represented as a table.

DESCRIPTION OF EXAMPLES

The present disclosure provides a system and associated method for extracting data from a plurality of potentially disparate source databases and providing a single multicharacter expression database. The extract-transform-load (ETL) relationship between the source databases and the multicharacter expression database is managed using a staging database. The staging database, which may be provided as a monolithic entity, consolidates selected data from the source databases. The staging database may store data in the same format as the source databases. In this way, mapping of entries between the source databases and the staging database may be made relatively straight forward. In addition, validation procedures can be performed on the data that is being, or has been, imported into the staging database to ensure data integrity. Validation and data processing steps can be performed on importing data to the staging database, to improve latency during use. Alternatively, validation and data processing steps can be deferred to the multicharacter expression database in order to take advantage of the improved processing performance of such databases.

In this way, the present disclosure combines the use of a (non-multicharacter expression) staging database, for ease of data management and ETL, which is then transformed into an output (multicharacter expression) database. Multicharacter expression databases allow for improved database performance in some operations.

Figure 1a illustrates a schematic block diagram of an exemplary data model for a multicharacter expression database.

In a multi-character expression database, the physical model, i.e. the storage model which represents the physical structure of the data stored on the computer system is designed to be much closer to a conceptual model of the real world, i.e. the data model of the organization(s) using the database. This closeness is normally difficult to achieve, simply because the requirements of the computer-accessed disks and other storage media are so different from the human view of the organizational structure being represented by the database. A database implementation which can simplify the interface between the physical model and the conceptual model offers huge advantages in terms of the speed of processing when accessing information from the database, and also greatly simplifies the software and hardware interface necessary to achieve this interface.

Every entity, every attribute and every occurrence of every entity in the data model is uniquely specified by a multicharacter expression which is conveniently (for the sake of clarity of explanation) divided into a number of words. In a presently preferred embodiment, the expression comprises three five-byte words, with each byte representing one ASCII character selected from a set of approximately 200.

The expressions do more than simply provide a unique label to each entity, each attribute and each occurrence of each entity, but also implicitly encode the data model by reference to its hierarchical structure and protocol.

This is achieved by use of the strict hierarchical protocol in the assignment of expressions to each entity. This can be achieved automatically by the database management system when the user is initially setting up the database, or preferably is imposed by a higher authority to enable the database structure to conform to wider standards thereby ensuring compatibility with other users of similar database systems.

The way in which the database structure is imposed by the assignment of these expressions is best described with reference to an exemplary data model as shown in Figure 1.

The tree structure in figure 1 represents the known universe of the data model. Each hierarchical level of the data model is shown horizontally across the tree structure, and each one of these hierarchical levels may be represented by an appropriate byte h to h₅of the expression shown vertically on the left hand side of the drawing. At the highest level of the tree h, we have context information defining the organization using the data, for example the National Health Service, Prison Service, Local Authority, Educational Establishment etc.

The significance of byte l₂ will be discussed later, but broadly speaking indicates a data type from a plurality of possible data types which might be used in one implementation of the present invention.

Within each organization (e.g. the Health Service) there may typically be a number of departments or functions or data view types (represented by byte) such as administration, finance / accounts and clinical staff, all of whom have different data requirements. These different data requirements encompass:

a) different data structures or models reflecting different organizational hierarchies within the department;

b) different views of the same entities and occurrences of entities; and

c) the same or different views of standard format data relating to different occurrences of similar or identical entities or attributes.

The significance of this to the present invention will become clear as one progresses downward through the hierarchy.

Each department may wish to segregate activities (e.g. for the purpose of data collection and analysis) to various regional parts of the organization: e.g. a geographically administered area or a sub-department. This can be reflected by expression byte U- Each geographically administered area may further be characterized by a number of individual unit types, such as: (i) hospitals, health centres etc. in the case of an National Health Service (NHS) application, for example; (ii) schools or higher education institutions in the case of an education application; (iii) prisons and remand centres in the case of the prison service application.

Each of the organizations and units above will have different data structure requirements (as in (a) above) reflecting different entities, attributes and entity relationships within the organization and these are provided for by suitable allocation of codes within the l₆ to ho range of expression bytes. In this case, the same alphanumeric codes in bytes le to ho will have different meaning when in a branch of the tree under NHS than when under, eg. the education branch, even though they exist at the same hierarchical level. As an example, the sub-tree structure represented by particular values of bytes l₆ to ho may refer to patient treatment records in the NHS context, whereas those values of codes may refer to pupil academic records in the education context.

However, in the case of (b) above, where the organizational unit requires the same or different views of the same entities, attributes and occurrences of entities as other organizational units, the codes in bytes l₆ to ho of one branch of the tree will represent the same underlying structure and have the same meaning as corresponding byte values under another branch of the tree. An example of this is where both the administration departments and the finance departments require a view of the personal details of the staff in the hospital, both doctors and nurses. Note that the views of the data may be the same or different for each department, because the view specification is inferred from the higher level h to I5 fields. In this case, as will be explained later, for entities, attributes and occurrences of entities which are the same in each subbranch, some or all of the codes h to I5 which identify each entity occurrence will have identical values.

In the case of (c) above, ie. the same or different views of standard format data relating to different occurrences of similar or identical entities and their attributes, it will be understood that a number of predefined bytes require the same specification regardless of the particular organization using them. For example, a sub-tree relating to personnel records, and including a standard format data structure for recording personnel names, addresses,

National Insurance numbers, sex, date of birth, nationality etc. can be replicated for each branch of the tree in which it is required. For example, all of the organizations in the tree will probably require such an employee data sub-tree, and thus by use of standardized codes in bytes Ιθ to ho such organizational sub-trees are effectively copied into different parts of the tree. However, in this case, the context information in fields h to h will indicate that within each organization, we are actually dealing with different occurrences of similar format data.

In preferred embodiments of the present invention, the tree structure defined by the expressions h to I5 is used to define not only all entity types, all entity attribute types and all entity occurrences, but is also used to encode the actual attribute values of each entity occurrence where such values are limited to a discrete number of possible values. For example, in the sub-tree relating to treatments in the NHS hospital context, drug is an entity which has a relation with or is an attribute of, for example: doctors (from the point of view of treatments prescribed); patients (from the point of view of treatments given); administration (from the point of view of maintaining stocks of drugs) and so on. The entire set of drugs used can be provided for with an expression to identify each drug. In a preferred embodiment, the parts of the expression specific to the occurrences of each drug will be located in the In to I15 fields as shown in Figure 1. Thus when used in conjunction with the appropriate fields h to ho. it will be apparent whether the specified drug is in the context of a treatment prescribed by a doctor, a treatment received by a patient, or a stock to be held in the hospital pharmacy.

Further bytes in the expression, lower in the hierarchy can be associated with the drug to describe, for example, quantities or standard prescription types. It will be apparent whether the expression refers to a prescribed quantity or a stock quantity by reference to the context information found higher in the hierarchy. In practice, the number of discrete values allowed for each of these grouped entity values using the five fields In to h₅ is approximately 200⁵ = 3.2 x 10¹¹ As will be described later, the number of permutations allowed can actually be expanded indefinitely, but in practice this has not been found to be necessary.

Thus, in the fifteen character expression h to I15 each character represents a English language expression defining some aspect of the data model, and by travelling downward through the table it is possible to compose a collection of English expressions which represents the complete specification of an entity, an attribute or an entity occurrence.

FIG. 1 b illustrates an overview of the use of an expression set together with implementing tables. Every occurrence of an entity about which information must be stored is recorded in the entity details table 510. Each occurrence of each entity is given a unique identifier 512 that is assigned to that entity occurrence, and information about the entity is stored as a value expression information string 513. Examples of value expressions are the character strings giving names, street addresses, town, county, country etc., ordrug name, manufacturer's product code etc. These details are essentially alphanumeric strings which themselves contain no further useful hierarchical information and are treated solely as character strings.

The unique identifier 512 of each entity occurrence in the entity details table 510 provides a link to an entity history table 520 where entry of, or update to the entity occurrence status is stored. In this table, the event updating the database is given a date and/or time 524, an expression 526, and the unique identifier 522 to which the record pertains, and may include other information such as the user ID 527 of the person making the change.

In the entity history table 520, various details of the event being recorded may not be available, or may have no relevance at that time. For example, a new patient in a designated hospital may be admitted, and some details put on record, but the patient is not assigned to any particular doctor or ward until a later time. Additionally, some information may be recorded which is completely independent of the user view or other context information. Thus the event is logged with only relevant bytes of the expression encoded. Bytes for which the information is not known, or which are irrelevant to the event are non-deterministic and are filled with the wild card character, “#”.

The entity history table 520 may also include an event tag field 528 which can be used in conjunction with a corresponding field in an episode management table to be described hereinafter. It will indicate which coding activity was being carried out when the expression was assigned to the entity. For example, this tag could indicate whether the coding was carried out during an initial assessment, an update, a correction, a re-assessment, etc. This tag also orders entity codes into event groups. For example, in the medical context, when a person enters the system as a patient, they initiate an admission. An episode can have many spells, (such as a period of treatment on ward A, followed by a period on Ward B) and a spell can consist of many events (such as contacts with the attending physician, procedures, tests). What is more, a patient can be involved with more than one episode at a time (for example out-patient episodes with different hospitals pertaining to different illnesses), and under each episode, more than one spell at a time (e.g. involvement with more than one department of each hospital, each dealing with different aspects of each illness). Many organizations need to store this sort of information for costing and auditing purposes. By coding this information into an expression, it will be possible to browse this information.

The entity history table may also include a link field 529 which is designated to link related groups of codes allocated during a particular entity-event-times. For example, in a social services application, a home visit, a visit date, miles travelled and the visitor could all have an expression associated with the visit event. The link field will link these expressions together. Alternatively, the event tag field may also cater for this function.

A memo field 523 may also be included in the entity history table to allow the user to enter a free text memorandum of any length for each code allocated to an entity. In effect, every time a field is filled, a memo can be added.

The expression set of the entire database is recorded in a third table, the expression set table 530. This encodes each expression against its natural language meaning, and effectively records the data model as defined by the hierarchical structure of Figure 1a. There is a natural language meaning for each byte of the expression, each byte representing a node position in the data model tree, and the precise significance of every occurrence of every entity or attribute is provided by concatenating all natural language meanings for each byte of the expression: e.g. and again in the context of the NHS in the United Kingdom, —Presentation Data Type—Administrator's View—Region 1— HospitalNo2—Doctor Record—Name—DoctorlDI. It will be appreciated that the terms within this expression set table may be appropriate for the environment of which the databases are provided. For example in a US healthcare environment, while the terms used may differ from those used in a UK healthcare environment, the expression set table still provides a link in a natural language to which the user is familiar to the underlying data which is provided in a more complex form.

The expressions may include expression extensions which map a sub-tree onto the main tree as are discussed in more detail in aforementioned GB 2293697B and GB 2398143B. For convenience, these extension expressions can be located within the expression set table 530 (the extension entries being identified by the byte h, or could be located in a supplementary table (not shown), in which the pointer fields In to hsofthe main expression are used as the first fields h to h of the extension expression.

The entity history table 520 and the expression set table 530 may each include an extra field holding a version code. In the entity history table, this would indicate a version number of the expression in use at the time the record was created; in the expression set table, expressions may be varied over time according to the version code given. This allows the structure of the hierarchy to change over time without necessarily introducing new expressions. This assists in maintaining backward compatibility of recorded data.

Use of a multicharacter expression database offers significant advantages in the execution of reporting and database querying functions, particularly for multiple users or multiple classes of users.

To answer a given query, the database system defines a query expression comprising fifteen bytes (l· to h₅) which correspond with the expressions as stored in the entity history table 520 and expression set table 530. The query expression will include a number of deterministic bytes and a number of non-deterministic bytes. The non-deterministic bytes are effectively defined as the wild-card character “#”-“matches anything”. The deterministic bytes are defined by the query parameters.

For example, a simple query might be: “How many patients are presently registered at hospital X”. To answer this query, the query expression imposes deterministic characters in fields 11, (=NHS), I4 (=hospital identity), I6 (=patients). Other context information may be imposed by placing deterministic characters in bytes I2 (=presentation information). All other bytes are non-deterministic and are set to “#”. The database scans through the expression set table matching the deterministic characters and ignoring others. It should be noted that in the preferred embodiment, the expression set table is maintained in strict alphanumeric sequence and thus very rapid homing in on the correct portions of the database table is provided where high-order bytes are specified. This will normally be the case, since the hierarchical nature of the expression set will be arranged to reflect the needs of the organization from which the data was retrieved. The database system can then readily identify all the tuples of the expression set table providing a match to the query expression.

A significant advantage of the database structure will now become evident. The answer to the initial query has effectively homed in on one or more discrete portions of the expression set table and counted the number of tuples matching the query expression. Supposing that the user now requires to “progressively query” by stipulating additional conditions: “How many of those patients are being prescribed drug Y” requires only the substitution of the non-deterministic character “#” with the appropriate character in the requisite field l_n of the expression to change the result. Similarly, carrying out statistical analysis of other parameters, such as: “How many patients were treated by doctor Z with drug Y” can rapidly be assessed. It should be understood that progressively narrowing the query will eventually result in all bytes of the query expression becoming deterministic and yielding no match, or yielding a single patient entity match whose details can then be determined by reference to the entity details table 510 (or the appropriate memo field).

It should now be clear that the key to the speed of result of the statistical querying function is the construction of the expression set table. When imposing conditions on various attributes of an entity, i.e. by setting a deterministic character in a byte of the query expression, the relevant data will be found in portions of the table in blocks corresponding to that character. Progressive querying requires only scanning portions of the table already identified by the previously query. Even where a higher level context switch takes place, relevant parts of the expression set table can be accessed rapidly as they appear in blocks which are sequenced by the expression hierarchy.

Scanning the table can be achieved most efficiently by recognizing that only the highest order, deterministic byte of the query expression need be compared with corresponding bytes of each record in the expression set table until a first match is obtained. Thereafter, the next highest order byte must be included, and so on until all deterministic bytes are compared. This results from maintaining a strict alphanumeric ordering to the table.

Another type of querying relates to examining the historical aspects of the database through the use of entity history table 520. For example, the query may be, “In the last year, what drugs and quantities have been prescribed by doctor X”? To answer this query, the query expression is formulated in the same manner as before with regard to the expression set table 530, imposing deterministic bytes in the appropriate places in the query expression. This will include one or more “lowest order” bytes in 111 to 115 which actually identify a doctor, and non-deterministic characters against the drug fields. This time, however, the entity history table 520 is scanned, in a similar manner, seeking only matches of deterministic characters. In a preferred embodiment, the entity history table 520 will be maintained in chronological sequence and thus the search can be limited to a portion of the table where date limitations are known and relevant. Matches of deterministic characters will be found throughout the table where a relevant event relating to prescription of a drug by doctor X is found. Note that the entity history table 520 may include other fields which can be used to impose conditions on the query, such as the user ID of the person entering the record.

A further type of querying relates to analysis of the records pertaining to a single entity value: the entire medical record of patient X. In the preferred embodiment, patient X would be identifiable from the entity details table 510.

The query would initially involve searching for the patient's name to locate the unique identifier (unless that was already known). Once the unique identifier for a patient was known, then the entire entity history table can be scanned very rapidly for any entry including the unique identifier. The strengths of the present invention will then be realized in that the output from this scan will provide a number of entries each of which carries all of the relevant information about that patient incorporated into the extracted expression bytes h to I15. The entire patient's record can then be “progressively queried” without recourse to any further searching operation on the main entity history table 530. Specific details of the patient's treatments, doctors, hospital admissions, prescriptions etc. are all very rapidly available at will be assertion of appropriate deterministic bytes in the expression h to I15.

It is noted that the event history table will include many records where the expression stored in the record contains many non-deterministic bytes. For example, where a doctor X prescribes a patient Y with drug Z, other bytes of the expression may be either not known, or not relevant. For example, the patient may have been assigned to a ward W in the hospital which could be identified by another byte. However, this venue in which the treatment took place might be: a) unknown; b) known but not relevant to the record; or c) automatically inferable from the context of the person making the record entry.

Whether this information is included in the record is stipulated by the users; however, it will be noted that it does not affect the result of the query whether the byte in the entity history table relating to WARD W is deterministic or non-deterministic, because the query expression will set that relevant byte to non-deterministic unless it is stipulated as part of the query.

When the database system has extracted all of the records of the entity history table matching the query expression, it preferably saves these to a results table for further querying, or progressive browsing. For example, the results table can then be analysed to identify which treatments were made at an individual hospital or by an individual doctor by setting additional conditions on particular bytes of the query expression. Memo fields can be extracted to view comments made at the time of treatment. It can be seen that the results table formed in response to the initial query actually contains all of the information relevant to a given patient's treatment, and not just the answer to the initial query “What drugs have been prescribed to patient X”?

In summary, the information of the database is stored in such a manner that data for a query may be extracted far more rapidly than relational database storage schemas such as those used in the output database discussed below with reference to Figures 2 and 3, and with an expression for each extracted record. The presence of this expression in the query result has an important effect. A unique reporting benefit gained is the scope for progressive querying and “interactive reporting”. In this way a set of results that is the result of a first search query represents a data set that provides the searchable data for a second search query. This searchable data set has been qualified or constrained by the matching of the deterministic criteria of the first search query against the interface database.

When a database query is executed to provide information for a report, the answer will be made up of a number of expression records. This subset of expressions inherits all the structural information held in the main expression set.

As a general example: a detailed report on the number of severe hallucination instances in a given geographical area during the past year might return a subset of 12,000 expressions. Because these are full expressions, higher and lower level information is also inherent in this subset. Further investigation of the answer through browsing the returned hierarchy might reveal that 70% of cases were male, or 30% of cases occurred in the prison service, etc. Similarly, a high level report on the number of instances of hallucination in a particular organization might return a subset of 9,000. More detailed information will be inherent in this retrieved subset. By progressive querying of this subset, it may transpire that 90% of mild occurrences were in planning departments or that 5% of severe occurrences were in education departments. The processing time required to browse this information with further, more detailed “sub-queries” is substantially speeded up over prior art systems simply because the expression set readily provides all the lower level information.

Figure 2 illustrates a flow chart of a method 200 for operating a database system with a staging database. Access to a plurality of data sources is provided 202. Each data source may be provided by a separate device, such as a drive, or a logical segment / partition of a device. Each data source has a structure based on a respective data model and stores a plurality of data entities having attributes and occurrences within the respective structure. Selected attributes associated with the plurality of data sources are received 204 from the plurality of data sources. A staging database is generated 206 based on the selected attributes associated with the plurality of data sources. Alternative, other classes of data may be selected, such as one or more entities or one or more occurrences. The structure of data in the staging database may be independent of a data model relationship between the entities, attributes and entity occurrences in the staging database. Further, the entity occurrences in the staging database may have the same format as the corresponding entity occurrences in the data sources. In this way, the process of mapping data between the staging database and the source databases may be simplified.

An output database is generated 208 from the staging database. The content of the output database corresponds to the content of staging database. The output database is a multicharacter expression database populated with a plurality of unique, multi-character expressions, such as those described previously with reference to Figures 1a and 1b. Each multi-character expression has a predetermined hierarchical structure which defines a relationship between each entity, attribute and entity occurrence with every other entity, attribute and entity occurrence in the multicharacter expression database.

Figure 3 illustrates a schematic block diagram of a series of data structures 300 that relate to those described previously with reference to the method of Figure 2. A plurality of data sources 302a-n is provided. Each data source 302a-n has a structure based on a respective data model. In this example, each of the data sources has a structure based on a different data model from the other sources, although that is not necessarily the case. A plurality of data entries having attributes and occurrences is provided by each of the structures in the respective data sources 302a.

At least two of the plurality of source databases may differ in their database management systems. Each database management system may define a set of programs that enable a user to store, modify, and extract information from the respective source databases. The database management systems may be selected from one or more of those provided by Oracle, FoxPro, IBM DB2, Linter, Microsoft Access, Microsoft SQL Server, MySQL, PostgreSQL and SQLite.

One or more attributes of each of the data sources 302a-n is selected, to provide selected attributes 304a-n. A staging database 306 is generated using the selected attributes 304a n that are extracted from the plurality of data sources 302a-n. An output database 308 is generated from the staging database 306, as described previously with reference to figures 2. The output database is a multi-character expression database.

In this way the present teaching may provide efficient methods of collating data from multiple source databases into a structure that makes for flexible and straightforward analytical interrogation, while simplifying the mapping of data from the source databases to the output (multi-character expression) database.

Example implementations of the method of Figure 2 are discussed further below with reference to Figures 4 to 8. Various optional, or preferable, features of the method of Figure 2 are described further below with reference to these specific examples. It will be appreciated that any particular step, or feature, described below with reference to one specific example may be used in combination with one or more other features of that example, another example, or of the method described above with reference to Figure 2.

1.1 Step 1: Identify Data

Figure 4 illustrates a data structure of a data source. The data structure comprises a plurality of attributes (for example Patient ID, Encounter ID, weight, height, mobility, bp, tetanus vaccine date, notes), a plurality of records (for Patient ID: 1,2 and 3 in this example) is provided within the data structure. That is, the data source in this example may be provided by an entirely conventional, relational database. In this example, the data source is an example relating to patient data for use in a medical context.

A data source may include:

• Patient demographic data • Medical events • Clinical coding systems (e.g. ICD, CPT) • Financial data

1.2 Step 2: Design Forms

Forms are structured computer-generated documents or displays used for data entry to computer systems and possibly also for data verification during the data entry process. Forms allow a form provider to acquire structured and standardised data by asking one or more respondents to fill in instances of the form. The data acquired is generally stored in a database so that it can be extracted by appropriate queries and analysed.

Data entry forms are widely used by a number of different stakeholders, including: users who wish to collect data using forms; form creators who create forms for use by an organisation or by multiple organisations and who may wish to share forms with, or licence forms to, other users; forms automators who may include software vendors and IT departments responsible for the electronic automation of forms; and forms analysers who may include data analysts, information specialists and managers who wish to use, aggregate and work with the data captured using forms.

In some embodiments, one or more forms may be used to control the selection and transfer of data from source databases to the staging database.

Figure 5 illustrates an example of a form. Forms may be designed using Atmoforms, for example. The form comprises a number of fields. Each field relates to an attribute in the data structure. The form represents the structure of the data set to import. That form in Figure 5 has been designed to match the data structure described previously with reference to Figure 4, which would result in all fields of Figure 4 being imported to the staging database.

Data is mapped to the most appropriate data type, for example:

Numeric

Enumeration

- Date

- Text

- ‘Added value’ items can be included (e.g. calculated answers, hierarchical enumerations)

Selected attributes of the entity occurrences stored in the plurality of data sources may be mapped onto the staging database using a form, or plurality of forms. The form(s) may define composite attributes based on one or more attributes in the source data. Generation of composite attributes using the form may reduce the time taken to run searches at run-time and therefore enable a trade-off between storage space and runtime efficiency to be achieved. For example, the attributes “date of birth” and “gender” may be used to define a composite attribute with occurrences that can take the states of: “woman”; “girl”; “man”; “boy”. In this way, the data in the staging database corresponds to, but does not necessarily possess identity with, data in the source databases.

Data constraints can also be added. E.g. Max value 120. The form may define mapping rules to convert variants of an attribute in the source data to a transformed attribute in the staging database. For example, the variants of an attribute such as “co-habiting” and “living with partner” in the different source data may all be mapped to “co-habiting” in the staging database. In this way, data can be sanitized automatically on importation into the staging database.

Alternatively, the form may retain, in the staging database, divergent variants of attributes from the plurality of data sources. The retention of divergent variants (such as “cohabiting” and “living with partner” used in respective source databases) in the staging database enables analysis of the variants to be performed subsequently, using the multicharacter expression database, before a decision on conforming the variants is taken. Such retention of divergent variants may be advantageous in circumstances where it is unclear at the outset whether corresponding terms in different source databases have the same meaning for the purposes of the combined data analysis. In this approach, an agnostic position may be taken in some circumstances.

For example, if B and C can be represented as subsidiary concepts to a broader concept

A, then one may use the coding to import a hierarchy:

A

AB

AC

Having done that, the user may view the impact of treating B and C as equivalent by selecting all instances where A is true, and viewing the effects separately by selecting codes AB or AC. Hence the decision on equivalence is deferred until after the ETL, during analysis by a multicharacter expression database, and may even vary according to circumstance or purpose.

Although the forms created constrain the ETL process, there is the capacity to report on a data item in a configuration that is independent of the form in which it has been placed for ETL purposes, for example, by assigning different numeric values to specific answers to a question or placing the item in a different hierarchy to those assigned upon import.

1.3 Auto-generate Staging DB and ETL

Figure 6 illustrates the relationship between a plurality of forms, such as those described with reference to Figure 5, and a staging database and automatically generated scripts.

Fields from a plurality of different forms 607, which may each relate to a respective data source, may be selected in order to determine which data are to be transferred from the source databases to a staging database 608. The translation of information from the source database to the staging database results in an extract-transform-load (ETL) relationship between the source databases and the staging database. The ETL relationship maps the data from the staging database back to the source database. ETL scrips that describe the mapping may be generated, and optionally saved, in order to assist in the maintenance of the staging database. Documentation concerning the selected attributes, and the relationship between the staging database 608 and the source databases may be automatically generated using the ETL relationship data, and the inputted selected attributes.

1.4 Example Staging Table

Figure 7 illustrates a staging table 702 of selected attributes (each row in the ‘Column name’ column) of the forms. The staging table represents an MSSQL server table definition of a generated staging database. Each selected attribute has a type (for example integer, Boolean, or a specific format of data) and a flag to mark whether the selected attribute is nullable. The staging table is generated from the structure of the forms, such as those described previously with reference to Figure 5 and therefore closely matches the source data structure. A table definition 704 constraining the columns in staging table 702 to limited values 706 is also shown in Figure 7. In this example, the mobility values of respective patients from the staging table are mapped to one of three values for storing in the multicharacter expression database (poor: 25.92.45.1; fair: 25.92.45.2; good: 25.92.45.3) using a lookup table, for example. In this way, referential integrity can be used to enforce data accuracy. Other constraints in the transferred data may be enforced, for example, by SQL constraints and triggers.

1.5 Transfer Data to Staging

The staging table may therefore be used to control the extraction of attributes from the source databases and map the selected attributes into the staging database. In this way, the translation of information between the source database and the staging databases may be controlled and modified using the staging table.

Once it has been compiled, the staging database can be used to generate a multicharacter expression database, as described previously with reference to Figures 1 to 3. The provision of a multicharacter expression database allows a user to access data that corresponds to data stored in the plurality of data sources without using the database management system or systems that are specific to the plurality of databases.

Further discussion of aspects of the use of the staging database in the formation of a multicharacter expression database are provided below.

FURTHER OPTIONAL FEATURES

2.1 Representing an Imosphere Form with its associated Atmocodes as a DATABASE TABLE

Atmocodes are multi-character expressions of an Atmolytics database, available from Imosphere Limited. Form definition, comprising shared Atmocodes concepts, is a model that represents data. This model may be derived by analysing customer workflows, systems and requirements. The model represents an abstract view of data, as discussed previously with reference to Figures 1 and 2. Each form may be represented by a different table structure.

Figure 8 illustrates a form definition 804 with an associated database table 802, which represents data from a staging database.

The database table 802 comprises relational data records. Each record comprises a field associated with the attributes PatientID, name, age and sex. None of the fields are Atmcodes (multicharacter expressions). However the form definition 804 enables each field to be mapped to a multicharacter expression. In this way, a multicharacter database can be generated from the staging database 802.

2.2 Using the Imosphere Form’s meta data to constrain values

As part of defining a form, extra metadata may be created on top of the Atmocodes, as shown in the translation table 804 in Figure 8. For example:

• a numeric value for age may be constrained to be between 0 and 100.

• a text answer length of less than 50 characters.

• a drop down list of Male or Female.

This data can be used to setup constraints to only allow valid data to be entered from the source databases into the staging database, thereby excluding bad data. This feature may be of particular use in analytical packages because bad data can seriously skew and affect calculations. Therefore, by removing or flagging bad data on compilation of the staging database, the accuracy of the analytics package may be improved.

2.3 FORMS/DATABASE CHANGING OVER TIME

Over the operational lifetime of a database, many changes can be applied to it to meet various business needs. For example, a form may be changed so that questions are added, removed, options updated, etc. For example, in a medical context, an existing ward management database may need to be changed when the ward starts to undertake new types of procedure, resulting in a change in form requirements. Such changes in structure would normally cause a major source of difficulty for ETL processes. However, by tracking changes between versions, it possible to calculate what has changed. The change, or delta, may then be applied to the staging database. That is, rather than updating the entire staging database, a delta staging database, or difference file, may be provided which comprises details of changes that have been made over a previously generated staging database. The delta staging database may be appended to the previously generated staging database to provide an updated database. By providing a delta staging database, the current data in the staging database is kept, rather than being thrown away and reloaded.

Deltas may relate to changes in a form, such as the addition or removal of a form, or the updating of an existing form, for example by the addition or removal of a field (changing the form structure), or a change in a validation or encoding scheme applied to a field. Validation or encoding type information may relate to possible answers to questions on a form.

For example, where the composition of senior staff on a ward changes, resulting in a change in the types of procedure that are undertaken, a form associated with patient enrolment for that ward may change so that the list of possible procedures associated with the respective patients also changes.

Deltas may also relate to changes in the data in the database, or details of the database package (such as version or coding system). A script, such as SQL script, may be used to determine deltas between one version of a source or staging database and another version. The use of deltas assists in the rectification of human errors due to incorrect data entry because modifications to the database can be tracked.

The application of a delta to the staging database may be used to trigger updating of the corresponding data in the multi-character expression database.

2.4 Software changing over time

As business requirements change over time, the ETL target database will also change. This changing structure of the target database is normally a source of difficulty for ETL processes. However, the use of the staging database allows consistency in mapping between from the source databases to be maintained, while allowing the structure of the output database to also be modified. The provision of the staging database also provides an abstraction layer that allows structural changes to be made to the target database without affecting the loading of data into the staging database from the source databases. Having the ability to support changing versions without needing to correct any breaking changes removes upgrade risk and reduces a major cause of version lag.

2.5 ETL PROCESS

Each application of the staging database may be different, depending on the particular source data structures that need to be transferred to a multi-character expression database. Transfer of data from the source databases to the staging database my be performed using Integration Services functionality of Microsoft™ SQL Server SISS), for example.

A custom process that moves data from the staging database into the multi-character expression database may be provided. This is also generated at the same time as the staging database.

2.6 Staging Database documentation

While generating a staging database, bespoke documentation is also generated at the same time that describes what data is allowed, how to populate the database and other useful information.

2.7 Bespoke artefacts

All of the above features are, by their very nature, may be different for each application or set of forms. As such, each staging database or database delta may be delivered as a unique and bespoke artefact. Each artefact may contain a database (or database delta), documentation and an ETL process to move data from the staging database into the target Atmolytics system, as discussed above.

Claims

1. A method of operating a database system, the method comprising:

providing access to a plurality of data sources, each data source having a structure based on a respective data model and storing a plurality of data entities having attributes and occurrences within the respective structure;

receiving selected data associated with the plurality of data sources;

generating a staging database based on the selected data; and generating an output database corresponding to the contents of the staging database, wherein the output database is a multicharacter expression database populated with a plurality of unique, multi-character expressions, each multi-character expression having a predetermined hierarchical structure which defines a relationship between each entity, attribute and entity occurrence with every other entity, attribute and entity occurrence in the multicharacter expression database.

2. The method of claim 1, in which the structure of data in the staging database is independent of a data model relationship between the entities, attributes and entity occurrences in the staging database.

3. The method of claim 1 or claim 2, in which the entity occurrences in the staging database have the same format as the corresponding entity occurrences in the data sources.

4. The method of any preceding claim, in which the selected data are one or more selected attributes.

5. The method of any preceding claim, in which the staging database is a relational database.

6. The method of any preceding claim, in which the staging database is not a multicharacter expression database.

7. The method of any preceding claim, comprising generating a form defining the selected attributes and mapping, onto the staging database, the selected attributes of the entity occurrences stored in the plurality of data sources.

8. The method of claim 7, in which the form defines composite attributes based on one or more attributes in the source data.

9. The method of claim 7 or claim 8, in which the form defines mapping rules to convert variants of an attribute in the source data to a transformed attribute in the staging database.

10. The method of any of claims 7 to 9, in which the form retains, in the staging database, divergent variants of attributes from the plurality of data sources.

11. The method of any preceding claim, comprising updating the form to change the selected attributes.

12. The method of any preceding claim, comprising:

generating an updated staging database following a change to the plurality of data sources, the form, or the selected attributes.

13. The method of claim 12, wherein the updated staging database comprises the staging database and a difference file detailing changes between the staging database and the updated staging database.

14. The method of any preceding claim, in which the plurality of data sources comprises a plurality of respective source databases.

15. The method of claim 14, in which at least two of the plurality of source databases differ in their database management systems, each database management system defining a set of programs that enable a user to store, modify, and extract information from the respective source databases.

16. The method of claim 15, comprising using the output database to allow a user to access data stored in the plurality of data sources without using the database management system specific to the plurality of databases.

17. The method of any preceding claim, comprising storing said multicharacter expressions in an expression set table linking each element of each expression with a natural language phrase relating the expression to a hierarchical level and a position in a data model.

18. An apparatus comprising one or more processors and memory including computer program, the memory and computer program configured to, with the one or more processors, cause the apparatus to configured to perform the method of any preceding

5 claim.

19. A computer program configured to, when executed, cause a processor to perform the method of any of claims 1 to 17.