GB2327133A

GB2327133A - Automatic recognition and expansion of abbreviated medical descriptions

Info

Publication number: GB2327133A
Application number: GB9710017A
Authority: GB
Inventors: John Young Thomson; Markus Simon Bolton
Original assignee: AINTREE HOSPITALS NHS TRUST; SYSTEM C Ltd
Current assignee: AINTREE HOSPITALS NHS TRUST; SYSTEM C Ltd
Priority date: 1997-05-17
Filing date: 1997-05-17
Publication date: 1999-01-13
Anticipated expiration: 2017-05-17
Also published as: GB2327133B; GB9710017D0

Abstract

Short phrases of abbreviations, for example "chro duo ulc", describing diseases are recognised and expanded by reference to a database containing lists of primary and secondary medical terms. Primary terms point only to those secondary terms which can validly occur with that particular primary term in clinical phrases. For example, "duodenitis" may point to "acute" and "chronic", while "ulcer" may point to "acute", "chronic", "duodenal", "gastric" and "venal". Each secondary term may point to other secondary terms. The secondary terms related to a first input term from the phrase may be compared with other terms in the phrase, and the secondary terms for which this comparison is successful may be associated with the first input term. ICD-10 coding may be implemented.

Description

METHOD AND APPARATUS FOR DATA ENTRY AND CODING The present invention relates to a method and apparatus for use in connection with a large text database, in particular for entering and coding new data, and for efficient searching and retrieval of stored data.

One example of a large database is the records of patients treated by a hospital or other medical facility, including records of clinical diagnosis and treatments applied. It is desired to conform the clinical information to a standard clinical code for use in epidemiological studies, research, audit, management information, performance indicators, and the correct allocation of health-care resources. There are at least three established medical coding systems, including the ICD code (International Classification of Diseases), the OPCS code (Office of Population Censuses and Surveys) and the Read codes. These medical coding systems have to be updated to keep pace with advances in medical science, and the ICD codes are now in their tenth edition, ICD-10.

Hospitals commonly employ skilled coding staff who are supplied with manually-written patient discharge forms including a patient's details, clinical diagnosis and treatment. This clinical information is converted to ICD-10 or OPCS 4 codes and is entered on to a database together with relevant patient details. Skilled coders require a detailed knowledge of the structure and rules of the coding system, and a good understanding of the clinical information presented for coding. A problem arises in that manual coding adds significantly to the bureaucratic overhead costs of a hospital. Further, coding accuracy can be poor, commonly due to lack of detail in the original clinical information. Another problem is that once patient records have been entered on to a database together with coding information, search and retrieval of relevant information is difficult and time consuming.

An object of the present invention is to provide a method and apparatus to assist entry of new data into a database. In the example of a database containing patient records of a hospital, it is desired to match clinical information such as a diagnosis written in a shorthand form, which may be acceptable within the hospital, to a formal longhand clinical diagnosis readily understood by clinical or other personnel anywhere in the world. This is an important first step in producing a database with meaningful information for use beyond a restricted set of skilled users.

A preferred aim of the invention is to assist the process of coding new data according to an established coding system, such as a medical coding like ICD-10.

Another preferred aim is to provide a method and apparatus for assisting efficient data search and retrieval, in particular in a large database, where interrogation may be performed on one or more axes.

According to a first aspect of the present invention there is provided an apparatus for assisting entry and/or coding of new data in a database, comprising: a primary file means having a plurality of entries, wherein each entry comprises a primary term and a plurality of primary tokens each associated with said primary term; a secondary file means having a plurality of entries, wherein each entry comprises a secondary term, a primary token associated with at least one of said primary terms in said primary file, and one or more secondary tokens; wherein said primary and secondary tokens each point to an entry or group of entries of the secondary file means.

Also according to the present invention there is provided a method for assisting entry and/or coding of new data in a database, said method for use with apparatus defined above, said method comprising the steps of: A. receiving said new data and selecting one or more keywords; B. searching said primary terms of said primary file to produce a primary list of candidate primary terms matching said keyword; C. where a plurality of keywords are selected, for a first of said keywords: D. for a first candidate in said primary list: E. retrieving from said primary file said primary tokens associated with said primary term; F. searching said secondary file using a first of said primary tokens to produce a list of candidate secondary terms; G. comparing a first of said secondary candidate terms with each of said keywords other than said first keyword, and adding any matched secondary terms to said primary list in association with said first candidate primary term; H. repeating step G for each token associated with said first candidate primary term; I. repeating steps F-H for each candidate in said primary list; J. repeating steps C-I for each of said plurality of keywords; K. outputting said primary list including primary terms and associated secondary terms as potential matches to said new data.

The method and apparatus allows an iterative search to be performed to match keywords selected from new data such as a text input string to a primary term in the primary file, and secondary terms, representing qualifying sub-terms, in the secondary file. In this manner, textual input data is matched to a selected vocabulary held in the primary and secondary files.

One advantage of this technique is that the tokens associated with each term point to further relevant terms, which significantly reduces the volume of data searched and improves searching speed.

This basic method outputs all possible matches to the input text string and the user can select one or more of these for further processing. However, it is desired to select only the most relevant or most likely matches, and, in the context of a hospital patient database, to offer only those matches which are clinically sensible and to disregard those matches which are clinically impossible.

The tokens themselves ensure that potential matches are clinically sensible, by pointing to only those entries or groups of entries which can sensibly be associated with the primary term. In this way, general clinical knowledge is embedded into the database constructed according to this technique.

The two file means are conveniently held as separate files but can equally be treated as one file, or split between several files.

Each token uniquely identifies an entry or group of entries in the secondary file and in the preferred embodiment comprises five characters beginning with a nontextual character such as a dollar sign ($) followed by four alpha-numeric characters. The skilled person can select any suitable form of token and allocation system.

The method preferably includes the additional step of maintaining a status field associated with each candidate primary term in the primary list, where the status field has a plurality of positions corresponding in number to the number of keywords selected from the input text string. In use, each position is initially filled with the character "N" indicating that the corresponding keyword has not been matched. As primary and secondary terms are found matching that keyword, the corresponding entry in the status field is changed to "Y". There is then no need to continue searching against that keyword in the relevant iteration, and so again improving response speed. When the entire search process is completed, only those candidates in the primary list having all "Y" in the status field are presented as output, and in most cases this would present only one output option.

Where a candidate secondary term itself has further tokens pointing to other secondary terms, these further secondary terms are themselves searched against the keywords and the status file updated. The method ends when all keywords have been searched. In some cases, tokens will be left unused and not matched to any keywords, because the text input string does not require that level of detail. If further detail is desired beyond the initially selected keywords, the user is presented with reminders of detail that might be desired in the form of picking lists derived from the unused tokens.

The technique also allows for handling synonyms and local shorthand notation or language variants. One example would be the abbreviation "FRAC" or the symbol "#" representing the full word "Fracture". These variants each have their own place in the primary and secondary files, but point to the relevant full term. This feature allows adaptation of the system according to local usage and improves universal usability.

Preferably, the step of receiving new data and selecting one or more keywords comprises the steps of ignoring irrelevant characters in a text input string, such as punctuation, and splitting the text string into words, such as by taking groups of characters between spaces.

However, common text input will include not only clinical terms but also linking words such as "of", "to", "and", "in", and qualifiers such as "massive", "possible" or "?".

In the preferred embodiment these additional terms are stored in a tertiary file. At the end of search process any unused keywords are compared with this tertiary file and if found are included in the output.

Any suitable search routine may be employed, but in the preferred embodiment each keyword is assumed to be a right truncation of a whole word and is compared with the first characters of a key field associated with each entry.

For example, the key field of the term "Fracture" may itself read "Fracture" and a keyword such as "Frac" would be considered a likely match.

The output taken from the primary list contains a string of terms generated from the keywords in the input string. It is desired to order the output terms such that they make grammatical sense.

Each entry in the primary and secondary files preferably further comprises a grammar field used to determine the relative position of that term in relation to other terms of the output. In the preferred embodiment the grammar field comprises a priority level such as between one and twenty, and the order of priority determines the order of the terms from left to right within the output.

In addition to using the method and apparatus to produce a sensible longhand version of a text input, it is also possible to produce a coded notation. In the context of medical records, this coded notation allows classification according to a recognised system such as the ICD-10 codes.

Each entry in the primary and secondary files further comprises a code field for classifying that term according to a predetermined classification system. The preferred code field uses a total of thirteen characters representing for example process (type of injury or infection), body region, body tissue affected, and further specific details.

The code are complementary. The codes of the output terms are added together to produce a final code. Thus, the code can represent the diagnosis and treatment in detail. This internal code can then be matched, for example using a lookup table, to a corresponding entry in a recognised coding system such as an ICD-10 code. The internal code is more detailed than the ICD-10 code which has only four characters with option fifth and sixth characters. Many internal codes will point to one ICD-10 code.

A further internal coding language is used for searching and retrieval of records in the database. In this retrieval coding, a plurality of search axis are determined according to the needs of the database, and each given a code field in the database. In the preferred embodiment, separate code fields are provided for process, region, tissue and specific items, each preferably five characters long. By masking parts of these retrieval codes, the records in the database may be efficiently searched to identify only those of interest.

A preferred embodiment of the present invention will now be described with particular reference to a database of patient records of a hospital or other medical facility.

Also, the preferred embodiment is implemented in the form of a computer program but the skilled person will appreciate that the invention could equally be implement solely by hardware or by a combination of software and hardware.

Turning first to the data entry routine, this has two main parts, namely (a) the relation of text input to a controlled clinical vocabulary, and (b) the selection of the most likely match from among several potential matches.

The data entry routine takes as input a text string such as may be typed at a data entry station by a user.

This string represents a shorthand notation of information such as a patient diagnosis, for example "chro duo ulc".

This shorthand may be unambiguous and well understood amongst a group of skilled clinical and other staff of a particular hospital, but is not universally understood. The first step in recording useful data is to transform this shorthand into a longhand medical description.

Keywords are taken from the input string and in this example there are three: "chro", "duo" and "ulc". In any diagnosis there will be a primary term and one or more qualifying terms. Conveniently, the primary terms are grouped in a primary file and are searched first by comparing the keyword against a search key for each entry.

If no matches are found, the search ends. Any 'hits' are noted as a list of candidate primary terms.

Each primary term may have tokens associated therewith which point to groups of qualifying terms located in a secondary file or files, which are the starting points for searching the secondary files. For example, the primary file may contain (amongst many others) the terms: token token 2 ... token n chromatrichia (none) (none) duodenitis SD001 (none) ulcer sucol SDooi Each of the tokens either point to a term or group of terms in the secondary files which can validly be clinically associated with the relevant primary term, or confirm that no such links exist. For example, the token SD001 may point to a group of terms specifying chronicity such as "chronic" or "acute" which can be associated with the terms "duodenitis" and "ulcer". The token $ZOOT may point to secondary terms denoting anatomical regions where ulcers may occur. For example, the secondary file contains: token secondary term secondary tokens SD001 acute ...

SD001 chronic SU001 duodenal so001 gastric so001 venal The keyword "chro" matches "chromatrichia" in the primary file and is a candidate, but has no linking tokens and so does not require further searching.

"Duo" matches "duodenitis in the primary file which has the token $D001 and therefore offers access to potential matches for the other keywords. A search of the relevant group of terms in the secondary files matches "chro" with "chronic" but there is no match for "ulc".

"Ulc" matches "ulcer" in the primary file which has two tokens, and in the secondary file "duo" matches "duodenal" and "chro" matches "chronic".

In the simplest embodiment, all of these possible matches are output and the preferred match may be selected by the user. The number of potential matches in the output is inversely proportioned to the detail of the initial text entry. It is therefore desired to reduce the output to one or perhaps only a few options to improve usability. A status matrix is created using a status field, with in effect a column for each keyword and a row for each candidate found in the primary file. Each row is initially set to "N" and changed to "Y" when a match is found in the primary and secondary files. After the search described above, the matrix would be: "chro" "duo" "ulc" chromatrichia Y N N duodenitis Y Y N ulcer Y Y Y The candidate with all Y's or the greatest number of Y's is selected as the most likely match, in this case "ulcer chronic duodenal". A grammar function re-arranges these terms to "chronic duodenal ulcer" for output to the user for approval and ultimately for writing to the patient record.

The terms in the secondary file may have tokens associated therewith which point to further groups of secondary terms and which are searched in the same way. In the preferred embodiment, each term may have up to ten tokens and up to twenty keywords may be selected from the input string.

There are situations where the one keyword matches two or more synonymous terms in the primary file, and the same output string may then be reached by two or more routes. Also, a term may be used both as a primary term and as a secondary term and appear in both files. In which case the tokens, which embody clinical knowledge, ensure that a valid output is reached.

For example, the term "liver" may appear in both files, and the input string may be "liv neop". The term "liver" appears in the primary file with the token $$SAD (where double dollar sign $$ indicates that the term by itself is not a valid output). The token $$SAD points to a group of secondary terms denoting pathologies applicable to the liver, but not including "neoplasm" so giving YN in the status matrix. "Neop" matches the term "neoplasm" in the primary file, with a token $$SDT which points to secondary terms including "liver", giving YY. The output is correctly selected as "liver neoplasm".

Any unsubstituted tokens may be ignored by the user as representing unwanted detail, or may be offered to the user via picking lists to add extra detail to the initial text entry.

A tertiary file contains "exception" words which are a useful part of the input text but are not clinical terms.

These exception terms include linking words such as "of", "the", colloquial descriptive terms such as "mild", "massive", and query terms such as "possible" or "?". Any unused keywords are searched against this exception file and terms found are added to the output.

A further aspect is coding the medical diagnosis according to an established system such as ICD-10. An internal code explicitly embodies the concepts, terms, synonyms and rules for the coding system, but also allows for additional detail beyond that expressed in the ICD-10 code, and allows for additional synonyms and local variants. The internal code also allows for context terms which do not appear in the literal wording of the ICD-10 code but which do influence selection of the correct code.

Examples of context terms relate to queries, certainty, pregnancy and post-operative condition.

Each entry in the primary and secondary files includes a code field containing part of a complementary internal code. The internal codes of all the terms in the output string are combined to give a final code relevant to the output term. This final code can then be transformed to a corresponding ICD-10 code using, for example, many-to-one mapping. An example internal code uses a string of up to ten alpha-numeric characters, followed by a modifier of three characters. Predominantly, terms in the primary file define a base portion of five characters, with up to a further five supplemental characters being defined from terms in the secondary file. The three character rule string defines the placing of the secondary characters in the combinatorial code, and also defines the optional fifth character used in the ICD-10 code.

In the preferred embodiment, the translator code consists of the five character base plus zero or more supplemental characters, the number and position of which is determined by the first two characters of the modifier, i.e. flags 1 and 2. The third flag holds any optional fifth character extension to the ICD codes. If flag 1 equals x and flag 2 equals y then the translator is the base plus y of the supplemental characters, starting at position x. For example: Base = 123** Supplemental = AB*** Flag 1 = 0 Flag 2 = 0 Translator code = 123** Flag 1 = 1 Flag 2 = 2 Translator code = 123**AB Flag 1 = 2 Flag 2 = 3 Translator code = 123 **B** The skilled person can select the internal coding system according to preference to give combinatorial codes which build up a complete code relating to the output terms and to a corresponding universal coding system.

Exception terms in the tertiary file are also associated with parts of the internal code and so may modify the final code. For example, the input "heroin induced amnesia" has a final internal code beginning V125* which corresponds to ICD-10 code F116. However, where the input also includes a query word such as "possible" or "?", the internal code is modified to point to the ICD-10 code F196 indicating that the drug is not known. The text output is still "Possible heroin induced amnesia", and the ICD-10 code reflects "amnesia caused by unknown drug". This is an example of the internal code and longhand diagnosis being more detailed than can be represented in the ICD-10 code.

The internal code copes with the way that the dagger and asterisk symbols are used in the ICD-10 code to synthesise compound subjects from different parts of the basically uni-axial classification. For example, "tuberculosis ureteritis" draws on both the A chapter (infection) and N chapter (disorders of the genito-urinary system) . The technique also copes with variation in the mode of use of the fifth character. For example, for fractures the fifth character can represent "open" or "closed", but for musculoskeletal disorders the fifth character is related to body site.

Where patient records are ordered according to a standard classifications such as ICD-10 and OPCS-4, interrogation of the database for non-standard criteria such as "open wounds", "fractures", "biopsies" or "injuries to nerves" is difficult because relevant entries are scattered widely across the classification with no common identifying characteristic. This problem arises because the standard classifications are uni-axial, being based, for example, by anatomy. Many common concepts are distributed as narrow terms in separate branches of classification with no linking index or pointer. In response to this problem, each entry in the primary and secondary files further comprises a plurality of retrieval code fields. In the preferred embodiment there four fields, each comprising five alpha-numeric characters, and representing, 1.

Process, 2. Body region, 3. Tissue, and 4. Specific.

The code in each of these four fields is hierarchical to represent sub-groups. For example, the Process field includes: N**** non-trauma T**** trauma TF*** trauma - fracture TW*** trauma - open wound TW1** trauma - open wound - bite.

As for the internal code, the codes are combinatorial so that each term relates to part of the final retrieval codes. For example, the input "acute appendicitis" brings in: term process region tissue special appendicitis NI*** IA*** G51** ***** acute ***** ***** ***** A**** combined code NI*** IA*** G51** A**** Note that the combined code is the same whether the input string is "acute appendicitis" or "infection of the appendix defined as acute", which deals with the problem of equivalence.

The retrieval codes allow fast access to data in response to user queries. For example, the data input routine is first used in a query mode to retrieve relevant terms from the primary and secondary files along with the relevant retrieval codes. These codes are then used as a mask to select only those patient records from the database matching the enquiry. The searching routine selects the order of the four retrieval codes for efficient searching.

For example, the input string "report all fractures of the upper limb" produces the following, where the process mask is used first and then the region mask.

term process region tissue special fractures TF*** ***** ***** ***** upper limb ***** 31*** ***** ***** A second example is "report all injuries of the hand", where the region process mask is searched first because it is more detailed.

term process region tissue special injuries T**** ***** ***** ***** hand * * * * * 3151* ** * * * * * * * * Two variants require explanation. One is where conditions have a strong clinical relationship but only a tenuous hierarchical relationship in the indexes used for the retrieval codes. For example, a pneumothorax may be a traumatic process and have the trauma mask T71**, but may also be a spontaneous process with the mask ND41*. There is no direct relationship between the two process masks and two separate searches must be made although this is transparent to the user.

The second variation is where a condition may be a prime diagnosis or it may be a causative agent of another prime diagnosis. For example, diabetes mellitus may be the diagnosis or it may be the cause of diabetic retinopathy.

Diabetic retinopathy has a process code NOD**, and a specific code NE4**, while diabetic mellitus has a process code of NE4** and a specific code of *****. Even though these two conditions share the same code NE4** determined from the rules of code allocation, each can be uniquely identified because one is a process code and one a specific code. Searching both the process and specific fields for NE4** would cover both terms.

The apparatus and methods described herein have a number of advantages. The controlled vocabulary embodied in the primary and secondary files is highly structured and requires minimal space. It is easily updated and adapted by adding new terms and variants according to local needs, whilst retaining universal readability of the output text.

Further, the direct translation of the input text to a code such as ICD-10 avoids any intermediate manual classification which may be difficult to prove and maintain. Further still, the system is convenient and allows a user-friendly interface.

Claims

1. An apparatus for assisting entry and/or coding of new data in a database, comprising: a primary file means having a plurality of entries, wherein each entry comprises a primary term and a plurality of primary tokens each associated with said primary term; a secondary file means having a plurality of entries, wherein each entry comprises a secondary term, a primary token associated with at least one of said primary terms in said primary file, and one or more secondary tokens; wherein said primary and secondary tokens each point to an entry or group of entries of the secondary file means.

2. An apparatus as claimed in claim 1, wherein said primary and secondary files are held as separate files.

3. An apparatus as claimed in claim 1, wherein said primary and secondary files are held as two parts of one file.

4. An apparatus as claimed in claim 1, wherein said primary and secondary files are split between several files.

5. An apparatus as claimed in claim 1, wherein each said token uniquely identifies an entry or group of entries of the secondary file means.

6. An apparatus as claimed in claim 5, wherein each said token comprises five characters, commencing with a non-textual character followed by four alpha-numeric characters.

7. An apparatus as claimed in claim 1, wherein each entry in the primary and secondary files further comprises a code field for classifying that term according to a predetermined classification system.

8. An apparatus as claimed in claim 7, wherein the code field uses a total of thirteen characters representing process (type of injury or infection), body region, body tissue affected, and further specific details.

9. An apparatus as claimed in claim 8, wherein the codes are complementary, so that the codes of the output terms when added together produce a final code to represent the diagnosis and treatment in detail.

10. An apparatus as claimed in claim 9, wherein a further internal coding language is used for searching and retrieval of records in the database.

11. A method for assisting entry and/or coding of new data in a database, wherein said database includes: a primary file means having a plurality of entries, wherein each entry comprises a primary term and a plurality of primary tokens each associated with said primary term; a secondary file means having a plurality of entries, wherein each entry comprises a secondary term, a primary token associated with at least one of said primary terms in said primary file, and one or more secondary tokens; and wherein said primary and secondary tokens each point to an entry or group of entries of the secondary file means, said method comprising the steps of: A. receiving said new data and selecting one or more keywords; B. searching said primary terms of said primary file to produce a primary list of candidate primary terms matching said keyword; C. where a plurality of keywords are selected, for a first of said keywords: D. for a first candidate in said primary list: E. retrieving from said primary file said primary tokens associated with said primary term; F. searching said secondary file using a first of said primary tokens to produce a list of candidate secondary terms; G. comparing a first of said secondary candidate terms with each of said keywords other than said first keyword, and adding any matched secondary terms to said primary list in association with said first candidate primary term; H. repeating step G for each token associated with said first candidate primary term; I. repeating steps F-H for each candidate in said primary list; J. repeating steps C-I for each of said plurality of keywords; and K. outputting said primary list including primary terms and associated secondary terms as potential matches to said new data.

12. The method of claim 11, including the additional step of maintaining a status field associated with each candidate primary term in the primary list, where the status field has a plurality of positions corresponding in number to the number of keywords selected from the input text string.

13. The method of claim 11, in which each position is initially filled with the character "N" indicating that the corresponding keyword has not been matched, and as primary and secondary terms are found matching that keyword, the corresponding entry in the status field is changed to "Y".

14. The method of claim 11, in which the step of receiving new data and selecting one or more keywords comprises the steps of ignoring irrelevant characters in a text input string, and splitting the text string into words.

15. The method of claim 14, in which additional terms are stored in a tertiary file, and at the end of a search process any unused keywords are compared with this tertiary file and if found are included in the output.

16. The method of claim 11, in which each keyword is assumed to be a right truncation of a whole word and is compared with the first characters of a key field associated with each entry.

17. The method of claim 11, in which each entry in the primary and secondary files further comprises a grammar field used to determine the relative position of that term in relation to other terms of the output.

18. The method of claim 17, in which the grammar field comprises a priority level between one and twenty, and the order of priority determines the order of the terms from left to right within the output.