US20020010714A1 - Method and apparatus for processing free-format data - Google Patents
Method and apparatus for processing free-format data Download PDFInfo
- Publication number
- US20020010714A1 US20020010714A1 US09/898,948 US89894801A US2002010714A1 US 20020010714 A1 US20020010714 A1 US 20020010714A1 US 89894801 A US89894801 A US 89894801A US 2002010714 A1 US2002010714 A1 US 2002010714A1
- Authority
- US
- United States
- Prior art keywords
- data
- free
- accordance
- format
- text object
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Abandoned
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/14—Tree-structured documents
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99931—Database or file accessing
- Y10S707/99933—Query processing, i.e. searching
- Y10S707/99934—Query formulation, input preparation, or translation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99942—Manipulating data structure, e.g. compression, compaction, compilation
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10—TECHNICAL SUBJECTS COVERED BY FORMER USPC
- Y10S—TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y10S707/00—Data processing: database and file management or data structures
- Y10S707/99941—Database schema or data structure
- Y10S707/99943—Generating database or data structure, e.g. via user interface
Definitions
- the present invention relates generally to the processing, storage and analysis of information in the form of free-format data, and particularly, but not exclusively, to a method and apparatus for interpreting free-format text.
- One of the main purposes of computer systems is to manage information. This management of information is performed internally by data management systems.
- data management systems may be divided into two categories: 1) Database management systems; and 2) Text search and retrieval systems.
- the first type of data management system imports and manipulates data into internal representations so that the data may be located and modified. When required, these systems generate a suitable representation of this data which is read by humans or used by another system.
- This category of data management system includes: hierarchical, network, relational, object-oriented database management systems and knowledge based management systems.
- each of the attribute fields/slots has a format which can be, for example, integer, real number, boolean, character etc. Others are records/objects. Some fields/slots have specific formats (e.g., date, time), but yet others are free-format text.
- the free-format data “35 Pitt Street, NORTH SYDNEY” has a number of “elements”. Each element has an associated “attribute”. An attribute of the element “NORTH” is that it is a “geographical indicator”. An attribute of the element “ 12 ” is that it is a “number”. Note that the “low level” elements correspond to the “tokens” of data i.e., the element “NORTH” is a token of the data. The data also includes higher level elements, however, e.g., “NORTH SYDNEY” is an element which includes two tokens and this element has the attribute of being a “town”. An attribute of the entire data “12 Pitt Street, NORTH SYDNEY”, i.e., the total “element” is that it is an “address”. An alternative term for element is “component”.
- address data may be stored in a single field labelled “address”.
- This field contains the address in free-format form and it is therefore not possible with present database technology to perform normal database operations on individual elements of the address—those elements cannot be accessed separately (apart from the total combination of elements which makes up the address, which can of course, be accessed as a whole, as “address”).
- Natural language processing systems are known that employ “Semantic Grammars” to encode semantic information into a syntactic grammar. These systems are mainly used to provide natural language interface to other systems such as a data base management system. The following description comes from a book by Patterson, D. W. “Artificial Intelligence and Expert Systems”. “. . . They use context-free rewrite rules with non- terminal semantic constituents.
- the Constituents are categories or metasymbols such as attribute, object, present (as in display or print), and ship, rather then NP (Noun Phase), VP (Verb Phase), N (Noun), V (Verb), and so on. . . .
- Semantic grammars have proven to be successful in limited applications including LIFER, a data base query system distributed by the US Navy . . . and a tutorial system named SOPHIE which is used to teach the debugging of circuit faults. Rewrite rules in these systems essentially take the forms S ⁇ > What is ⁇ OUTPUT-PROPERTY> of ⁇ CIRCUIT-PART>?
- wh-queries such as What is the name of the carrier nearest to New York? Who commands the Kennedy? etc . . .
- the input statement ‘Print the length of the Enterprise’ would fit with the LIFER top grammer rule (LTG) of the form ⁇ LTG> ⁇ > ⁇ PRESENT> the ⁇ ATTRIBUTE> of ⁇ SHIP> where print matches ⁇ PRESENT>, length matches ⁇ ATTRIBUTE>, and the Enterprise matches ⁇ SHIP>.
- LIFER top grammer rule LAG
- Other typical lexicon entries that can match ⁇ ATTRIBUTE> include CLASS, COMMANDER, FUEL, BEAM, LENGTH, and so on.”
- the interface is flexible the database they interface to has a fixed structure and these systems are unable to perform changes on the original (human readable) data.
- KBMS knowledge based management systems
- the text search and retrieval category of data management system does not import the data but builds searchable indices which point to the original data.
- This category includes: document storage & retrieval systems; and Internet search engines.
- text object as used in the following description should not be confused with the terminology “text object” which has been used in systems to describe software techniques which assist in the storage and transfer of pieces of text data between computer systems by encapsulating the text string.
- Techniques which have used the term “text object” range from the “String” object employed within Apple Computer's operating systems (where the object contains a leading two byte “length” value and the text string) to the “Compound String” object employed by the X-Windows operating system (where the object encapsulates multiple encodings, language translations and font styles of one piece of information.)
- the present invention provides a method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- text object does not encapsulate text string, as discussed above.
- the text object in the terms of the present invention provides a “semantic layer” between the actual text data and, for example, an application software system which may need to access and/or manipulate the text data.
- the text object is the additional data, related to the semantic and syntactic information obtained from examination of the data elements, and a pointer means (such as a key) which can lead back to the elements of the free-format data (e.g., back to the text string which forms the free-format data).
- a pointer means such as a key
- the additional data preferably allows identification of the attributes of the data which have been obtained by the examination of the data. For example, in the “12 Pitt Street, NORTH SYDNEY” example given in the preamble, the various attributes of the data, e.g., “street” equals “12 Pitt Street”; “street number” equals “12”; “town” equals “NORTH SYDNEY”, etc., are identified by the additional data and the pointer means preferably allows access to the elements of the data which are associated with those attributes.
- the additional data effectively provides “virtual data fields”—the data fields do not exist as they do in a normal database which would have a column field head for each attribute.
- the free-format data can be accessed on an attribute by attribute basis using the present invention, as if actual fields for those attributes did exist.
- the preferred embodiment of the invention thus operates to create “virtual data fields” which, preferably, allow all normal database operations on free-format text, without having to create actual database fields for the free-format text.
- the free-format text can remain stored as it is in the same location (usually database).
- each data record can have many attributes, which differ from attributes of other addresses e.g., England has counties, the USA has states. To produce actual conventional database fields for all the attributes for international addresses would be an almost impossible task.
- each record of free-format data can be taken and processed to produce a (small) number of virtual data fields for that particular record in the form of a text object. The text object for each record can then be queried separately by an appropriate query processing means to provide all the normal database operations for that record. The data itself may stay in place.
- the step of examining preferably includes the step of parsing the free-format data.
- a text object preferably enables manipulation of the data to carry out all the normal database operations, such as changing the record, locating an element of the record, retrieving information from the record, etc.
- the information which may be provided by the text object preferably includes information on the elements of the data.
- the information may also include matching information (such as phonetics) to facilitate comparison of one record of data with another record of data, parsing priority information to assist in the processing of ambiguous free-format text, etc.
- the text object preferably includes attribute-type identifiers accessible to enable identification of attributes of the free-format data and pointer means for locating elements of the data having the particular attribute.
- the text object comprises a plurality of parts in the form of “component nodes”.
- a plurality of component nodes may be associated together in a text object in a predetermined hierarchy.
- a plurality of component nodes may be considered to be “nested” together in the form of a “text node tree” which may have a plurality of branches associating various component nodes with each other in a predetermined hierarchy.
- Each component node may comprise:
- an attribute type identifier for the classification of an attribute of the free-format data which is associated with that component node
- a pointer to the beginning of a sub-string within the text object's text string i.e. beginning of the element associated with the component node.
- a matching weight (to indicate the relative importance of this element when performing comparisons with other text objects);
- a parsing priority value (giving a notional “priority” to the elements of the free-format data associated with the component node so that a priority may be allocated and used in the determination of the best interpretation of free-format text when ambiguities exist).
- component nodes may not be physically nested within the component node but each component node may just contain a list of pointers to subordinate component nodes so that the subordinate component nodes can be “found” from the component node which includes the list.
- Each component node preferably relates to one particular attribute of the free-format data, as identified by the attribute type identifier in the component node.
- Component nodes which are relatively high in hierarchy may contain or point to a plurality of other component nodes, whereas those component nodes which are the lowest in the hierarchy may not contain or point to any other component nodes as the next step down in the hierarchy is the associated element of the free-format data.
- the hierarchy is determined by the parsing of the free-format data.
- one attribute of a record of address data may be a ⁇ Street>, e.g. “12 Pitt Street”.
- Sub attributes of the ⁇ Street>component are ⁇ Street number>“12”, ⁇ Street name> “Pitt” and ⁇ Street type> “Street”.
- the ⁇ Street> component node will therefore list three other sub component nodes, having attribute type identifiers ⁇ Street number>, ⁇ Street name> and ⁇ Street type>.
- each component node could be considered to be text objects themselves.
- This recursive definition allows all the functions of the text object of the present invention to be applied to each attribute.
- the text object may also comprise other data structures which assist in the quick location of specific component nodes.
- An example of such a structure is a lookup table containing all the attribute type identifiers and a pointer to their associated component nodes.
- the query processing means is preferably a software application engine which is configured to be able to use the text object to answer questions on the data and access the data to manipulate it (e.g., correct it if it is in error).
- the method preferably also includes the further step of preparing an “index” which facilitates comparison of elements of a plurality of records of free-format data.
- the index is preferably in the form of a table (termed by the inventors a “text object index”) including columns, column headings and data, very much in the same way as a conventional database, except that it is prepared from the additional data for each of the plurality of data records.
- the text object index preferably includes a table with a column for the attribute type identifier, a column for representative value keys and a column for user supplied record identifiers.
- the representative value key preferably provides a value representative of a feature of the element associated with the appropriate component type identifier, e.g., a phonetic value for elements which are proper nouns (e.g., Smith) or a numeric identifier for common words (e.g., Street).
- the section on text string matching below contains more details regarding the values of the representative key value.
- the user supplied record identifier will identify to the user which record of free-format data is being compared or accessed i.e., is a pointer which enables access to the record.
- a text object index is prepared, a text object having a plurality of component nodes containing attribute-type identifiers and other data may not be necessary. All that may be required to access the data and carry out database operations is the query processing engine and the text object index.
- the text object index may be prepared directly from the examination of the data and the text object index includes text objects for a plurality of records (i.e., additional data plus pointer to record).
- the text object as a separate “component node structure” can therefore be dispensed with or is not needed in the first place as a separate entity, instead it is incorporated in the text object index as additional data plus pointers.
- the text object includes “matching” values (or procedures to create these values) for low level matching elements of the free-format text
- a free-format record containing a street name value in Kanji may be compared with a street name element in Arabic by comparing respective matching values.
- the street name for each record could be the same street, but merely being expressed in different languages in the free-format data.
- the matching information provided by this aspect of the present invention therefore enables comparison of elements of free-format text expressed in different written languages.
- Matching values may be generated during processing of text objects, and need not be stored in the text object. That is, they can be generated “on the fly” via procedures designated by the query processing engine. See later on in the description.
- the step of examining the elements of the data to determine the components preferably comprises the step of parsing the free-format data in accordance with grammar rules applied by a domain object.
- the domain object is preferably constructed by a domain construction process which uses as input data: character definition data, regular expression definition data, and grammar data.
- the hierarchy of the component nodes of the text node tree is preferably determined by the grammar rules for the particular domain object.
- An embodiment of the present invention may be implemented by a software application which includes a domain object and a query processing means.
- the domain object is arranged to examine free-format data to produce a text object which can be then used by the query processing means to enable all database operations on the free-format data.
- the free-format data may be stored in any conventional way, such as in a conventional database on a computer system.
- the free-format data may also be stored as a string in the text object.
- the software application comprising the domain object and query processing engine would be used to process the data without affecting its storage in the database.
- the present invention also has great potential for the future structuring and ordering of data. For example, using the present invention it may be possible to greatly reduce the number of fields which are required to store data in a database. Considering the example given above, of international name and address data, at present it is not possible for a database to deal with international address data in a single field—because international address data has many different attributes. With the present invention, however, international addresses may be kept in single free-format field containing all the international address records.
- Processing by the present invention provides each individual international address record with its own set of virtual data fields allowing comparison with other records via the query processing means, manipulation and access to information of all elements of each data record. Indeed, it is possible to provide a single domain object for all international addresses. Any free-format data could be processed in this way.
- the invention is not limited to address data.
- the present invention provides a method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data for each data record, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- the text object preferably includes any or all of the properties of the text object as discussed above in relation to the first aspect of the invention and the text object is preferably produced by an examination including any or all of the features as discussed above.
- the present invention further provides a method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data of each data record, the additional data being in the form of a text object index which includes attribute—type identifiers for elements of each data record and pointers to each data record, the text object index being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- the text object index preferably includes any or all of the properties of the text object index as discussed above in relation to the first aspect of the invention.
- the text object index is preferably produced by process steps as discussed above in relation to the first aspect of the invention.
- the present invention provides a processing system for processing free-format data stored in a computing system, the apparatus including means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, means for producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and a query processing means which is arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- the examination means and means for producing is arranged to produce a text object with any or all of the features as discussed above in relation to first aspect of the invention, by applying, preferably, the same methods of examination.
- the present invention further provides a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising additional data relating to semantic and syntactic information (attributes) about the data for each data record, stored and accessible by the processing system, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising additional data relating to semantic and syntactic information (attributes) about the data for each data record, stored and accessible by the processing system, the additional data being in the form of a text object associated with each data record, the text object including pointer means
- the present invention further provides a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising the additional data relating to semantic and syntactic information (attributes) about the free-format data for each data record, the additional data being in the form of a text object index which includes attribute type identifiers for elements of each data record and pointers to each data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising the additional data relating to semantic and syntactic information (attributes) about the free-format data for each data record, the additional data being in the form of a text object index which includes attribute type identifiers for elements of each data record and pointers to each data record, and
- the present invention yet further provides an apparatus including a domain object arranged to process free-format data to produce a text object, the text object including any or all of the features of the text object as discussed above in relation to previous aspects of the present invention.
- the step of accessing the text object may comprise querying one or more text objects for attributes and obtaining the value of the element corresponding to the queried attribute.
- the free-format data is name and address data
- a person may query the text object or objects to see if there is a ⁇ Street> element, and, if so, obtain the value of the element (e.g., “12 Pitt St”).
- the “address” field merely includes all the ⁇ address> in free-format form.
- Other older systems provide search facilities which scan for a particular text string without regard for the semantics of the text being searched. These systems could be used to find all address with a street name of “Pitt” by searching for that string. This leads to problems when the string being searched for can be used in different ways.
- the step of accessing the text object may also include comparing two text objects and ascertaining and providing a confidence value that indicates how closely the two text objects match. For example, two street addresses may be compared by comparing their respective text objects, and a confidence value (in percentage points) can be given depending on how closely they match.
- the step of accessing may also include the step of changing a value associated with a particular component.
- a value associated with a particular component Common examples include changing a woman's surname after marriage and changing the name of a street or town name after a mistake has occurred.
- Yet a further aspect of the present invention provides a processing system for enabling access to free-format data processed in accordance with the method of any one of claims 1 to 19 , the processing system including a query processing means arranged to access the additional data and provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- the apparatus may include means for accessing the text object in accordance with any or all of the method steps given above.
- the present invention yet further provides a processing system for processing free-format data stored in a computing system, comprising means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationship of elements to each other, to determine semantic and syntactic information (attributes) about the data, and a query processing means for utilising this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
- the means for examining may comprise a domain object which examines the elements and produces virtual data (being data relating to the semantic and syntactic information about the data) which is used by the query processing means to access the data and obtain information on attributes of the data.
- virtual data being data relating to the semantic and syntactic information about the data
- the present invention yet further provides a method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, and querying the data using this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
- the present invention provides a method of processing a plurality of records of free-format data stored in a computing system, comprising the steps of, for each record examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, and producing virtual data fields enabling access to this information and the associated elements for each data record, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
- virtual data fields is used in the same sense as previously. Unlike prior art conventional databases, where it is necessary to process the information and produce actual data fields, no separate data fields need be produced. The data may remain in place where it is in the database, and instead an associated “virtual field” is produced for attributes of the semantic and syntactic information, and the virtual fields can be queried to obtain all the information required of the record, and preferably all normal database operations may be implemented.
- the present invention yet further provides a processing system for processing a plurality of free-format data records stored in a computing system, comprising means for examining elements of the data of each record to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about each record, and means for producing virtual data fields associated with each record enabling access to this information and the associated elements, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
- FIG. 1 is a diagram illustrating the architecture of a system for enabling the processing of free-format data in accordance with an embodiment of the present invention
- FIG. 2 illustrates sample “address” data
- FIG. 3 is a more detailed structural view of an example text object produced by operation of the embodiment of the invention on free-format data
- FIG. 4 illustrates sample “address” formats
- FIG. 5 is a flow chart illustrating a method for getting a sub-component of a specific type from the text object of the invention
- FIG. 6 illustrates the results of the get sub-component method
- FIG. 7 is a flow chart illustrating a method for modifying a sub-component of a text object of the invention
- FIG. 8 is an illustration of the mechanics of modifying a text object of the invention
- FIGS. 9, 10 and 11 provides an example of modifying a text object of the invention
- FIG. 9 shows a text object before modification
- FIG. 10 shows the replacement text object
- FIG. 11 shows the text object referred to in FIG. 9 after it has been modified
- FIG. 12 is flow chart illustrating the node matching subroutine used by other methods
- FIG. 13 illustrates examples of text objects in accordance with embodiments of the present invention for illustrating a method of comparison of text objects in accordance with an embodiment of the present invention
- FIG. 14 is flow chart illustrating the “adjust node” subroutine used by other methods
- FIG. 15 is a diagram illustrating the architecture of the domain object block of FIG. 1;
- FIG. 16 is an illustration of the domain construction process of FIG. 1 in more detail
- FIG. 17 provides two examples of standard transliteration tables. One for Japanese Katakana and one for Greek.
- FIG. 18 contains tables illustrating Regular Expression Definition data
- FIG. 19 illustrates a demonstration grammar data file
- FIGS. 20 and 21 provide flow charts of the domain object construction process block of FIG. 1;
- FIG. 22 illustrates an example session with a implementation of the invention within a SQL relational database system.
- the present invention relates to an entirely new concept and approach for processing computerised information, in particular free-format data.
- the idea is to produce from the free-format data a “text object” which may be stored in a computer and which can be used to obtain information about the free-format data, compare records of free-format data and manipulate the data. This is achieved without it being necessary to construct complex databases having many fields.
- FIG. 1 is a diagram showing the configuration of an entire “virtual data” system in accordance with an embodiment of the present invention. It comprises a user interface 101 , a processor 102 .
- the processor 102 can be a standard computer system and has a general configuration such as a CPU, a computer memory and mass storage device.
- the user interface 101 can be a standard keyboard and VDU, and/or an interface to another computer system. User interfaces like these, along with other equivalent interfaces, are well known.
- the system 104 comprises a domain construction process 106 which is arranged to take a plurality of input data 107 (in this example in the form of data files) and build a domain object 108 which is used to produce text objects 105 .
- Each “domain” will include all the grammar and syntax rules necessary for that particular domain of free-format data.
- one domain may be international name and addresses and will include all the information necessary to analyse free-format international name and address data to produce a text object.
- Another domain may be a commodity description knowledge base, another one may be a transportation industry knowledge base. Domains may be produced to handle any free-format data.
- the domain construction process 106 is essentially an engine which works on the knowledge bases (input files) for the particular domain type to produce the domain object 108 for that type.
- a text object index 109 may be produced by processing a number of text objects 105 , and this will be described later.
- the invention 104 provides a layer between general application software systems 103 and their stored data 110 . Unlike “Knowledge Based Management Systems” described above, this invention allows the free-format data to remain in its original location and legacy application software to operate using the original access paths 111 .
- FIG. 3 is a schematic diagram of the detailed structure of an example text object in accordance with an embodiment of the present invention, in order to assist with illustrating the concept.
- the example free-format data illustrated in FIG. 3 is a street address, “12 Pitt Street, North Sydney” (designated by reference numeral 301 ).
- this information may have been stored in a single “address” field or may have been divided into a number of separate fields corresponding to the various attributes, i.e., street number, street name, street type and town.
- FIG. 4 for other examples of common Australian address formats.
- the prior art database format requirement for a separate field for each attribute gives rise to much complexity and, where the information is intricate, it is cost prohibitive and even impossible to produce a field for every attribute of the free-format data.
- the text object (illustrated in FIG. 1) comprises a plurality of component nodes 302 - 312 .
- the text object can be represented as a text node tree, having branches (eg 313 ) wherein the component nodes 302 - 312 are positioned in a predetermined hierarchy.
- the “lowest” hierarchy is at the bottom of the text node tree and the “highest” hierarchy is at the top of the text node tree.
- the node 302 at the top of the node tree will be refer to as the “root” node.
- components of the text object can be stored in any convenient manner in a memory of a processing means, could be nested within each other, for example, refer to each other in some way, etc.
- the text object is able to be represented as a text node tree, but that does not mean that it is stored in memory in this way. As long as the components of the text object can be processed in such a fashion that the components act like component nodes of a text node tree as represented in the figure, then this is sufficient.
- each component node 302 - 312 could be considered text objects themselves. This recursive definition allows all the functions of the present invention to be applied to each component.
- each component node 302 - 312 includes:
- An attribute type identifier (which in this embodiment is an integer) which identifies an attribute type of the free-format data 301 associated with the text object.
- component node 303 includes the attribute type identifier ⁇ Street>, indicating that this component node 303 is associated with the element of the free-format data which gives is the Street, i.e., “12 Pitt Street”.
- Component node 302 is the main component node for the text object illustrated in FIG. 3 and includes the attribute type identifier ⁇ Address>. The component node 302 is therefore associated with the entire free-format data record in this case, being “12 Pitt Street, North Sydney”, which is an address.
- component node 302 is “higher” in the hierarchy in the text node tree than component 303 ; the ⁇ Address> component includes within it the ⁇ Street> component.
- the hierarchy of the component node 302 - 312 within the text node tree is in fact determined by the attribute type identifier of the component node and by grammatical rules which determine that the attribute should be of a lower or higher hierarchy.
- a pointer to the starting position of the actual element sub-string of the free-format data associated with a component node is stored as a string in memory and the pointers point towards the beginning of the character string.
- component node 303 would point to numeral “1” of the address.
- component node 303 would have a length of 14 (including space characters after “12” and “Pitt”) which would in effect point to the last letter “t” of “Street”.
- nodes 306 , 307 , 308 are all directly subordinate in the hierarchy and nodes 311 , 312 indirectly subordinate. This array enables the component nodes to be related to each other in the text node tree construction.
- a boolean variable indicating whether this attribute type identifier is for a “low level” matching element.
- “Regular expression” terms such as ⁇ word> and ⁇ nbr> are not matched against each other. Matching of these term is performed at the next level up the hierarchy (e.g. ⁇ Street Name> 307 ).
- a node is flagged as a low level matching component if it either: is a literal which was located in the dictionary (e.g. nodes 308 , 309 ); or contains “Regular expression” terms (e.g. nodes 306 , 307 , 305 ).
- an integer representing the element's match weighting This indicates the relative importance of each of the elements when performing comparisons between text objects. For example: when comparing “Level 3, 45 Pitt st” with “3rd Floor, 45 Pitt St” the fact that the elements “Level” and “Floor” are not equal is insignificant.
- the “match weighting” values are specified in the grammar rules used to construct the domain object.
- a boolean value indicating whether this component node is responsible for deleting and moving the piece of text it points to.
- the two conditions when a component is responsible for its text are: 1) When a outside process requests that the text object manage the entire text string, the text object “root” node is flagged as being responsible for the text string. 2) When a implied value is created. See below for details.
- a integer value representing the free space available at the end of the buffer in which the free-format text is held This value is calculated during the creation of the text object and is usually only applicable to the “root” node of the text object.
- the foot of the hierarchy is a component node dealing with an element for each token of the free-format data, in this case being ⁇ number> 311 , ⁇ word> 312 , ⁇ street type> 308 , ⁇ geographic term> 309 , ⁇ word> 310 .
- component nodes for more generic attribute type identifiers. For example these are ⁇ street name> 307 for the word “Pitt”, ⁇ Street> 303 for the three tokens “12 Pitt Street”, ⁇ town> 305 for the tokens “North Sydney” and, at the top of the hierarchy of this particular free-format data record, the attribute type identifier ⁇ Address> 302 .
- attribute type identifiers can be stored in any form, i.e., they need not be stored as integers but could be stored in any representation.
- a program engine is provided enabling access to the text node tree and this engine has the information necessary to identify the attribute type identifiers as stored.
- each component node contains an integer indicating the “parsing priority” of the element. These values are assigned during construction of the text object and are used to select the best text node tree if more than one exists for a particular ambiguous free-format text. For example: “12 Pitt St Nth Sydney” contains two interpretations. Although “12 Pitt St Nth” is a valid street address, it has a lower priority than “Nth Sydney” and therefore not selected. These “parsing priority” values are specified in the grammar rules used to construct the domain object (see below).
- Another feature of the present invention is the production of extra implied sub fields in a text object, in the form of the creation of extra component nodes for information that is not actually explicit in the original text. For example, “Mr John Smith” has an implied sub field “sex” with a value “male”.
- the text object can be created with an extra component node dealing with this element and having the attribute type identifier “sex”.
- the text object acts as a “virtual interface” enabling access to the free-format data and facilitating all normal database operations on the free-format data.
- the user does not “see” the internals of the text object, but can query the text object via the associated program engine (query processing means) and, by virtue of the structure of the stored text object, the attribute type identifiers and other data being placed in nodes, can perform all the normal database operations on the free-format text record.
- Another embodiment of this invention may speed up the above procedure by performing the above process and create a lookup table containing every sub-attribute and sorting by the attribute type identifier. This technique is well known to those skilled in the art.
- Modify Sub-component Changes the value of a particular element of a text object to a specific value. For example, change “Pitt” to “King”.
- FIG. 5 illustrates this method. Beginning this recursive procedure with the “root” node of the text object, starting at 501 , a determination is made ( 502 ) as to whether the attribute type of this node is same as the required attribute type. If it is, a pointer to this node is appended to the result list at step 503 . Continuing with step 504 , for each sub-component node referenced by this node recursively call this procedure 505 . Then return to caller 506 .
- FIG. 5 illustrates this method. Beginning this recursive procedure with the “root” node of the text object, starting at 501 , a determination is made ( 502 ) as to whether the attribute type of this node is same as the required attribute type. If it is, a pointer to this node is appended to the result list at step 503 . Continuing with step 504 , for each sub-component node referenced by this node recursively call this procedure 505 . Then return to
- FIG. 6 illustrates the node tree for “Mr Fred and Mrs Mary Smith”. Searching the tree for nodes with attribute type ⁇ Given Name> will return a list containing pointers to two nodes 601 , 602 . These nodes point to the sub-strings “Fred”, “Mary” respectively.
- Another version of this operation takes a text string as a parameter. Only nodes containing the same attribute type and same text string (ignoring case) are added to the list. For example: calling this function with an attribute type ⁇ Given Name> and text string “FRED” would return a list containing one node.
- Yet another version of this operation takes as parameters a text string and a confidence level. Only nodes containing the same attribute type and have a text string which matches the supplied string with a confidence above the supplied level are added to the list.
- This operation compares two text object and returns a confidence level indicating how closely they match. It performs this by:
- This operation searches one text object for a sub-component which matches a second text object. If found, it returns to the caller a confidence level indicating how well they match. This operation is achieved by first calling the “Get Component” function (describe above) passing the component type of the second text object. If successful, it calls the “Match Node” subroutine (described below) with the “root” node of the second text object and the node of the result of the “Get Component” function.
- This operation appends an extra component node into the text object.
- queries performed on the text object return the correct results. For example: a text object pointing to a record containing “Dr Chris Smith” may need to modified to indicate that the person is a female. Invoking the Add Component function containing a sex attribute type with a value of “female” will append the respective component node to the text object.
- FIG. 8 illustrates the mechanics of the “Modify” operation.
- the text object to be modified is represented by 801 .
- the actual text data consists of the sub-string to be replaced 805 and the sub-strings before 804 and after 806 .
- the sub-tree 803 represents the sub-string to be replaced 805 .
- the replacement text string 807 is represented by another text object 802 .
- FIG. 7 provides a flow chart of the “Modify” procedure.
- a call to the “Get Component” function (described above) is performed to locate the required component node at step 702 .
- the results of this function call are tested (step 703 ) to ensure that one and only one component node is returned. If zero or more than one nodes are returned, a error condition is set 704 and the procedure returns to the caller 714 . Otherwise, the procedure continues with step 705 by calculating the difference (Diff) in length between the sub-string to be replaced 805 and the new replacement sub-string 807 . If the difference is not zero (i.e.
- the string have unequal lengths invoke the “Adjust Node Variables” subroutine 707 (described below). If the subroutine 707 is unsuccessful, set a error condition 711 and return to caller 714 . Continuing the procedure at step 708 , copy the new replacement string 807 into the location of the old string 805 . Replace the old node sub-tree 803 with the new sub-tree 802 at step 710 . For each node in the new sub-tree 712 adjust the node's “text start address” variable by adding the starting position of the new sub-string 713 .l Then terminate this procedure and return to caller 714 .
- FIG. 9 shows a text object before modification.
- FIG. 10 shows the replacement text object and
- FIG. 11 shows the text object referred to in FIG. 9 after it has been modified.
- This procedure compares two elements with the same attribute type and returns a confidence level value indicating how closely they match.
- FIG. 12 shows a flow chart for the “Match Node” operation.
- a determination is made as to whether the nodes being compared are low level matching components at step 1202 . If the two nodes are low level matching components, perform the “String Comparison” procedure (described below) at step 1203 and return to caller 1210 . Otherwise, if the two nodes contain sub-component nodes recursively invoke this procedure 1205 with all combinations of sub-component pairs which have the same attribute type (step 1204 ). Record the best confidence level for each 1206 . Multiply each node's confidence level by its respective matching weight value 1207 . Sum all the resulting values into one confidence value 1208 . Divide that value by the sum of the match weighting's 1209 and return to the caller 1210 .
- FIG. 13 contains an example showing the matching process.
- the text object's node tree there are three types of component nodes:
- nodes which are contained within the low level matching components and represent simple “regular expression” terms. (Refer to the description of the grammar file for details of the terms.) These nodes are not used in the matching process.
- the nodes 1301 , 1302 , 1313 and 1314 contain sub-component nodes.
- the nodes 1304 , 1305 , 1306 , 1307 , 1308 , 1309 , 1315 , 1316 , 1317 and 1318 are low level matching nodes.
- the nodes 1309 , 1310 , 1311 , 1312 , 1319 , 1320 and 1321 are simple “regular expression” terms.
- the first number within the parentheses is the weighting value for that component.
- the second number is the best result from the node matching procedure for that node.
- the number on top is the node's reference label in FIG. 13.
- Fuzzy logic techniques are well known to those skilled in the art and many suitable reference books are available.
- This subroutine is called from the “Modify Component” procedure described above.
- the purpose of this routine is adjust the actual free-format text and all corresponding sub-component nodes and located after the node being replaced so that the new replace sub-string and sub-tree fit exactly. If the old sub-string and the new replacement sub-string are the same length, this subroutine is not invoked.
- FIG. 14 shows a flow chart of the steps required.
- a determination is made at step 1402 as to whether there is enough space in the current text buffer to accommodate the change. This is done by referring to the “free space” variable (described above) of the “root” node of the text object. If there is not enough space, the “Relocate Text Data” subroutine is invoked 1403 to create free space in the text object. If this routine is unsuccessful 1404 , an error condition is set 1415 , the procedure terminates and return to the caller 1416 . Otherwise, the procedure continues at 1405 and calculate the extra space requirements of the modified text object by subtracting the size of the old sub-tree being replaced from the size of the new replacement sub-tree.
- a zero or negative value indicates that the text object has enough space to accommodate the change. If text object requires more space 1406 , the “Relocate Text Object” subroutine is invoked 1407 to create free space in the text object. If this routine is unsuccessful 1408 , an error condition is set 1415 , the procedure terminates and return to the caller 1416 . If the above steps are successful, the procedure continues at step 1409 and shifts the “after” string 806 in FIG. 8 by the difference between the old sub-string 805 and the new replacement sub-string 807 . For each node which refers to components located after the replacement node 1410 , add this difference to the node's start address variable 1411 . For each node which has the replace node as a sub-component 1412, add the difference to the node's length variable 1413. Adjust the text object's “free space” variable by subtracting the difference 1414 and return to caller 1416 .
- This subroutine is invoked by the “Adjust Node Variables” to move the current free-format text into a space large enough to accommodate the required modification.
- the ability of this routine to perform this operation depends on where the text data is stored. Typically, free-format data such as “address” information is stored in fixed length database fields and will not be able to be relocated. If this is the case, this routine will set an error condition and return to caller. However, if the text data is stored within moveable storage such as the computer's memory or with a object-oriented database as a non-persistent object, this procedure will relocate the text data and return to the caller with the text data's new address.
- This subroutine is invoked by the “Adjust Node Variables” to move the current text object into a space large enough to accommodate the required modification.
- the ability of this routine to perform this operation depends on how this invention is implemented. If the text object is stored within moveable storage such as the computer's memory or with a object-oriented database as a non-persistent object, this procedure will relocate the text object and return to the caller with the text object's new address.
- This operation is used exclusively by the “Text Object Index” described below. It provides key information used in updating and querying of the text object index. It recursively searches the text object node tree and returns a list of all the nodes which have been flagged as low level matching components. See above for a definition of a low level matching component. Refer to the description of the Text Object Index below for an example of the output of this function.
- Free-format text is stored basically as it is, with the associated text object providing all the facility required to provide all the normal database operations on the free-format data. This essentially enables a computer to handle information in much the same way as a human being does.
- the text object is produced by an examination of the free-format data by applying natural language processing techniques, such as parsing, which is known in the prior art.
- natural language processing techniques such as parsing
- Such language processing techniques have been applied to “clean” or “scrub” databases and large and complex software systems have been applied.
- the natural language processing has been applied to analyse the data to enable the creation of new database fields.
- the idea of maintaining the free-format data as it is and creating a text object as described is a totally new concept.
- the processing of each item of free-format text to produce the text object involves, firstly, lexical analysis in which regular expression analyser reads the free-format text and groups the items of the text into tokens with their associated attribute type identifier (e.g., word, number, coma, etc). Each token is then checked against a dictionary for other applicable attribute type identifiers (e.g., Street type, State, etc).
- attribute type identifier e.g., word, number, coma, etc.
- the main function of the domain object 108 is to create text objects 105 . This function is described in detail below. Other functions the domain object performs relate to maintaining an attribute type table. This table contains the information for all the attribute types defined for its domain.
- FIG. 15 shows the domain object architecture 108 in more detail. It comprises a series of “look up” tables, which include the symbol table (e.g., ⁇ Street name> NB the term “symbol” is equivalent to the term “attribute type identifier”) 1502 and the parse table 1504 (contains rules for applying the grammar). It also comprises a lexicon 1503 contains a character definition table 1505 , regular expression analyser 1506 and a dictionary 1507 (e.g., NSW, VIC, SA). All of these parts are used by a modified “Tomita parser” (described below) to process free-format text to produce text objects.
- the symbol table e.g., ⁇ Street name>
- the term “symbol” is equivalent to the term “attribute type identifier”
- the parse table 1504 contains rules for applying the grammar.
- It also comprises a lexicon 1503 contains a character definition table 1505 , regular expression analyse
- FIG. 16 gives an overview of the operation of the domain object 108 creating a text object 105 of FIG. 1.
- the domain object 1605 uses the attribute type 1608 to locate the respective parsing rules and then “parses” the free-format data 1607 and produces a text object 1606 .
- Parsing is a known technique for analysing free-format data and a skilled person would be able to arrange appropriate parsing.
- the parser may consist of any non-deterministic parser.
- the common parsing techniques are listed as follows:
- FIG. 16 gives an overview of the operation of the domain object 108 creating a text object 105 of FIG. 1.
- This procedure takes a free-format text string 1607 and an attribute type identifier 1608 and creates a text object 1606 .
- FIG. 16 shows an overview of the domain construction process.
- the input files for the domain construction process 1604 include the following:
- this file contains one record per character, and each record contains:
- This file could also define how character combinations are translated into phonetic representations (e.g. “PH” ⁇ “F”). Phonetics is a known technique and a skilled person would be able to arrange appropriate translation tables.
- a word consists of two or more alphabetic characters. These tokens are represented in the grammar by the term “word”.
- a number consists of one or more numeric characters. Represented in the grammar by the term “nbr”.
- the basic premise of the grammar file is to define all possible tree structures for the text objects created in its language domain.
- the grammar file consists of a number of grammar rules in the form “A ⁇ B 1 B 2 B 3 . . . ”.
- Each grammar rule consists of a LHS symbol ⁇ A> and zero, one or many RHS symbols ⁇ B n >.
- the LHS symbol ⁇ A> is the name of the component type and the RHS symbols ⁇ B n > defines its sub-components.
- Each of the RHS symbols ⁇ B n > can be one of the following:
- each attribute type i.e. LHS symbol
- a “match weight adjustment” This is used to vary the default match weighting. Match weighting are used when comparing text objects to indicate the relative importance of sub-components during the calculation of the matching confidence.
- each grammar rule can be assigned a “parsing priority”. This is used during the construction of text objects to assist in selecting the best structure for the text object when two or more ambiguous structures are available.
- FIGS. 20 and 21 provide flow charts of the domain object construction process.
- the character definition data is loaded into memory at step 2002 , then the regular expression definition loaded at step 2003 .
- Processing continues by reading the grammar definition data and for each rule in the grammar 2004 , process the grammar rule 2005 by creating a new rule in the temporary rule table 2102 ; using the LHS symbol of the rule to create a new symbol/component type in the Symbol table if it does not exist already, and then for each symbol on the RHS of the rule (step 2104 ), if it is a literal 2105 , then add it to the dictionary 2106 , If it is a recognised “regular expression” term such as “word” or “nbr” 2107 , do nothing 2108 , otherwise it is attribute/symbol and it is added as a new symbol/attribute type to the Symbol table if it does not exist already at step 2109 .
- processing continues at step 2006 by checking that each symbol/attribute type added to the Symbol table has been defined. i.e. has appeared at least once on the LHS of a grammar rule (step 2007 ). If any are undefined symbols/attribute types, an error condition is set at step 2011 , the procedure terminates and returns to the caller 2012 . Otherwise processing continues at step 2008 . Again, for each symbol/attribute type added to the Symbol table, a parse table is created at step 2009 , and a reference to this new parse table is recorded in the corresponding Symbol table entry. After all the required parse tables have been created, the procedure terminates and returns to the caller 2012 .
- domain object 1605 can be saved to memory or loaded to operate on a record of free-format data.
- a “text object index” 109 (FIG. 1) is used as a means to perform normal database operations on the “virtual data” fields of a plurality of text objects and their associated free-format text.
- the text object index differs from this method in two major ways. 1) All constituent parts of the free-format text are classified and used to reference the index. (i.e. not just the nouns). 2) There are no relationship links between objects.
- the main part of the text object index is a three column table with the following fields:
- a query is performed to check if the following address exists in the database.
- a query is performed to find all address which contain the Street:
- Key word search techniques applicable to this invention include:
- the interface of the text object index is designed to mirror the standard commands of SQL.
- SQL is the “Standard Query Language” of relational databases and is very well known within the computer industry.
- this operation makes all the required changes to the text object index so that the respective text object reference can be located using any similar free-format text or subcomponent there of.
- This operation returns all references (normally record identifiers supplied by the system user) to free-format texts which contain the supplied free-format text. For example: to locate all records which contain “Box Rd”.
- This operation takes the user supplied reference key and deletes all records with that reference key.
- This operation updates the entries for a modified text object by first deleting all the previous entries and then reinserting new entries using the “Insert” operation describe above.
- a typical matching procedure could perform the following steps:
- text string matching is performed on certain low level matching component nodes.
- the values used in steps 1, 2, 4 and 5 of the above procedure may be generated each time the string comparison is done, or alternatively may be generated once when the text object is created and stored within the respective component node. These values could also be used as the “representative value key” in the text object index described above.
- Steps 4 and 5 of the above procedure allow the invention to compare free-format data in foreign language text e.g., Japanese Kanji.
- a phonetic value can be stored for the Kanji symbols, and can be used to compare the Kanji with elements of other free-format data which may not be in Kanji. In other words, this feature facilitates the processing of free-format data in foreign languages. See FIG. 17 and previous description
- FIG. 22 gives an example of how this invention could be implemented within a SQL relational database implementation.
- a description of the SQL statements are as follows:
- Any free-format data record may be analysed by applying the present invention and by constructing the appropriate domain using the appropriate domain construction process and appropriately designed input files. All data can be analysed by computer in this way to produce text objects for all free-format descriptions.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
A method and apparatus for processing free-format data (301) to produce a “text object” associated with the free-format data. The text object comprises a plurality of “component nodes” (302-312) containing attribute-type identifiers for elements of the free-format text and other data facilitating access to the text object to obtain information and/or change or add the free-format data. This arrangement obviates the need for the provision of separate database fields for each element of the information. Free-format data can therefore be processed in a similar manner to the way a human being processes free-format data. All elements can be accessed via the constructed text object.
Description
- The present invention relates generally to the processing, storage and analysis of information in the form of free-format data, and particularly, but not exclusively, to a method and apparatus for interpreting free-format text.
- One of the main purposes of computer systems is to manage information. This management of information is performed internally by data management systems. Generally, data management systems may be divided into two categories: 1) Database management systems; and 2) Text search and retrieval systems.
- The first type of data management system imports and manipulates data into internal representations so that the data may be located and modified. When required, these systems generate a suitable representation of this data which is read by humans or used by another system. This category of data management system includes: hierarchical, network, relational, object-oriented database management systems and knowledge based management systems.
- Within hierarchical, network and relational databases, information about an entity (a transaction, a stock item, a person, a company, an address etc.) is usually referred to as a “record” (although sometimes a record may contain information about many entities). Within each record the various “attributes” of the entity are usually classified into “fields”.
- Within object-oriented database management systems and knowledge based management systems these basic units may have other names such as “object” and the information regarding the object may have names such as “slot” or “member”. Each of the attribute fields/slots has a format which can be, for example, integer, real number, boolean, character etc. Others are records/objects. Some fields/slots have specific formats (e.g., date, time), but yet others are free-format text.
- Once the database has been constructed, it may be used to perform the following operations:
- Add a record/object
- Locate and change a record/object
- Locate and delete a record/object
- Retrieve information
- These operations will be referred to as “normal database operations”.
- Storing of information about an entity in fields/slots is suitable for many types of data. There are however, some types of data which do not have a suitable standard structure. One best example of data which does not have a standard structure is “address” data. As most databases store people's address information in one, two or three free-format fields, performing normal database operations on individual attributes of the address is very difficult. Note that the term “attribute” is used in this specification to refer to a property of an “element” of data.
- For example, the free-format data “35 Pitt Street, NORTH SYDNEY” has a number of “elements”. Each element has an associated “attribute”. An attribute of the element “NORTH” is that it is a “geographical indicator”. An attribute of the element “12” is that it is a “number”. Note that the “low level” elements correspond to the “tokens” of data i.e., the element “NORTH” is a token of the data. The data also includes higher level elements, however, e.g., “NORTH SYDNEY” is an element which includes two tokens and this element has the attribute of being a “town”. An attribute of the entire data “12 Pitt Street, NORTH SYDNEY”, i.e., the total “element” is that it is an “address”. An alternative term for element is “component”.
- For each element of this free-format data to be provided with its own field for the associated attribute would increase the size and complexity of the database quite significantly, even for this simple example of addresses. Where the database includes information on people, together with their addresses, for example, in order to avoid complexity, and particularly with older databases, address data may be stored in a single field labelled “address”. This field contains the address in free-format form and it is therefore not possible with present database technology to perform normal database operations on individual elements of the address—those elements cannot be accessed separately (apart from the total combination of elements which makes up the address, which can of course, be accessed as a whole, as “address”).
- This problem is to some extent addressed by the science of database scrubbing/cleansing. This field of commercial endeavour applies parsing processes to free-format text with the objective of creating new database fields for the attributes of the free-format text and entering into those fields completely standardised data. This standardising of data includes converting all spelling variations into one consistent set. (eg “Street”→“St”.) The above example would produce the following:
House Number Street Name Street Type City 12 Pitt St Sydney - The new database fields are then used to perform normal database operations. An entire industry is devoted to this field, applying large, complex and expensive software packages to take information stored in databases, analyse and process the information to produce new databases including more fields for the attributes of the information records, thus providing more flexibility for operations which can be applied to the records.
- Much has been written about the field of database cleansing/scrubbing (see e.g., “Dealing with Dirty Data” DBMS Magazine, September, 1996). The process is expensive —a complete cleansing operation for a large database can cost millions of dollars, as it is so time consuming and the software packages that have been developed to cleanse databases are very complex—and it is still limited by the fundamental requirement that to perform database operations on an element, the element must have a field to itself.
- This brings us to the second major problem which afflicts the present methods of storing computerised information in commercial databases. Practically all commercial data is stored within hierarchical, relational databases or flat data files which have a structure which is fixed at time of design, but information by its very nature is complex and can have almost an infinite number of different attributes. To create a database containing fields for each and every attribute for each and all types of different information is just not practical, if not totally impossible, and certainly the cost of any attempt to produce a database containing fields for all the types of information available to humanity would be cost prohibitive.
- Even a fairly trivial (although very important) example illustrates the scale of the problem. Consider international addresses, i.e., addresses the world over. Although four or five free-format fields can contain any address, to design a database table which has a data field for every possible attribute of all international addresses would contain hundreds, if not thousands of data fields. England has counties, USA and Australia have states, Japan has districts and different orders of addresses, etc.
- The field of database cleansing/scrubbing is therefore a partial solution at best. It still requires the same fundamental database structure of a field for each data attribute. One can build more and more complex databases but this problem can never be completely resolved, and limits the computerised handling of information significantly.
- Natural language processing systems are known that employ “Semantic Grammars” to encode semantic information into a syntactic grammar. These systems are mainly used to provide natural language interface to other systems such as a data base management system. The following description comes from a book by Patterson, D. W. “Artificial Intelligence and Expert Systems”.
“. . . They use context-free rewrite rules with non- terminal semantic constituents. The Constituents are categories or metasymbols such as attribute, object, present (as in display or print), and ship, rather then NP (Noun Phase), VP (Verb Phase), N (Noun), V (Verb), and so on. . . . Semantic grammars have proven to be successful in limited applications including LIFER, a data base query system distributed by the US Navy . . . and a tutorial system named SOPHIE which is used to teach the debugging of circuit faults. Rewrite rules in these systems essentially take the forms S −> What is <OUTPUT-PROPERTY> of <CIRCUIT-PART>? OUTPUT-PROPERTY −> the <OUTPUT-PROP> OUTPUT-PROPERTY −> <OUTPUT-PROP> CIRCUIT-PART −> C23 CIRCUIT-PART −> D12 OUTPUT-PROP −> voltage OUTPUT-PROP −> current In the LIFER system, there are rules to handle numerous forms of wh-queries such as What is the name of the carrier nearest to New York? Who commands the Kennedy? etc . . . These sentences are analyzed and words matched to metasymbols contained in lexicon entries. For example, the input statement ‘Print the length of the Enterprise’ would fit with the LIFER top grammer rule (LTG) of the form <LTG> −> <PRESENT> the <ATTRIBUTE> of <SHIP> where print matches <PRESENT>, length matches <ATTRIBUTE>, and the Enterprise matches <SHIP>. Other typical lexicon entries that can match <ATTRIBUTE> include CLASS, COMMANDER, FUEL, BEAM, LENGTH, and so on.” - These types of systems receive information in structured or free-format form and converts it to its own representations.
- Although the interface is flexible the database they interface to has a fixed structure and these systems are unable to perform changes on the original (human readable) data.
- Indeed there are many prior art systems which provide “Natural Language” interfaces to structured databases. All of these systems provide translation from “Natural Language” into some form of structured data and suffer from the same problems described above.
- Refer to U.S. Pat. No. 4,787,035, Bourne, D. “META-INTERPRETER” and U.S. Pat. No. 5,454,106, Burns, L., Malhotra, A., “Database retrieval system using natural language for presenting understood components of . . . ” for examples of such systems.
- As discussed earlier, one type of database management systems are knowledge based management systems (KBMS).
- These systems employ the concept of attribute “slots” on an object. Slots provide or change information regarding the object either directly onto the stored values or indirectly through procedures. A simple example of “slots” will illustrate the concept: a “Square” object has two attribute slots “Length” and “Area”. The “Area” slot does not need to store a value because its value can calculated by squaring the “Length” value.
- Although these types of systems do not require fixed database structures, they do however, need to transform the original data into internal data representations which must be put through a very process intensive “language generation” process to produce information that is understandable by humans. If these types of systems were required to maintain the original data for use by other systems and humans, a small change would require the whole text string to regenerated.
- The text search and retrieval category of data management system does not import the data but builds searchable indices which point to the original data. This category includes: document storage & retrieval systems; and Internet search engines.
- These types of systems have very successful because they leave the original information in human readable form. This basic principle means that unlike the prior art database system described above, the underlying data can be very easily shared with many systems of this type. Another reason for their success is that improvements in technology can be implemented without requiring conversion of the original data. Data conversion is not only extremely expensive, but it is also a major source of data errors.
- There are however, major drawbacks in using this type of system to manage data. Compared with the database systems described above. The major limitation is that the data cannot be manipulated—it cannot be modified, it must remain as it is. Other database functions which are very difficult to perform include:
- Cross checking and validating the data
- Integrating the data with database systems
- Sorting and classifying the text data
- From these limitations, we can see that this category of data management system is suited to unstructured data which does not need to be changed.
- In text search and retrieval systems, it is known to process a documentation collection to identify specific attributes of each document such as its “subject” topic. The types of documents which have been processed by this type of system include books, newspapers, reports, manuals and e-mail messages.
- Most of these types of systems, however, only look for individual words to match and do not look at words in context. Some others identify words that are nouns but do not classify the type of noun. Both are unsuitable for data such as address data, which contains a large portion of proper nouns.
- Further, the original data cannot be changed within context.
- For more information regarding this area, refer to the works published by Gerald Salton.
- Note that the term “text object” as used in the following description should not be confused with the terminology “text object” which has been used in systems to describe software techniques which assist in the storage and transfer of pieces of text data between computer systems by encapsulating the text string. Techniques which have used the term “text object” range from the “String” object employed within Apple Computer's operating systems (where the object contains a leading two byte “length” value and the text string) to the “Compound String” object employed by the X-Windows operating system (where the object encapsulates multiple encodings, language translations and font styles of one piece of information.)
- From a first aspect the present invention provides a method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- The term “text object” as used in the current specification does not encapsulate text string, as discussed above. The text object in the terms of the present invention provides a “semantic layer” between the actual text data and, for example, an application software system which may need to access and/or manipulate the text data.
- In its simplest form, as defined above, the text object is the additional data, related to the semantic and syntactic information obtained from examination of the data elements, and a pointer means (such as a key) which can lead back to the elements of the free-format data (e.g., back to the text string which forms the free-format data).
- The additional data preferably allows identification of the attributes of the data which have been obtained by the examination of the data. For example, in the “12 Pitt Street, NORTH SYDNEY” example given in the preamble, the various attributes of the data, e.g., “street” equals “12 Pitt Street”; “street number” equals “12”; “town” equals “NORTH SYDNEY”, etc., are identified by the additional data and the pointer means preferably allows access to the elements of the data which are associated with those attributes. The additional data effectively provides “virtual data fields”—the data fields do not exist as they do in a normal database which would have a column field head for each attribute. Nevertheless the free-format data can be accessed on an attribute by attribute basis using the present invention, as if actual fields for those attributes did exist. The preferred embodiment of the invention thus operates to create “virtual data fields” which, preferably, allow all normal database operations on free-format text, without having to create actual database fields for the free-format text. The free-format text can remain stored as it is in the same location (usually database).
- The significance of this becomes apparent when one considers the processing of many records of free-format data, for example international address data. As discussed above, although four or five address fields could store all international address data in free-format form, each data record can have many attributes, which differ from attributes of other addresses e.g., England has counties, the USA has states. To produce actual conventional database fields for all the attributes for international addresses would be an almost impossible task. However, with the present invention, each record of free-format data can be taken and processed to produce a (small) number of virtual data fields for that particular record in the form of a text object. The text object for each record can then be queried separately by an appropriate query processing means to provide all the normal database operations for that record. The data itself may stay in place. As a separate text object is created for each record, there is no problem with having different virtual data fields for each record. We do not have to create a large database with many fields, instead we leave the database records as they are and create many text objects, one for each record, to give many virtual fields overall, but few virtual fields for each text object.
- The step of examining preferably includes the step of parsing the free-format data.
- A text object preferably enables manipulation of the data to carry out all the normal database operations, such as changing the record, locating an element of the record, retrieving information from the record, etc. The information which may be provided by the text object preferably includes information on the elements of the data. In a preferred embodiment, the information may also include matching information (such as phonetics) to facilitate comparison of one record of data with another record of data, parsing priority information to assist in the processing of ambiguous free-format text, etc.
- It is believed that this new approach will lead to computers being able to manipulate free-format data in much the same way as human beings do. There is no need disassemble the data record according to its attributes and place standardised values for each attribute type into an appropriate field in a database (as is conventional practise), once the appropriate column names for the database have been determined. Each text object for each data record provides all the processing and information the computer needs to provide all the normal database operations. The attribute types of, for example, international addresses can be compared, manipulated, etc., without it being necessary to provide a complex database with many fields.
- The text object preferably includes attribute-type identifiers accessible to enable identification of attributes of the free-format data and pointer means for locating elements of the data having the particular attribute.
- In a preferred embodiment, the text object comprises a plurality of parts in the form of “component nodes”. Preferably, a plurality of component nodes may be associated together in a text object in a predetermined hierarchy. For example, a plurality of component nodes may be considered to be “nested” together in the form of a “text node tree” which may have a plurality of branches associating various component nodes with each other in a predetermined hierarchy. Each component node may comprise:
- an attribute type identifier (for the classification of an attribute of the free-format data which is associated with that component node);
- a pointer to the beginning of a sub-string within the text object's text string (i.e. beginning of the element associated with the component node).
- an integer containing the character length of the element sub-string (of the data).
- zero, one or more other component nodes (nested within this component node or otherwise associated with the component nodes so that the other component nodes can be accessed via the component node) preferably stored as an array;
- a matching weight (to indicate the relative importance of this element when performing comparisons with other text objects);
- a boolean variable indicating whether this attribute type identifier is a low level matching element; and
- depending on time/space considerations, one or more values to assist in the matching process. (See section on “text string operations” below for more details.)
- a parsing priority value (giving a notional “priority” to the elements of the free-format data associated with the component node so that a priority may be allocated and used in the determination of the best interpretation of free-format text when ambiguities exist).
- Other component nodes may not be physically nested within the component node but each component node may just contain a list of pointers to subordinate component nodes so that the subordinate component nodes can be “found” from the component node which includes the list.
- Each component node preferably relates to one particular attribute of the free-format data, as identified by the attribute type identifier in the component node. Component nodes which are relatively high in hierarchy may contain or point to a plurality of other component nodes, whereas those component nodes which are the lowest in the hierarchy may not contain or point to any other component nodes as the next step down in the hierarchy is the associated element of the free-format data.
- The hierarchy is determined by the parsing of the free-format data. E.g., one attribute of a record of address data may be a <Street>, e.g. “12 Pitt Street”. Sub attributes of the <Street>component are <Street number>“12”, <Street name> “Pitt” and <Street type> “Street”. The <Street> component node will therefore list three other sub component nodes, having attribute type identifiers <Street number>, <Street name> and <Street type>.
- Preferably, each component node could be considered to be text objects themselves. This recursive definition allows all the functions of the text object of the present invention to be applied to each attribute.
- The text object may also comprise other data structures which assist in the quick location of specific component nodes. An example of such a structure is a lookup table containing all the attribute type identifiers and a pointer to their associated component nodes.
- The query processing means is preferably a software application engine which is configured to be able to use the text object to answer questions on the data and access the data to manipulate it (e.g., correct it if it is in error).
- The method preferably also includes the further step of preparing an “index” which facilitates comparison of elements of a plurality of records of free-format data. The index is preferably in the form of a table (termed by the inventors a “text object index”) including columns, column headings and data, very much in the same way as a conventional database, except that it is prepared from the additional data for each of the plurality of data records.
- The text object index preferably includes a table with a column for the attribute type identifier, a column for representative value keys and a column for user supplied record identifiers. The representative value key preferably provides a value representative of a feature of the element associated with the appropriate component type identifier, e.g., a phonetic value for elements which are proper nouns (e.g., Smith) or a numeric identifier for common words (e.g., Street). The section on text string matching below contains more details regarding the values of the representative key value. The user supplied record identifier will identify to the user which record of free-format data is being compared or accessed i.e., is a pointer which enables access to the record.
- Where a text object index is prepared, a text object having a plurality of component nodes containing attribute-type identifiers and other data may not be necessary. All that may be required to access the data and carry out database operations is the query processing engine and the text object index. The text object index may be prepared directly from the examination of the data and the text object index includes text objects for a plurality of records (i.e., additional data plus pointer to record). The text object as a separate “component node structure” can therefore be dispensed with or is not needed in the first place as a separate entity, instead it is incorporated in the text object index as additional data plus pointers.
- Where the text object includes “matching” values (or procedures to create these values) for low level matching elements of the free-format text, it is possible, for example, to compare records including elements which are in different written languages. For example, a free-format record containing a street name value in Kanji, may be compared with a street name element in Arabic by comparing respective matching values. The street name for each record could be the same street, but merely being expressed in different languages in the free-format data. The matching information provided by this aspect of the present invention therefore enables comparison of elements of free-format text expressed in different written languages.
- Matching values may be generated during processing of text objects, and need not be stored in the text object. That is, they can be generated “on the fly” via procedures designated by the query processing engine. See later on in the description.
- In the method of the present invention, the step of examining the elements of the data to determine the components preferably comprises the step of parsing the free-format data in accordance with grammar rules applied by a domain object. The domain object is preferably constructed by a domain construction process which uses as input data: character definition data, regular expression definition data, and grammar data.
- The hierarchy of the component nodes of the text node tree is preferably determined by the grammar rules for the particular domain object.
- An embodiment of the present invention may be implemented by a software application which includes a domain object and a query processing means. The domain object is arranged to examine free-format data to produce a text object which can be then used by the query processing means to enable all database operations on the free-format data. The free-format data may be stored in any conventional way, such as in a conventional database on a computer system. The free-format data may also be stored as a string in the text object. The software application comprising the domain object and query processing engine would be used to process the data without affecting its storage in the database. Other software applications could therefore interface with the database as normal, i.e., the database remains totally unaffected as far as its operation is concerned apart from the fact that the domain object and query processing means can be used to enhance the capabilities of the database by providing access to all the elements of the free-format data.
- As well as allowing access to data in free-format data fields which has previously been unavailable without data cleansing and preparation of new databases with more fields, the present invention also has great potential for the future structuring and ordering of data. For example, using the present invention it may be possible to greatly reduce the number of fields which are required to store data in a database. Considering the example given above, of international name and address data, at present it is not possible for a database to deal with international address data in a single field—because international address data has many different attributes. With the present invention, however, international addresses may be kept in single free-format field containing all the international address records. Processing by the present invention provides each individual international address record with its own set of virtual data fields allowing comparison with other records via the query processing means, manipulation and access to information of all elements of each data record. Indeed, it is possible to provide a single domain object for all international addresses. Any free-format data could be processed in this way. The invention is not limited to address data.
- From yet a further aspect, the present invention provides a method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data for each data record, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- The text object preferably includes any or all of the properties of the text object as discussed above in relation to the first aspect of the invention and the text object is preferably produced by an examination including any or all of the features as discussed above. The present invention further provides a method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data of each data record, the additional data being in the form of a text object index which includes attribute—type identifiers for elements of each data record and pointers to each data record, the text object index being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- The text object index preferably includes any or all of the properties of the text object index as discussed above in relation to the first aspect of the invention. The text object index is preferably produced by process steps as discussed above in relation to the first aspect of the invention.
- From yet a further aspect, the present invention provides a processing system for processing free-format data stored in a computing system, the apparatus including means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, means for producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and a query processing means which is arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- Preferably, the examination means and means for producing is arranged to produce a text object with any or all of the features as discussed above in relation to first aspect of the invention, by applying, preferably, the same methods of examination.
- The present invention further provides a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising additional data relating to semantic and syntactic information (attributes) about the data for each data record, stored and accessible by the processing system, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- The present invention further provides a processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising the additional data relating to semantic and syntactic information (attributes) about the free-format data for each data record, the additional data being in the form of a text object index which includes attribute type identifiers for elements of each data record and pointers to each data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
- The present invention yet further provides an apparatus including a domain object arranged to process free-format data to produce a text object, the text object including any or all of the features of the text object as discussed above in relation to previous aspects of the present invention.
- In a preferred embodiment, the step of accessing the text object may comprise querying one or more text objects for attributes and obtaining the value of the element corresponding to the queried attribute. For example, where the free-format data is name and address data, a person may query the text object or objects to see if there is a <Street> element, and, if so, obtain the value of the element (e.g., “12 Pitt St”). This is something that cannot be done with present databases where the “address” field merely includes all the <address> in free-format form. Other older systems provide search facilities which scan for a particular text string without regard for the semantics of the text being searched. These systems could be used to find all address with a street name of “Pitt” by searching for that string. This leads to problems when the string being searched for can be used in different ways.
- “76 Box Rd Townsville QLD”
- “
PO Box 92 Geelong VIC” - “39 Main St Box Hill NSW”
- Attempting to locate the all the address with a street name of “Box” by scan for the string “Box” will lead to many errors being generated. The present invention, in the preferred embodiment, will report only addresses contain the correct term. So, searching for street name of “Box” will return records such as:
- “8 Box Ave Devonport TAS”
- “76 Box Rd Townsville QLD”
- “110 Box St Parramatta NSW”
- Consider the address examples in FIG. 2 of the drawings, and a system user wishes to locate all the addresses on “Box Rd” within this data. If the user searches for “Box Rd”, the system would return
record 201, but missrecords records record 206 when specifying “Box Rd”. - Another example where string searching without considering the semantics can lead to erroneous results is when <Street Names> have the same names as <Town Names>. For example: “123 Sydney Ave, Melbourne VIC”. String searching will not allow one to find only records with “Sydney” as their town name.
- The step of accessing the text object may also include comparing two text objects and ascertaining and providing a confidence value that indicates how closely the two text objects match. For example, two street addresses may be compared by comparing their respective text objects, and a confidence value (in percentage points) can be given depending on how closely they match.
- The step of accessing may also include the step of changing a value associated with a particular component. Common examples include changing a woman's surname after marriage and changing the name of a street or town name after a mistake has occurred.
- There are also many cases where governments change the names of street names, postcodes (e.g. Australia's Northern Territory changed their postcode range from 5800-5999 to 0800-0899), or even whole city names (e.g. Leningrad to St Petersburg).
- This ability of the present invention to change a value of a particular element of the original piece of text has the benefit that the operations of legacy computer systems which use the data directly (i.e. without using text objects) will not be affected.
- Yet a further aspect of the present invention provides a processing system for enabling access to free-format data processed in accordance with the method of any one of
claims 1 to 19, the processing system including a query processing means arranged to access the additional data and provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data. - The apparatus may include means for accessing the text object in accordance with any or all of the method steps given above.
- The present invention yet further provides a processing system for processing free-format data stored in a computing system, comprising means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationship of elements to each other, to determine semantic and syntactic information (attributes) about the data, and a query processing means for utilising this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
- The means for examining may comprise a domain object which examines the elements and produces virtual data (being data relating to the semantic and syntactic information about the data) which is used by the query processing means to access the data and obtain information on attributes of the data.
- The present invention yet further provides a method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, and querying the data using this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
- From yet a further aspect the present invention provides a method of processing a plurality of records of free-format data stored in a computing system, comprising the steps of, for each record examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, and producing virtual data fields enabling access to this information and the associated elements for each data record, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
- The term “virtual data fields” is used in the same sense as previously. Unlike prior art conventional databases, where it is necessary to process the information and produce actual data fields, no separate data fields need be produced. The data may remain in place where it is in the database, and instead an associated “virtual field” is produced for attributes of the semantic and syntactic information, and the virtual fields can be queried to obtain all the information required of the record, and preferably all normal database operations may be implemented.
- The present invention yet further provides a processing system for processing a plurality of free-format data records stored in a computing system, comprising means for examining elements of the data of each record to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about each record, and means for producing virtual data fields associated with each record enabling access to this information and the associated elements, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
- Features and advantages of the present invention will become apparent from the following description of an embodiment thereof, by way of example only, with reference to the accompanying drawings, in which:
- FIG. 1 is a diagram illustrating the architecture of a system for enabling the processing of free-format data in accordance with an embodiment of the present invention;
- FIG. 2 illustrates sample “address” data;
- FIG. 3 is a more detailed structural view of an example text object produced by operation of the embodiment of the invention on free-format data;
- FIG. 4 illustrates sample “address” formats;
- FIG. 5 is a flow chart illustrating a method for getting a sub-component of a specific type from the text object of the invention;
- FIG. 6 illustrates the results of the get sub-component method;
- FIG. 7 is a flow chart illustrating a method for modifying a sub-component of a text object of the invention;
- FIG. 8 is an illustration of the mechanics of modifying a text object of the invention; FIGS. 9, 10 and11 provides an example of modifying a text object of the invention;
- FIG. 9 shows a text object before modification;
- FIG. 10 shows the replacement text object; and
- FIG. 11 shows the text object referred to in FIG. 9 after it has been modified;
- FIG. 12 is flow chart illustrating the node matching subroutine used by other methods;
- FIG. 13 illustrates examples of text objects in accordance with embodiments of the present invention for illustrating a method of comparison of text objects in accordance with an embodiment of the present invention;
- FIG. 14 is flow chart illustrating the “adjust node” subroutine used by other methods;
- FIG. 15 is a diagram illustrating the architecture of the domain object block of FIG. 1;
- FIG. 16 is an illustration of the domain construction process of FIG. 1 in more detail;
- FIG. 17 provides two examples of standard transliteration tables. One for Japanese Katakana and one for Greek.
- FIG. 18 contains tables illustrating Regular Expression Definition data;
- FIG. 19 illustrates a demonstration grammar data file;
- FIGS. 20 and 21 provide flow charts of the domain object construction process block of FIG. 1;
- FIG. 22 illustrates an example session with a implementation of the invention within a SQL relational database system.
- Although the following descriptions use English name and address examples, the invention can be equally applied to any domain of free-format text.
- As discussed in the preamble of this specification, the present invention relates to an entirely new concept and approach for processing computerised information, in particular free-format data. As discussed above, the idea is to produce from the free-format data a “text object” which may be stored in a computer and which can be used to obtain information about the free-format data, compare records of free-format data and manipulate the data. This is achieved without it being necessary to construct complex databases having many fields.
- FIG. 1 is a diagram showing the configuration of an entire “virtual data” system in accordance with an embodiment of the present invention. It comprises a
user interface 101, aprocessor 102. Theprocessor 102 can be a standard computer system and has a general configuration such as a CPU, a computer memory and mass storage device. Theuser interface 101 can be a standard keyboard and VDU, and/or an interface to another computer system. User interfaces like these, along with other equivalent interfaces, are well known. - For the purposes of the internal storage requirements of the invention, no distinction will be made between the computer memory and the mass storage device and will be referred to as memory.
- Loaded into the memory of the
processor 102 is standard system software well known to those skilled in the art, such as a operating system and a database system (not shown), one or moreapplication software systems 103 such as an accounting package or word processor, and an embodiment of thepresent invention 104, for producing text objects 105 from free-format data. Thesystem 104 comprises adomain construction process 106 which is arranged to take a plurality of input data 107 (in this example in the form of data files) and build adomain object 108 which is used to produce text objects 105. Each “domain” will include all the grammar and syntax rules necessary for that particular domain of free-format data. For example, one domain may be international name and addresses and will include all the information necessary to analyse free-format international name and address data to produce a text object. Another domain may be a commodity description knowledge base, another one may be a transportation industry knowledge base. Domains may be produced to handle any free-format data. Thedomain construction process 106 is essentially an engine which works on the knowledge bases (input files) for the particular domain type to produce thedomain object 108 for that type. - Referring again to FIG. 1, a
text object index 109 may be produced by processing a number of text objects 105, and this will be described later. - It should noted, as shown in FIG. 1, that the
invention 104 provides a layer between generalapplication software systems 103 and their storeddata 110. Unlike “Knowledge Based Management Systems” described above, this invention allows the free-format data to remain in its original location and legacy application software to operate using theoriginal access paths 111. - Text Object
- Structure
- FIG. 3 is a schematic diagram of the detailed structure of an example text object in accordance with an embodiment of the present invention, in order to assist with illustrating the concept.
- The example free-format data illustrated in FIG. 3 is a street address, “12 Pitt Street, North Sydney” (designated by reference numeral301). In prior art databases, this information may have been stored in a single “address” field or may have been divided into a number of separate fields corresponding to the various attributes, i.e., street number, street name, street type and town. Refer to FIG. 4 for other examples of common Australian address formats. As discussed in the preamble, the prior art database format requirement for a separate field for each attribute gives rise to much complexity and, where the information is intricate, it is cost prohibitive and even impossible to produce a field for every attribute of the free-format data.
- The text object (illustrated in FIG. 1) comprises a plurality of component nodes302-312. The text object can be represented as a text node tree, having branches (eg 313) wherein the component nodes 302-312 are positioned in a predetermined hierarchy. The “lowest” hierarchy is at the bottom of the text node tree and the “highest” hierarchy is at the top of the text node tree. The
node 302 at the top of the node tree will be refer to as the “root” node. It will be appreciated that components of the text object can be stored in any convenient manner in a memory of a processing means, could be nested within each other, for example, refer to each other in some way, etc. The text object is able to be represented as a text node tree, but that does not mean that it is stored in memory in this way. As long as the components of the text object can be processed in such a fashion that the components act like component nodes of a text node tree as represented in the figure, then this is sufficient. - Note that each component node302-312 could be considered text objects themselves. This recursive definition allows all the functions of the present invention to be applied to each component.
- The architecture of each component node302-312 includes:
- An attribute type identifier (which in this embodiment is an integer) which identifies an attribute type of the free-
format data 301 associated with the text object. For example,component node 303 includes the attribute type identifier <Street>, indicating that thiscomponent node 303 is associated with the element of the free-format data which gives is the Street, i.e., “12 Pitt Street”.Component node 302 is the main component node for the text object illustrated in FIG. 3 and includes the attribute type identifier <Address>. Thecomponent node 302 is therefore associated with the entire free-format data record in this case, being “12 Pitt Street, North Sydney”, which is an address. Note thatcomponent node 302 is “higher” in the hierarchy in the text node tree thancomponent 303; the <Address> component includes within it the <Street> component. The hierarchy of the component node 302-312 within the text node tree is in fact determined by the attribute type identifier of the component node and by grammatical rules which determine that the attribute should be of a lower or higher hierarchy. - A pointer to the starting position of the actual element sub-string of the free-format data associated with a component node. The free-format data is stored as a string in memory and the pointers point towards the beginning of the character string. In the example,
component node 303 would point to numeral “1” of the address. - An integer containing the character length of the element. In the example,
component node 303 would have a length of 14 (including space characters after “12” and “Pitt”) which would in effect point to the last letter “t” of “Street”. - An array of subordinate component nodes. For example, for
component node 303,nodes nodes - a boolean variable indicating whether this attribute type identifier is for a “low level” matching element. “Regular expression” terms such as <word> and <nbr> are not matched against each other. Matching of these term is performed at the next level up the hierarchy (e.g. <Street Name>307). A node is flagged as a low level matching component if it either: is a literal which was located in the dictionary (
e.g. nodes 308, 309); or contains “Regular expression” terms (e.g. nodes - an integer representing the element's match weighting. This indicates the relative importance of each of the elements when performing comparisons between text objects. For example: when comparing “
Level 3, 45 Pitt st” with “3rd Floor, 45 Pitt St” the fact that the elements “Level” and “Floor” are not equal is insignificant. The “match weighting” values are specified in the grammar rules used to construct the domain object. - depending on time/space considerations, other optional data items used to assist the “matching” processes. Refer the section on “text string operations” below for more details.
- an integer indicating the parsing priority.
- This will be described later.
- a boolean value indicating whether this component node is responsible for deleting and moving the piece of text it points to. The two conditions when a component is responsible for its text are: 1) When a outside process requests that the text object manage the entire text string, the text object “root” node is flagged as being responsible for the text string. 2) When a implied value is created. See below for details.
- a integer value representing the free space available at the end of the buffer in which the free-format text is held. This value is calculated during the creation of the text object and is usually only applicable to the “root” node of the text object.
- In the text node tree the foot of the hierarchy is a component node dealing with an element for each token of the free-format data, in this case being <number>311, <word> 312, <street type> 308, <geographic term> 309, <word> 310.
- Further up in the hierarchy are component nodes for more generic attribute type identifiers. For example these are <street name>307 for the word “Pitt”, <Street> 303 for the three tokens “12 Pitt Street”, <town> 305 for the tokens “North Sydney” and, at the top of the hierarchy of this particular free-format data record, the attribute type identifier <Address> 302.
- Attribute Type Identifier
- It will be appreciated that the attribute type identifiers can be stored in any form, i.e., they need not be stored as integers but could be stored in any representation. A program engine is provided enabling access to the text node tree and this engine has the information necessary to identify the attribute type identifiers as stored.
- Parsing Priority
- To assist in the processing of ambiguous free-format data, each component node contains an integer indicating the “parsing priority” of the element. These values are assigned during construction of the text object and are used to select the best text node tree if more than one exists for a particular ambiguous free-format text. For example: “12 Pitt St Nth Sydney” contains two interpretations. Although “12 Pitt St Nth” is a valid street address, it has a lower priority than “Nth Sydney” and therefore not selected. These “parsing priority” values are specified in the grammar rules used to construct the domain object (see below).
- Implied Fields
- Another feature of the present invention is the production of extra implied sub fields in a text object, in the form of the creation of extra component nodes for information that is not actually explicit in the original text. For example, “Mr John Smith” has an implied sub field “sex” with a value “male”. The text object can be created with an extra component node dealing with this element and having the attribute type identifier “sex”.
- Normally these implied fields will be created during the parsing process and are specified in the grammar, but they can be added manually if required. See the description of the “Add Sub-component” function below.
- Interface
- The text object acts as a “virtual interface” enabling access to the free-format data and facilitating all normal database operations on the free-format data. The user does not “see” the internals of the text object, but can query the text object via the associated program engine (query processing means) and, by virtue of the structure of the stored text object, the attribute type identifiers and other data being placed in nodes, can perform all the normal database operations on the free-format text record.
- All the below operations require that the text node tree be searched for specific attribute types. This searching is performed by the engine using recursive procedure calls. This technique is very well known within computer science. Refer to the book “Data Structures and Program Design” by Robert Kruse (Prentice Hall) for a description of recursion.
- Another embodiment of this invention may speed up the above procedure by performing the above process and create a lookup table containing every sub-attribute and sorting by the attribute type identifier. This technique is well known to those skilled in the art.
- Function Overview
- These operations include:
- “Get Sub-component” Requests the text object to supply (zero, one or many) values for the respective attribute type.
- “Compare Text Objects” Compares two text objects and reports a confidence value that indicates how closely they match.
- “Contains component” Tests if a particular text object contains a specific value for a particular element and returns a confidence, e.g., one could obtain all free-format data records which include Pitt Street as the “street”. This would be one way of finding how many people on a database live in Pitt Street where the database includes free-format data in an address field and without requiring a string search (which can often give rise to error).
- “Modify Sub-component” Changes the value of a particular element of a text object to a specific value. For example, change “Pitt” to “King”.
- “Add Component” Adds extra data to the text object by appending a new sub-component node to the respective node. Future operations will reference this information.
- Get Sub-component
- When the “Text Object” is queried, an attribute type identifier is supplied and zero, one or more “Sub-component Nodes” are returned. These “Sub-component Nodes” point to the text of the required elements. FIG. 5 illustrates this method. Beginning this recursive procedure with the “root” node of the text object, starting at501, a determination is made (502) as to whether the attribute type of this node is same as the required attribute type. If it is, a pointer to this node is appended to the result list at
step 503. Continuing withstep 504, for each sub-component node referenced by this node recursively call thisprocedure 505. Then return tocaller 506. FIG. 6 illustrates the node tree for “Mr Fred and Mrs Mary Smith”. Searching the tree for nodes with attribute type <Given Name> will return a list containing pointers to twonodes - Another version of this operation takes a text string as a parameter. Only nodes containing the same attribute type and same text string (ignoring case) are added to the list. For example: calling this function with an attribute type <Given Name> and text string “FRED” would return a list containing one node.
- Yet another version of this operation takes as parameters a text string and a confidence level. Only nodes containing the same attribute type and have a text string which matches the supplied string with a confidence above the supplied level are added to the list.
- Compare Text Objects
- This operation compares two text object and returns a confidence level indicating how closely they match. It performs this by:
- 1. Determining if the “root” nodes of the two text object have the same attribute type. If they do not, return a zero confidence level to the caller.
- 2. Otherwise, call the “Match Node” subroutine (described below) with the “root” nodes of the two text objects and return the result of that operation to the caller.
- For example: passing the two following text objects will return a confidence of 100%.
<Address> “12/34 PITT ST SYDNEY 2000 NSW” <Address> “ Unit 12 34 Pitt Street,SYDNEY N.S.W., 2000” - Contains Sub-Component
- This operation searches one text object for a sub-component which matches a second text object. If found, it returns to the caller a confidence level indicating how well they match. This operation is achieved by first calling the “Get Component” function (describe above) passing the component type of the second text object. If successful, it calls the “Match Node” subroutine (described below) with the “root” node of the second text object and the node of the result of the “Get Component” function.
- For example: passing the two following text objects will (depending on how the string matching procedures are set up) return a confidence of approximately 80%.
<Street> “Kathryn Street” <Address> “12-14 Catherine St, Dubbo NSW 2830” - Add Sub-Component
- This operation appends an extra component node into the text object. Although the value of this element is not contained in the original free-format text, queries performed on the text object return the correct results. For example: a text object pointing to a record containing “Dr Chris Smith” may need to modified to indicate that the person is a female. Invoking the Add Component function containing a sex attribute type with a value of “female” will append the respective component node to the text object.
- Modify Sub-Component
- FIG. 8 illustrates the mechanics of the “Modify” operation. The text object to be modified is represented by801. The actual text data consists of the sub-string to be replaced 805 and the sub-strings before 804 and after 806. Within the
main text object 801, the sub-tree 803 represents the sub-string to be replaced 805. Thereplacement text string 807 is represented by anothertext object 802. - FIG. 7 provides a flow chart of the “Modify” procedure. Starting at701, a call to the “Get Component” function (described above) is performed to locate the required component node at
step 702. The results of this function call are tested (step 703) to ensure that one and only one component node is returned. If zero or more than one nodes are returned, a error condition is set 704 and the procedure returns to thecaller 714. Otherwise, the procedure continues withstep 705 by calculating the difference (Diff) in length between the sub-string to be replaced 805 and thenew replacement sub-string 807. If the difference is not zero (i.e. the string have unequal lengths) invoke the “Adjust Node Variables” subroutine 707 (described below). If thesubroutine 707 is unsuccessful, set aerror condition 711 and return tocaller 714. Continuing the procedure atstep 708, copy thenew replacement string 807 into the location of theold string 805. Replace theold node sub-tree 803 with thenew sub-tree 802 atstep 710. For each node in thenew sub-tree 712 adjust the node's “text start address” variable by adding the starting position of the new sub-string 713.l Then terminate this procedure and return tocaller 714. - FIGS. 9, 10 and11 provides an example of the “Modify” operation. FIG. 9 shows a text object before modification. FIG. 10 shows the replacement text object and FIG. 11 shows the text object referred to in FIG. 9 after it has been modified.
- The extra versions of the “Get Sub-component” operation described above also apply to this operation.
- Subroutines
- The operations described below are invoked from other text object procedures described above.
- Match Node
- This procedure compares two elements with the same attribute type and returns a confidence level value indicating how closely they match.
- FIG. 12 shows a flow chart for the “Match Node” operation. Starting a1201, a determination is made as to whether the nodes being compared are low level matching components at
step 1202. If the two nodes are low level matching components, perform the “String Comparison” procedure (described below) atstep 1203 and return tocaller 1210. Otherwise, if the two nodes contain sub-component nodes recursively invoke thisprocedure 1205 with all combinations of sub-component pairs which have the same attribute type (step 1204). Record the best confidence level for each 1206. Multiply each node's confidence level by its respectivematching weight value 1207. Sum all the resulting values into oneconfidence value 1208. Divide that value by the sum of the match weighting's 1209 and return to thecaller 1210. - FIG. 13 contains an example showing the matching process. Within the text object's node tree there are three types of component nodes:
- 1) nodes which contain sub-component nodes;
- 2) low level matching components near the foot of the node tree; and
- 3) nodes which are contained within the low level matching components and represent simple “regular expression” terms. (Refer to the description of the grammar file for details of the terms.) These nodes are not used in the matching process.
- In this example text object, the
nodes nodes nodes - In following calculation, the first number within the parentheses is the weighting value for that component. The second number is the best result from the node matching procedure for that node. The number on top is the node's reference label in FIG. 13.
-
-
- This value indicates the two pieces of text match “quite closely”. Values greater than 90% indicate a match that is “very close”.
- The above procedure may be improved by applying “Fuzzy Logic” techniques. Fuzzy logic techniques are well known to those skilled in the art and many suitable reference books are available.
- Adjust Node Variables
- This subroutine is called from the “Modify Component” procedure described above. The purpose of this routine is adjust the actual free-format text and all corresponding sub-component nodes and located after the node being replaced so that the new replace sub-string and sub-tree fit exactly. If the old sub-string and the new replacement sub-string are the same length, this subroutine is not invoked.
- FIG. 14 shows a flow chart of the steps required. Starting at1401, a determination is made at
step 1402 as to whether there is enough space in the current text buffer to accommodate the change. This is done by referring to the “free space” variable (described above) of the “root” node of the text object. If there is not enough space, the “Relocate Text Data” subroutine is invoked 1403 to create free space in the text object. If this routine is unsuccessful 1404, an error condition is set 1415, the procedure terminates and return to thecaller 1416. Otherwise, the procedure continues at 1405 and calculate the extra space requirements of the modified text object by subtracting the size of the old sub-tree being replaced from the size of the new replacement sub-tree. A zero or negative value indicates that the text object has enough space to accommodate the change. If text object requiresmore space 1406, the “Relocate Text Object” subroutine is invoked 1407 to create free space in the text object. If this routine is unsuccessful 1408, an error condition is set 1415, the procedure terminates and return to thecaller 1416. If the above steps are successful, the procedure continues atstep 1409 and shifts the “after”string 806 in FIG. 8 by the difference between theold sub-string 805 and thenew replacement sub-string 807. For each node which refers to components located after thereplacement node 1410, add this difference to the node'sstart address variable 1411. For each node which has the replace node as a sub-component 1412, add the difference to the node'slength variable 1413. Adjust the text object's “free space” variable by subtracting thedifference 1414 and return tocaller 1416. - Relocate Text Data
- This subroutine is invoked by the “Adjust Node Variables” to move the current free-format text into a space large enough to accommodate the required modification. The ability of this routine to perform this operation depends on where the text data is stored. Typically, free-format data such as “address” information is stored in fixed length database fields and will not be able to be relocated. If this is the case, this routine will set an error condition and return to caller. However, if the text data is stored within moveable storage such as the computer's memory or with a object-oriented database as a non-persistent object, this procedure will relocate the text data and return to the caller with the text data's new address.
- Relocate Text Object
- This subroutine is invoked by the “Adjust Node Variables” to move the current text object into a space large enough to accommodate the required modification. The ability of this routine to perform this operation depends on how this invention is implemented. If the text object is stored within moveable storage such as the computer's memory or with a object-oriented database as a non-persistent object, this procedure will relocate the text object and return to the caller with the text object's new address.
- For a description of Object-Oriented databases and object persistence, refer to the book “Object-Oriented Databases” by Setrag Khoshafian (Wiley Press).
- Get Keys
- This operation is used exclusively by the “Text Object Index” described below. It provides key information used in updating and querying of the text object index. It recursively searches the text object node tree and returns a list of all the nodes which have been flagged as low level matching components. See above for a definition of a low level matching component. Refer to the description of the Text Object Index below for an example of the output of this function.
- Summary of Text Object Benefits
- Many records of free-format text may be processed in accordance with this embodiment of the present invention, to produce text objects in each case. Different text objects may have different attribute type identifiers, but it is not necessary to produce a complex database structure having a separate field for each attribute type. Free-format text is stored basically as it is, with the associated text object providing all the facility required to provide all the normal database operations on the free-format data. This essentially enables a computer to handle information in much the same way as a human being does.
- Text Object Construction Overview
- The text object is produced by an examination of the free-format data by applying natural language processing techniques, such as parsing, which is known in the prior art. Such language processing techniques have been applied to “clean” or “scrub” databases and large and complex software systems have been applied. In each case in the prior art, however, the natural language processing has been applied to analyse the data to enable the creation of new database fields. The idea of maintaining the free-format data as it is and creating a text object as described is a totally new concept.
- In this embodiment of the present invention, the processing of each item of free-format text to produce the text object involves, firstly, lexical analysis in which regular expression analyser reads the free-format text and groups the items of the text into tokens with their associated attribute type identifier (e.g., word, number, coma, etc). Each token is then checked against a dictionary for other applicable attribute type identifiers (e.g., Street type, State, etc).
- Syntax analysis is then applied and in the present embodiment, the position of each of the tokens in the free-format data is also analysed to provide attribute type identifiers. For example, in the FIG. 5 example, “Pitt” is a plain word not found in the dictionary and therefore probably a proper noun. By analysing its position in relation to the other elements of the free-format data, however, the embodiment can “imply” that it is a <StreetName>. Therefore, “12 Pitt Street” can be classified as a <Street> from the relative positioning of the tokens.
- Domain Object
- The main function of the domain object108 (FIG. 1) is to create text objects 105. This function is described in detail below. Other functions the domain object performs relate to maintaining an attribute type table. This table contains the information for all the attribute types defined for its domain.
- Structure
- FIG. 15 shows the
domain object architecture 108 in more detail. It comprises a series of “look up” tables, which include the symbol table (e.g., <Street name> NB the term “symbol” is equivalent to the term “attribute type identifier”) 1502 and the parse table 1504 (contains rules for applying the grammar). It also comprises alexicon 1503 contains a character definition table 1505,regular expression analyser 1506 and a dictionary 1507 (e.g., NSW, VIC, SA). All of these parts are used by a modified “Tomita parser” (described below) to process free-format text to produce text objects. - Text Object Construction
- FIG. 16 gives an overview of the operation of the
domain object 108 creating atext object 105 of FIG. 1. - In operation, the
domain object 1605 uses theattribute type 1608 to locate the respective parsing rules and then “parses” the free-format data 1607 and produces atext object 1606. - Parsing is a known technique for analysing free-format data and a skilled person would be able to arrange appropriate parsing.
- Parser Types
- The parser may consist of any non-deterministic parser. The common parsing techniques are listed as follows:
- Top Down Backtracking Parser
- Bottom Up Backtracking Parser
- Top Down Chart Parser
- Bottom Up Chart Parser
- Augmented Transition Network Parser
- Shift Reduce Parser with Backtracking
- Tomita's Graph stack Shift Reduce parser
- The main reasons for selecting Tomita's Graph-stack Shift-Reduce parser for the best implementation of the invention are:
- A detailed description of the algorithm is readily available.
- The algorithm processes ambiguous text data very well.
- The resulting data structures represent ambiguous text data in a very efficient form.
- The structure and operation of the parsing process is described in the book by Tomita, M. “Efficient Parsing for Natural Language”, Kluwer 1986. A summarised copy of this description is also given in the Appendix to this description.
- Modifications to Tomita's Parser
- In addition to producing the component node tree described by Tomita, a number of enhancements are required for the text object. These enhancements allow the text object to provide the “virtual data” fields.
- Modifications to Tomita's Graph-Stack Shift-Reduce parser for this invention are as follows:
- Assigning parsing priorities to the tokens returned from the lexical analyser and to the rules in the parse table. Summing these priorities to obtain the most suitable component node tree for a given free-format text. All of these priorities are specified in the input grammar file1603 (FIG. 16).
- Classifying the component nodes of the syntax tree as either visible or invisible. Low level “regular expression” terms such as <word> are classified as invisible.
- Assigning match weightings to all component nodes. These values are specified in the grammar data and are used to determine the relative importance of each of the components when matching two free-format texts.
- Procedure
- FIG. 16 gives an overview of the operation of the
domain object 108 creating atext object 105 of FIG. 1. - This procedure takes a free-
format text string 1607 and anattribute type identifier 1608 and creates atext object 1606. - 1. Using the
attribute type identifier 1608, look up the symbol table 1502 (FIG. 15) to get the corresponding parse table. - 2. Call the parser to create a “shared parse forest” as defined in section 2.4 of Tomita's book. A shared parse forest is used to represent ambiguous parse trees within the one structure. It does this by allowing trees to share common sub-trees.
- 3. Recursively accumulate all the “parsing priorities” of all the sub-component nodes of each node.
- 4. Based on the values in the previous step, select the best parse tree.
- 5. Create a new Text Object with the selected parse tree.
- 6. Recursively search the parse tree to locate and flag specific nodes as “low level matching components”. (see above for definition)
- Refer to FIG. 3 for a simple example of a text object.
- Construction of Domain Object
- FIG. 16 shows an overview of the domain construction process.
- The input files for the
domain construction process 1604 include the following: -
Character Definition File 1601 - This defines all the valid characters of the domain and specifies their usage. The range of usage typically includes alphabetic, numeric, punctuation, space. It also specifies which characters are similar for matching purposes. It also specifies all information required to perform the “text string matching” described below.
- In the best embodiment of the invention, this file contains one record per character, and each record contains:
- the character in question
- the character's type (alpha, numeric, etc)
- a base character for case and diacritic matching (e.g. “a”, “À”, “à”, “Å”→“A”)
- a flag indicating the significance of the character. (e.g. vowels are considered insignificant.)
- one or more characters for standard international transliteration. (see FIG. 17 for example tables)
- This file could also define how character combinations are translated into phonetic representations (e.g. “PH”→“F”). Phonetics is a known technique and a skilled person would be able to arrange appropriate translation tables.
-
Regular Expression Definition 1602 - This defines the structure of the elementary tokens of the system. For example:
- A word consists of two or more alphabetic characters. These tokens are represented in the grammar by the term “word”.
- A number consists of one or more numeric characters. Represented in the grammar by the term “nbr”.
- The structure of the Regular Expression definition is a basic “state transition table”. This technique is well known within computer science. A working sample is shown in FIG. 18.
-
Grammar 1603 - The basic premise of the grammar file is to define all possible tree structures for the text objects created in its language domain.
- The grammar file consists of a number of grammar rules in the form “A→B1 B2 B3 . . . ”. Each grammar rule consists of a LHS symbol <A> and zero, one or many RHS symbols <Bn>. The LHS symbol <A> is the name of the component type and the RHS symbols <Bn> defines its sub-components. Each of the RHS symbols <Bn> can be one of the following:
- Another component type name
- A literal ( enclosed in quotes
- A reserved word
- The reserved words represent simple “regular expression” terms as follows:
- “word”—one or more alphabetic characters
- “nbr”—one or more numeric characters
- “A”—one alphabetic character
- “9”—one numeric character
- Additionally, each attribute type (i.e. LHS symbol) can be assigned a “match weight adjustment”. This is used to vary the default match weighting. Match weighting are used when comparing text objects to indicate the relative importance of sub-components during the calculation of the matching confidence.
- Additionally, each grammar rule can be assigned a “parsing priority”. This is used during the construction of text objects to assist in selecting the best structure for the text object when two or more ambiguous structures are available.
- All branches at the lowest levels of the hierarchy of rules and attribute type names defined by the grammar must end with literals or reserved words. A simple example grammar is shown in FIG. 19.
- Procedure
- FIGS. 20 and 21 provide flow charts of the domain object construction process. Starting2001, the character definition data is loaded into memory at
step 2002, then the regular expression definition loaded atstep 2003. Processing continues by reading the grammar definition data and for each rule in thegrammar 2004, process thegrammar rule 2005 by creating a new rule in the temporary rule table 2102; using the LHS symbol of the rule to create a new symbol/component type in the Symbol table if it does not exist already, and then for each symbol on the RHS of the rule (step 2104), if it is a literal 2105, then add it to thedictionary 2106, If it is a recognised “regular expression” term such as “word” or “nbr” 2107, donothing 2108, otherwise it is attribute/symbol and it is added as a new symbol/attribute type to the Symbol table if it does not exist already atstep 2109. After all the grammar rules have been processed, processing continues atstep 2006 by checking that each symbol/attribute type added to the Symbol table has been defined. i.e. has appeared at least once on the LHS of a grammar rule (step 2007). If any are undefined symbols/attribute types, an error condition is set atstep 2011, the procedure terminates and returns to thecaller 2012. Otherwise processing continues atstep 2008. Again, for each symbol/attribute type added to the Symbol table, a parse table is created atstep 2009, and a reference to this new parse table is recorded in the corresponding Symbol table entry. After all the required parse tables have been created, the procedure terminates and returns to thecaller 2012. - Building of parse tables is a well known technique within computer science. Parse tables were originally developed for programming languages. The algorithm for construction of the “LR parsing table” can be found in Aho, A. V. and Ullman, J. D. “Principles of Compiler Design” Addison Wesley 1977. Tomita applied these techniques to “Natural Language Processing” by building parse tables which are non-deterministic in that each entry in the tables can have more than one action.
- Note the
domain object 1605 can be saved to memory or loaded to operate on a record of free-format data. - Text Object Index
- A “text object index”109 (FIG. 1) is used as a means to perform normal database operations on the “virtual data” fields of a plurality of text objects and their associated free-format text.
- The basic concept for the text object index is similar to the concepts published in the book “Human Associative Memory” by John R. Anderson (Wiley 1973). This work described how the nouns in a sentence are used to reference a database of named objects, and then to match the “relationship” links between these objects to the implied relationships in the original sentence. These relationships follow the “Actor-Object-Action” model.
- Although similar, the text object index differs from this method in two major ways. 1) All constituent parts of the free-format text are classified and used to reference the index. (i.e. not just the nouns). 2) There are no relationship links between objects.
- Looking at the text object index with a different perspective, one could consider the text object index an array with unlimited dimensions where each dimension is one of the low level matching attribute types described above. The text object created from a free-format text string will provide the low level matching components used to query the text object index. So that all references to other text objects which are located at the intersection of the supplied components are returned.
- Performance improvements to this basic concept can be provided by applying “fuzzy logic” techniques to the process. Fuzzy logic techniques are well known to those skilled in the art and many suitable reference books are available.
- In the best embodiment of the invention, the main part of the text object index is a three column table with the following fields:
- Attribute Type Identifier
- Representative Value Key
- User Supplied Record Identifier
- This simple structure allows the text object index to be implemented using the database technology available on the respective computer.
- The following example demonstrates how the three column table is used. The basic idea behind the Text Object Index is that all matching free-format texts have the same low level matching attribute. For example, assume the following record has been added to the text object index with a “user reference” of123.
- “
Unit 12 34 Pitt Street, Sydney N.S.W., 2000” - After obtaining the respective text object's low level matching attributes, the following entries will be added to the index:
<Unit Number> “12” 123 <Street Number> “34” 123 <Street Name> “PITT” 123 <Street Type> “ST” 123 <Town Name> “SYDNEY” 123 <State> “NSW” 123 <Postcode> “2000” 123 - A query is performed to check if the following address exists in the database.
- “12/34PITT ST SYDNEY NSW”
- After creating a text object for this input and generating the low level matching attributes:
<Unit Number> “12” <Street Number> “34” <Street Name> “PITT” <Street Type> “ST” <Town Name> “SYDNEY” <State> “NSW” - Performing intersection analysis on all index entries retrieved with the above attributes-type identifiers and values will yield the record specified at the beginning of this section.
- A query is performed to find all address which contain the Street:
- “PITT ST”
- After creating a text object for this input and generating the index key set:
<Street Name> “PITT” <Street Type> “ST” - Again, performing intersection analysis on all index entries retrieved with the above attribute-type identifier and values will yield the correct subset of records including the record specified at the beginning of this section.
- The above examples have been over simplified to demonstrate the concept. In a practical system, once the low level matching key set has been generated, all the techniques used in “key word searching” can be applied to each attribute type subset. For more detailed information on “key word searching” techniques, refer to the numerous books and journal articles published by Gerald Salton.
- “Key word search” techniques applicable to this invention include:
- Storing very common terms in a high speed cache and using this to avoid doing searches on index with terms that will return too many entries.
- Using one or more Representative Value Keys that allows for common misspellings. Typically this is the original value with vowels and double constants removed.
- Using one or more Representative Value Keys that encodes the original value into a one or more phonetic representations.
- Using a Representative Value Key that encodes the original value into a international standard transliteration representation. (See FIG. 17 for examples of Greek and Japanese Katakana transliteration tables.)
- Checking the original value against a dictionary of synonyms to obtain the value which represents the full set of synonyms.
- Interface/Operations
- The following operations can be provided by the text object index.
- The interface of the text object index is designed to mirror the standard commands of SQL. SQL is the “Standard Query Language” of relational databases and is very well known within the computer industry.
- Insert Text Object
- As shown in the previous examples, this operation makes all the required changes to the text object index so that the respective text object reference can be located using any similar free-format text or subcomponent there of.
- The steps required by this operation are:
- 1. Call the “Get Key” function of the respective text object to obtain all of its low level matching components.
- 2. For each low level matching component, add an entry in the text object index's three column table.
- 3. Optionally save the respective text object depending on technical considerations of the current computer system.
- Select Text Objects
- This operation returns all references (normally record identifiers supplied by the system user) to free-format texts which contain the supplied free-format text. For example: to locate all records which contain “Box Rd”.
- This operation proceeds with the following steps:
- 1. Build a text object from the query input data.
- 2. Invoke the “Get Keys” function of the text object to obtain a list all of its low level matching components.
- 3. Use the attribute type identifier and representative value of each of the component nodes to retrieve all references with any common low level matching items.
- 4. Perform intersection analysis on the reference returned from the previous step to select the free-format texts which contain all the important low level matching elements of the query data.
- 5. Obtain the original text objects.
- 6. Perform a “Text Object” Comparison on each to obtain confidences.
- 7. Sort according to confidences.
- 8. Return the results to the caller.
- Delete Text Object
- This operation takes the user supplied reference key and deletes all records with that reference key.
- Update Text Object
- This operation updates the entries for a modified text object by first deleting all the previous entries and then reinserting new entries using the “Insert” operation describe above.
- Text String Operations
- The techniques used to compare two text string to obtain a matching confidence are well known within the computer industry. This section is provided as a quick overview of what text string matching normal involves.
- A typical matching procedure could perform the following steps:
- 1. Check for exact character match without regard to upper and lower case.
- 2. Check for common spelling mistakes by removing vowels and double constants, then comparing the results.
- 3. Check for any spelling mistakes by performing comparison functions which allow for character deletion, insertion and transposition.
- 4. Check for similarity after standard international transliteration. See FIG. 17 for example of transliteration tables.
- 5. Check for phonetic similarity after translating the string into a standard phonetic representation.
- In the present invention, text string matching is performed on certain low level matching component nodes. The values used in
steps - Steps 4 and 5 of the above procedure allow the invention to compare free-format data in foreign language text e.g., Japanese Kanji. A phonetic value can be stored for the Kanji symbols, and can be used to compare the Kanji with elements of other free-format data which may not be in Kanji. In other words, this feature facilitates the processing of free-format data in foreign languages. See FIG. 17 and previous description
- FIG. 22 gives an example of how this invention could be implemented within a SQL relational database implementation. A description of the SQL statements are as follows:
- 1. Create a domain object called “US_ADDRESS”
- 2. Initialise it with a Language definition (which contains the character definition and regular expression definition described above and a Grammar definition).
- 3. Create a text object class called “ADDRESS”
- 4. Set its domain to “US_ADDRESS” and its type to “Address” ( the type name must be defined in the grammar.)
- 5. Create a database table called “PERSONS” with one of the elements being an “ADDRESS” text object called “Home_Addr”.
- 6. Insert a record into the table.
- 7. Select all records in the “PERSONS” table with a specific address.
- 8. Select all records in the “PERSONS” table that have the data in “Home Addr” column which contains a sub-component “State” with a value matching “California”
- 9. Select all records in the “PERSONS” table that have the data in “Home Addr” column which contains a sub-component “Street” that matches “Kathie St” with a confidence level greater than 80%.
- Any free-format data record may be analysed by applying the present invention and by constructing the appropriate domain using the appropriate domain construction process and appropriately designed input files. All data can be analysed by computer in this way to produce text objects for all free-format descriptions.
- It will be appreciated that there are a number of processing steps for processing free-format data in accordance with embodiments of the present invention. It will be appreciated that each of these steps can be done once during system initialisation and the results saved, or they can be performed at execution time only when they are needed (e.g., every time a query is performed). A summary of these steps is as follows:
- Construction of the domain object.
- Construction of the text objects text node tree.
- Construction of text objects extra implied sub-fields.
- In addition to this, there are the other related steps of producing a text object index from a plurality of text objects.
- It will be appreciated by persons skilled in the art that numerous variations and/or modifications may be made to the invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive.
Claims (53)
1. A method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
2. A method in accordance with claim 1 , wherein the free-format data is stored as a record in a free-format field of a database.
3. A method in accordance with claim 1 or claim 2 , wherein the data remains stored in the computing system as it was originally stored, whereby it may be accessed by other applications.
4. A method in accordance with any preceding claim, wherein the text object includes an attribute—type identifier which identifies an attribute type of an element of the data.
5. A method in accordance with any preceding claim, wherein the text object includes a value indicating the character length of an element of the data.
6. A method in accordance with claim 4 or claim 5 , wherein the text object includes a value indicating whether an element is low level in a syntactic hierarchy or higher level whereby the value may be used for matching purposes when matching data with other data processed in accordance with the method.
7. A method in accordance with any preceding claim, the text object including a match weighting value for an element of the data, which can be used to determine the significance of the element when matching with other free format data.
8. A method in accordance with any preceding claim, wherein the text object comprises a plurality of component nodes arranged according to the semantic structure of the free-format data, the component nodes being arranged in a hierarchy corresponding to the semantic structure of the free-format data and each component node including additional data relating to the corresponding element of the free-format data.
9. A method in accordance with any preceding claim, comprising the further step of generating matching values for comparing an element of the free-format data with an element of other free-format data processed in accordance with the present method.
10. A method in accordance with claim 9 where the matching value is a phonetic value for phonetically comparing elements of free-format data.
11. A method in accordance with any preceding claim, wherein the text object includes implied data relating to information implied from the free-format data.
12. A method in accordance with any preceding claim, wherein a plurality of free-format data records are processed and a text object associated with each free-format data record is produced.
13. A method in accordance with claim 12 , wherein the text object is stored in the computer system whereby it is available for queries on the associated free-format data record via the query processing means.
14. A method in accordance with claim 12 comprising the further step of producing a text object index including attribute type identifiers for elements of each data record and pointers to each data record, whereby the index may be queried by queries relating to semantic and syntactic information about the data and the data may be accessed via the index.
15. A method in accordance with claim 14 wherein each entry in the text object index includes a representative value key, which gives a value representative of a feature of the element associated with the attribute—type identifier.
16. A method in accordance with any preceding claim, comprising the further step of carrying out a domain construction process to construct a domain object from domain definition data files, the domain object being arranged to carry out the examination process by parsing the free-format data in accordance with grammar rules.
17. A method in accordance with claim 16 , wherein the domain definition data files include character definition data, regular expression definition data and grammar data.
18. A method in accordance with any preceding claim, wherein the free-format data is postal address data.
19. A method in accordance with any preceding claim wherein the query processing means can carry out normal database operations on the data via the additional data.
20. A processing system for processing free-format data stored in a computing system, the apparatus including means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, means for producing additional data relating to this information, in the form of a text object which includes pointer means enabling access to the elements of the free-format data, and a query processing means which is arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
21. A processing system in accordance with claim 20 , wherein the free-format data is stored as a record in a free-format field of a database.
22. A processing system in accordance with claim 20 or claim 21 , wherein the examining means does not affect the storage of the data.
23. A processing system in accordance with any one of claims 20 to 22 , wherein the text object includes an attribute—type identifier which identifies an attribute type of an element of the data.
24. A processing system in accordance with any one of claims 20 to 23 , wherein the text object includes a value indicating the character length of an element of the data.
25. A processing system in accordance with claim 23 or claim 24 , wherein the text object includes a value, indicating whether an attribute—type of an element is low level in a syntactic hierarchy or high level whereby the value may be used for matching purposes when matching with other free-format data processed in accordance with this system.
26. A processing system in accordance with any one of claims 20 to 25 , wherein the text object includes a match weighting value for an element of the data, which can be used to determine the significance of the element when matching with other free-format data.
27. A processing system in accordance with any one of claims 20 to 26 , wherein the text object comprises a plurality of component nodes arranged according to the semantic structure of the free-format data, the component nodes being arranged in a hierarchy corresponding to the semantic structure of the free-format data, and each component node including additional data relating to the corresponding element of free-format data.
28. A processing system in accordance with any one of claims 20 to 27 , the text object means for generating matching values for comparing an element of the free-format data with an element of other free-format data processed by the processing system.
29. A processing system in accordance with claim 28 , wherein the matching value is a phonetic value for phonetically comparing elements of free-format data.
30. A processing system in accordance with any one of claims 20 to 29 , wherein the text object includes implied data relating to information implied from the free-format data.
31. A processing system in accordance with any one of claims 20 to 30 , wherein the system is arranged to process a plurality of free-format data records and produce a text object associated with each free-format data record.
32. A processing system in accordance with claim 31 , wherein the means for producing additional data is arranged to produce a text object index including attribute—type identifiers for elements of each data record and pointers to each data record and wherein the query processing means is arranged to access the text object index to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
33. A processing system in accordance with claim 32 , wherein the text object index includes representative value keys for entries, which give a value representative of a feature of the element associated with the attribute—type identifier for the entry for facilitating matching with other free-format data processed in accordance with this system.
34. A processing system in accordance with any one of claims 20 to 33 , further comprising a domain object, the domain object being arranged to carry out the examination process by parsing the free-format data in accordance with grammar rules.
35. A processing system in accordance with claim 34 , wherein the domain object is produced by a domain construction process from domain definition data files.
36. A processing system in accordance with claim 35 , further comprising a domain constructor for carrying out the domain construction process.
37. A processing system in accordance with claim 35 or claim 36 , wherein the domain definition data files include character definition data, regular expression definition data and grammar data.
38. A processing system in accordance with any one of claims 20 to 37 , wherein the free-format data is postal address data.
39. A processing system in accordance with any one of claims 20 to 38 , wherein the query processing means is arranged to carry out normal database operations on the data via the additional data.
40. A method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data for each data record, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, the additional data being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
41. A processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising additional data relating to semantic and syntactic information (attributes) about the data for each data record, stored and accessible by the processing system, the additional data being in the form of a text object associated with each data record, the text object including pointer means enabling access to elements of each free-format data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
42. A method of enabling access to free-format data stored in a computing system, including a plurality of free-format data records, comprising the steps of storing additional data relating to semantic and syntactic information (attributes) about the data of each data record, the additional data being in the form of a text object index which includes attribute—type identifiers for elements of each data record and pointers to each data record, the text object index being accessible by a query processing means to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
43. A processing system for enabling access to free-format data stored in a computing system, including a plurality of free-format data records, the processing system comprising the additional data relating to semantic and syntactic information (attributes) about the free-format data for each data record, the additional data being in the form of a text object index which includes attribute type identifiers for elements of each data record and pointers to each data record, and a query processing means arranged to access the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
44. A method of accessing free-format data processed in accordance with the method of any one of claims 1 to 19 comprising the steps of accessing the additional data to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
45. A processing system for enabling access to free-format data processed in accordance with the method of any one of claims 1 to 19 , the processing system including a query processing means arranged to access the additional data and provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data to manipulate the data.
46. A processing system for processing free-format data stored in a computing system, comprising means for examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationship of elements to each other, to determine semantic and syntactic information (attributes) about the data, and a query processing means for utilising this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
47. A processing system in accordance with claim 46 , wherein the examining means retains the free-format data as stored in the computer system, without affecting it.
48. A method of processing free-format data stored in a computing system, comprising the steps of examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about the data, and querying the data using this information to provide answers to queries relating to the semantic and syntactic information about the data and/or to access the data.
49. A method of processing free-format data in accordance with claim 48 , wherein the free-format data is unaffected by the examining process and remains stored in the computing system as it was originally stored.
50. A computer readable memory storing instructions for controlling a computer to process free-format data stored in a computing system, in accordance with the method of any one of claims 1 to 19 .
51. A computer readable memory storing instructions for controlling a computer to process free-format data stored in a computing system, in accordance with the method of claim 48 .
52. A method of processing a plurality of records of free-format data stored in a computing system, comprising the steps of, for each record, examining elements of the data to determine attributes of the data, by examining the content of the elements and the contextual relationships of elements to each other, to determine semantic and syntactic information (attributes) about each record, and producing virtual data fields associated with each record enabling access to this information and the associated elements, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
53. A processing system for processing free-format data records stored in a computing system, comprising means for examining elements of the data of each record to determine attributes of the data, by examining the content of the elements and the contextual relationship of elements to each other, to determine semantic and syntactic information (attributes) about the data, and means for producing virtual data fields associated with each record enabling access to this information and the associated elements, whereby each record is provided with associated virtual data fields enabling access to semantic and syntactic information about that record and also access to the associated elements.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/898,948 US20020010714A1 (en) | 1997-04-22 | 2001-07-03 | Method and apparatus for processing free-format data |
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
AUPP0439 | 1997-04-22 | ||
AUPP043997 | 1997-04-22 | ||
US09/117,776 US6272495B1 (en) | 1997-04-22 | 1998-04-22 | Method and apparatus for processing free-format data |
US09/898,948 US20020010714A1 (en) | 1997-04-22 | 2001-07-03 | Method and apparatus for processing free-format data |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/117,776 Division US6272495B1 (en) | 1997-04-22 | 1998-04-22 | Method and apparatus for processing free-format data |
Publications (1)
Publication Number | Publication Date |
---|---|
US20020010714A1 true US20020010714A1 (en) | 2002-01-24 |
Family
ID=3804719
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/117,776 Expired - Lifetime US6272495B1 (en) | 1997-04-22 | 1998-04-22 | Method and apparatus for processing free-format data |
US09/898,948 Abandoned US20020010714A1 (en) | 1997-04-22 | 2001-07-03 | Method and apparatus for processing free-format data |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/117,776 Expired - Lifetime US6272495B1 (en) | 1997-04-22 | 1998-04-22 | Method and apparatus for processing free-format data |
Country Status (5)
Country | Link |
---|---|
US (2) | US6272495B1 (en) |
EP (1) | EP1078323A4 (en) |
CN (1) | CN1204515C (en) |
CA (1) | CA2329345A1 (en) |
WO (1) | WO1998048360A1 (en) |
Cited By (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020023123A1 (en) * | 1999-07-26 | 2002-02-21 | Justin P. Madison | Geographic data locator |
US20020111993A1 (en) * | 2001-02-09 | 2002-08-15 | Reed Erik James | System and method for detecting and verifying digitized content over a computer network |
US20020111961A1 (en) * | 2000-12-19 | 2002-08-15 | International Business Machines Corporation | Automatic assignment of field labels |
US20030046399A1 (en) * | 1999-11-10 | 2003-03-06 | Jeffrey Boulter | Online playback system with community bias |
US20030158835A1 (en) * | 2002-02-19 | 2003-08-21 | International Business Machines Corporation | Plug-in parsers for configuring search engine crawler |
US20030216913A1 (en) * | 2002-05-14 | 2003-11-20 | Microsoft Corporation | Natural input recognition tool |
US20030229537A1 (en) * | 2000-05-03 | 2003-12-11 | Dunning Ted E. | Relationship discovery engine |
US6694338B1 (en) | 2000-08-29 | 2004-02-17 | Contivo, Inc. | Virtual aggregate fields |
US20040060006A1 (en) * | 2002-06-13 | 2004-03-25 | Cerisent Corporation | XML-DB transactional update scheme |
US20040103105A1 (en) * | 2002-06-13 | 2004-05-27 | Cerisent Corporation | Subtree-structured XML database |
US20040107205A1 (en) * | 2002-12-03 | 2004-06-03 | Lockheed Martin Corporation | Boolean rule-based system for clustering similar records |
US20040167883A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and systems for providing a service for producing structured data elements from free text sources |
US20040268247A1 (en) * | 2003-02-12 | 2004-12-30 | Lutz Rosenpflanzer | Managing different representations of information |
US20050028046A1 (en) * | 2003-07-31 | 2005-02-03 | International Business Machines Corporation | Alert flags for data cleaning and data analysis |
US20050187968A1 (en) * | 2000-05-03 | 2005-08-25 | Dunning Ted E. | File splitting, scalable coding, and asynchronous transmission in streamed data transfer |
US20050197906A1 (en) * | 2003-09-10 | 2005-09-08 | Kindig Bradley D. | Music purchasing and playing system and method |
US20050216429A1 (en) * | 2004-03-24 | 2005-09-29 | Hertz Michael T | System and method for collaborative systems engineering |
US20050234675A1 (en) * | 2004-04-15 | 2005-10-20 | Tillotson Timothy N | Dynamic runtime modification of scpi grammar |
US20060242193A1 (en) * | 2000-05-03 | 2006-10-26 | Dunning Ted E | Information retrieval engine |
US20070055656A1 (en) * | 2005-08-01 | 2007-03-08 | Semscript Ltd. | Knowledge repository |
US20070083511A1 (en) * | 2005-10-11 | 2007-04-12 | Microsoft Corporation | Finding similarities in data records |
US20070081529A1 (en) * | 2003-12-12 | 2007-04-12 | Nec Corporation | Information processing system, method of processing information, and program for processing information |
US20070136250A1 (en) * | 2002-06-13 | 2007-06-14 | Mark Logic Corporation | XML Database Mixed Structural-Textual Classification System |
US20070226211A1 (en) * | 2006-03-27 | 2007-09-27 | Heinze Daniel T | Auditing the Coding and Abstracting of Documents |
US20080065607A1 (en) * | 2006-09-08 | 2008-03-13 | Dominik Weber | System and Method for Building and Retrieving a Full Text Index |
US20080080505A1 (en) * | 2006-09-29 | 2008-04-03 | Munoz Robert J | Methods and Apparatus for Performing Packet Processing Operations in a Network |
US20080256108A1 (en) * | 2007-04-13 | 2008-10-16 | Heinze Daniel T | Mere-Parsing with Boundary & Semantic Driven Scoping |
US20080256329A1 (en) * | 2007-04-13 | 2008-10-16 | Heinze Daniel T | Multi-Magnitudinal Vectors with Resolution Based on Source Vector Features |
US20080270136A1 (en) * | 2005-11-30 | 2008-10-30 | International Business Machines Corporation | Methods and Apparatus for Use in Speech Recognition Systems for Identifying Unknown Words and for Adding Previously Unknown Words to Vocabularies and Grammars of Speech Recognition Systems |
WO2008156600A1 (en) * | 2007-06-18 | 2008-12-24 | Geographic Services, Inc. | Geographic feature name search system |
US20090070284A1 (en) * | 2000-11-28 | 2009-03-12 | Semscript Ltd. | Knowledge storage and retrieval system and method |
US20090070140A1 (en) * | 2007-08-03 | 2009-03-12 | A-Life Medical, Inc. | Visualizing the Documentation and Coding of Surgical Procedures |
US20090100138A1 (en) * | 2003-07-18 | 2009-04-16 | Harris Scott C | Spam filter |
US20090192968A1 (en) * | 2007-10-04 | 2009-07-30 | True Knowledge Ltd. | Enhanced knowledge repository |
US7681185B2 (en) | 2005-10-12 | 2010-03-16 | Microsoft Corporation | Template-driven approach to extract, transform, and/or load |
US7707221B1 (en) | 2002-04-03 | 2010-04-27 | Yahoo! Inc. | Associating and linking compact disc metadata |
US7711838B1 (en) | 1999-11-10 | 2010-05-04 | Yahoo! Inc. | Internet radio and broadcast method |
US20100115129A1 (en) * | 2008-10-31 | 2010-05-06 | Samsung Electronics Co., Ltd. | Conditional processing method and apparatus |
US7756858B2 (en) | 2002-06-13 | 2010-07-13 | Mark Logic Corporation | Parent-child query indexing for xml databases |
US20100205167A1 (en) * | 2009-02-10 | 2010-08-12 | True Knowledge Ltd. | Local business and product search system and method |
US20110196665A1 (en) * | 2006-03-14 | 2011-08-11 | Heinze Daniel T | Automated Interpretation of Clinical Encounters with Cultural Cues |
US8271333B1 (en) | 2000-11-02 | 2012-09-18 | Yahoo! Inc. | Content-related wallpaper |
US20130151571A1 (en) * | 2011-12-07 | 2013-06-13 | Sap Ag | Interface defined virtual data fields |
US20140278363A1 (en) * | 2013-03-15 | 2014-09-18 | International Business Machines Corporation | Enhanced Answers in DeepQA System According to User Preferences |
US9110882B2 (en) | 2010-05-14 | 2015-08-18 | Amazon Technologies, Inc. | Extracting structured knowledge from unstructured text |
US9298754B2 (en) * | 2012-11-15 | 2016-03-29 | Ecole Polytechnique Federale de Lausanne (EPFL) (027559) | Query management system and engine allowing for efficient query execution on raw details |
WO2016149834A1 (en) * | 2015-03-26 | 2016-09-29 | Caswil Corporation | System and method for querying data sources |
US9535899B2 (en) | 2013-02-20 | 2017-01-03 | International Business Machines Corporation | Automatic semantic rating and abstraction of literature |
US9547650B2 (en) | 2000-01-24 | 2017-01-17 | George Aposporos | System for sharing and rating streaming media playlists |
WO2017040103A1 (en) * | 2015-08-28 | 2017-03-09 | Honeywell International Inc. | Converting data sets in a shared communication environment |
WO2020023719A1 (en) * | 2018-07-25 | 2020-01-30 | Ab Initio Technology Llc | Structured record retrieval |
US11100557B2 (en) | 2014-11-04 | 2021-08-24 | International Business Machines Corporation | Travel itinerary recommendation engine using inferred interests and sentiments |
US11200379B2 (en) | 2013-10-01 | 2021-12-14 | Optum360, Llc | Ontologically driven procedure coding |
US11562813B2 (en) | 2013-09-05 | 2023-01-24 | Optum360, Llc | Automated clinical indicator recognition with natural language processing |
US12124519B2 (en) | 2020-10-20 | 2024-10-22 | Optum360, Llc | Auditing the coding and abstracting of documents |
Families Citing this family (90)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6571243B2 (en) | 1997-11-21 | 2003-05-27 | Amazon.Com, Inc. | Method and apparatus for creating extractors, field information objects and inheritance hierarchies in a framework for retrieving semistructured information |
US6631500B1 (en) * | 1998-12-15 | 2003-10-07 | International Business Machines Corporation | Method, system and computer program product for transferring human language data across system boundaries |
US6654731B1 (en) | 1999-03-01 | 2003-11-25 | Oracle Corporation | Automated integration of terminological information into a knowledge base |
US6889260B1 (en) * | 1999-06-10 | 2005-05-03 | Ec Enabler, Ltd | Method and system for transferring information |
US6697947B1 (en) * | 1999-06-17 | 2004-02-24 | International Business Machines Corporation | Biometric based multi-party authentication |
US6631501B1 (en) * | 1999-06-30 | 2003-10-07 | Microsoft Corporation | Method and system for automatic type and replace of characters in a sequence of characters |
US7143350B2 (en) * | 1999-06-30 | 2006-11-28 | Microsoft Corporation | Method and system for character sequence checking according to a selected language |
US6507846B1 (en) * | 1999-11-09 | 2003-01-14 | Joint Technology Corporation | Indexing databases for efficient relational querying |
WO2001035217A2 (en) * | 1999-11-12 | 2001-05-17 | E-Brain Solutions, Llc | Graphical user interface |
US6947946B2 (en) * | 1999-12-28 | 2005-09-20 | International Business Machines Corporation | Database system including hierarchical link table |
US7251665B1 (en) | 2000-05-03 | 2007-07-31 | Yahoo! Inc. | Determining a known character string equivalent to a query string |
US6560608B1 (en) * | 2000-06-09 | 2003-05-06 | Contivo, Inc. | Method and apparatus for automatically selecting a rule |
US6941513B2 (en) * | 2000-06-15 | 2005-09-06 | Cognisphere, Inc. | System and method for text structuring and text generation |
JP2002007169A (en) * | 2000-06-23 | 2002-01-11 | Nec Corp | System for measuring grammar comprehension rate |
US6920247B1 (en) * | 2000-06-27 | 2005-07-19 | Cardiff Software, Inc. | Method for optical recognition of a multi-language set of letters with diacritics |
US6633868B1 (en) * | 2000-07-28 | 2003-10-14 | Shermann Loyall Min | System and method for context-based document retrieval |
US20020120651A1 (en) * | 2000-09-12 | 2002-08-29 | Lingomotors, Inc. | Natural language search method and system for electronic books |
US20020147578A1 (en) * | 2000-09-29 | 2002-10-10 | Lingomotors, Inc. | Method and system for query reformulation for searching of information |
US7233940B2 (en) * | 2000-11-06 | 2007-06-19 | Answers Corporation | System for processing at least partially structured data |
US7031959B2 (en) * | 2000-11-17 | 2006-04-18 | United States Postal Service | Address matching |
US7370040B1 (en) * | 2000-11-21 | 2008-05-06 | Microsoft Corporation | Searching with adaptively configurable user interface and extensible query language |
US20030028564A1 (en) * | 2000-12-19 | 2003-02-06 | Lingomotors, Inc. | Natural language method and system for matching and ranking documents in terms of semantic relatedness |
US6714939B2 (en) * | 2001-01-08 | 2004-03-30 | Softface, Inc. | Creation of structured data from plain text |
AUPR511301A0 (en) * | 2001-05-18 | 2001-06-14 | Mastersoft Research Pty Limited | Parsing system |
SG103289A1 (en) * | 2001-05-25 | 2004-04-29 | Meng Soon Cheo | System for indexing textual and non-textual files |
US20030114955A1 (en) * | 2001-12-17 | 2003-06-19 | Pitney Bowes Incorporated | Method and system for processing return to sender mailpieces, notifying sender of addressee changes and charging sender for processing of return to sender mailpieces |
PL374305A1 (en) * | 2001-12-28 | 2005-10-03 | Jeffrey James Jonas | Real time data warehousing |
US20080256069A1 (en) * | 2002-09-09 | 2008-10-16 | Jeffrey Scott Eder | Complete Context(tm) Query System |
US20030154208A1 (en) * | 2002-02-14 | 2003-08-14 | Meddak Ltd | Medical data storage system and method |
US7673234B2 (en) * | 2002-03-11 | 2010-03-02 | The Boeing Company | Knowledge management using text classification |
US7900052B2 (en) | 2002-11-06 | 2011-03-01 | International Business Machines Corporation | Confidential data sharing and anonymous entity resolution |
US20040107203A1 (en) * | 2002-12-03 | 2004-06-03 | Lockheed Martin Corporation | Architecture for a data cleansing application |
US7346927B2 (en) | 2002-12-12 | 2008-03-18 | Access Business Group International Llc | System and method for storing and accessing secure data |
US20040122653A1 (en) * | 2002-12-23 | 2004-06-24 | Mau Peter K.L. | Natural language interface semantic object module |
US8620937B2 (en) * | 2002-12-27 | 2013-12-31 | International Business Machines Corporation | Real time data warehousing |
JP2006512864A (en) | 2002-12-31 | 2006-04-13 | インターナショナル・ビジネス・マシーンズ・コーポレーション | Allowed anonymous authentication |
US7603661B2 (en) * | 2003-01-30 | 2009-10-13 | Hamilton Sunstrand | Parse table generation method and system |
US7200602B2 (en) * | 2003-02-07 | 2007-04-03 | International Business Machines Corporation | Data set comparison and net change processing |
US7962757B2 (en) * | 2003-03-24 | 2011-06-14 | International Business Machines Corporation | Secure coordinate identification method, system and program |
EP1609044A4 (en) * | 2003-03-28 | 2008-08-06 | Dun & Bradstreet Inc | System and method for data cleansing |
US20050065960A1 (en) * | 2003-09-19 | 2005-03-24 | Jen-Lin Chao | Method and system of data management |
US7779039B2 (en) * | 2004-04-02 | 2010-08-17 | Salesforce.Com, Inc. | Custom entities and fields in a multi-tenant database system |
US7305404B2 (en) * | 2003-10-21 | 2007-12-04 | United Parcel Service Of America, Inc. | Data structure and management system for a superset of relational databases |
US8037102B2 (en) | 2004-02-09 | 2011-10-11 | Robert T. and Virginia T. Jenkins | Manipulating sets of hierarchical data |
JP2005234915A (en) * | 2004-02-20 | 2005-09-02 | Brother Ind Ltd | Data processor and data processing program |
US20050246353A1 (en) * | 2004-05-03 | 2005-11-03 | Yoav Ezer | Automated transformation of unstructured data |
US20050257135A1 (en) * | 2004-05-14 | 2005-11-17 | Robert Listou | Computer generated report comprised of names, text descriptions, and selected parametric values of designated text data objects listed on data tables |
CN1322418C (en) * | 2004-08-18 | 2007-06-20 | 华为技术有限公司 | System for realizing object continuous service and method thereof |
US7925658B2 (en) * | 2004-09-17 | 2011-04-12 | Actuate Corporation | Methods and apparatus for mapping a hierarchical data structure to a flat data structure for use in generating a report |
US7801923B2 (en) | 2004-10-29 | 2010-09-21 | Robert T. and Virginia T. Jenkins as Trustees of the Jenkins Family Trust | Method and/or system for tagging trees |
US7627591B2 (en) | 2004-10-29 | 2009-12-01 | Skyler Technology, Inc. | Method and/or system for manipulating tree expressions |
US8161020B2 (en) * | 2004-11-15 | 2012-04-17 | Zi Corporation Of Canada, Inc. | Searching for and providing objects using byte-by-byte comparison |
US7630995B2 (en) | 2004-11-30 | 2009-12-08 | Skyler Technology, Inc. | Method and/or system for transmitting and/or receiving data |
US7636727B2 (en) | 2004-12-06 | 2009-12-22 | Skyler Technology, Inc. | Enumeration of trees from finite number of nodes |
US8316059B1 (en) | 2004-12-30 | 2012-11-20 | Robert T. and Virginia T. Jenkins | Enumeration of rooted partial subtrees |
US7979405B2 (en) * | 2005-01-14 | 2011-07-12 | Microsoft Corporation | Method for automatically associating data with a document based on a prescribed type of the document |
US7555486B2 (en) * | 2005-01-20 | 2009-06-30 | Pi Corporation | Data storage and retrieval system with optimized categorization of information items based on category selection |
US7412452B2 (en) * | 2005-01-20 | 2008-08-12 | Pi Corporation | Data storage and retrieval system with intensional category representations to provide dynamic categorization of information items |
US8615530B1 (en) | 2005-01-31 | 2013-12-24 | Robert T. and Virginia T. Jenkins as Trustees for the Jenkins Family Trust | Method and/or system for tree transformation |
US7966286B2 (en) * | 2005-02-14 | 2011-06-21 | Microsoft Corporation | Hierarchical management of object schema and behavior |
US7653653B2 (en) * | 2005-02-14 | 2010-01-26 | Microsoft Corporation | Dynamically configurable lists for including multiple content types |
US7681177B2 (en) | 2005-02-28 | 2010-03-16 | Skyler Technology, Inc. | Method and/or system for transforming between trees and strings |
US7899821B1 (en) | 2005-04-29 | 2011-03-01 | Karl Schiffmann | Manipulation and/or analysis of hierarchical data |
DE102006021543A1 (en) * | 2006-05-08 | 2007-11-15 | Abb Technology Ag | System and method for the automated acceptance and evaluation of the quality of mass data of a technical process or a technical project |
US7627562B2 (en) * | 2006-06-13 | 2009-12-01 | Microsoft Corporation | Obfuscating document stylometry |
US8204831B2 (en) | 2006-11-13 | 2012-06-19 | International Business Machines Corporation | Post-anonymous fuzzy comparisons without the use of pre-anonymization variants |
US20080133443A1 (en) * | 2006-11-30 | 2008-06-05 | Bohannon Philip L | Methods and Apparatus for User-Guided Inference of Regular Expressions for Information Extraction |
US7913764B2 (en) * | 2007-08-02 | 2011-03-29 | Agr Subsea, Inc. | Return line mounted pump for riserless mud return system |
US8311806B2 (en) | 2008-06-06 | 2012-11-13 | Apple Inc. | Data detection in a sequence of tokens using decision tree reductions |
US8738360B2 (en) | 2008-06-06 | 2014-05-27 | Apple Inc. | Data detection of a character sequence having multiple possible data types |
US9349143B2 (en) | 2008-11-24 | 2016-05-24 | Ebay Inc. | System and method for generating an electronic catalog booklet for online computer users |
US8397222B2 (en) | 2008-12-05 | 2013-03-12 | Peter D. Warren | Any-to-any system for doing computing |
US8175388B1 (en) * | 2009-01-30 | 2012-05-08 | Adobe Systems Incorporated | Recognizing text at multiple orientations |
US20120078950A1 (en) * | 2010-09-29 | 2012-03-29 | Nvest Incorporated | Techniques for Extracting Unstructured Data |
US8447782B1 (en) * | 2010-12-16 | 2013-05-21 | Emc Corporation | Data access layer having a mapping module for transformation of data into common information model compliant objects |
US8990232B2 (en) * | 2012-05-15 | 2015-03-24 | Telefonaktiebolaget L M Ericsson (Publ) | Apparatus and method for parallel regular expression matching |
CN103514224B (en) | 2012-06-29 | 2017-08-25 | 国际商业机器公司 | Data processing method, data query method and related device in database |
US9070090B2 (en) * | 2012-08-28 | 2015-06-30 | Oracle International Corporation | Scalable string matching as a component for unsupervised learning in semantic meta-model development |
US9201869B2 (en) | 2012-08-28 | 2015-12-01 | Oracle International Corporation | Contextually blind data conversion using indexed string matching |
US9852188B2 (en) * | 2014-06-23 | 2017-12-26 | Google Llc | Contextual search on multimedia content |
US10216783B2 (en) * | 2014-10-02 | 2019-02-26 | Microsoft Technology Licensing, Llc | Segmenting data with included separators |
US10572589B2 (en) * | 2014-11-10 | 2020-02-25 | International Business Machines Corporation | Cognitive matching of narrative data |
US9535903B2 (en) * | 2015-04-13 | 2017-01-03 | International Business Machines Corporation | Scoring unfielded personal names without prior parsing |
US10365901B2 (en) * | 2015-08-14 | 2019-07-30 | Entit Software Llc | Dynamic lexer object construction |
US10156842B2 (en) | 2015-12-31 | 2018-12-18 | General Electric Company | Device enrollment in a cloud service using an authenticated application |
US10275450B2 (en) * | 2016-02-15 | 2019-04-30 | Tata Consultancy Services Limited | Method and system for managing data quality for Spanish names and addresses in a database |
CN107203542A (en) * | 2016-03-17 | 2017-09-26 | 阿里巴巴集团控股有限公司 | Phrase extracting method and device |
US11495343B2 (en) * | 2017-04-21 | 2022-11-08 | Koninklijke Philips N.V. | Device, system, and method for determining a reading environment by synthesizing downstream needs |
US10482128B2 (en) | 2017-05-15 | 2019-11-19 | Oracle International Corporation | Scalable approach to information-theoretic string similarity using a guaranteed rank threshold |
US10885056B2 (en) | 2017-09-29 | 2021-01-05 | Oracle International Corporation | Data standardization techniques |
Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4974191A (en) * | 1987-07-31 | 1990-11-27 | Syntellect Software Inc. | Adaptive natural language computer interface system |
US5060155A (en) * | 1989-02-01 | 1991-10-22 | Bso/Buro Voor Systeemontwikkeling B.V. | Method and system for the representation of multiple analyses in dependency grammar and parser for generating such representation |
US5146406A (en) * | 1989-08-16 | 1992-09-08 | International Business Machines Corporation | Computer method for identifying predicate-argument structures in natural language text |
US5276616A (en) * | 1989-10-16 | 1994-01-04 | Sharp Kabushiki Kaisha | Apparatus for automatically generating index |
US5406480A (en) * | 1992-01-17 | 1995-04-11 | Matsushita Electric Industrial Co., Ltd. | Building and updating of co-occurrence dictionary and analyzing of co-occurrence and meaning |
US5442780A (en) * | 1991-07-11 | 1995-08-15 | Mitsubishi Denki Kabushiki Kaisha | Natural language database retrieval system using virtual tables to convert parsed input phrases into retrieval keys |
US5454106A (en) * | 1993-05-17 | 1995-09-26 | International Business Machines Corporation | Database retrieval system using natural language for presenting understood components of an ambiguous query on a user interface |
US5515534A (en) * | 1992-09-29 | 1996-05-07 | At&T Corp. | Method of translating free-format data records into a normalized format based on weighted attribute variants |
US5526522A (en) * | 1991-03-08 | 1996-06-11 | Nec Corporation | Automatic program generating system using recursive conversion of a program specification into syntactic tree format and using design knowledge base |
US5740421A (en) * | 1995-04-03 | 1998-04-14 | Dtl Data Technologies Ltd. | Associative search method for heterogeneous databases with an integration mechanism configured to combine schema-free data models such as a hyperbase |
US5826258A (en) * | 1996-10-02 | 1998-10-20 | Junglee Corporation | Method and apparatus for structuring the querying and interpretation of semistructured information |
US5937407A (en) * | 1996-12-12 | 1999-08-10 | Digital Vision Laboratories Corporation | Information retrieval apparatus using a hierarchical structure of schema |
US5960430A (en) * | 1996-08-23 | 1999-09-28 | General Electric Company | Generating rules for matching new customer records to existing customer records in a large database |
US5963894A (en) * | 1994-06-24 | 1999-10-05 | Microsoft Corporation | Method and system for bootstrapping statistical processing into a rule-based natural language parser |
US5966686A (en) * | 1996-06-28 | 1999-10-12 | Microsoft Corporation | Method and system for computing semantic logical forms from syntax trees |
US5978820A (en) * | 1995-03-31 | 1999-11-02 | Hitachi, Ltd. | Text summarizing method and system |
US6052693A (en) * | 1996-07-02 | 2000-04-18 | Harlequin Group Plc | System for assembling large databases through information extracted from text sources |
US6151604A (en) * | 1995-03-28 | 2000-11-21 | Dex Information Systems, Inc. | Method and apparatus for improved information storage and retrieval system |
US6167393A (en) * | 1996-09-20 | 2000-12-26 | Novell, Inc. | Heterogeneous record search apparatus and method |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0280866A3 (en) * | 1987-03-03 | 1992-07-08 | International Business Machines Corporation | Computer method for automatic extraction of commonly specified information from business correspondence |
US5715449A (en) | 1994-06-20 | 1998-02-03 | Oceania, Inc. | Method for generating structured medical text through user selection of displayed text and rules |
US5734883A (en) | 1995-04-27 | 1998-03-31 | Michael Umen & Co., Inc. | Drug document production system |
AU5852896A (en) | 1995-05-05 | 1996-11-21 | Apple Computer, Inc. | Method and apparatus for managing text objects |
-
1998
- 1998-04-22 WO PCT/AU1998/000288 patent/WO1998048360A1/en active IP Right Grant
- 1998-04-22 CA CA002329345A patent/CA2329345A1/en not_active Abandoned
- 1998-04-22 US US09/117,776 patent/US6272495B1/en not_active Expired - Lifetime
- 1998-04-22 CN CNB988142023A patent/CN1204515C/en not_active Expired - Fee Related
- 1998-04-22 EP EP98916644A patent/EP1078323A4/en not_active Withdrawn
-
2001
- 2001-07-03 US US09/898,948 patent/US20020010714A1/en not_active Abandoned
Patent Citations (19)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4974191A (en) * | 1987-07-31 | 1990-11-27 | Syntellect Software Inc. | Adaptive natural language computer interface system |
US5060155A (en) * | 1989-02-01 | 1991-10-22 | Bso/Buro Voor Systeemontwikkeling B.V. | Method and system for the representation of multiple analyses in dependency grammar and parser for generating such representation |
US5146406A (en) * | 1989-08-16 | 1992-09-08 | International Business Machines Corporation | Computer method for identifying predicate-argument structures in natural language text |
US5276616A (en) * | 1989-10-16 | 1994-01-04 | Sharp Kabushiki Kaisha | Apparatus for automatically generating index |
US5526522A (en) * | 1991-03-08 | 1996-06-11 | Nec Corporation | Automatic program generating system using recursive conversion of a program specification into syntactic tree format and using design knowledge base |
US5442780A (en) * | 1991-07-11 | 1995-08-15 | Mitsubishi Denki Kabushiki Kaisha | Natural language database retrieval system using virtual tables to convert parsed input phrases into retrieval keys |
US5406480A (en) * | 1992-01-17 | 1995-04-11 | Matsushita Electric Industrial Co., Ltd. | Building and updating of co-occurrence dictionary and analyzing of co-occurrence and meaning |
US5515534A (en) * | 1992-09-29 | 1996-05-07 | At&T Corp. | Method of translating free-format data records into a normalized format based on weighted attribute variants |
US5454106A (en) * | 1993-05-17 | 1995-09-26 | International Business Machines Corporation | Database retrieval system using natural language for presenting understood components of an ambiguous query on a user interface |
US5963894A (en) * | 1994-06-24 | 1999-10-05 | Microsoft Corporation | Method and system for bootstrapping statistical processing into a rule-based natural language parser |
US6151604A (en) * | 1995-03-28 | 2000-11-21 | Dex Information Systems, Inc. | Method and apparatus for improved information storage and retrieval system |
US5978820A (en) * | 1995-03-31 | 1999-11-02 | Hitachi, Ltd. | Text summarizing method and system |
US5740421A (en) * | 1995-04-03 | 1998-04-14 | Dtl Data Technologies Ltd. | Associative search method for heterogeneous databases with an integration mechanism configured to combine schema-free data models such as a hyperbase |
US5966686A (en) * | 1996-06-28 | 1999-10-12 | Microsoft Corporation | Method and system for computing semantic logical forms from syntax trees |
US6052693A (en) * | 1996-07-02 | 2000-04-18 | Harlequin Group Plc | System for assembling large databases through information extracted from text sources |
US5960430A (en) * | 1996-08-23 | 1999-09-28 | General Electric Company | Generating rules for matching new customer records to existing customer records in a large database |
US6167393A (en) * | 1996-09-20 | 2000-12-26 | Novell, Inc. | Heterogeneous record search apparatus and method |
US5826258A (en) * | 1996-10-02 | 1998-10-20 | Junglee Corporation | Method and apparatus for structuring the querying and interpretation of semistructured information |
US5937407A (en) * | 1996-12-12 | 1999-08-10 | Digital Vision Laboratories Corporation | Information retrieval apparatus using a hierarchical structure of schema |
Cited By (130)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020023123A1 (en) * | 1999-07-26 | 2002-02-21 | Justin P. Madison | Geographic data locator |
US20030046399A1 (en) * | 1999-11-10 | 2003-03-06 | Jeffrey Boulter | Online playback system with community bias |
US7711838B1 (en) | 1999-11-10 | 2010-05-04 | Yahoo! Inc. | Internet radio and broadcast method |
US9779095B2 (en) | 2000-01-24 | 2017-10-03 | George Aposporos | User input-based play-list generation and playback system |
US9547650B2 (en) | 2000-01-24 | 2017-01-17 | George Aposporos | System for sharing and rating streaming media playlists |
US10318647B2 (en) | 2000-01-24 | 2019-06-11 | Bluebonnet Internet Media Services, Llc | User input-based play-list generation and streaming media playback system |
US20030229537A1 (en) * | 2000-05-03 | 2003-12-11 | Dunning Ted E. | Relationship discovery engine |
US8005724B2 (en) | 2000-05-03 | 2011-08-23 | Yahoo! Inc. | Relationship discovery engine |
US10445809B2 (en) | 2000-05-03 | 2019-10-15 | Excalibur Ip, Llc | Relationship discovery engine |
US8352331B2 (en) | 2000-05-03 | 2013-01-08 | Yahoo! Inc. | Relationship discovery engine |
US7162482B1 (en) | 2000-05-03 | 2007-01-09 | Musicmatch, Inc. | Information retrieval engine |
US7720852B2 (en) | 2000-05-03 | 2010-05-18 | Yahoo! Inc. | Information retrieval engine |
US20050187968A1 (en) * | 2000-05-03 | 2005-08-25 | Dunning Ted E. | File splitting, scalable coding, and asynchronous transmission in streamed data transfer |
US20060242193A1 (en) * | 2000-05-03 | 2006-10-26 | Dunning Ted E | Information retrieval engine |
US6694338B1 (en) | 2000-08-29 | 2004-02-17 | Contivo, Inc. | Virtual aggregate fields |
US8271333B1 (en) | 2000-11-02 | 2012-09-18 | Yahoo! Inc. | Content-related wallpaper |
US8719318B2 (en) | 2000-11-28 | 2014-05-06 | Evi Technologies Limited | Knowledge storage and retrieval system and method |
US20090070284A1 (en) * | 2000-11-28 | 2009-03-12 | Semscript Ltd. | Knowledge storage and retrieval system and method |
US8468122B2 (en) | 2000-11-28 | 2013-06-18 | Evi Technologies Limited | Knowledge storage and retrieval system and method |
US20020111961A1 (en) * | 2000-12-19 | 2002-08-15 | International Business Machines Corporation | Automatic assignment of field labels |
US7694216B2 (en) * | 2000-12-19 | 2010-04-06 | International Business Machines Corporation | Automatic assignment of field labels |
US20020111993A1 (en) * | 2001-02-09 | 2002-08-15 | Reed Erik James | System and method for detecting and verifying digitized content over a computer network |
US20030158835A1 (en) * | 2002-02-19 | 2003-08-21 | International Business Machines Corporation | Plug-in parsers for configuring search engine crawler |
US8527495B2 (en) * | 2002-02-19 | 2013-09-03 | International Business Machines Corporation | Plug-in parsers for configuring search engine crawler |
US7707221B1 (en) | 2002-04-03 | 2010-04-27 | Yahoo! Inc. | Associating and linking compact disc metadata |
US7380203B2 (en) * | 2002-05-14 | 2008-05-27 | Microsoft Corporation | Natural input recognition tool |
US20030216913A1 (en) * | 2002-05-14 | 2003-11-20 | Microsoft Corporation | Natural input recognition tool |
US20040060006A1 (en) * | 2002-06-13 | 2004-03-25 | Cerisent Corporation | XML-DB transactional update scheme |
US7962474B2 (en) | 2002-06-13 | 2011-06-14 | Marklogic Corporation | Parent-child query indexing for XML databases |
US7756858B2 (en) | 2002-06-13 | 2010-07-13 | Mark Logic Corporation | Parent-child query indexing for xml databases |
US20040103105A1 (en) * | 2002-06-13 | 2004-05-27 | Cerisent Corporation | Subtree-structured XML database |
US20070136250A1 (en) * | 2002-06-13 | 2007-06-14 | Mark Logic Corporation | XML Database Mixed Structural-Textual Classification System |
US20040107205A1 (en) * | 2002-12-03 | 2004-06-03 | Lockheed Martin Corporation | Boolean rule-based system for clustering similar records |
US20040167887A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Integration of structured data with relational facts from free text for data mining |
US20040167870A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Systems and methods for providing a mixed data integration service |
US20040167883A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and systems for providing a service for producing structured data elements from free text sources |
US20040167885A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Data products of processes of extracting role related information from free text sources |
US20040167908A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Integration of structured data with free text for data mining |
US20040167910A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Integrated data products of processes of integrating mixed format data |
US20040167886A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Production of role related information from free text sources utilizing thematic caseframes |
US20050108256A1 (en) * | 2002-12-06 | 2005-05-19 | Attensity Corporation | Visualization of integrated structured and unstructured data |
US20040167884A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and products for producing role related information from free text sources |
US20040167911A1 (en) * | 2002-12-06 | 2004-08-26 | Attensity Corporation | Methods and products for integrating mixed format data including the extraction of relational facts from free text |
US20040215634A1 (en) * | 2002-12-06 | 2004-10-28 | Attensity Corporation | Methods and products for merging codes and notes into an integrated relational database |
US20040268247A1 (en) * | 2003-02-12 | 2004-12-30 | Lutz Rosenpflanzer | Managing different representations of information |
US7797626B2 (en) * | 2003-02-12 | 2010-09-14 | Sap Ag | Managing different representations of information |
US20090100138A1 (en) * | 2003-07-18 | 2009-04-16 | Harris Scott C | Spam filter |
US20070203939A1 (en) * | 2003-07-31 | 2007-08-30 | Mcardle James M | Alert Flags for Data Cleaning and Data Analysis |
US20050028046A1 (en) * | 2003-07-31 | 2005-02-03 | International Business Machines Corporation | Alert flags for data cleaning and data analysis |
US20050197906A1 (en) * | 2003-09-10 | 2005-09-08 | Kindig Bradley D. | Music purchasing and playing system and method |
US7672873B2 (en) | 2003-09-10 | 2010-03-02 | Yahoo! Inc. | Music purchasing and playing system and method |
US20090043423A1 (en) * | 2003-12-12 | 2009-02-12 | Nec Corporation | Information processing system, method of processing information, and program for processing information |
US8473099B2 (en) | 2003-12-12 | 2013-06-25 | Nec Corporation | Information processing system, method of processing information, and program for processing information |
EP2267697A3 (en) * | 2003-12-12 | 2011-04-06 | Nec Corporation | Information processing system, method of processing information, and program for processing information |
US8433580B2 (en) | 2003-12-12 | 2013-04-30 | Nec Corporation | Information processing system, which adds information to translation and converts it to voice signal, and method of processing information for the same |
US20070081529A1 (en) * | 2003-12-12 | 2007-04-12 | Nec Corporation | Information processing system, method of processing information, and program for processing information |
US20050216429A1 (en) * | 2004-03-24 | 2005-09-29 | Hertz Michael T | System and method for collaborative systems engineering |
US20050234675A1 (en) * | 2004-04-15 | 2005-10-20 | Tillotson Timothy N | Dynamic runtime modification of scpi grammar |
US6975957B2 (en) * | 2004-04-15 | 2005-12-13 | Agilent Technologies, Inc. | Dynamic runtime modification of SCPI grammar |
US20070055656A1 (en) * | 2005-08-01 | 2007-03-08 | Semscript Ltd. | Knowledge repository |
US8666928B2 (en) | 2005-08-01 | 2014-03-04 | Evi Technologies Limited | Knowledge repository |
US9098492B2 (en) | 2005-08-01 | 2015-08-04 | Amazon Technologies, Inc. | Knowledge repository |
US20070083511A1 (en) * | 2005-10-11 | 2007-04-12 | Microsoft Corporation | Finding similarities in data records |
US7681185B2 (en) | 2005-10-12 | 2010-03-16 | Microsoft Corporation | Template-driven approach to extract, transform, and/or load |
US20080270136A1 (en) * | 2005-11-30 | 2008-10-30 | International Business Machines Corporation | Methods and Apparatus for Use in Speech Recognition Systems for Identifying Unknown Words and for Adding Previously Unknown Words to Vocabularies and Grammars of Speech Recognition Systems |
US9754586B2 (en) * | 2005-11-30 | 2017-09-05 | Nuance Communications, Inc. | Methods and apparatus for use in speech recognition systems for identifying unknown words and for adding previously unknown words to vocabularies and grammars of speech recognition systems |
US8423370B2 (en) | 2006-03-14 | 2013-04-16 | A-Life Medical, Inc. | Automated interpretation of clinical encounters with cultural cues |
US8655668B2 (en) | 2006-03-14 | 2014-02-18 | A-Life Medical, Llc | Automated interpretation and/or translation of clinical encounters with cultural cues |
US20110196665A1 (en) * | 2006-03-14 | 2011-08-11 | Heinze Daniel T | Automated Interpretation of Clinical Encounters with Cultural Cues |
US20070226211A1 (en) * | 2006-03-27 | 2007-09-27 | Heinze Daniel T | Auditing the Coding and Abstracting of Documents |
US8731954B2 (en) | 2006-03-27 | 2014-05-20 | A-Life Medical, Llc | Auditing the coding and abstracting of documents |
US10216901B2 (en) | 2006-03-27 | 2019-02-26 | A-Life Medical, Llc | Auditing the coding and abstracting of documents |
US10832811B2 (en) | 2006-03-27 | 2020-11-10 | Optum360, Llc | Auditing the coding and abstracting of documents |
WO2008031062A3 (en) * | 2006-09-08 | 2008-06-12 | Guidance Software Inc | System and method for building and retriving a full text index |
US20080065607A1 (en) * | 2006-09-08 | 2008-03-13 | Dominik Weber | System and Method for Building and Retrieving a Full Text Index |
WO2008031062A2 (en) * | 2006-09-08 | 2008-03-13 | Guidance Software, Inc. | System and method for building and retriving a full text index |
US7752193B2 (en) | 2006-09-08 | 2010-07-06 | Guidance Software, Inc. | System and method for building and retrieving a full text index |
US20080080505A1 (en) * | 2006-09-29 | 2008-04-03 | Munoz Robert J | Methods and Apparatus for Performing Packet Processing Operations in a Network |
US11966695B2 (en) | 2007-04-13 | 2024-04-23 | Optum360, Llc | Mere-parsing with boundary and semantic driven scoping |
US20110167074A1 (en) * | 2007-04-13 | 2011-07-07 | Heinze Daniel T | Mere-parsing with boundary and semantic drive scoping |
US8682823B2 (en) | 2007-04-13 | 2014-03-25 | A-Life Medical, Llc | Multi-magnitudinal vectors with resolution based on source vector features |
US20080256329A1 (en) * | 2007-04-13 | 2008-10-16 | Heinze Daniel T | Multi-Magnitudinal Vectors with Resolution Based on Source Vector Features |
US10019261B2 (en) | 2007-04-13 | 2018-07-10 | A-Life Medical, Llc | Multi-magnitudinal vectors with resolution based on source vector features |
US20080256108A1 (en) * | 2007-04-13 | 2008-10-16 | Heinze Daniel T | Mere-Parsing with Boundary & Semantic Driven Scoping |
US11237830B2 (en) | 2007-04-13 | 2022-02-01 | Optum360, Llc | Multi-magnitudinal vectors with resolution based on source vector features |
US10839152B2 (en) | 2007-04-13 | 2020-11-17 | Optum360, Llc | Mere-parsing with boundary and semantic driven scoping |
US10354005B2 (en) | 2007-04-13 | 2019-07-16 | Optum360, Llc | Mere-parsing with boundary and semantic driven scoping |
US9063924B2 (en) | 2007-04-13 | 2015-06-23 | A-Life Medical, Llc | Mere-parsing with boundary and semantic driven scoping |
US7908552B2 (en) * | 2007-04-13 | 2011-03-15 | A-Life Medical Inc. | Mere-parsing with boundary and semantic driven scoping |
US10061764B2 (en) | 2007-04-13 | 2018-08-28 | A-Life Medical, Llc | Mere-parsing with boundary and semantic driven scoping |
WO2008156600A1 (en) * | 2007-06-18 | 2008-12-24 | Geographic Services, Inc. | Geographic feature name search system |
US8015196B2 (en) | 2007-06-18 | 2011-09-06 | Geographic Services, Inc. | Geographic feature name search system |
US20080319990A1 (en) * | 2007-06-18 | 2008-12-25 | Geographic Services, Inc. | Geographic feature name search system |
US9946846B2 (en) | 2007-08-03 | 2018-04-17 | A-Life Medical, Llc | Visualizing the documentation and coding of surgical procedures |
US11581068B2 (en) | 2007-08-03 | 2023-02-14 | Optum360, Llc | Visualizing the documentation and coding of surgical procedures |
US20090070140A1 (en) * | 2007-08-03 | 2009-03-12 | A-Life Medical, Inc. | Visualizing the Documentation and Coding of Surgical Procedures |
US20090192968A1 (en) * | 2007-10-04 | 2009-07-30 | True Knowledge Ltd. | Enhanced knowledge repository |
US8838659B2 (en) | 2007-10-04 | 2014-09-16 | Amazon Technologies, Inc. | Enhanced knowledge repository |
US9519681B2 (en) | 2007-10-04 | 2016-12-13 | Amazon Technologies, Inc. | Enhanced knowledge repository |
US9058181B2 (en) * | 2008-10-31 | 2015-06-16 | Samsung Electronics Co., Ltd | Conditional processing method and apparatus |
US9298601B2 (en) | 2008-10-31 | 2016-03-29 | Samsung Electronics Co., Ltd | Conditional processing method and apparatus |
US20100115129A1 (en) * | 2008-10-31 | 2010-05-06 | Samsung Electronics Co., Ltd. | Conditional processing method and apparatus |
US11182381B2 (en) | 2009-02-10 | 2021-11-23 | Amazon Technologies, Inc. | Local business and product search system and method |
US9805089B2 (en) | 2009-02-10 | 2017-10-31 | Amazon Technologies, Inc. | Local business and product search system and method |
US20100205167A1 (en) * | 2009-02-10 | 2010-08-12 | True Knowledge Ltd. | Local business and product search system and method |
US9110882B2 (en) | 2010-05-14 | 2015-08-18 | Amazon Technologies, Inc. | Extracting structured knowledge from unstructured text |
US11132610B2 (en) | 2010-05-14 | 2021-09-28 | Amazon Technologies, Inc. | Extracting structured knowledge from unstructured text |
US20130151571A1 (en) * | 2011-12-07 | 2013-06-13 | Sap Ag | Interface defined virtual data fields |
US9298754B2 (en) * | 2012-11-15 | 2016-03-29 | Ecole Polytechnique Federale de Lausanne (EPFL) (027559) | Query management system and engine allowing for efficient query execution on raw details |
US9535899B2 (en) | 2013-02-20 | 2017-01-03 | International Business Machines Corporation | Automatic semantic rating and abstraction of literature |
US9244911B2 (en) * | 2013-03-15 | 2016-01-26 | International Business Machines Corporation | Enhanced answers in DeepQA system according to user preferences |
US20150006158A1 (en) * | 2013-03-15 | 2015-01-01 | International Business Machines Corporation | Enhanced Answers in DeepQA System According to User Preferences |
US9311294B2 (en) * | 2013-03-15 | 2016-04-12 | International Business Machines Corporation | Enhanced answers in DeepQA system according to user preferences |
US20140278363A1 (en) * | 2013-03-15 | 2014-09-18 | International Business Machines Corporation | Enhanced Answers in DeepQA System According to User Preferences |
US11562813B2 (en) | 2013-09-05 | 2023-01-24 | Optum360, Llc | Automated clinical indicator recognition with natural language processing |
US12045575B2 (en) | 2013-10-01 | 2024-07-23 | Optum360, Llc | Ontologically driven procedure coding |
US11200379B2 (en) | 2013-10-01 | 2021-12-14 | Optum360, Llc | Ontologically driven procedure coding |
US11288455B2 (en) | 2013-10-01 | 2022-03-29 | Optum360, Llc | Ontologically driven procedure coding |
US11100557B2 (en) | 2014-11-04 | 2021-08-24 | International Business Machines Corporation | Travel itinerary recommendation engine using inferred interests and sentiments |
US10346397B2 (en) | 2015-03-26 | 2019-07-09 | Caswil Corporation | System and method for querying data sources |
WO2016149834A1 (en) * | 2015-03-26 | 2016-09-29 | Caswil Corporation | System and method for querying data sources |
US10216742B2 (en) | 2015-08-28 | 2019-02-26 | Honeywell International Inc. | Converting data sets in a shared communication environment |
WO2017040103A1 (en) * | 2015-08-28 | 2017-03-09 | Honeywell International Inc. | Converting data sets in a shared communication environment |
JP2022503456A (en) * | 2018-07-25 | 2022-01-12 | アビニシオ テクノロジー エルエルシー | Get structured records |
JP7105982B2 (en) | 2018-07-25 | 2022-07-25 | アビニシオ テクノロジー エルエルシー | Structured record retrieval |
AU2019309856B2 (en) * | 2018-07-25 | 2022-05-26 | Ab Initio Technology Llc | Structured record retrieval |
US11294874B2 (en) | 2018-07-25 | 2022-04-05 | Ab Initio Technology Llc | Structured record retrieval |
WO2020023719A1 (en) * | 2018-07-25 | 2020-01-30 | Ab Initio Technology Llc | Structured record retrieval |
CN112513836A (en) * | 2018-07-25 | 2021-03-16 | 起元技术有限责任公司 | Structured record retrieval |
US12124519B2 (en) | 2020-10-20 | 2024-10-22 | Optum360, Llc | Auditing the coding and abstracting of documents |
Also Published As
Publication number | Publication date |
---|---|
CA2329345A1 (en) | 1998-10-29 |
US6272495B1 (en) | 2001-08-07 |
CN1315020A (en) | 2001-09-26 |
EP1078323A4 (en) | 2007-04-25 |
WO1998048360A1 (en) | 1998-10-29 |
EP1078323A1 (en) | 2001-02-28 |
CN1204515C (en) | 2005-06-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6272495B1 (en) | Method and apparatus for processing free-format data | |
US10496722B2 (en) | Knowledge correlation search engine | |
US7398201B2 (en) | Method and system for enhanced data searching | |
US6055528A (en) | Method for cross-linguistic document retrieval | |
US7512575B2 (en) | Automated integration of terminological information into a knowledge base | |
JP3266246B2 (en) | Natural language analysis apparatus and method, and knowledge base construction method for natural language analysis | |
US8417513B2 (en) | Representation of objects and relationships in databases, directories, web services, and applications as sentences as a method to represent context in structured data | |
US20010007987A1 (en) | Structured-document search apparatus and method, recording medium storing structured-document searching program, and method of creating indexes for searching structured documents | |
US20050203900A1 (en) | Associative retrieval system and associative retrieval method | |
JP2000315216A (en) | Method and device for retrieving natural language | |
US8000957B2 (en) | English-language translation of exact interpretations of keyword queries | |
JP2008033931A (en) | Method for enrichment of text, method for acquiring text in response to query, and system | |
US20070179932A1 (en) | Method for finding data, research engine and microprocessor therefor | |
US7409381B1 (en) | Index to a semi-structured database | |
Al-Safadi | Natural language processing for conceptual modeling | |
CN110119404B (en) | Intelligent access system and method based on natural language understanding | |
Bais et al. | An Arabic natural language interface for querying relational databases based on natural language processing and graph theory methods | |
AU774729B2 (en) | Method and apparatus for processing free-format data | |
JP2997469B2 (en) | Natural language understanding method and information retrieval device | |
WO2001024053A9 (en) | System and method for automatic context creation for electronic documents | |
Kumaran et al. | Lexequal: Supporting multiscript matching in database systems | |
Papakitsos et al. | Modelling a Morpheme‐based Lexicon for Modern Greek | |
Narayanan et al. | Finite-state abstractions on Arabic morphology | |
JPH0258166A (en) | Knowledge retrieving method | |
Kumaran | Multilingual information processing on relational database architectures |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCB | Information on status: application discontinuation |
Free format text: ABANDONED -- FAILURE TO RESPOND TO AN OFFICE ACTION |