WO2010038017A1 - Procédé d'analyse grammaticale partielle reposant sur l'évaluation d'appartenance de chaîne dans un fragment de grammaire floue - Google Patents
Procédé d'analyse grammaticale partielle reposant sur l'évaluation d'appartenance de chaîne dans un fragment de grammaire floue Download PDFInfo
- Publication number
- WO2010038017A1 WO2010038017A1 PCT/GB2009/002328 GB2009002328W WO2010038017A1 WO 2010038017 A1 WO2010038017 A1 WO 2010038017A1 GB 2009002328 W GB2009002328 W GB 2009002328W WO 2010038017 A1 WO2010038017 A1 WO 2010038017A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- sequence
- document
- store
- sequences
- grammar
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Definitions
- the present invention relates to text analysis. More specifically, aspects of the present invention relate to computer-implemented methods and apparatus for analysing text in a document, and to computer-implemented methods and apparatus for updating a store of sequences of textual units being electronically stored for use in methods such as the above.
- Text mining differs from information retrieval, in that the aim of an information retrieval system is to find documents relevant to a query (or set of terms). It is assumed that the "answer" to a query is contained in one or more documents; the information retrieval process prioritises the documents according to an estimate of document relevance.
- the aim of text mining is to discover new, useful and previously unknown information automatically from a set of text documents or records. This may involve trends and patterns across documents, new connections between documents, or selected and/or changed content from the text sources.
- a natural first step in a text mining process is to tag individual words and short sections where there is a degree of local structure, prior to an intelligent processing phase. This step can involve tagging the whole document in a variety of ways - for example:
- PoS Tagging This is based on identification of each word's role in the sentence (noun, verb, adjective etc.) according to its context and a dictionary of known words and grammatically allowed combinations. Clearly this depends on a sophisticated general grammar and a good dictionary, and labels the whole of every sentence rather than simply identifying word sequences that may be of interest. See for example Brill [2] - A Full Parse: This is based on a more sophisticated language model that recognises natural language grammar as well as domain-specific structures. This is currently beyond the range of automatic systems and requires human input, although work involving machine learning is ongoing - see reference [3] below.
- HMM Hidden Markov Models
- p542 stochastic context free grammars as discussed in references [4] or [5] (p542)
- the aim of the segmentation process is generally to label sub-sequences of symbols with different tags (i.e. attributes) of a specified schema.
- attributes correspond to the fragment identifiers or tags.
- tags i.e. attributes
- the segmentation and tagging process involves finding examples of numbers, street names etc. that are sufficiently close to each other in the document to be recognisable as an address.
- catalogue entries might have product name, manufacturer, product code, price and a short description. It is convenient to define a schema and to label sequences of symbols with XML tags, although this is clearly only one of many possible representations.
- Standard methods of measuring the difference between two sequences include the "Levenshtein distance", which counts the number of insertions, deletions and substitutions necessary to make two sequences the same - for example, the sequence (S a t u rd a y) can be converted to (S u n d a y) by 3 operations: deleting a, deleting t and substituting n for r.
- An extension, the "Damerau edit distance” also includes transposition of adjacent elements as a fundamental operation. Note that the term “fuzzy string matching” is sometimes used to mean “approximate string matching", and does not necessarily have any connection to formal fuzzy set theory.
- parsing is the process of determining whether a sequence of symbols conforms to a particular pattern specified by a grammar.
- a formalised grammar such as a programming language
- parsing is not a matter of degree - a sequence of symbols is either a valid program or it is not - but in natural language and free text there is the possibility that a sequence of symbols may 'nearly' satisfy a grammar.
- Crisp parsing has been extensively studied and algorithms are summarised in standard texts (e.g. Aho and Ullman, 1972: “Theory of Parsing, Translation and Compiling", Prentice Hall).
- Fuzzy parsing can be seen as an extension of sequence matching, in which (a) we allow one sequence to be defined by a grammar, so that the matching process is far more complex, and (b) we allow the standard edit operations (insert, delete, substitute) on the string in order to make it conform to the grammar.
- probabilistic parsing which annotates each grammar rule with a probability and computes an overall probability for the (exact) parse of a given sequence of symbols.
- Such probabilities can be used to choose between multiple possible parses of a sequence of symbols; each of these parses is exact. This is useful in (for example) speech recognition where it is necessary to identify the most likely parse of a sentence.
- sequences that may not (formally) parse, but which are close to sequences that do parse. The degree of closeness is interpreted as a fuzzy membership.
- Koppler [9] described a system that he called a "fuzzy parser" which only worked on a subset of the strings in a language, rather than on the whole language. This is used to search for sub-sequences (e.g. all public data or all class definitions in a program). This is not related to the present field, however.
- CYK Cocke-Younger-Kasami
- CKY Cocke-Younger-Kasami
- the algorithm uses a 3-dimensional matrix P[i, j, k] where 1 ⁇ ij ⁇ n and 7 ⁇ k ⁇ r and the matrix element P[i,j,k] is set to true if the substring of lengthy starting from position / can be generated from the Ath grammar rule.
- European patent application EP 1197885 (QAS Ltd) relates to the retrieval of data representing postal addresses from a database of postal addresses, and is concerned with retrieving, in respect of an input having one or more input data terms that do not correspond exactly to any entry in an address dictionary, the "correct" postal address from a database and outputting this address in a correctly-formatted manner, where "correct” means the closest matching address according to a pre-defined measure of closeness.
- an input data term does not correspond exactly to any entry in the address dictionary, it is suggested to use the Levenshtein distance in order to determine the quality of correspondence between the input data term and entries from the address dictionary. It will be noted that the output provided is a correctly-chosen, correctly-formatted postal address.
- An article by Davis & Fonseca [10] relating to address matching (referred to as "geocoding”) applications discusses applications which receive addresses as their input, obtained from alphanumeric attributes corresponding to data which has been recorded in a database. These attributes are descriptors of locations, and are usually in the form of postal addresses. The geocoding application then tries to find a match for each of these descriptors in a reference database, which supposedly contains the locations for every relevant address, and, if successful, returns coordinates corresponding to its location. The output provided thus comprises corresponding to a location.
- GCI Global Coding Certainty Indicator
- the contracted dictionary of proper names comprises two dictionaries: a first dictionary used to store single word names, each word name having an identification number (ID number), and a second dictionary used to store multi-word names encoded with ID numbers.
- An approximate lookup for a multi-word name is conducted for each word of the multi-word name using an approximate matching technique such as ' phonetic proximity or edit distance. Accordingly, suggestions are determined for each word of the multi-word name under consideration.
- multi-word candidates are assembled in ID notation.
- an approximate search for each assembled candidate is performed based on an edit distance or on approximate string matching. Edit distances and N-grams are used to measure how similar two strings are. The result is a set of multi-word suggestions in an ID notation. This ID notation is encoded back to the original form using the first dictionary.
- a computer- implemented method of analysing text in a document comprising a plurality of textual units, said method comprising: receiving said document; partitioning the text in said document into sequences of textual units from the text in said document, each sequence having a plurality of textual units; for each of a plurality of sequences of textual units from said document:
- a computer- implemented method of updating a sequence store comprising a plurality of pre-determined sequences of textual units being electronically stored for use in a computer-implemented process of analysing text in a document, said text analysis process comprising: comparing at least one sequence of textual units from said document with at least one of a plurality of pre-determined sequences stored in said sequence store; determining a similarity measure dependent on differences between said sequence from said document and said sequence from said sequence store, and in the event that said similarity measure is indicative of a degree of similarity above a pre-determined threshold between said sequence from said document and said sequence from said sequence store, updating a results store with an identifier identifying the sequence from said document and with a tag indicative of said sequence from said sequence store; said updating method comprising: receiving an indication of an existing sequence in said sequence store, said existing sequence comprising a plurality of textual units; receiving an indication of a candidate sequence, said candidate sequence comprising a plurality of
- preferred embodiments of the invention may be thought of as allowing for comparison of the ranges of two approximate grammars (i.e. the sets of strings that can be regarded as "fuzzily" conforming to each grammar) without further explicit parsing of any strings. This may enable a determination as to whether grammar definitions overlap significantly, or whether one grammar definition completely subsumes another by parsing a superset of strings.
- Figure 1 is a schematic diagram illustrating the tagging process
- Figure 2 is a flow-chart illustrating fuzzy parsing: how tags may be added and how the text analytics database may be updated;
- Figure 3 shows a detail of the tagging process for a string
- Figure 4 shows how the grammar-grammar conversion cost may be calculated
- Figure 5 illustrates how a grammar-grammar cost table may be created
- Figure 6 categorises four possible types of outcome that may result from performing an updating method according to the second aspect.
- the first aspect which will be described with reference mainly to figures 1 , 2 and 3, relates in general to enabling efficient identification of segments of natural language text which approximately conform to one or more patterns specified by fuzzy grammar fragments. Typically these may be used to find small portions of text such as addresses, names of products, named entities such as organisations, people or locations, and phrases indicating relations between these items.
- the second aspect which will be described with reference mainly to figures 4, 5 and 6, relates in general to enabling comparisons to be made of the effectiveness or usefulness of two different sets of fuzzy grammar fragments that are candidates for use in this process.
- This aspect may be of use in assessing the effect of changes proposed by a human expert without there being any need to re-process an entire set of documents and examine the tagged outputs for differences.
- a tokeniser is available 5 to extract the symbols from the document.
- the tokeniser may be of a known type, and is merely intended to split the document into a sequence of symbols recognisable to the fuzzy grammars.
- a string S could be
- BNF Backus-Naur Form
- T- a finite set of terminal symbols e.g. ⁇ 1, 2, 3, ..., High, Main, London, ..., Street, St, Road, Rd, ..., January, Jan, February, ... ⁇
- F - a finite set of fragment tags e.g. ⁇ address>, ⁇ date>, ⁇ number>, ⁇ streetEnding>, ⁇ placeName>, ⁇ postcode>, ⁇ dayNumber>, ⁇ dayName>, ⁇ monthName>,
- TS [TSi ... TS n ] - fuzzy subsets of T, such that TS 1 - ⁇ TS j unless There is a partial order on TS, defined by the usual fuzzy set inclusion operator.
- These sets may be defined by intension (i.e. a procedure or rule that returns the membership of the element in the set) or by extension (i.e. explicitly listing the elements and their memberships); where they are defined by intension, an associated set of "typical" elements is used to determine subsethood etc.
- the typical elements of a set TSi should include elements that are in TS 1 but not in other sets TS j , where y ⁇ /
- G ⁇ l i n n is a grammar element, i.e. either G * ⁇ F or G ij s TS or G ⁇ j ⁇ T
- each G 9 may be enclosed in brackets [G 1 JXo denote that it is optional in a grammar fragment.
- Fuzzy parsing is a matching process between a string and a grammar fragment in which we also allow the standard edit operations (insert, delete, substitute) on the string in order to make it conform more closely to the grammar fragment.
- the number of edit operations is represented by three numbers (I D S) where /, D and S are respectively the approximate number of Insertions, Deletions and Substitutions needed to make the string conform to the grammar fragment.
- the total cost is the sum of the three numbers.
- the set of edit operations may include other types of operation such as Transpositions (i.e. exchanging the respective positions of two or more adjacent or near- neighbouring symbols in the string) instead of or as well as one or more of the above edit operations, with the number T of Transpositions (or the number of any other such operation) being included in the sum in order to find the total cost.
- Transpositions i.e. exchanging the respective positions of two or more adjacent or near- neighbouring symbols in the string
- Figure 1 shows a schematic of the components involved in the tagging process, which will assist in providing a brief overview of the process to be described with reference to Figure 2.
- a fuzzy parser 10 is arranged to receive text from one or more documents 12 as its input. With reference to a store 14 containing a set of possible approximate grammar fragments, the fuzzy parser 10 uses the process to be described later in order to produce, as its output, one or more partially tagged documents 16 and a text analytics database 18 which contains the positions of tagged text, the types of tagged entities, and possible relations between tagged entities.
- Figure 2 shows the fuzzy parsing process, as follows:
- the fuzzy parser receives the document text as a file of symbols (i.e. textual units), which may include letters and/or numerals and/or other types of characters such as punctuation symbols for example. It also receives an indication of the "window size" (i.e. the number of symbols that are to be processed at a time as a "string” (i.e. a sequence of textual units).
- the window size is generally related to n, the maximum length of the grammar fragment, and is preferably 2n - although it is possible to create contrived cases in which the highest membership (equivalently, lowest cost) occurs outside a string of length 2n, such cases require large numbers of changes to the string and can generally be ignored.
- the fuzzy parser is also in communication with the store of possible grammar fragments. Assuming that the end of the file has not yet been reached (step 22), a string of symbols S is extracted from trie input file according to the window size (step 23). Assuming the current string S has not already been considered in relation to all grammar fragments (step 24), a cost table is created for string S and the next grammar fragment (step 25), and the tag membership of string S is calculated (step 26). If the tag membership is above or equal to a specified threshold (step 27), an appropriate tag is added to an output file, and the tag and its position in the file are added to the analytics database (step 28), then the process returns to step 24 and continues with the next grammar fragment.
- step 27 If the tag membership is below the threshold (step 27), the process simply returns to step 24 without updating the output file and continues with the next grammar fragment. Once the current string S has been considered in relation to all grammar fragments (step 24), the process returns to step 22 and extracts the next string (step23), unless the end of the file has been reached, in which case the process is completed at step 29 by providing, as outputs, the tagged file and the analytics database as updated.
- the step of extracting the next string of symbols from the input file may involve extracting a string starting with the textual unit immediately after the last textual unit of the previous string.
- some overlap may be arranged between the next string and it predecessor. While this may lead to slightly decreased speed (which may be compensated for by use of increased computing power, for example), this will enable that further comparisons to be made between the textual units in the input document and those of the grammar fragments. If the maximum length of a grammar fragment is n and the window size is 2n, extracting strings such that the size of the overlap between successive strings is approximately equal to the maximum length of a grammar fragment (i.e. n) has been found to provide a good compromise between these two considerations.
- the tag membership is above zero (or some other pre-determined threshold)
- the appropriate tag is added to the output file and appropriate information (i.e. the tag, the tag membership, the scope within the file) is stored in the analytics database.
- the output file and analytics database are updated in respect of each of the possibilities. For example, the string:
- tags being allocated in three different ways: (1 ) in respect of the pair of words “tony blair” as a person (leaving the words “witch” and “project” yet-to-be tagged or left un- tagged); (2) in respect of the words “blair witch project” as a movie, leaving the word “tony” yet-to-be tagged or left un-tagged), and (3) in respect of the whole string, which is also a movie. All three of these are therefore kept as possibilities for the string.
- the tagger receives the grammar element and the string to be tagged (or, more conveniently in practice, a reference to the string; this allows previously calculated cells of the cost table to be re-used). If a cost table has already been found for this combination of string and grammar fragment (step 32), it can be retrieved and returned immediately as an output (step 38). If not, a check is made (step 33) on whether the grammar fragment is a compound definition (rule) or atomic. If yes, repeated use of this process is made to calculate cost tables for each element in the rule body (step 36) and these are combined using. the operations described below in the section below headed "String Membership in a Compound Grammar Element" (step 37). If no, storage is allocated for a cost table and the cell values are calculated (step 34) as described in the section below headed "String Membership in an Atomic Grammar Element".
- step 35 the cost table is stored for possible re-use (step 35) and the table is returned as the result of this process (step 38).
- Example 2 the table columns are labelled sequentially with s1, s2, ...and the same is done for the rows.
- Each cell /, j represents the cost of converting the substring si ... s/ to the grammar fragment.
- the cost table required for the tagging process is a portion of the full table as shown after cost table 1.
- Cost table 4 illustrates reuse of a stored cost table.
- the incremental cost may be calculated as follows.
- ⁇ (s,t) measures the cost of replacing the symbol t by s .
- Such an intermediate matching operation must also be applied consistently to the costs of matching a symbol against another symbol and also when checking set membership. For simplicity we assume here that only identical symbols match )
- ⁇ numberSuffix> ⁇ a/1, b/1 , c/1 , d/1 , e/0.8, f/0.6, g/0.2 ⁇
- cost tables 1 and 2 can be combined to give cost table 4 below:
- Cost Table 5 Cost Table 5
- grammar fragments may be altered or extended by changing atomic and compound definitions. This process could be automatic (as part of a machine learning system) or manual.
- Preferred embodiments according to a second aspect now to be described relate to enabling an approximate comparison of the sets of strings tagged by two similar but not identical grammar fragments, based on estimating the number of edit operations needed to change an arbitrary string parsed by a first (source) grammar fragment into a string that would be parsed by a second (target) grammar fragment already existing in a store of grammar fragments.
- a cost to be a 5-tuple (I D S Rs Rt) where /, D and S are respectively the approximate number of insertions, deletions and substitutions needed to match the grammar fragments, i.e. to convert a string parsed by the source grammar fragment into one that would satisfy the target grammar fragment. Because the source and target grammar fragments may be different lengths, Rs and Rt represent sequences of grammar elements remaining (respectively) in the source and target grammar fragments after the match; at least one of Rs and Rt is null in every valid cost.
- a total order is defined on costs by
- the substitution cost is calculated from the membership of the element and the cardinality of the fuzzy set alpha cut at that membership. For example:
- Cost GG (jJiouse/l, cott ⁇ ge/l, vill ⁇ /0.9, p ⁇ l ⁇ ce/O ⁇ ⁇ , 1 — , null, null ⁇
- TS 1 ⁇ a/1, c/1, d/0.8 ⁇
- TS s n TS t ⁇ a/1, c/0.4 ⁇ and the degree of overlap is ⁇ 0.5/ 1 , 0.666/0.4 ⁇
- E(Cost) is the expected value of the corresponding least prejudiced distribution
- MA [Y 2 , 2/3 ⁇ : 0.4, [V 2 ) : 0.6
- LPD exp value 0.5333
- a grammar fragment as a sequence of grammar elements, e.g. all or part of the body of a grammar rule. Calculating the cost of converting a source fragment GS[L..n] to satisfy the target grammar fragment GT[L..m] proceeds as follows :
- lookupDict(SPre, GS, TPre, GT) returns the relevant table if it has been stored, or null if not
- NB SPre + GS and TPre + GT are paths from root to a node in the expanded grammar tree.
- a cost table is an n+1 by m+1 array of costs, reflecting the transformation process from a source grammar fragment GS[L..n] to target grammar fragment GT[1...m]
- bottomRight, ⁇ ghtCol, bottomRow operate on a cost table returning respectively the bottom right cell, rightmost column and bottom row of the table.
- MatchAtoms (Src, Targ) is specified by the first six lines of Cost GG above.
- this shows the steps that are preferably taken in calculating the cost of transforming a source grammar fragment GS[L..n] to a target grammar fragment GT[L..m].
- steps 51 and 52 these are recorded in a table. Each row of the table corresponds to an element of the source grammar fragment, and each column to an element of the target grammar fragment.
- Each cell within the table is specified by a row and column and represents the cost of transforming the source grammar fragment up to that row into the target grammar fragment up to that column. This is calculated in an incremental fashion, by examining the cost of transforming up to the immediately preceding row and column, finding the additional cost and minimising the total.
- the initial costs are easy to find - if there is a preceding grammar element (i.e. a non-null value for either SPre or TPre or both) then the cost is read from a previously calculated table. In the case of null values for either SPre or TPre or both, the cost is simply insertion of the target grammar elements (for row 0) or deletion of the source grammar elements (for column zero).
- Step 1 Initialise table, and the procedure is itemised in steps 1-9 of the "CreateTable” procedure discussed below.
- the cost to be stored in each remaining cell of the table can be calculated from its immediate neighbours on the left, above and diagonally above left.
- the cost of moving from the left cell to the current cell is simply the cost stored in the left cell plus the cost of inserting the target element corresponding to the current column.
- the cost of moving from the cell above to the current cell is the cost stored in the cell above plus the cost of deleting the source element corresponding to the current row.
- the cost of moving from the cell diagonally above left is the cost in that cell plus the cost of matching the source and target elements corresponding to the current row and column respectively. This may require creation of an additional table, but such a step is merely a repeat of the process described here.
- the cost stored in the current cell is then the minimum of these three candidates.
- Calculating the cost in this fashion can proceed by considering any cell in which the required three neighbours have been calculated.
- a simple approach is to start at the top left and move across each row cell by cell.
- Steps 53 to 58 of Figure 5 correspond to steps 10-33 of the "CreateTable" procedure below.
- Step 54 it is ascertained whether both the source and target elements are atomic. If so, the cost of this cell is calculated (in Step 55) to be the minimum of the following:
- Step 56 the cost of this cell may be calculated (in Step 56) using a method such as that shown in Figure 4, using the following as the inputs:
- Step 57 Target context taken from cell above If it is ascertained in Step 57 that the bottom right cell has not yet been calculated, another cell is chosen according to Step 53 and the procedure of Steps 54 then 55 or 56 is repeated in respect of this cell. If it is ascertained in Step 57 that the bottom right cell has now been calculated, the procedure ends at Step 58 by providing as an output the cost of the grammar- grammar match, which will be the content of the bottom right cell.
- initTrgCost bottoraRow( lookupDict (SPre, TPre A GT)
- initSrcCost rightCol (lookupDict (SPre A 6S, TPre)
- leftCost C[i, j-1] + (0, maxLen (Rs [i, j -1] ) , 0, null, Rt [i, j -1] A Gt [j] )
- the same algorithm can be used to find the membership of a specific string in a grammar fragment : given a string S of length m and a grammar rule with body length n we follow the algorithms above, using a table with n+1 columns labelled by the elements of the grammar fragment and m+1 rows, labelled by the symbols in the string.
- Figure 4 shows a schematic of the grammar-grammar matching process, given a source (GS) and target (GT) grammar fragment plus their respective contexts SPre and TPre. If either grammar fragment is null (this occurs when the end of a grammar definition is reached) then determination of the cost is straightforward (steps 42-45). If the table comparing GS to GT with contexts SPre and TPre has already been calculated (step 46) then it can be retrieved and the cost obtained from the bottom right cell (step 47). If not, a new table must be allocated, filled (as shown in Figure 5) and stored for possible re-use. The cost of this grammar-grammar match is the bottom right cell of the filled table (step 48). Finally the cost is returned as the result of this process (step 49).
- GS source
- GT target
- the set T of terminal elements is ⁇ a, b, c, d, e ⁇
- Source grammar fragment is g4, target grammar fragment is g3.
- Step 2 (recursively) calculate a-g1 match (and cache the table for future use)
- Step 5 to complete the b-g1 cell we need to re-use the table from step 2 (bottom line) to give the top line of the b-g1 table
- the content of the bottom right cell shows that any string tagged (i.e. parsed) by g4 will also be tagged by g3.
- the overlap of g4 with g3 is 1.
- Fig.6(a) it may be determined that an attempt to parse an arbitrary document using the source grammar fragment GS instead of the target grammar fragment GT would result in all of the tagging of sequences that would happen if the arbitrary document were parsed using the target grammar fragment GT and some further tagging of sequences. In this case it may be deemed appropriate to update the store of possible grammar fragments by replacing the target grammar fragment GT with the source grammar fragment GT in its original form.
- a second scenario illustrated by Fig.6(b) it may be determined that an attempt to parse an arbitrary document using the source grammar fragment GS instead of the target grammar fragment GT would result in no tagging of sequences other than the tagging that would happen if the arbitrary document were parsed using the target grammar fragment GT. In this case it may be deemed appropriate not to replace the target grammar fragment GT at all. (This will also happen of course where the source grammar fragment GS and the target grammar fragment GT are identical, or so similar as to result in exactly the same tagging as each other.
- a third scenario illustrated by Fig.6(c) it may be determined that attempts to parse an arbitrary document using the source grammar fragment GS and target grammar fragment GT would result in different sets of sequences being tagged, but with some overlap between the respective tagging.
- Fig.6(d) it may be determined that there is no overlap (or insignificant overlap) between the respective tagging that would happen respectively using the source grammar fragment GS and the target grammar fragment GT, in which case it may be deemed appropriate to update the store of possible grammar fragments by adding the target grammar fragment GT without removing the source grammar fragment GT.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
L'invention concerne des procédés et un appareil correspondant, destinés à analyser du texte dans un document comprenant une pluralité d’unités textuelles, le procédé comportant les étapes consistant à : recevoir le document; partitionner le texte en séquences d’unités textuelles; comparer les séquences issues du document à des séquences prédéterminées provenant d’une réserve de séquences; déterminer des mesures de similitude dépendant des différences entre les séquences issues du document et les séquences issues de la réserve de séquences, les mesures de similitude étant fonction du nombre d’opérations unitaires nécessaires pour rendre les séquences issues du document identiques aux séquences issues de la réserve de séquences; mettre à jour une réserve de résultats par rapport à des séquences dont les mesures de similitude indiquent des degrés de similitude supérieurs à un seuil prédéterminé; et produire un document de sortie comportant des marques indicatives de ces similitudes.
Applications Claiming Priority (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP08253189 | 2008-09-30 | ||
EP08253188.0 | 2008-09-30 | ||
EP08253188A EP2169562A1 (fr) | 2008-09-30 | 2008-09-30 | Analyse synatxique de surface, basée sur comparaison approximative de chaînes de charactères |
EP08253189.8 | 2008-09-30 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2010038017A1 true WO2010038017A1 (fr) | 2010-04-08 |
Family
ID=41327313
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/GB2009/002328 WO2010038017A1 (fr) | 2008-09-30 | 2009-09-30 | Procédé d'analyse grammaticale partielle reposant sur l'évaluation d'appartenance de chaîne dans un fragment de grammaire floue |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2010038017A1 (fr) |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000057291A1 (fr) * | 1999-03-24 | 2000-09-28 | Justsystem Corporation | Procede de correction orthographique utilisant un algorithme a distance d'edition minimale ameliore |
EP1197885A2 (fr) * | 2000-10-12 | 2002-04-17 | QAS Limited | Procédé et appareil pour la récupération de données postales d'une base de données avec des adresses postales |
WO2008043582A1 (fr) * | 2006-10-13 | 2008-04-17 | International Business Machines Corporation | Systèmes et procédés pour construire un dictionnaire électronique de noms composés de mots multiples et pour faire des recherches floues dans ledit dictionnaire |
-
2009
- 2009-09-30 WO PCT/GB2009/002328 patent/WO2010038017A1/fr active Application Filing
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2000057291A1 (fr) * | 1999-03-24 | 2000-09-28 | Justsystem Corporation | Procede de correction orthographique utilisant un algorithme a distance d'edition minimale ameliore |
EP1197885A2 (fr) * | 2000-10-12 | 2002-04-17 | QAS Limited | Procédé et appareil pour la récupération de données postales d'une base de données avec des adresses postales |
WO2008043582A1 (fr) * | 2006-10-13 | 2008-04-17 | International Business Machines Corporation | Systèmes et procédés pour construire un dictionnaire électronique de noms composés de mots multiples et pour faire des recherches floues dans ledit dictionnaire |
Non-Patent Citations (3)
Title |
---|
CHAN S.: "An Edit distance Approach to Shallow Semantic Labeling", IDEAL 2007, LNCS 4881, 2007, pages 57 - 66, XP002513997, Retrieved from the Internet <URL:http://www.springerlink.com/content/p69320xj77v6255u/fulltext.pdf> [retrieved on 20090203] * |
DAVIS C.A., FONSECA F.T.: "Assessing the Certainty of Locations Produced by an Address Geocoding System", GEOINFORMATICA, vol. 11, no. 1, 2007, pages 103 - 129, XP002513995, Retrieved from the Internet <URL:http://www.personal.psu.edu/faculty/f/u/fuf1/publications/Davis_Fonseca_Geoinformatica_2007.pdf> [retrieved on 20090203] * |
VILARES M., RIBADAS F., VILARES J.: "Phrase similarity through Edit distance", DEXA 2004, LNCS 3180, 2004, pages 306 - 317, XP002513996, Retrieved from the Internet <URL:http://www.springerlink.com/content/eqwh688nvh52qdfr/> [retrieved on 20090203] * |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5794177A (en) | Method and apparatus for morphological analysis and generation of natural language text | |
US9672205B2 (en) | Methods and systems related to information extraction | |
Torisawa | Exploiting Wikipedia as external knowledge for named entity recognition | |
Schwartz et al. | A simple algorithm for identifying abbreviation definitions in biomedical text | |
US7035789B2 (en) | Supervised automatic text generation based on word classes for language modeling | |
US6816830B1 (en) | Finite state data structures with paths representing paired strings of tags and tag combinations | |
US7676358B2 (en) | System and method for the recognition of organic chemical names in text documents | |
EP2169562A1 (fr) | Analyse synatxique de surface, basée sur comparaison approximative de chaînes de charactères | |
Kaur et al. | Spell checker for Punjabi language using deep neural network | |
Gholami-Dastgerdi et al. | Part of speech tagging using part of speech sequence graph | |
CN113392189B (zh) | 基于自动分词的新闻文本处理方法 | |
Liang | Spell checkers and correctors: A unified treatment | |
Marcińczuk et al. | Statistical proper name recognition in Polish economic texts | |
Seresangtakul et al. | Thai-Isarn dialect parallel corpus construction for machine translation | |
KS et al. | Automatic error detection and correction in malayalam | |
KR100745367B1 (ko) | 템플릿에 기반한 기록정보 색인 및 검색 방법과 이를이용한 질의응답 시스템 | |
Shamsfard et al. | A Hybrid Morphology-Based POS Tagger for Persian. | |
WO2010038017A1 (fr) | Procédé d'analyse grammaticale partielle reposant sur l'évaluation d'appartenance de chaîne dans un fragment de grammaire floue | |
Wen | Text mining using HMM and PMM | |
Novák | A model of computational morphology and its application to Uralic languages | |
CN116226362B (zh) | 一种提升搜索医院名称准确度的分词方法 | |
JP3027553B2 (ja) | 構文解析装置 | |
CN115114447B (zh) | 一种关于构建情报中技术知识演化图谱的方法 | |
KR100303171B1 (ko) | 형태소 접속 그래프를 사용한 형태소 및 구문 분석 방법 | |
JP3043625B2 (ja) | 単語分類処理方法、単語分類処理装置及び音声認識装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 09785178 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 09785178 Country of ref document: EP Kind code of ref document: A1 |