WO2005111824A2 - Procede et systeme pour traiter un contenu textuel - Google Patents

Procede et systeme pour traiter un contenu textuel Download PDF

Info

Publication number
WO2005111824A2
WO2005111824A2 PCT/IL2005/000521 IL2005000521W WO2005111824A2 WO 2005111824 A2 WO2005111824 A2 WO 2005111824A2 IL 2005000521 W IL2005000521 W IL 2005000521W WO 2005111824 A2 WO2005111824 A2 WO 2005111824A2
Authority
WO
WIPO (PCT)
Prior art keywords
processing
directive
text
schema
semantic
Prior art date
Application number
PCT/IL2005/000521
Other languages
English (en)
Other versions
WO2005111824A3 (fr
Inventor
Eyal Maor
Gideon Kaempfer
Eliezer Gur
Original Assignee
Silverkite Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Silverkite Inc. filed Critical Silverkite Inc.
Publication of WO2005111824A2 publication Critical patent/WO2005111824A2/fr
Publication of WO2005111824A3 publication Critical patent/WO2005111824A3/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/81Indexing, e.g. XML tags; Data structures therefor; Storage structures

Definitions

  • the present invention relates to systems and methods for processing of text files encoded in dialects of data representation languages.
  • Data representation languages such as SGML, HTTP headers, XML, and EDI are well known in the art, and are useful both for representing data as well as for exchange of data in a generic format between applications .
  • XML XML
  • HTML HyperText Markup Language
  • XML allows the definition of new elements and data structures and the exchange of such definitions between devices. In addition it remains readable to humans and appropriate for representation.
  • any data element in XML can be defined by a developer of a document and understood by any device that receives it.
  • XML enables better, open, and standardized business-to-business transactions.
  • XML is expected to become the dominant format for electronic data interchange.
  • Web Services are emerging to facilitate standard XML based message exchange in which different services are described.
  • XML processing including XML queries, Web Services and XML dialect transformation are load intensive tasks. This load intensive nature is at least partly due to the fact that XML is readable by humans, such that information is carried in a very heavy format. Failures or delays may cause major problems to any distributed heterogeneous real time system that must work in a highly reliable data distribution and integration framework that can deliver information to end clients with low latency in the presence of both node and link failures.
  • XML processing is memory and CPU intensive. Since XML is a markup language it is by nature more resource intensive when processed on a general processor server using the standard methods such as DOM and SAX as defined by the W3C or other proprietary software programs. Servers can easily be overloaded and memory consumption can reach its maximum capacity when processing XML data. These methods require large memory usage since the XML is eventually text that is being processed and not byte code as in a software compiler. CPU usage is therefore is getting limited with more processes added in parallel.
  • XML parsers are not adapted for or specifically customized in accordance with the characteristics of the file or group of files to be parsed.
  • One potential solution is to employ hand-coded dialect specific-tools rather than generic
  • XML tools such as DOM or SAX parsers. These dialect specific tools are created in accordance with a specific corpus of one or more files with common features associated with the specific dialect. Unfortunately, while hand-optimization techniques can lead to more efficient tools for accelerated processing of content, the cost of creating these tools can be prohibitive.
  • a method of text file processing including providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
  • the provided schema is a schema which defines the dialect of the data representation language.
  • the schema is provided electronically, and the stage of generating includes only electronic processing of the electronically provided schema.
  • the generating of one or more look-up table includes effecting a compiling of the schema.
  • the compiling includes an electronic identification of at least one production rule of a grammar of the schema. According to some embodiments, the compiling includes an electronic identification of at least one semantic directive associated with a said production rule of said grammar.
  • the compiling includes a compiling of identified semantic directives into one or more lookup tables.
  • Exemplary semantic directives include but are not limited to classification semantic directives, such as a directive to semantically analyze text and classify analyzed text according to its semantic structure, and an action semantic directive whereby an action with semantic meaning is performed to processed text.
  • classification semantic directives such as a directive to semantically analyze text and classify analyzed text according to its semantic structure
  • action semantic directive whereby an action with semantic meaning is performed to processed text.
  • Exemplary semantic directives include but are not limited to validation directives and transformation directives.
  • the compiling includes a compiling of semantic meaning associated with the schema.
  • the compiling includes compiling includes a compiling of semantic classification directives associated with the schema. According to some embodiments, the compiling includes compiling includes a compiling of semantic analysis directives associated with the schema.
  • the compiling includes a compiling of validation directives of the schema into one or more of the lookup tables.
  • the compiling includes a compiling of transformation directives of the schema into one or more of the lookup tables.
  • files encoded in the data representation language are representable as a tree structure.
  • the data representation language is a tag delimited language. According to some embodiments, the data representation language is selected from the group consisting of XML and EDI .
  • the dialect is an XML dialect selected from the group consisting of FIX, OFX, SwiftML, SOAP,WSDL, HL7, EDI AccordXML
  • the schema is provided in a format selected from the group consisting of a DTD format and an XSD format.
  • the schema is a transformation schema.
  • Exemplary formats in which a transformation schema may be provided include but are not limited to XSL.
  • the generating of at least one lookup table includes at least partially implementing at least one lookup table in hardware.
  • the hardware includes at least one re-programmable logic component.
  • Appropriate logic components include but are not limited to FPGA components, ASIC components, and gated components.
  • a plurality of lookup tables are generated and the processing includes determining a first subset of the plurality of lookup tables to be implemented in software and a second subset of the plurality lookup tables to be implemented in hardware.
  • first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively simple operation.
  • Appropriate relatively simple operations include but are not limited to type checking operations, range checking operations, and fixed increment operations.
  • a first subset of the plurality of lookup tables includes at least one lookup table that encodes at least one directive to perform a relatively complex operation.
  • Appropriate relatively complex operations include but are not limited sorting operations, transactional operations, transactional operations requirement communication with a remote entity, and cryptographic operations.
  • At least one lookup table is a semantic lookup table encoding at least one semantic directive including but not limited to a directive to effect a semantic analysis, a directive to effect a semantic classification, and a directive to carry out a semantic action such as a transformation and a validation.
  • a directive to effect a semantic analysis includes an operation to be performed upon identification of at least one token, at least one production rule, or a combination thereof.
  • Exemplary operations to be performed upon identification of a token include but are not limited to a storing of a position of the token in an input stream, a storing of a length of the token, a storing of a prefix of the token, a calculation and stormg of a numerical value corresponding to the token, a storing of a token type of the token, a discarding of the token, a counting of the token, and a storing of pointer to a semantic rule.
  • Exemplary operations to be performed upon identification of a production rule include but are not limited to a storing of a position of a token associated with the production rule, a storing of number of tokens associated with the production rule, a storing of a character prefix associated with the production rule, a calculation and a storing of a numerical value associated with a token of the production rule, a calculation and storing of an abbreviated representation of the production rule, a calculation and storing of an abbreviated representation of a hierarchies of rules including the production rule, a storing of a rule type of the production rule, a discarding of production the rule, a counting of the production rule, a storing of an index of the production rule in a rule type table, a storing of an index of at least one sub-element of the production rule, a storing of an index of at least one sub-element of the production rule, a storing of a value of a counter associated with at least one sub-element of the production rule, an inheriting of
  • Exemplary semantic rules but are not limited to validation rules and transformation rules.
  • Exemplary operations to be performed upon identification of a combination of a token and a production rule include but are not limited to a storing of a pointer to a specific semantic rule and a counting of the production rule.
  • Exemplary semantic rules include but are not limited to validation rules and transformation rules.
  • at least one semantic lookup table is a stateless lookup table.
  • At least one semantic directive is selected from the group consisting of a validation directive and a transformation directive.
  • Exemplary validation directives include but are not limited to value comparison directives, simple value comparison directives, multiple value comparison directives, range checking directives, type checking directives, integer range checking directives, fraction range checking directives, value length directives, and value length checking directives.
  • At least one validation directive is selected from the group consisting of a path context validation and a context specific range changing.
  • at least one validation directive is a syntax directive selected from the group consisting of tracking how many elements of a certain type are permitted, effecting of a complex choice of possible parameter or attribute combinations, and enforcement of a specific element sequence constraint.
  • At least one validation directive is at least one validation directive is selected from the group consisting of a path context validation and a context specific range changing.
  • transformation directives include but are not limited to directives to mark a structure for transformation of a given type such as a type specified by a transformation identification code, directives to remove a token, directives to remove a complex structure, directives to truncate a token, directives to truncate a complex structure, directives to replace a first token with a second token, numerical conversion directives and directives to update of a given field in an output table.
  • the updating of the given field allows altering of one or more pointers some predefined value.
  • the altering is for a namespace change in XML.
  • At least transformation directive is conditional upon at least one validation result.
  • Appropriate range checking directives include but are not limited to directives for context specific range checking.
  • an array of at least one transformation directive encodes a transformation between the dialect and a different dialect of the data representation language.
  • an array of at least one transformation directive encodes a transformation between the data representation language and a different data representation language.
  • an array of at least one transformation directive encodes a transformation between the data representation language and a language other than a data representation language.
  • at least one semantic directive is a path validation directive.
  • a method of text processing including receiving a plurality of text files encoded in a data representation language, for at least one text file determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
  • At least a part of one said look-up table is implemented at least in part in hardware.
  • a plurality of said text files are subjected to a processing including determining a set of common or heavy operations, determining a set of uncommon or light operations; for a subset of the plurality of text files, performing the set of common or heavy operations and perfo ⁇ ning on the subset of the text files the set of uncommon or light operations.
  • the determining is effected at least in part using a first hardware module and the processing is effected at least in part using a second hardware module.
  • the first and second hardware modules are configured to effect a pipeline processing.
  • data associated with at least one look-up table is cached, and the stage of processing includes retrieving at least some of cached data.
  • the caching includes caching at a plurality of locations within a network.
  • the definition of the dialect includes a schema, and the generating of at least one look-up table includes electronic processing of the schema.
  • processing of said schema includes effecting a compiling of the schema.
  • a hardware update of a said lookup table is performed before commencing processing of at least one the text file.
  • the determining of the dialect includes identifying a string selected from the group consisting of a file name of a text file and a file type of a text file.
  • the determining of the dialect includes identifying a file source of one or more of the text files. According to some embodiments, the determining of the dialect includes parsing at least some of one or more of the text files.
  • the determining of the dialect includes effecting a first pass over a respective text file, and the processing using at least one look-up table includes effecting a second pass over the respective text file. According to some embodiments, the determining, generating and processing is performed iteratively more than once on a said text file.
  • more than one text file is subjected to processing in parallel.
  • At least one look-up table encodes a production rule of a grammar.
  • the processing of the text file includes effecting in accordance with at least one look-up table at least one grammatical analysis of a the text file.
  • exemplary grammatical analyses include but are not limited to syntactic analyses and semantic analyses.
  • a syntactic analysis includes recording at least one value selected from the group consisting of a production rule identifier, a parsing record index, a beginning of a record in an input stream, a length of a record in an index stream, a beginning of a specified sub-element of a production rule, a length of a specified sub-element of a sub-element, a context value, a value stored earlier in a parsing process, a record prefix, a prefix of a specified token in a production rule, a value associated with a specific token in a production rule, a hash value associated with at least one token associated with a production rule, a number of combined hash values, a number of counters incremented
  • the grammatical analysis includes determining a validity of a production rule.
  • the determining of the validity of the production rule includes determining a validity of at least one parent production rule, wherein invalidity of the parent production rule is indicative of invalidity of the production rule.
  • the semantic analysis includes recording at least one value selected from the group consisting of a validation rule to be applied to a parsing record a result of a comparison between two calculated values, a transformation rule to be applied to a parsing record.
  • Exemplary reference nodes include but are not limited to a root node of the tree, a node that is fixed distance from a root node of the tree.
  • the determining of the path and the storing of the abbreviated representation is carried out for at least one node element which is a descendant of the reference node.
  • the determining of the path and the storing of the abbreviated representation is carried out only for the node elements which are descendants of said reference node.
  • determining and storing is effected for all nodes having one or more predefined depths wilbin the tree.
  • the abbreviated representation of the path is mapped to a representation of data associated with the node.
  • the method includes mapping an abbreviated representation of a path between an ancestor of the node element to a representation of data associated with the node.
  • a system for accelerating text file processing including a schema input for receiving a schema defining a dialect structure of a data representation language and look-up table generator for processing the schema to generate at least one look-up table encoding directives for text processing in accordance with the schema.
  • the look-up data generator includes a schema compiler for effecting of compiling said schema.
  • the look-up data generator is operative to implement at least one said look up table in hardware such as re-programmable logic.
  • the implementing includes configuring re-programmable logic.
  • a system for text processing including an input for receiving at least one text file encoded in a data representation language and a text processing engine including at least one look-up table encoding a plurality of text-processing directives in accordance with a schema of the data representation language, the text processing engine for processing at least one received text file.
  • the system further comprises a dialect determiner for determining a said dialect of a said text file.
  • a dialect determiner for determining a said dialect of a said text file.
  • at least one look-up table is implemented at least in part in hardware.
  • At least one look-up table encodes at least semantic directive such as a directive to effect a semantic analysis, a directive to effect a semantic analysis, and a directive to effect a semantic operation.
  • exemplary semantic operations include but are not limited to validation operations and transformation operations.
  • the text processing engine is distributed within a computer network.
  • the presently disclosed system further comprises an exception handling module for handling exceptions generated by the processing of at least one text file.
  • the text processing engine is operative to reconfigure at least one said look-up table while effecting processing at a received text file.
  • the text processing engine includes a hardware character normalization engine
  • the processing of at least one received text file includes generating a character stream from the byte stream using the hardware character normalization engine.
  • the text processing engine includes a hardware token identification engine
  • the processing of the received text file includes identifying tokens within a character stream representative of said text file using said hardware token identification engine.
  • the text processing engine includes a hardware parsing engine
  • the processing of one or more received text files includes receiving a stream of tokens representative of the text file and producing a stream of parsing records using the hardware parsing engine.
  • the hardware parsing engine uses a look-up table encoding at least one syntactic text-processing directive of the dialect.
  • the text processing engine further includes at least one hardware semantic analysis engine.
  • the hardware semantic analysis engine is selected from the group consisting of a hardware validation engine and a hardware transformation rule engine.
  • a system for generating data useful for fast text processing including an input for receiving a text file representable as a tree having a plurality of node elements, a path determiner for determining a path within said tree between said respective node element and a reference element, a path representor for deriving an abbreviated representation of said determined path and a storage for said abbreviated representation.
  • the reference element is a root element of said tree.
  • the path representor is operative to derive a hash value as a abbreviated representation of a determined path corresponding to a node.
  • the storage is operative to store a map between a representation of a said node and an abbreviate representation of a said path corresponding to said represented node.
  • the presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing determining a schema of a dialect associated with the processed files, modifying a set of at least one lookup table encoding a plurality of directives for processing of text files of the corpus and processing at least one text file in accordance with the modified lookup tables. It is now disclosed for the first time a method of processing a corpus of at least one text file encoded in a data representation language.
  • the presently disclosed method includes processing at least one text file of the corpus using directives associated with the data representation language, from results of the processing, determining schema data of a dialect associated with the processed files, using the determined schema data, modifying at least one lookup table encoding at least one directive for text file processing, and processing at least one text file in accordance with a modified lookup table.
  • the determining, modifying, and processing with the modified lookup tables are repeated at least once.
  • the modifying of at least one lookup table includes at least one of generating at least one lookup table and updating at least one lookup table.
  • the modifying of at least one lookup table includes updating hardware associated with the lookup table.
  • At least one lookup table encodes at least one semantic directive.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a schema associated with a dialect of a data representation language and processing the schema to generate at least one look-up table encoding a plurality of directives for text processing in accordance with the schema.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of text files encoded in a data representation language, for at least one said text file, determining a dialect of the data representation language, generating from a definition of the dialect at least one look-up table encoding directives for text processing in accordance with the determined dialect, and using at least one look-up table, processing the respective text file in accordance with the determined dialect.
  • a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for providing a text file representable as a tree having a plurality of node elements, for at least one node element, determining a path within the tree between the respective node element and a selected reference element, and storing an abbreviated representation of the path.
  • a method of accelerating processing of HTTP headers includes receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
  • At least one HTTP header is processed using one generated lookup table.
  • At least one look-up . table is implemented at least in part in hardware. It is now disclosed for the first time a computer readable storage medium having computer readable code embodied in said computer readable storage medium, said computer readable code comprising instructions for receiving a plurality of HTTP headers; determining common patterns among said plurality of the HTTP headers and generating from the determined common patterns at least one look-up table encoding directives for text processing in accordance with the determined common patterns.
  • FIG. 1 provides a listing for an exemplary DTD file.
  • FIG. 2 provides a listing for an exemplary XSL file.
  • FIGS. 3A-B provide a listing for an exemplary XML file.
  • FIG. 4 provides a flow chart describing generation of lookup tables in accordance with exemplary embodiments of the present invention.
  • FIG. 5 provides an exemplary lookup table encoding semantic directives.
  • FIG. 6 provides a flow chart describing content processing in accordance with exemplary embodiments of the present invention.
  • FIG. 7 provides a flow chart describing pre-processing of data representation language content in accordance with exemplary embodiments of the present invention.
  • FIG. 8 provides a flow chart describing exemplary processing of data representation language content in accordance with some embodiments of the present invention.
  • FIG. 9 provides a flow chart describing a process wherein different aspects of a dialect associated with a data representation language are iteratively compiled and used for content processing.
  • FIG. 10 provides a flow chart describing a process wherein hash values of certain paths in a tree representation of a text file are computed and stored.
  • TABLE 1 includes information gathered for the production rule related to XML elements for the example input files of FIGS . 1 -3.
  • TABLE 2 includes information gathered for the production rule related to XML attributes for the example input files of FIGS. 1-3.
  • TABLE 3 includes flags and counters generated in the processing stage for the example input files of FIGS. 1-3.
  • FIG. 11 provides a listing of an HTML file that is the product the XML file of FIG. 3 transformed in accordance with XSL directives provides in FIG. 2
  • Embodiments of the present invention provide methods, apparatus and computer readable code for efficient processing of content in accordance with data representation languages.
  • Exemplary text processing in accordance with some embodiments of the present invention includes but is not limited to parsing, validation, searching, extracting data and transformation to other formats.
  • the presently disclosed methods and systems can be implemented in software, hardware or any combination thereof. Not wishing to be bound by any particular theory, it is noted that certain hardware implementations are useful for accelerating the processing of text files in accordance with presently disclosed techniques. Furthermore, it is noted that the data processing of the present invention can be applied to a variety of different types of data in various computer systems and appliances. In accordance with certain embodiments of the present invention, it has been discovered that processing of text files encoded in data representation languages can be accelerated by generating a series of lookup tables in accordance with the data representation language and/or a specific dialect of the data representation language.
  • abbreviated representations such as hash values of paths between certain nodes within the tree is useful for accelerating a subsequent process where it is necessary to quickly locate paths between nodes such as a path between a given node and a root node of the tree.
  • abbreviated representations of paths of hash values are used in search expressions, such as for example XPath and XQuery for XML files, as well as for transformation expressions, such as for example those encoded by XSL.
  • semantic directives include but are not limited to validation directives and transformation directives.
  • the file “emp.dtd” in Figure 1 defines that the input text files should be XML files that have a root element “employees". This element should contain one or more sub-elements, each called “employee”. Each “employee” element should have an attribute “serial” of type “ID”, and the following sub-elements (in this order): "name”, "position”, “address 1", “address2" (optional), "city”, “state”, “zip”, “phone” (optional), and “email” (optional).
  • the "name” element should have the following attributes: “age”, “sex”, “race” (optional, when absent a default value is implied), and "m_status”.
  • Each of the other elements is then defined to be a string (character data).
  • the above XSL file of FIG. 2 has conversion and presentation instructions. When applied to the input XML file the expected result of the transformation is in HTML format. The resulting HTML file depends on the content of the actual input XML file, and is presented below for the specific input XML file of FIG. 3 (see FIG. 11) It is noted that although “eml.xml” includes no explicit reference to schema file for validation, the “emp.xml” file conforms to the "emp.dtd” file described above, and that the transformation schema "emp.xsl” may be applied to “emp.xml.”
  • the XML file explicitly references an XML style sheet file "emp.xsl” that includes transformation rules and formatting instructions. Having the "emp.xsl” file preprocessed is a prerequisite for transforming the input "emp.xml” file. It will be appreciated that no specific characteristics of the aforementioned input files are intended as a limitation of the scope of the present invention. Thus, although these files are associated with the specific data representation language XML any data representation language currently known in the art or defined in the future is appropriate for the present invention. Similarly, "Document Type Definition” is just one appropriate format for schema defining specific dialects, and any other validation schema format currently known or to be defined in the future is appropriate.
  • XML Style-sheet is merely one exemplary format for a transformation schema, and any other transformation schema format currently known or to be defined in the future, including both transformation schema associated with XML and transformation schema associated with other data representation languages, is appropriate.
  • drawings which are used to illustrate the inventive concepts, are not mutually exclusive. Rather, each one has been tailored to illustrate a specific concept discussed. In some cases, the elements or steps shown in a particular drawing co-exist with others shown in a different drawing, but only certain elements or steps are shown for clarity.
  • a "data representation language” or a “markup language” such as, for example, XML or EDI is a language with a well defined syntax and semantics for providing agreed upon standards for message or data exchange and/or for data storage. This in contrast with programming languages or computer code languages, such as, for example, C, C++, Java, C#, and Fortran, containing specific computer instructions compilable into machine executable code.
  • the data represented in the data representation language is a structured message.
  • the data representation language and/or the dialect has strict syntax, and rigid singular semantics.
  • a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard.
  • ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇ ⁇
  • transformation schema including but not limited to XSL schema defining transformation rules may be associated with a schema of a dialect of a data representation language.
  • a data representation language and/or a dialect or sub-language defines a message protocol or messaging standard.
  • One exemplary message protocol is the Financial Information eXchange (FIX) protocol developed specifically for the real-time electronic exchange of securities transactions.
  • FIX Financial Information eXchange
  • One example of a data representation language that functions as a message protocol is the hypertext transport protocol (HTTP), used for conveying messages over a network between various electronic devices or appliances.
  • HTTP hypertext transport protocol
  • SOAP Simple Object Access Protocol
  • FIG. 4 provides a flow chart illustrating generation of one or more lookup tables according to certain exemplary embodiments of the present invention.
  • FIG. 4 illustrates a process for the generation of five different types of lookup tables, namely one or more Lexical lookup tables
  • the definition of the dialect and/or of the schema associated with the dialect is expressed through at least one of:
  • a context free grammar defined by a set of production rules based on the above set of tokens.
  • a set of semantic operations to be performed for any given token or production rule There is no specific limitation on the format in which a schema associated with or defining a dialect of a data representation language. Exemplary formats for schema defining a dialect include but are not limited to DTD and XSD formats. Exemplary formats for transformation schema associated with specific data representation languages and/or schemas include but are not limited to XSL. It is noted that the specific schema formats for defining a dialect and/or transformation schema mentioned herein are merely appropriate examples, and it is understood that other schema formats including those not disclosed to date are within the spirit and scope of the present invention.
  • the schema is not provided independently of text files to be subsequently processed, but is derived from processing a number of text files, and determining patterns within the text files. In some embodiments, these patterns are determined using statistical methods known in the art.
  • the allowed lexical and syntactical rules including tokens and grammar rules are well defined. Additional tokens and grammar rules may be derived from the validation and transformation schemas. The tokens and the grammar are then used for generating the state-machines required for fast processing of files encoded in the data definition language. This information may be saved as DFA (Deterministic Final Automata) state-machines, which in turn may be translated into lookup tables (or LUTs) that may be used by hardware and software for the actual processing of the data representation language content.
  • DFA Deterministic Final Automata
  • a token may be defined as a regular expression (see, for example, the definition of GNU regex library) over any standard character set (e.g. ASCII, UTF-8 or UTF-16). There is no inherent limitation on the token definitions.
  • the tokens are compiled by the means of a Regular Expression Compiler such as GNU Flex or other appropriate compilers, into a deterministic state machine 106 (also termed Deterministic Finite Automata - DFA) such as a software encoded DFA.
  • a deterministic state machine 106 also termed Deterministic Finite Automata - DFA
  • This state machine is then converted into one or more memory mapped Lexical lookup table(s) 108 to be loaded into the hardware memory device used for lexical analysis.
  • this conversion process is performed by a dedicated software converter, though it will be appreciated that this is not a specific limitation.
  • Each lexical lookup table 108 is in essence a depiction of the state machine where each memory record corresponds to a state including the current output of the state machine (e.g. if a token was recognized) and the method of calculating the next state according to the next character to be read.
  • a context free grammar is defined through a set of production rules beginning from a single root variable for the specific data representation language dialect (different dialects may have different root variables).
  • the format of describing the grammar is similar to the grammar description allowed by GNU Bison or Yacc.
  • the grammar is compiled by means of a Grammar Compiler (SWGC), similar in essence to GNU Bison or Yacc, into a software encoded deterministic state machine. Traversal from one state to the other is defined by the current token received from the input (coming from a lexical analyzer), the current state and the top of a stack that accompanies the state machine during processing.
  • SWGC Grammar Compiler
  • the construction of such a state machine with stacks is well defined and published in classic computer science literature (see for instance "Compilers: Principles, Techniques and Tools" by A. A o, R. Sethi and J. Ullman, Addison-Wesley, 1986).
  • the state machine is converted into a memory mapped Syntax Lookup Table 114 to be loaded into the hardware memory device used for parsing (syntax analysis). In some embodiments, this is performed by a dedicated software converter (SWDFA2SYLUT) though this is not a specific limitation.
  • the Syntax Lookup Table 114 is in essence a depiction of the state machine where each memory record corresponds to the state including the current output of the state machine (i.e. action to be performed, production rule completed, statement identified or syntax error) and the method of calculating the next state according to the next token to be read and the content of the top of the stack.
  • lookup tables such as the Token Lexical Lookup Tables 108 and the Syntax Lookup Tables 114 encode operations or directives having semantic meaning.
  • Exemplary operations for tokens in the Lexical Lookup table include but are not limited to one or more operations enabled by the hardware implementation as follows:
  • token value e.g. integer value of a string of digits.
  • token value e.g. integer value of a string of digits.
  • hash function of the token value e.g. a 64-bit value calculated as a function of the string.
  • Exemplary operations in the Syntax Lookup Table include but are not limited to one or more operations enabled by the hardware implementation as follows:
  • index e.g. a pointer to a specific validation or transformation rule
  • the processing mechanisms implementing the lexical and syntax analysis according to the defined tokens and grammar inherently validate that the processed file is well formed in the sense that it is constructed out of a set of well defined "words" (tokens) that are put together into “meaningful sentences” (grammar).
  • the tokens and grammar production rules may be augmented with a set of semantic operation identifiers to be performed when a given token or production rule is identified (followed).
  • semantic operation identifiers may also be associated to combinations of production rules and specific token values.
  • a semantic lookup table encodes directives for semantic processing of a text file encoded in a data representation language.
  • semantic directives include semantic classification or semantic analysis directives which are directives to determine or classify text based upon semantic characteristics, and semantic action directives for performing an action other than semantic classification in accordance with semantic structures. It is noted that one particular example of semantic classification or semantic analysis is validation, wherein text is analyzed and semantic properties of the analyzed text are dete ⁇ nined.
  • semantic action directives are transformation directives wherein output text is generated in accordance with semantic properties of input text. It is noted that these directives allow for accelerated processing of the text file.
  • Exemplary semantic processing directives include but are not limited to transformation directives and validation directives.
  • Exemplary transformation directives include but are not limited to a directive to transform a particular document encoded in a dialect of data representation language into a document encoded in the same dialect of the same data representation language, into a document encoded in a different well-defined dialect of the same data representation language, and/or into a document encoded in a different data representation language.
  • the operations for combinations of production rules and token values are stored in a dedicated Semantic Lookup Table 120. These operations include but are not limited to: 1. Store an index (e.g. a pointer to a specific validation or transformation rule).
  • VALIDATION LOOKUP TABLE(S) 126 In addition to the definition of the data representation language dialect to be processed, specific validation rules may be defined that allow the system to verify that the messages being processed meet given criteria.
  • the validation rules for XML files may be written in the Document Type Definition (DTD) language (which is part of the XML definition, for formal definition see http://www.w3.org/TRxmlll/).
  • the validation rules may be written in standard XML Schema Definition (XSD) language (for a formal definition see http://www.w3.org/TR xmlschema-O/, http://www.w3.org/TR/xmlschema- 1/, http://www.w3.org/TR xmlschema-2/, http://www.w3.org/TR/xmlschema-formal/) or in a similar type of validation language.
  • DTD Document Type Definition
  • XSD XML Schema Definition
  • XSD allows verifying various aspects of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
  • higher level XML dialects e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL
  • EDI HyperText Markup Language
  • HTTP headers e.g. HTTP headers, CSV
  • validation schemas are compiled by a Validation Schema Compiler (SWVSC) into two sets of validation rales: Rules to be verified by the Hardware Validation Engine (HWVE) and rules to be verified by a
  • SWVSC Validation Schema Compiler
  • HWVE Hardware Validation Engine
  • SWVM Software Validation Module
  • XSD is supported by the SWVM.
  • HWVE Validation Lookup Table
  • VDLUT Validation Lookup Table
  • Type checking (e.g. integer, string).
  • Fraction range checking e.g. number of digits before and after decimal point.
  • Value length (e.g. the number of characters in the corresponding value field).
  • Pattern matching Patterns will be defined as validation tokens, matched during lexical analysis and included in the appropriate grammar).
  • Syntax directive e.g. how many elements of a certain type are allowed, complex choices of possible parameter/attribute combinations, specific element sequence constraints
  • Path context validation together with simple value comparisons e.g. XP ATH predicates
  • Context specific range checking e.g. Context specific range checking
  • XSL and specifically XSL Transformations (XSLT) (for a formal definition see http://www.w3.org/TR/xslt), is a standard way to define required transformations of XML files, but may be generalized to other data formats as well, such as higher level XML dialects (e.g. FIX, OFX, HL7, SwiftML, SOAP, WSDL) and other structured message formats (e.g. EDI, HTTP headers, CSV).
  • SWTSC Transformation Schema Compiler
  • HWTRE Hardware Transformation Engine
  • SWTM Software Transformation Module
  • the transformation type may be a general type (e.g. "convert string to uppercase") or specified by a Transformation Identification Code (TIC) for a specific operation the SWTM will recognize.
  • TIC Transformation Identification Code
  • the transformation rales may appear in the context of tokens as well as grammar rules. In addition, they may optionally be conditional on the result of the validation results.
  • lookup tables described in the previous section are useful in the framework of a general text file processing method illustrated in FIG. 6 and described herein for the first time by the present inventor.
  • actual text files are processed in accordance with the results of the preprocessing 204A.
  • the text files are subjected to a post processing 206 described below.
  • some or all required software and hardware elements for maximal processing performance are prepared prior to the reception of some or all of the data representation language content to be processed.
  • these preparations are performed by software elements, using information that is known about the structure of the actual text files that to be processed.
  • the resultant tables and state-machines encoded in hardware and/or in software can be used in the processing stage 204A. In some embodiments, this processing is performed by a combination of tightly coupled hardware and software elements.
  • the text files are subjected to a post processing 206 described below where accumulation of information regarding the processed file(s) and caching information for possible future processing performance enhancements.
  • the pre-processing stage is a one-time operation that associated with configuration of the system (for example, when the input content is expected to conform to information known at configuration time).
  • the pre-processing may be done when new data is encountered when the relevant information is not yet ready for its analysis (for example, new files encoded in a data representation language arrive with indications of additional schemas to which the data representation language files should conform).
  • the new content is not necessarily parsed immediately, but waits until the appropriate information is prepared. Note that this enables support for future data formats and for schema updates.
  • pre-processing 202 A may be used for any incoming data that conforms to the appropriate schema or template It is noted that the processing of 202 A prepares the system for the arrival of data representation language files in such a way that they are processed with minimal effort and time. Nevertheless, some of the steps of preprocessing 202A may be performed well in advance of the arrival of the data representation language content to be processed while others may be performed only after the arrival of some or all of one or more data representation language file to be processed. From an algorithmic perspective, the time of arrival of the data representation language file has no impact whatsoever. However, in some embodiments it is desired to reduce the pre-processing required after the arrival of the target data representation language file to a minimum in order to reduce file-handling latency.
  • the preprocessing stage prepares helpful information for optimizing the processing of the content.
  • this information is based on the known of the data, such as its validation schema and its transformation template.
  • FIG. 7 provides a flow chart of an exemplary pre-processing in accordance with a schema associated with a dialect of the data representation language. It is understood that in some embodiments, certain steps are performed without others.
  • one or more state machines are prepared 310.
  • a validation schema compilation is effected 312 of schema-specific validation rules identified in accordance with the identified schema tokens and or grammar.
  • a transformation schema compilation is effected 314 of schema-specific transformation rules identified in accordance with the identified schema tokens and/or grammar.
  • the compilation of schema to generate lookup tables 304 may be quite extensive.
  • the preparations include compilation of several complex directives, including regular expressions, grammars, automata, validation schemas, transformation schemas and additional rules (e.g. rales defined by IT or business policies).
  • a large portion of the preprocessing is expected to be common to most of the files handled at a given location in a network. Therefore, the results of the above mentioned operations may be stored temporarily or permanently in a local cache. The stored information may then be retrieved at a later time instead of performing the actual processing operations repetitively.
  • the cache may be managed using any caching strategy, such as most recently used, most frequently used or other well-known techniques.
  • one or more hardware engines are updated 308 with the results of earlier pre-processing steps before processing of text file. In some embodiments, this is carried out by configuring re-programmable logic components. This is only required for results that were not previously loaded into the hardware tables.
  • the integrated circuit includes memory, parsing circuitry and an interface.
  • the parsing circuitry is configured to parse the building blocks that compromise the data. For example, for XML data, the element tag and attributes are the building blocks. Once these building blocks are parsed, further processing may be performed. Using a special binary format and a processing methodology, content and data processing may be significantly improved, while reducing the memory and CPU requirements, as well as the time required for the processing.
  • One feature provided by many embodiments of the present invention is the ability to implement some parts of it (mostly the processing stage 204A) in hardware components such as reprogrammable logic components while maintaining the flexibility required for supporting a broad range of dialects and schemas of the data representation language or markup language.
  • reprogrammable logic components include but are not limited to FPGA and ASIC components
  • the actual processing of files encoded in the data representation language is performed by a combination of tightly coupled hardware and software elements, and is carried out after some or all of the requisite pre-processing operations have been at least partially completed and some or all of the software and/or hardware engines have been updated accordingly.
  • FIG. 8 provides a flow chart of an exemplary processing of text files in accordance with previously generated lookup table 304. It is noted that not every step depicted in FIG. 8 is required in every embodiment, and that while some embodiments provide that the processing is carried out according to the order illustrated in FIG. 8, the specific order provided is not a limitation of the present invention.
  • the optional first step of processing prerequisites verification 502 is useful for identification of the prerequisites for processing the file and verification that these have been completed.
  • a character set normalization 504 is carried out, including a transformation of the original file character set into a single common character set.
  • a lexical analysis 506 including identification of tokens within the character stream is carried out.
  • identification of these tokens it is possible to identify specific grammatical structures within the token stream in the context of a syntactical analysis 508.
  • This, in turn, is followed by a semantic analysis 510 which includes identification of certain semantic meanings of the parsed data.
  • the initial validation 512 includes execution of limited hardware based validation.
  • the initial transformation 514 includes execution of limited hardware based transformation.
  • the subsequent final validation 516 includes the completion of the validation process in software, while the final transformation 518 includes the completion of the transformation process in software.
  • processing prerequisites include but are not limited to:
  • the source of the file This may be defined by the Layer2 or Layer3 addresses, system interface ID's or other addressing information available (e.g. HTTP headers).
  • the file type This may be defined by the file name (e.g. a file name suffix) or by an identification tag in the beginning of the file (e.g. a magic number).
  • the file content In some cases, the processing prerequisites will be identified within the file itself in a well-defined manner (e.g. XML files may mention the XML schemas to be used for their own validation).
  • the file may be processed twice. Initially, it will be processed as a generic file according to its type as identified by one of the two first methods. During this processing phase, only the required prerequisite information is extracted. Then the full pre-processing may be completed and only then, a second, full processing phase is performed. Such a recursive procedure may take place more than once if during the second processing phase, additional prerequisites are identified. This recursive procedure is described in the flow chart of FIG. 9. First, one or more text files are processed 204B genetically using directives appropriate for the data representation language. Based on this processing, a set of previously unknown dialect-specific characteristics of previously processed text files is determined 202B.
  • appropriate hardware and/or look-up tables are updated 210.
  • the text files are once more reprocessed 204B using the updated hardware and/or look-up tables.
  • additional dialect specific characteristics are detected 202, and thus the stages of updating hardware and/or look-up tables 210 and processing text files using updated hardware and/or look-up tables 204 is repeated.
  • HWCNE Hardware Character Normalization Engine
  • the HWCNE is a simple engine that transforms a byte stream into a character stream.
  • the characters are identified according to the appropriate expected character set definition.
  • a character may be defined by a single byte or by a small number of consecutive bytes.
  • each character is transformed into a single 16-bit word constituting its UTF-16 representation.
  • UTF-16 encoded files no transformation is required and the HWCNE operation becomes transparent.
  • the output of the HWCNE includes at least one of:
  • the Hardware Token Identification Engine receives a stream of characters and transforms it into a stream of tokens and accompanying semantic information.
  • the HWTIE uses the Lexical lookup table 108 in order to identify the tokens. It is initialized at an initial tokenization state before the first character and after each token is identified. For each character and current state, the HWTIE calculates the memory address to be accessed in the Lexical lookup table 108 and reads the corresponding record. This record may contain information regarding the tokenization result (e.g.
  • the HWTIE calculates at least one of the following values during the tokenization process:
  • the token identifier (as defined by the TKLUT when the token is identified).
  • a prefix of the token e.g. the first N characters of the string representing the token.
  • the hash of the token value (e.g. an N-bit value representing the token value).
  • the hash function is computed on the normalized character representation (and not on the original bytes stream representing the character).
  • the numeric value of the token (e.g. the integer represented by the token string). This may only be applicable for a subset of the tokens.
  • HWTIE may process one character every clock cycle, which in some embodiments is beyond 1 Gigabit per second of throughput (dependant on the character encoding) while requiring in some embodiments two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block).
  • Lexical lookup table 108 The number of states that can be supported by the Lexical lookup table 108, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java.
  • the Hardware Parsing Engine receives a stream of tokens and produces a stream of parsing records.
  • a parsing record may be produced for every input token or for a stream of input tokens. In some cases, more than one parsing record may be produced per token (e.g. in the case that a single token has a complex meaning that may be expressed as multiple tokens or if a token requires a specific transformation action but would otherwise not be represented by a parsing record by itself).
  • the HWPE uses the Syntax Lookup Table 114 and an auxiliary Parsing Stack (PSTK) in order to identify the production rules of a given grammar used to produce the token stream.
  • PSTK auxiliary Parsing Stack
  • the HWPE is initialized at an initial parsing state before the first token of the file is received.
  • the PSTK is initialized to contain an end of stack identifier. For each token, current state and head of stack content, the HWPE calculates the address in the Syntax Lookup Table SYLUT 114 to be accessed and reads the corresponding record. This record may contain information regarding the parsing results (e.g. if a production rale has been identified or a syntax error has occurred), action to be taken (e.g.
  • tokens may be accumulated in a FIFO memory prior to being processed by the HWPE to allow for variable processing delays.
  • the HWPE may store several temporary values as part of the general context or within the PSTK. These values are required for various calculations performed by the HWPE.
  • the HWPE may calculate the following at least one of the following values to be included in the parsing records as output:
  • Production rule identifier e.g. the type of statement being parsed.
  • Parsing record index may be allocated before the record is sent to the output and hence may be lower than the index of records sent out earlier). May be based on one of several stored counter values (e.g. to enable separate indexing for different rale types).
  • Beginning of record in the input stream e.g. the beginning of a token within a production rule - not necessarily the first token.
  • Length of the record in the input stream e.g. the sum of lengths of a specified subset of tokens [or grammar variables] constituting a production rale).
  • Context value(s) e.g. a value stored earlier in the parsing process, such as a namespace identifier for XML files.
  • prefix e.g. the prefix of a specified token in the production rule
  • Value e.g. the value of a specified token included in the production rule.
  • a number of hash values (the hash value of one or more of the tokens constituting the rale).
  • a number of combined hash values are calculated as the combined hash value of the record with one of a number of hash values stored in the record of the parent production rule (residing on the top of the PSTK). If the hash values of the parent production rale are invalid, the result of the combination will be invalid. 12. A number of counters incremented based on the counter values of previously processed production rules (i.e. rule children).
  • a validation rule to be applied to the parsing record 15.
  • the result of a comparison between two specified calculated values e.g. two hash values).
  • HWPE may, in some embodiments, process up to one token every clock cycle.
  • a typical token of some embodiments may constitute between one and tens of characters or beyond.
  • this may support beyond 4 Gigabits per second of throughput (dependant on the character encoding) while requiring two external memory devices at the most (and possibly a combination of off chip memory device and one on chip memory block).
  • the throughput of the HWPE may be much higher than that of the HWTIE. This allows traversing multiple production rules without reading additional tokens and without exhausting a possible FIFO memory preceding the HWPE.
  • Syntax Lookup Table 114 The number of states that can be supported by Syntax Lookup Table 114, if implemented in standard SDRAM technology may reach tens of thousands of states and beyond. This is expected to be more than sufficient for any typical structured file format and certainly more than required for most common programming languages, including complex languages such as C or Java. This ensures that many grammars may be stored together in the Syntax Lookup Table 114 and activated according to the processing requirements or any given input file.
  • each row represents a single parsing record for the "emp.xml" input example that was introduced earlier.
  • these tables are generated as a single stream of parsing records with separate running record indexes for XML elements and XML attributes, effectively partitioning the records into two tables.
  • these tables are presented separately as tables 1A-1B and 2A-2B. In general an arbitrary number of tables may be generated in this fashion. Note also that additional records may be generated internally but discarded later on. It is stressed that the fields and elements in the tables provided are merely provided as an- illustrating examined, and it will be appreciated that tables that include extra fields or rows or that lack certain fields or rows are still within the scope of the invention.
  • Each instance in the element table includes information for an element in the XML file.
  • This information includes a unique identifier for the element, length of the element, length of the value of the element, hash values (detailed below), index to the first attribute (i.e. its position in the attributes table below), index to the parent element, index to last child of the element, index to the previous sibling, and sibling index.
  • the HWPE outputs the records one at a time, adding indices to mark the type of record (for example, if it is an element or an attribute) and the parsing record index (PRI, the record index in the table). Note that when using random access memory (RAM), the records may put directly in order even when the output is not in the same order by using the PRI.
  • RAM random access memory
  • the hash values are for the tag of the element, the path from the root element to the element, the path from the parent element to the element, the path from the grand-parent to the element and the path from the grand-office-parent to the element (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand).
  • the following pointers to the XML file i.e. the offset from the start of the file to the first character of the relevant object
  • start of the element tag start of the value of the element.
  • Table 1 includes the information gathered for the production rule related to XML elements.
  • PRI is the parsing record index attached to the element
  • type is the type of production rule that generated the record (this table includes all records that were generated by the "element” production rule)
  • V type is the type of the value of the element
  • V(8) is the 8 byte prefix of the value of the element
  • V(N) is the numeric value of the element (where a value may be computed)
  • 1st is the index of the first attribute of the element in the attributes table
  • # is the number of attributes the element has
  • P is the index of the processing record of the parent element (which is also within this table)
  • C# is the number of the children elements of the element
  • LC is the last child element of the element
  • PS is the previous siblmg element of the element
  • S# is the sibling index for the element within the parent element
  • h(V) is the hash of the value of the element
  • h(T) is the hash of the tag of the element
  • Each instance in the attribute table includes information for an attribute in the XML file. This information includes a unique identifier for the attribute, an index to its element, index to the next attribute (i.e. its position in the attributes table below), length of the attribute, length of the value of the attribute, and hash values (detailed below).
  • the hash values are for the attribute, the path from the root element to the element and to the attribute, the path from the parent element to the element and to the attribute, the path from the grand-parent to the element and to the attribute and the path from the grand-office-parent to the element and to the attribute (obviously, this can be further optimized to save more [or less] paths so that the compromise between space and performance is best for the applications at hand). Additional hash values may be calculated in order to accelerate the processing of specific dialects of data representation languages.
  • at least one of the following pointers to the XML file i.e. the offset from the start of the file to the first character of the relevant object
  • start of the attribute start of the value of the attribute.
  • VR is the validation rule to be applied for the attribute
  • FSME is the list of fields that are relevant for semantic analysis.
  • HWSME Semantic Analysis Engine
  • the HWSME may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the semantic analysis of the parsing records one record at a time regardless of the history of records previously processed by it. Using a subset of the record fields (such as the type and the hash value of the record, as indicated by the HWPE or HWTIE) the HWSME reads an entry from the Semantic Lookup Table 120. This entry specifies which validation and transformation rales may need to be applied to the parsing record. These rules will be applied later by the hardware and software validation and transformation engines (see below).
  • hash values when hash values are used, 2 approaches may be used.
  • the first approach assumes that the probability of a hash collision is negligible (for example, IO "48 ) and thus the hash value identifies the compared object uniquely.
  • a second approach is to verify that the value of the compared object is identical to the one expected when the hash value is identical to the expected hash value.
  • the latter may be slower in implementation, as it requires value comparison (for example, strings) in each case (though the hash value used may be shorter in this case), but it is accurate in all cases.
  • the output of the HWSME includes the original parsing record received from the HWPE with possible additional information including but not limited to:
  • a set of transformation rule identifiers to be applied to the record may store statistics related to certain semantic rules such as the number of instances of a certain record type with a given value.
  • one or two memory access cycles may need to be completed.
  • multiple record streams may be analyzed in parallel.
  • a single record stream may be broken into .
  • multiple streams since the HWSME may be stateless, this decomposition is relatively simple).
  • the HWSME may analyze up to one parsing record every clock cycle.
  • the HWSME throughput is limited by the HWPE throughput and does not require any input buffering. Only one or two external memory devices are required by the HWSME (and possibly a combination of one off chip memory device and one on chip memory block).
  • Semantic Lookup Table 120 The number of semantic rules that can be supported by the Semantic Lookup Table 120, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • a stream of semantically augmented parsing records is formed on which validation and transformation rules may be applied.
  • the HWVE performs the initial step of validation resulting in a validated stream of parsing records. Typically, for every parsing record received one validated parsing record is produced.
  • the HWVE uses the Validation Lookup Table 126 to deduce which fields of the parsing records require validation and which validation actions need to be performed on them.
  • the HWVE may be a stateless engine (in contrast to the HWTIE and HWPE) and may perform the validation actions of the parsing records one record at a time regardless of the liistory of records previously processed by it. Using the validation rale identifiers supplies by the previous processing steps, the HWVE reads the validation parameters (such as range boundaries and validation functions) from the Validation Lookup Table 126. For more information on the supported validation actions see section "Validation schema compilation" above.
  • the output of the HWVE includes the original parsing record received from the HWSME with optional additional information including at least one of:
  • a validation error indicator (e.g. range exception, type mismatch).
  • multiple record streams may be validated in parallel.
  • a single record stream may be broken into multiple streams (since the HWVE is stateless, this decomposition is relatively simple).
  • the HWVE may validate up to one parsing record every clock cycle.
  • the HWVE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWVE (and possibly a combination of one off chip memory device and one on chip memory block).
  • the number of validation records that can be supported by the Validation Lookup Table 126, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • the records described earlier in Tables 1-2 are scanned in order to look for the relevant fields (like name and position) in order to have the validation information available and then tested so that the values are verified with the rales defined by the pre- processing stage.
  • the "sex" attribute is verified to be either "male” or "female”.
  • the HWVE sets appropriate flags, which will later on trigger alerts and notifications for the control application. Note that the example file “emp.xml” conforms to the sample validation file “emp.dtd”, so its validation is successful.
  • Table 3 shows the flags that were generated during the processing stage as described above. Note that the flags may also refer to information searched within the data description language file. For example, it is possible to denote if the term "confidential" is found within the data description language file (or even in specific location, for example in the root element of an XML file).
  • Table 3 includes the results of the processing and the list of errors that were encountered. Table 3 holds variables that may be used by the application to tell if the XML is well formed, passes validation etc. Note that this is only a sample of the information that may be saved for the processed XML.
  • i) INITIAL TRANSFORMATION 506 Following the initial validation, a stream of validated parsing records is produced on which initial transformation rules may be applied.
  • the HWTRE performs the initial step of transformation resulting in a partially transformed stream of parsing records. Typically, for every parsing record received one transformed or untouched parsing record is produced.
  • the HWTRE uses the Transformation Lookup Table 128 to deduce which fields of the parsing records require transformation and which transformation actions need to be performed on them.
  • the HWTRE may be a stateless engine (similar to the HWVE) and may perform the transformation actions of the parsing records one record at a time regardless of the history of records previously processed by it. Using the transformation rule identifiers supplied by the previous processing steps, the HWTRE reads the transformation parameters (such as transformation functions) from the transformation lookup table TRLUT 128. For more information on the supported transformation actions see section "TRANSFORMATION SCHEMA COMPILATION TO GENERATE ONE OR MORE TRANSFORMATION LOOKUP TABLE(S) 128" above.
  • the output of the HWTRE includes the original parsing record received from the HWVE with possible additional information including at least one of: 1. An indication whether the parsing record requires transformation. 2. The required transformation type.
  • one or two memory access cycles may need to be completed.
  • multiple record streams may be transformed in parallel.
  • a single record stream may be broken into multiple streams (since the HWTRE is stateless, this decomposition is relatively simple).
  • the HWTRE in some embodiments may validate up to one parsing record every clock cycle.
  • the HWTRE throughput is limited by the HWSME throughput and does not require any input buffering. Only one or two external memory devices are required by the HWTRE (and possibly a combination of one off chip memory device and one on chip memory block).
  • Transformation Lookup Table 1208 The number of transformation records that can be supported by the Transformation Lookup Table 128, if implemented in standard SDRAM technology may reach many tens of thousands and beyond. This is expected to be more than sufficient for any typical structured file format.
  • the transformed and validated parsing records produced by the HWTRE are sent back to the software processing modules for completion of the process.
  • a transformation instructions table is created.
  • This table includes list of operations that are required in order to generate the required HTML output.
  • the processing that remains is generating the actual output by following these instructions, which are actually offset and length to copy strings from the input file as well as strings prepared in the pre-processing stage.
  • a typical implementation may have the relevant strings (that is, the ones that may be required for the output) from the pre-processing stage stored in a smaller file and then reference this file.
  • the records described earlier in Tables 2-3 are scanned in order to look for the relevant fields (like name and position) in order to have the transformation infonnation available. For example, the "name" element is recognized and then appropriate entries are prepared for building the appropriate output.
  • the result of the processing for the "emp.xml” file is in Table 4 (for simplicity, the offset value in this example refers to strings in the original "emp.xsl” file).
  • Each line in the table has an instruction which builds an additional part of the required output file.
  • the references to the "emp.xsl” file are actually constant strings known at the preprocessing stage, while the references to the "emp.xml” file are actually pointers to strings within the input file.
  • "TRI" is the transformation record index
  • "OP" is the operation required
  • the first software module to receive the transformed and validated parsing records is the SWVM.
  • the SWVM is responsible for completing all the required validation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them).
  • the SWVM uses indications received from the hardware engines pointing at the records that require further validation.
  • the actual validation is performed by executing the required validation directives as defined in the pre-processing phase. This is done using standard software validation techniques.
  • One important result of the final validation is a decision if the input data representation language file is valid or not. In case the file fails validation, the output records will in this example also point what validation rules were violated.
  • the SWTM is responsible for completing all the required transformation directives that the hardware engines did not perform (either because they were not capable of performing them or because they were not instructed to perform them).
  • the SWTM uses indications received from the hardware engines pointing at the records that require further transformation.
  • the actual transformation is performed by executing the required transformation directives as defined in the pre-processing phase. This is done using standard software transformation techniques.
  • exceptions may be the result of malformed data representation language files or implementation restrictions.
  • the file may be partially processed (e.g. if a simple lexical error is found) or it may be totally incomprehensible.
  • processing is delegated to standard software tools for functional completeness at the price of reduced performance.
  • tables may be created "on the fly”. Note also that creating the tables requires only a simple stack that depends on the depth of the XML structure and not on the amount of the information in the XML file.
  • This implementation is specifically optimized for hardware accelerations, as the sequential access to the file and low memory requirements allow for one-path processing that would generate the required tables and allow further information to be easily deduced from the tables.
  • Another optimization could be elimination of tables that are not required for a specific application. For example, in cases where the application is interested only in elements but not in the attributes, the list of attributes may be removed.
  • tables may be unified to a single table in case implementation is simplified or in case for application reasons it would result in better performance or for any other reason.
  • a feature provided by certain embodiments of the present invention is the ability to use hashes and compare hashes instead of complete strings.
  • Abbreviated representations such as hash values are calculated for one or more data item of the data representation language (e.g. XML elements, attributes, values) and the abbreviated representation (e.g. hash value) rather than the full representation is used in comparisons. Therefore it can efficiently be decided if the data conforms to a schema or if a transformation should occur for some part of the data.
  • Hash values are simple and fast to process (by hardware as well as by software) and thus it is possible to perform very fast and efficient processing of data representation language content.
  • the hash values may be used for search expressions (such are used by XPath and XQuery for XML files) as well as for transformation expressions (such are used by XSL for XML files) and actions.
  • hash values are not unique, and thus it is possible to have 2 different expressions that have the same hash value (this is called "hash collision", and such expressions are called “hash synonyms”).
  • the algorithm described may take different approaches to address this issue: The first would be to use a hash function that generates large values (for example, 160 bits values) and assume that the probability for a collision is negligible (for example, IO "48 ). This approach results in very efficient processing with a risk of collision going undetected. The other approach would be to verify the expression (for example, by comparing the strings) when the hash value for it signals that it is a synonym to the expression that is expected. Obviously, the latter approach requires more processing and thus would generate slower throughput for the data processing.
  • FIG. 10 One particular usage of abbreviated representations or hash values particular to markup languages or data representation languages providing tree representations is described in FIG. 10.
  • a tree representation of part or all of the text file is provided or derived 402.
  • paths are calculated between a selected reference node 404 and one or more selected 406 target nodes.
  • a representation of the path e.g. a hash value
  • this process of deriving a path 410 and storing the representation 412 is also carried out for additional nodes with fixed relation to a selected target node, such as an ancestor of the target node, such as a parent or grandparent of the target node.
  • each of the verbs, "comprise” “include” and “have”, and conjugates thereof, are used to indicate that the object or objects of the verb are not necessarily a complete listing of members, components, elements or parts of the subject or subjects of the verb.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Document Processing Apparatus (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

La présente invention concerne des procédés, un appareil et un code lisible par ordinateur permettant de traiter des fichiers texte selon des langages de représentation de données et des dialectes spécifiques de langages de représentation de données. Dans certains modes de réalisation, une ou plusieurs tables de recherche sont dérivées à partir d'un traitement ou d'une compilation d'un schéma d'un dialecte de langage de représentation de données. Des tables de recherche tenant lieu d'exemple comprenent, entre autres, des tables de recherche lexicale, des tables de recherche syntaxique, des tables de recherche sémantique, des tables de recherche de transformation et des tables de recherche de validation. Une ou plusieurs tables de recherche sont éventuellement mises en oeuvre au moins en partie dans du matériel au moyen, par exemple, de composants de logique reprogrammable. Dans certains modes de réalisation, des valeurs de hachage ou d'autres représentations abrégées d'éléments de données spécifiques sont utilisées afin d'accélérer le traitement du contenu. Dans un mode de réalisation particulier, des valeurs de hachage représentant des voies entre des noeuds spécifiques d'un arbre représentant un fichier texte sont enregistrées.
PCT/IL2005/000521 2004-05-19 2005-05-19 Procede et systeme pour traiter un contenu textuel WO2005111824A2 (fr)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US57212304P 2004-05-19 2004-05-19
US60/572,123 2004-05-19

Publications (2)

Publication Number Publication Date
WO2005111824A2 true WO2005111824A2 (fr) 2005-11-24
WO2005111824A3 WO2005111824A3 (fr) 2007-03-08

Family

ID=35394798

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IL2005/000521 WO2005111824A2 (fr) 2004-05-19 2005-05-19 Procede et systeme pour traiter un contenu textuel

Country Status (1)

Country Link
WO (1) WO2005111824A2 (fr)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2858323A1 (fr) * 2013-10-01 2015-04-08 Enyx SA Procédé et dispositif permettant de décoder des flux de données dans des plates-formes reconfigurables
CN109614593A (zh) * 2018-11-09 2019-04-12 深圳市鼎阳科技有限公司 人机交互设备及其多语种实现方法、装置及存储介质
CN109871529A (zh) * 2017-12-04 2019-06-11 三星电子株式会社 语言处理方法和设备
CN109961495A (zh) * 2019-04-11 2019-07-02 深圳迪乐普智能科技有限公司 一种vr编辑器的实现方法及vr编辑器
CN111045661A (zh) * 2019-12-04 2020-04-21 西安鼎蓝通信技术有限公司 基于语义和特征码的XML Schema生成方法
CN111444254A (zh) * 2020-03-30 2020-07-24 北京东方金信科技有限公司 一种skl系统文件格式转换方法和系统
CN113949438A (zh) * 2021-09-24 2022-01-18 成都飞机工业(集团)有限责任公司 一种无人机通讯方法、装置、设备及存储介质
CN116821271A (zh) * 2023-08-30 2023-09-29 安徽商信政通信息技术股份有限公司 一种基于音形码的地址识别和规范化方法及系统

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6466971B1 (en) * 1998-05-07 2002-10-15 Samsung Electronics Co., Ltd. Method and system for device to device command and control in a network
US20040205549A1 (en) * 2001-06-28 2004-10-14 Philips Electronics North America Corp. Method and system for transforming an xml document to at least one xml document structured according to a subset of a set of xml grammar rules
US20050014494A1 (en) * 2001-11-23 2005-01-20 Research In Motion Limited System and method for processing extensible markup language (XML) documents
US20050240392A1 (en) * 2004-04-23 2005-10-27 Munro W B Jr Method and system to display and search in a language independent manner

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6466971B1 (en) * 1998-05-07 2002-10-15 Samsung Electronics Co., Ltd. Method and system for device to device command and control in a network
US20040205549A1 (en) * 2001-06-28 2004-10-14 Philips Electronics North America Corp. Method and system for transforming an xml document to at least one xml document structured according to a subset of a set of xml grammar rules
US20050014494A1 (en) * 2001-11-23 2005-01-20 Research In Motion Limited System and method for processing extensible markup language (XML) documents
US20050240392A1 (en) * 2004-04-23 2005-10-27 Munro W B Jr Method and system to display and search in a language independent manner

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105659274B (zh) * 2013-10-01 2020-04-14 艾尼克斯股份有限公司 用于在可重构平台中解码数据流的方法和设备
CN105659274A (zh) * 2013-10-01 2016-06-08 艾尼克斯股份有限公司 用于在可重构平台中解码数据流的方法和设备
EP2858323A1 (fr) * 2013-10-01 2015-04-08 Enyx SA Procédé et dispositif permettant de décoder des flux de données dans des plates-formes reconfigurables
AU2014331143B2 (en) * 2013-10-01 2019-02-21 Enyx Sa A method and a device for decoding data streams in reconfigurable platforms
WO2015049305A1 (fr) * 2013-10-01 2015-04-09 Enyx Sa Procédé et dispositif pour décoder des flux de données dans des plates-formes reconfigurables
US10229426B2 (en) 2013-10-01 2019-03-12 Enyx Sa Method and a device for decoding data streams in reconfigurable platforms
CN109871529B (zh) * 2017-12-04 2023-10-31 三星电子株式会社 语言处理方法和设备
CN109871529A (zh) * 2017-12-04 2019-06-11 三星电子株式会社 语言处理方法和设备
CN109614593A (zh) * 2018-11-09 2019-04-12 深圳市鼎阳科技有限公司 人机交互设备及其多语种实现方法、装置及存储介质
CN109614593B (zh) * 2018-11-09 2023-06-30 深圳市鼎阳科技股份有限公司 人机交互设备及其多语种实现方法、装置及存储介质
CN109961495A (zh) * 2019-04-11 2019-07-02 深圳迪乐普智能科技有限公司 一种vr编辑器的实现方法及vr编辑器
CN111045661B (zh) * 2019-12-04 2023-07-04 鼎蓝惠民信息技术(西安)有限公司 基于语义和特征码的XML Schema生成方法
CN111045661A (zh) * 2019-12-04 2020-04-21 西安鼎蓝通信技术有限公司 基于语义和特征码的XML Schema生成方法
CN111444254A (zh) * 2020-03-30 2020-07-24 北京东方金信科技有限公司 一种skl系统文件格式转换方法和系统
CN113949438A (zh) * 2021-09-24 2022-01-18 成都飞机工业(集团)有限责任公司 一种无人机通讯方法、装置、设备及存储介质
CN116821271A (zh) * 2023-08-30 2023-09-29 安徽商信政通信息技术股份有限公司 一种基于音形码的地址识别和规范化方法及系统
CN116821271B (zh) * 2023-08-30 2023-11-24 安徽商信政通信息技术股份有限公司 一种基于音形码的地址识别和规范化方法及系统

Also Published As

Publication number Publication date
WO2005111824A3 (fr) 2007-03-08

Similar Documents

Publication Publication Date Title
US7458022B2 (en) Hardware/software partition for high performance structured data transformation
US7437666B2 (en) Expression grouping and evaluation
US7555709B2 (en) Method and apparatus for stream based markup language post-processing
US8250062B2 (en) Optimized streaming evaluation of XML queries
Green et al. Processing XML streams with deterministic automata and stream indexes
US7328403B2 (en) Device for structured data transformation
US7590644B2 (en) Method and apparatus of streaming data transformation using code generator and translator
US7287217B2 (en) Method and apparatus for processing markup language information
KR101093271B1 (ko) 데이터 센터에서 사용하기 위해 데이터 포맷을 변환하기위한 시스템
US7146352B2 (en) Query optimizer system and method
Barbosa et al. Efficient incremental validation of XML documents
US20060212859A1 (en) System and method for generating XML-based language parser and writer
US7853936B2 (en) Compilation of nested regular expressions
WO2005111824A2 (fr) Procede et systeme pour traiter un contenu textuel
Chiu et al. A compiler-based approach to schema-specific XML parsing
Dai et al. A 1 cycle-per-byte XML parsing accelerator
Borsotti et al. General parsing with regular expression matching
Møller Document Structure Description 2.0
US20090177765A1 (en) Systems and Methods of Packet Object Database Management
US20080092037A1 (en) Validation of XML content in a streaming fashion
Gao et al. A high performance schema-specific xml parser
Bettentrupp et al. A Prototype for Translating XSLT into XQuery.
Michel Representation of XML Schema Components
WO2011091472A1 (fr) Traitement de requête

Legal Events

Date Code Title Description
AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BR BW BY BZ CA CH CN CO CR CU CZ DE DK DM DZ EC EE EG ES FI GB GD GE GH GM HR HU ID IL IN IS JP KE KG KM KP KR KZ LC LK LR LS LT LU LV MA MD MG MK MN MW MX MZ NA NG NI NO NZ OM PG PH PL PT RO RU SC SD SE SG SK SL SM SY TJ TM TN TR TT TZ UA UG US UZ VC VN YU ZA ZM ZW

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SL SZ TZ UG ZM ZW AM AZ BY KG KZ MD RU TJ TM AT BE BG CH CY CZ DE DK EE ES FI FR GB GR HU IE IS IT LT LU MC NL PL PT RO SE SI SK TR BF BJ CF CG CI CM GA GN GQ GW ML MR NE SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase in:

Ref country code: DE

WWW Wipo information: withdrawn in national office

Country of ref document: DE

122 Ep: pct application non-entry in european phase
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 69(1) EPC (EPOFORM 1205A DATED 17.07.07)

122 Ep: pct application non-entry in european phase

Ref document number: 05741999

Country of ref document: EP

Kind code of ref document: A2

WWW Wipo information: withdrawn in national office

Ref document number: 5741999

Country of ref document: EP