CN102708155B - JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton - Google Patents

JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton Download PDF

Info

Publication number
CN102708155B
CN102708155B CN201210118808.0A CN201210118808A CN102708155B CN 102708155 B CN102708155 B CN 102708155B CN 201210118808 A CN201210118808 A CN 201210118808A CN 102708155 B CN102708155 B CN 102708155B
Authority
CN
China
Prior art keywords
state
mark
stack
automata
traceable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201210118808.0A
Other languages
Chinese (zh)
Other versions
CN102708155A (en
Inventor
段振华
张柯柯
王小兵
田聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201210118808.0A priority Critical patent/CN102708155B/en
Publication of CN102708155A publication Critical patent/CN102708155A/en
Application granted granted Critical
Publication of CN102708155B publication Critical patent/CN102708155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Document Processing Apparatus (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and a parsing method based on syntactic analysis of a backtracking automaton. An action transfer rule delta of the backtracking automaton is redefined and the improved backtracking automaton is applied to a syntactic parser, so that the design and the implementation of the syntactic parser are simplified, and the efficiency of the XML parser is effectively improved. During the syntactic parse, the backtracking automaton uses a mark flow provided by a lexer parser as the input and compresses the current state to the stack top when the mark read by the backtracking automaton is a start mark; when the read mark is an end mark, the automaton pops a state out of the stack top as the next state of the automaton; and the automaton does not carry out any stack operation for other marks. During the syntactic parse, the XML document information meeting the syntactic specifications returns to the user through a standard callback function. The JSAX parser and the parsing method solve the problems that the syntactic parser of the XML document parser is complex in structure and low in performance, have the characteristics of easy implementation and high efficiency, and can be applied to parsing the XML documents.

Description

Based on JSAX resolver and the analytic method of Traceable automata grammatical analysis
Technical field
The invention belongs to Web technical field, relate generally to the analytic technique of expandable mark language XML (eXtensible MarkupLanguage) document, particularly relate to the XML document analytic technique based on simple application routine interface SAX (Simple API forXML), specifically based on JSAX resolver and the analytic method of Traceable automata grammatical analysis, can be applicable to the parsing to XML document.
Background technology
In recent years, XML applies simply with it, uses advantage flexibly, be widely used in data transmission under Web environment with the field such as exchange, data integration, document storage, most typical is exactly Web service Web Service, soap protocol in Web Service and WSDL are based on XML, in addition, XML also has a lot of application in the numerous areas such as mathematics, chemistry, physics, as in chemical field for describing the chemical markup language CML of molecular information.XML document resolver exchanges conventional data in present stage Web application technology and process plays critical effect.Along with the application based on XML is more and more extensive, various industrial and the performance of scientific research to XML document resolver it is also proposed more and more higher requirement, and a high performance XML parser is most important for the speed and throughput of system improving analyzing XML file.
At present for all kinds of different demand, occur that DOM Document Object Model DOM, SAX etc. resolve interface standard successively.
SAX interface is a kind of parsing API (application programming interfaces) based on event, SAX resolver have employed the model based on event, it can trigger a series of event for user's process when analyzing XML file, conventional event type has: startDocument, represents the beginning of document; EndDocument, represents the end of document; StartElement, the beginning of expressive notation; EndElement, the end of expressive notation; Characters, represents content of text event; IgnoreWhitespace, represents blank event.The event processed can be deleted from internal memory, and discharges the resource that it occupies, because its performance advantage and be simple and easy to use, is not developed person and user widely uses.
When SAX resolver is at analyzing XML file, need to carry out lexical analysis and grammatical analysis to XML document.The model carrying out lexical analysis the most general is finte-state machine FA, according to XML specification, the mark regular grammer forming XML describes, FA can identify the mark described by regular grammer, due to FA be easy to construct, advantage that analysis efficiency is high, so FA is widely used in the design of lexical analyzer.The method of carrying out grammatical analysis then has multiple choices, Van Engelen once adopted recursive descent parsing recursive descent parsers to carry out grammatical analysis to XML document, but due to the process need maintenance system storehouse of Recursive Implementation, spatially consume larger, and recurrence also brings a large amount of function calls, result in extra time overhead, so recursive descent parsing efficiency is not high.Another general instrument carrying out grammatical analysis is pushdown automata PDA, the recognition capability of pushdown automata is stronger than finte-state machine, the more complicated but pushdown automata structure gets up, each step action when carrying out grammatical analysis all will determine the action changing general layout according to the content in current state, current input and push-down stack, change the state of push-down stack and residue input thus enter into next general layout, causing analysis efficiency not high.
Therefore, the efficiency improving the syntax analyzer of XML parser is imperative, and a high performance XML parser can improve a lot for the speed improving analyzing XML file, effectively can improve response speed and the handling capacity of system.
Project of the present invention, to domestic and international patent documentation and the journal article retrieval published, not yet finds the report closely related and the same with the present invention or document.
Summary of the invention
The present invention is difficult to the problems such as realization mainly for the syntax analyzer inefficiency of XML parser, by improving Traceable automata, a kind of new high-level efficiency is provided, the indicia matched string language with nested structure can be identified and be easy to realize based on Traceable automata grammatical analysis JSAX resolver and analytic method.The present invention can be applicable to the parsing to XML document.
The present invention is described in detail below.
The present invention is a kind of XML parser based on SAX interface and analytic method, and the present invention does not support the parsing to the XML document with name space at present, the invention belongs to standard x ML resolver.
The present invention is a kind of JSAX resolver based on Traceable automata grammatical analysis, comprise lexical analyzer, syntax analyzer, and event handler, the content reading XML document is responsible for by lexical analyzer, the mark of reading is exported to syntax analyzer, syntax analyzer is according to the language construction in XML code requirement identification input mark stream, corresponding event information is passed to event handler, event handler accepts all events of resolver report, and process the data found, realize the parsing to XML document, wherein syntax analyzer constructs based on automat, the structure of the Traceable automata in automat is five-tuple, structure is M=(S, ∑, δ, q 0f), also include a state stack to be used for preserving the partial history run, it is characterized in that: described syntax analyzer realizes based on Traceable automata, described Traceable automata improves Traceable automata, specifically redefine the action transition rule δ of Traceable automata, this is defined as systematicness definition, comprising:
(1) if (q, a)=p, namely under state q, when reading in mark a, be pressed into stack top by current state q to δ, and wherein a representative needs the mark carrying out stacked action;
(2) if δ (q, b)=trace, namely under state q, when reading in mark b, and when state stack is not empty, eject state stack stack top p, and control to turn to p state, wherein b representative needs the mark carrying out backtracking action;
(3) if δ (q, c)=p, namely under state q, when reading in mark c, do not need to carry out stack operation, wherein c representative does not need the mark carrying out stack operation;
(4) if δ (q, d), if d is blank character (blank character does not belong to input character collection, represents end of string), then shuts down and accept input of character string when q ∈ F, in time, refuses to accept;
(5) if δ (q, e) is without definition, then shut down and refuse to accept input of character string.
Syntax analyzer of the present invention realizes based on Traceable automata, Traceable automata is improved, and the Traceable automata of improvement is applied to XML parser syntax analyzer with the Design and implementation of reduced grammar analyzer, effectively improve the efficiency of JSAX resolver.
Realization of the present invention is also, gives the grammar form of equal value mutually with improving Traceable automata:
A→aβ
Wherein a ∈ T (a belongs to terminal symbol T), β ∈ { N 0∪ N 1∪ N 2(β is the string of zero, one or two nonterminal symbols N); And when containing two nonterminal symbols in β, the structure of production is: A → aCA, and this structural requirement production right part second nonterminal symbol is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
The descriptive power of these syntax is stronger than regular grammer RG, but more weak than context-free grammar CFG, is the subset of CFG, between RG and CFG.
The present invention not only redefines the action transfer function of Traceable automata, give with improve after the Traceable automata syntax of equal value mutually, these syntax can describe the indicia matched string language with Problem Representation.
Realization of the present invention is also: describe XML syntactic definition with the syntax improving Traceable automata equivalence, obtain describing XML document syntax rule, Traceable automata is improved according to these syntax rules structure, identify the language construction in XML document mark stream, judge whether grammaticalness specification, complete grammatical analysis.
Realization of the present invention is also: by the rule that be used for describe XML syntactic definition of equal value with improving the Traceable automata syntax, shape such as the grammar form of " A → a β " is namely used to build the syntax rule for describing XML syntactic definition, namely these rules describe XML syntactic definition syntax rule, comprising:
document::=prolog element Misc*
element::=EmptyElemTag|A
A::=STag B A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item B
B::=STag_B B
B::=ETag
A::=Miscs
Miscs::=ε|Misc_Miscs
Wherein, document represents XML document; Prolog is used for describing claim information and DTD doctypedecl; Element indicates the element in present XML document, description be the nested indicia matched string with hierarchical structure; STag represents beginning label; CharData represents character data; Reference represents and quotes; CDSect represents CDATA section; PI represents processing instruction; Comment represents annotation; EmptyElemTag represents empty rubidium marking; STag represents beginning label; Misc represents blank, processing instruction and annotation in XML document.And requiring to appear at beginning label in element and end mark must correct nested and coupling.The syntax rule of the description XML syntactic definition redefined can describe by the syntax with Traceable automata of equal value, belongs to the subset of context-free grammar.
Realization of the present invention is also: according to the syntax rule describing XML syntactic definition, structure improves Traceable automata and reads from the language construction in the mark stream in the XML document of lexical analyzer output, complete grammatical analysis, the improvement Traceable automata TA constructed is:
M=(S, ∑, δ, q 0, F), wherein,
M is the Traceable automata of structure
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0represent and resolve initial state, S 1represent the state after having resolved XMLDecl, S 2the shape arrived after having resolved doctypedecl, S 3start the state of resolving content after having resolved root element STag, trace represents that having resolved an ETag needs to enter backtracking state.
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0;
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3;
Transfer function δ, δ: S × ∑ → S ∪ trace} is the set of lower column jump:
(1) (S 0, XMLDecl) and=S 1: at initial state, reading in mark is that XML states XMLDecl, then transfer to S 1state;
(2) (S 1, Misc) and=S 1: at S 1it is Misc (i.e. blank character, annotation or processing instruction) that state reads in mark, then mark Misc is read in circulation;
(3) (S 1, STag) and=S 3: at S 1it is beginning label STag that state reads in mark, then by current state S 1press-in state stack stack top, by beginning label name press-in namespace stack stack top, transfers to S 3state;
(4) (S 1, doctypedecl) and=S 2: at S 1it is doctypedecl that state reads in mark, forwards S to 2state;
(5) (S 2, Misc) and=S 2: at S 2state read in mark be Misc (blank character, annotation or processing instruction) then circulate resolve Misc, state stack does not change;
(6) (S 2, STag) and=S 3: at S 2it is beginning label STag that state reads in mark, by beginning label name press-in namespace stack stack top, by current state S 2press-in state stack stack top, transfers to S 3state;
(7) (S 3, Content_item) and=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag) and=S 3: at S 3it is beginning label STag that state reads in mark, by current state S 3press-in state stack stack top, by beginning label name press-in namespace stack stack top;
(9) (S 3, ETag) and=trace: at S 3it is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejects state stack stack top p, and the name simultaneously ejecting namespace stack stack top mark and ETag label contrasts, if identical, shows to mark correct coupling, otherwise reporting errors.
The action transfer function of the present invention to Traceable automata re-starts definition, give the syntax of equal value with it, and by these syntax, description is re-started to XML grammer, and give syntax rule, and constructed improvement Traceable automata according to these syntax rules, this automat reads the language construction in mark stream, carries out grammatical analysis efficiently.
The present invention or a kind of JSAX analytic method based on Traceable automata grammatical analysis, use the above-mentioned JSAX resolver based on Traceable automata grammatical analysis, resolve XML document under Eclipse environment, concrete analyzing step comprises:
First step 1. reads the XMLDecl mark in XML document when starting by lexical analyzer, resolves, judge whether to meet XMLDecl specification to XMLDecl, proceeds to resolve, for incongruent direct reporting errors for the XMLDecl meeting specification;
Step 2. meets the XMLDecl of specification, judges whether next mark has Miscs to exist, if had, resolves Miscs circulation;
After step 3. has resolved Miscs, judge whether next mark is doctypedecl, if so, carry out step 4, otherwise carry out step 5;
Step 4. resolves doctypedecl, after having resolved doctypedecl, judges whether next mark is Miscs, if it is resolves Miscs circulation; Otherwise turn to step 6;
After step 5. has resolved doctypedecl, judge that next mark is empty rubidium marking, if not empty rubidium marking then enters step 6; If then resolve empty rubidium marking, then enter step 10;
Step 6. judges whether next mark is beginning label, if so, resolves beginning label, and current state is pressed into the stack top of state stack, by beginning label name press-in namespace stack stack top; Otherwise report an error;
If the next mark of step 7. is the mark not needing to carry out stack-incoming operation, as CharData, CDSection, Comment, Reference, PI, EmptyElemTag, S, respectively respective token is resolved, continue next step; If next mark is beginning label STag, then turn to step 6;
Step 8. judges whether next mark is end mark, if, end mark is resolved, eject state stack stack top state, as the NextState of automat, eject namespace stack stack top mark simultaneously, judge that whether this mark is identical with current end mark name, if the same enter next step, otherwise show that beginning label does not mate with end mark, reporting errors;
Step 9. checks after having resolved an end mark whether state stack is empty, then carry out step 10 if it is empty, otherwise turns to step 7;
Step 10. judges whether document goes back markedness, if do not had, shows to read XML document end, resolves and terminates; If document is markedness also, then judge that next mark is Miscs, if not then reporting errors, if then resolve Miscs mark, until read the end of XML document, complete the resolving to whole XML document, resolve and terminate.
Compared with prior art, the present invention has the following advantages:
(1) the present invention improves Traceable automata, makes the Traceable automata improved can identify the indicia matched string language with nested structure that XML is such; The present invention gives the rule of the Traceable automata of improvement and the syntax with improvement Traceable automata equivalence simultaneously.
(2) the present invention is owing to being applied to the syntax analyzer of JSAX resolver by the Traceable automata of improvement, and the syntax analyzer realized with pushdown automata is compared, and effectively simplifies the Design and implementation of the syntax analyzer of JSAX resolver.
(3) the present invention carries out grammatical analysis due to the Traceable automata that have employed improvement, and the syntax analyzer realized with recursive decrease subroutine and pushdown automata is compared, and efficiency is significantly improved.
(4) the JSAX resolver that the present invention provides meets the requirement of SAX interface specification, and user can carry out the parsing to XML document by SAX interface very easily.
Accompanying drawing explanation
Fig. 1 is the state transition graph of the corresponding Traceable automata of syntax analyzer of JSAX resolver of the present invention;
Fig. 2 is the architectural schematic of JSAX resolver of the present invention;
Fig. 3 is a simple XML document;
Fig. 4 is the state transition diagram that the lexical analyzer of JSAX resolver of the present invention needs to identify mark in XML;
Fig. 5 is the schematic flow sheet that the present invention is based on analytic method in Traceable automata grammatical analysis JSAX resolver;
Fig. 6 is for the present invention is to the result schematic diagram after the XML document parsing shown in Fig. 3;
Fig. 7 is the performance comparison curve map of JSAX resolver of the present invention and Xerces resolver.
Embodiment
The present invention is a kind of XML document resolver based on SAX interface and analytic method, belongs to Web technical field, relates generally to the analytic technique of XML document based on SAX interface.XML extend markup language is a kind of general data interchange language in computing machine and Internet, along with computing machine and internet are in the widespread use of commercial production and people's daily life, the application of XML also will penetrate into every field, XML will play the part of more and more important role, XML is simple with its application, use advantage flexibly, be widely used in the data transmission under Web environment and exchange, data integration, in the fields such as document storage, most typical is exactly Web service Web Service, soap protocol in WebService and WSDL are based on XML, XML has not only been widely used in the various aspects of computer techno-stress, also be applied to machinery, physics, chemistry, the fields such as mathematics, and play more and more important effect, the application of XML is risen just gradually, and develop rapidly on internet, XML parser exchanges conventional data in present stage Web application technology and process plays critical effect.The present invention is a kind of XML parser just, is applied to the parsing to XML document.
JSAX resolver of the present invention have employed the model based on event, and it can trigger a series of event when analyzing XML file, and conventional event type has: startDocument, represents that document starts event; EndDocument, represents document End Event; StartElement, expressive notation starts event; EndElement, expressive notation End Event; Characters, represents content of text event; IgnoreWhitespace, represents blank event.When the mark that Resolver Discovery is specified, an event report can be produced to event handler, event handler can activate a callback method, and the label of telling the method to specify finds, and application program can visit the particular content of specifying label by the method.The event processed can be deleted from internal memory, and discharges the resource that it occupies, and therefore, SAX resolver occupies considerably less system resource.
When SAX resolver is at analyzing XML file, need to carry out morphology and grammatical analysis to XML document.The syntax analyzer that recursive descent parsing method realizes is difficult to structure, and space consuming is larger; And although the pushdown automata for grammatical analysis the most general is powerful, construct more complicated, and analysis efficiency is not high yet.
For this problem, the present invention by the Traceable automata of improvement is applied to XML parser syntax analyzer with the Design and implementation of reduced grammar analyzer, thus effectively improves the efficiency of JSAX resolver.
Describe the present invention referring to accompanying drawing.
Embodiment 1
The present invention is the JSAX resolver based on Traceable automata grammatical analysis, see Fig. 2, comprise lexical analyzer, syntax analyzer and event handler, the content reading XML document is responsible for by lexical analyzer, the mark of reading is exported to syntax analyzer, syntax analyzer is according to the language construction in XML code requirement identification input mark stream, corresponding event information is passed to event handler, event handler accepts all events of resolver report, and process the data found, realize the parsing to XML document, wherein syntax analyzer constructs based on automat, in automat, the structure of Traceable automata is five-tuple, structure is M=(S, ∑, δ, q 0, F), also include a state stack and be used for preserving the partial history run, syntax analyzer of the present invention realizes based on Traceable automata.
Lexical analyzer: the content reading XML document is responsible for by lexical analyzer, reading character or character string are to grammatical analysis part, judge to form the specification whether label of XML document and mark meet XML, and the Traceable automata be supplied to by mark for grammatical analysis is as input.Because the mark in XML uses regular grammer to be described, the present invention has carried out the lexical analysis to XML document by structure finte-state machine.
Syntax analyzer: according to the language construction in the mark stream that syntax rule identification lexical analyzer provides, and corresponding event information is passed to event handler.For bad enough XML document, the information of JSAX resolver meeting reporting errors XML document.In order to complete the grammatical analysis to XML document, the present invention improves Traceable automata, makes the automat improved can identify the indicia matched string language with nested hierarchical structure as XML.When carrying out grammatical analysis, when the mark that improvement Traceable automata reads in is beginning label, then current state is pressed into the stack top of state stack; When the mark that improvement Traceable automata reads in is end mark, then, when state stack is not empty, ejects stack top state p, control steering state p, otherwise report an error; When the mark that improvement Traceable automata reads in is for other mark or mark, do not carry out stack operation.In order to complete the matching operation to mark, also need a namespace stack to store beginning label name, when reading in mark and being beginning label, by beginning label name press-in namespace stack, when the mark read in is end mark, eject namespace stack stack top and compare with end mark name, if name difference, reporting errors.
Event handler: event handler accepts all events of resolver report, and process the data found, document information is returned to user and processes.Specifically while grammatical analysis, by the call back function of standard, the XML document information of grammaticalness specification is returned to user.
Traceable automata of the present invention improves Traceable automata, specifically redefines the action transition rule δ of Traceable automata, and this is defined as systematicness definition, comprising:
(1) if (q, a)=p, namely under state q, when reading in mark a, be pressed into stack top by current state q to δ, and wherein a representative needs the mark carrying out stacked action.In the present invention, when state q, reading in mark is beginning label STag, then by current state q pop down, i.e. and δ (q, STag)=p.
(2) if δ (q, b)=trace, namely under state q, when reading in mark b, and when state stack is not empty, eject state stack stack top p, and control to turn to p state, wherein b representative needs the mark carrying out backtracking action.In the present invention, when state q, reading in mark is end mark ETag, then recall, and ejects stack top state p, and automat steering state p, i.e. δ (q, ETag)=trace.
(3) if δ (q, c)=p, namely under state q, when reading in mark c, do not need to carry out stack operation, wherein c representative does not need the mark carrying out stack operation.In the present invention, when state q, reading in mark Token is CDSect, PI, EmptyElemTag, Reference, Comment, CharData, then state does not change, and does not carry out stack operation, i.e. δ (q, Token)=q.
(4) if d is blank character (blank character does not belong to input character collection, represents end of string), then shut down and accept input of character string when q ∈ F (q belongs to a final state), in time, refuses to accept.In the present invention, under q state, read in mark is blank mark, then shut down when q is final state and accept this character string, namely as δ (q, ε), and q ∈ F, shut down and accept input of character string.
(5) if δ (q, e) is without definition, then shut down and refuse to accept input of character string.
The grammar construct of equal value mutually with improving Traceable automata is:
A→aβ
Wherein a ∈ T (a belongs to terminal symbol T), β ∈ { N 0∪ N 1∪ N 2(β is the string of zero, one or two nonterminal symbols N); And when containing two nonterminal symbols in β, the structure of production is: A → aCA, and this structural requirement production right part second nonterminal symbol is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
The descriptive power of these syntax is stronger than regular grammer RG, but more weak than context-free grammar CFG, is the subset of CFG, between RG and CFG.
With with the syntax improving Traceable automata equivalence, XML syntactic definition is described, obtain the syntax rule describing XML document, improve Traceable automata according to these syntax rules structure, identify the language construction in XML document mark stream, judge whether grammaticalness specification, complete grammatical analysis;
The rule that be used for describe XML syntactic definition of equal value with improving the Traceable automata syntax namely uses the grammar form of structure as " A → a β " to build syntax rule for describing XML syntactic definition, specifically comprises:
document::=prolog element Misc*
element::=EmpryElemTag|A
A::=STag B A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item B
B::=STag B B
B::=ETag
A::=Miscs
Miscs::=ε|Misc Miscs
Wherein, document represents XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, and requiring to appear at mark in element must correct nested and coupling; STag represents beginning label; CharData represents character data; Reference represents and quotes; CDSect represents CDATA section; PI represents processing instruction; Comment represents annotation; EmptyElemTag represents empty rubidium marking; STag represents beginning label; Misc* represents blank, processing instruction and annotation in XML document; B is a nonterminal symbol, can replace with end mark ETag or STag B B; A is a nonterminal symbol, can replace with Miscs or STag B A.
According to the rule describing XML syntactic definition, structure improves Traceable automata, read from the language construction in the mark stream in the XML document of lexical analyzer output with improving Traceable automata, complete grammatical analysis, the improvement Traceable automata TA constructed with reference to Fig. 1 is:
M=(S, ∑, δ, q 0, F), wherein:
M represents the Traceable automata of structure;
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0represent and resolve initial state, S 1represent the state after having resolved XMLDecl, S 2the state arrived after having resolved doctypedecl, S 3start the state of resolving content after having resolved root element STag, trace represents that having resolved an ETag needs to enter backtracking state.
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0;
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3, Z}; (wherein Z represents at the bottom of stack)
Transfer function δ, δ: S × ∑ → S ∪ trace}, with reference to accompanying drawing 1, transfer function is the set of lower column jump:
(1) (S 0, XMLDecl) and=S 1: at initial state, reading in mark is that XML states XMLDecl, then transfer to S 1state;
(2) (S 1, Misc) and=S 1: at S 1it is Misc (i.e. blank character, annotation or processing instruction) that state reads in mark, then mark Misc is read in circulation;
(3) (S 1, STag) and=S 3: at S 1it is beginning label STag that state reads in mark, then by current state S 1press-in state stack stack top, by beginning label name press-in namespace stack stack top, transfers to S 3state;
(4) (S 1, doctypedecl) and=S 2: at S 1it is doctypedecl that state reads in mark, forwards S to 2state;
(5) (S 2, Misc) and=S 2: at S 2state read in mark be Misc (blank character, annotation or processing instruction) then circulate resolve Misc, state stack does not change;
(6) (S 2, STag) and=S 3: at S 2it is beginning label STag that state reads in mark, by beginning label name press-in namespace stack stack top, by current state S 2press-in state stack stack top, transfers to S 3state;
(7) (S 3, Content_item) and=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag) and=S 3: at S 3it is beginning label STag that state reads in mark, by current state S 3press-in state stack stack top, by beginning label name press-in namespace stack stack top;
(9) (S 3, ETag) and=trace: at S 3it is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejects state stack stack top p, and the name simultaneously ejecting namespace stack stack top mark and ETag label contrasts, if identical, shows to mark correct coupling, otherwise reporting errors.
The present invention improves Traceable automata, give the syntax of equal value mutually with improving Traceable automata, and re-started definition by the syntax rule of these syntax to XML, give the syntax rule describing XML document, according to the Traceable automata that these syntax rules structure improves, by the language construction in the Traceable automata identification inlet flow of structure.In order to complete the grammatical analysis to XML document, when to read in mark be beginning label STag to Traceable automata, then turn to NextState, and current state is pressed into state stack stack top; When reading in mark and being end mark ETag, then automat steering state stack stack top state p, and p is ejected stack top; Being other mark when reading in mark, during as Comment, CDSect, PI, CharData, then not needing to carry out stack operation; When read in mark in final state be empty mark, then accept the document, show that the document meets XML syntax rule.The introducing of Traceable automata, effectively simplifies the Design and implementation of syntax analyzer, improves the efficiency of resolver.
The present invention or a kind of JSAX analytic method based on Traceable automata grammatical analysis, resolving based on the JSAX resolver of Traceable automata grammatical analysis, carrying out in resolving to XML document, see Fig. 5, concrete analyzing step comprises:
First step 1. reads the XMLDecl mark in XML document when starting by lexical analyzer, resolves, judge whether to meet XMLDecl specification to XMLDecl, proceeds to resolve, for incongruent direct reporting errors for the XMLDecl meeting specification;
Step 2. meets the XMLDecl of specification, judges whether next mark has Miscs to exist, if had, resolves Miscs circulation;
After step 3. has resolved Miscs, judge whether next mark is doctypedecl, if so, carry out step 4, otherwise carry out step 5;
Step 4. resolves doctypedecl, after having resolved doctypedecl, judges whether next mark is Miscs, if it is resolves Miscs circulation; Otherwise turn to step 6;
After step 5. has resolved doctypedecl, judge that next mark is empty rubidium marking, if not empty rubidium marking then enters step 6; If then resolve empty rubidium marking, then enter step 10;
Step 6. judges whether next mark is beginning label, if so, resolves beginning label, and current state is pressed into the stack top of state stack, by beginning label name press-in namespace stack stack top; Otherwise report an error.
If the next mark of step 7. is the mark not needing to carry out stack-incoming operation, as CharData, CDSection, Comment, Reference, PI, EmptyElemTag, S, respectively respective token is resolved, continue next step; If next mark is beginning label STag, then turn to step 6;
Step 8. judges whether next mark is end mark, if, end mark is resolved, eject state stack stack top state, as the NextState of automat, eject namespace stack stack top mark simultaneously, judge that whether this mark is identical with current end mark name, if the same enter next step, otherwise show that beginning label does not mate with end mark, reporting errors;
Step 9. checks after having resolved an end mark whether state stack is empty, then carry out step 10 if it is empty, otherwise turns to step 7;
Step 10. judges whether document goes back markedness, if do not had, shows to read XML document end, resolves and terminates; If document is markedness also, then judge that next mark is Miscs, if not then reporting errors, if then resolve Miscs mark, until read the end of XML document, complete the resolving to whole XML document, resolve and terminate.
The present invention not only gives based on Traceable automata grammatical analysis JSAX resolver, and grammatical the present invention of Traceable automata gives concrete resolving when a parsing XML document and step, the syntax analyzer realized with pushdown automata is compared, and analyzing step of the present invention obtains very large simplification.The present invention carries out grammatical analysis due to the Traceable automata that have employed improvement, and the syntax analyzer realized with recursive decrease subroutine and pushdown automata is compared, and efficiency is significantly improved.
Embodiment 2
Based on the JSAX resolver of Traceable automata grammatical analysis and analytic method with embodiment 1, the analytically formation angle of device, then the present invention is described in detail.
JSAX resolver based on Traceable automata grammatical analysis of the present invention mainly comprises lexical analyzer, syntax analyzer, the several part of event handler.
Design and implimentation based on Traceable automata grammatical analysis JSAX resolver lexical analyzer:
Due to FA be easy to construct, advantage that analysis efficiency is high, so FA is widely used in the design of lexical analyzer.JSAX resolver of the present invention is the XML document resolver based on SAX interface that a kind of Java realizes, and JSAX resolver is also carry out lexical analysis by structure FA.
With reference to accompanying drawing 4, the content reading XML document is responsible for by lexical analyzer, and the realization of lexical analyzer needs to construct finte-state machine, by the character in finte-state machine reading XML document or character string, is transferred to syntax analyzer in the mode of mark stream.
Draw state transition diagram according to production relevant to lexical analysis in XML specification production, the state transition graph according to obtaining is encoded.
When carrying out identifying various mark, need the finte-state machine of conformation identification respective token, then whether can accept this mark to judge a mark according to finte-state machine is the corresponding XML specification of symbol, it is ch that lexical analyzer reads current character, and the finte-state machine detailed process of structure lexical analyzer includes:
A. structure reads the finte-state machine of single character:
The production describing single character in XML specification is:
[2]Char::=#x9|#xA|#xD|[#x20-xD7FF]|[#xE000-#xFFFD]|[#x10000-#x10FFFF]。See accompanying drawing 4 (a), read state transition diagram corresponding to the finte-state machine of single character, the automat of single character is read according to production [2] structure, detailed process is: read a character ch at initial state, if this character is the character of XML specification production [2], transfer to final state and accept this character, otherwise report an error.
B. structure reads the finte-state machine of name class mark:
The production describing name class mark in XML specification has [4], [4a], [5]:
[4]NameStart::=[A-Z]|″_″|[a-z]|Extender
[4a]NameChar::=NameStart|″:″|″-″|″.″|[0-9]|CombingChar
[5]Name::=NameStart(NameChar)*
See accompanying drawing 4 (b), be read state transition diagram corresponding to the finte-state machine of name class mark, in JSAX resolver, lexical analyzer reads the detailed process of the mark of legal name class and is:
Step B1. first, at initial state S 0read a character, judge namely to judge the NameStart that whether legal this character is the beginning symbol of name mark, if then carry out step 2, otherwise enter step 3;
Step B2. is at S 1under state, name NameChar character is read in circulation, and circulation is read, until the character read in is not NameChar character, enters done state, successfully returns;
Step B3. reporting errors, reads in illegal name mark and returns.
C. structure reads the finte-state machine of beginning label STag mark:
The production that XML specification describes beginning label mark is:
[40]STag::=<Name(S Attribute)*S?>
[41]Attribute::=Name Eq AttValue
See accompanying drawing 4 (c), be the state transition diagram that the finte-state machine of reading beginning label mark is corresponding, the concrete steps reading STag comprise following several step:
Step C1. is at initial state S 0, read a character, if this character is ' < ', turn to step 2;
Step C2. is at S 1the name of state reading tag;
Step C3. is at S 2state reads character, if character late is ' > ', then turns to S 3state, successfully accepts and returns; If character late is blank character Space, then enter step 4;
Step C4. is at S 4state reads character, if character late is ' > ', then enters final state, successfully accepts this mark and return, if otherwise next mark is name Name mark, turn to step 5;
Step C5. is at S 6state reads character, if character late is '=', then turns to step 6;
Step C6. is at S 7state reads next mark, if this mark is property value AttValue mark, turns to step 2;
D. structure reads the finte-state machine of end mark ETag:
The production of end mark is described in XML specification
[5]Name::=NameStartChar(NameChar)*
[42]ETag::=′</′Name S?′>′
See accompanying drawing 4 (d), be the state transition diagram that the finte-state machine of reading end mark mark is corresponding, the concrete steps reading ETag are as follows:
Step D1. reads ' < ';
Step D2. reads '/';
Step D3. reads name beginning character NameStart;
Step D4. reads a character, if this character is ' > ', then enters done state, shows to read the success of ETag mark; Otherwise turn to step 5;
If step D5. character late is blank character, then read character again, till knowing that this character is not blank character, if this character is ' > ', then enter done state, show to read ETag success.
All come the identification of all kinds of mark by structure finte-state machine in lexical analyzer, herein according to the production in XML specification, provide the construction process of the finte-state machine identifying beginning label STag mark, end mark ETag mark, identification for other mark is also according to the production in XML specification, come to be similar, to list no longer one by one herein to the process of the identification of mark by structure finte-state machine.
Embodiment 3
Based on the formation of Traceable automata grammatical analysis JSAX resolver and syntax rule with embodiment 1-2, based on Traceable automata grammatical analysis JSAX analytic method with embodiment 1-2.
Be described with reference to the accompanying drawings the improvement concrete to Traceable automata.
Syntax analyzer of the present invention is based on Traceable automata, and Traceable automata is defined as: a Traceable automata DTA determined is made up of five-tuple, M=(S, ∑, δ, q 0, F), wherein,
M represents the Traceable automata of structure;
S={S 0, S 1..., S nit is the state set of non-NULL;
∑ is input character collection;
Q 0∈ S is original state;
it is the nonempty set of final state;
δ is S × ∑ → S ∪ { mapping on trace}.
Traceable automata is made up of input tape, state stack and finite control, and time initial, read head points to input tape high order end symbol, and state stack is empty, and finite control is in state q 0, each step run, the mark a of finite control pointed by current state q and read head determines transfer action according to state transition function δ, and state transition function exists following several situation:
If (3.1.1) (q, a)=p, be then pressed into stack top by state q to δ, and control to turn to p, read head moves to right one (being called the rule that pushes on);
If (3.1.2) (q, a)=trace, and stack is not empty, then control to turn to stack top state p, and p moves back stack to δ, and read head moves to right one (being called backtracking rule); If stack is empty, then shuts down and refuse to accept;
If (3.1.3) a is blank character (blank character does not belong to input character collection, represents end of string), then shuts down and accept input of character string when q ∈ F, in time, refuses to accept;
If (3.1.4) (q a) without definition, then shuts down and refuses to accept input of character string δ.
Above-mentioned Traceable automata can identify bracket pairing string language, but except the mark being similar to bracket, also have other character data not requiring to match a lot of in XML language, above-mentioned Traceable automata is not classified to the mark a pointed by read head, mark a makes a general reference the mark pointed by read head, and above-mentioned Traceable automata can not identify the language as XML, therefore need to improve above-mentioned Traceable automata, the present invention changes 4 rules of Traceable automata into following 5 rules again, and a is divided into the mark pointed by read head, b, c, d, e five class:
1) if (q, a)=p, namely when reading in mark a, being pressed into stack top by current state q, a being called the mark needing to carry out stack-incoming operation δ;
2) if δ (q, b)=trace, namely when reading in mark b, and when front state stack is not empty, eject stack top p, and automat turns to p state, b is called the mark that needs carry out recalling;
3) if δ (q, c)=p, namely when reading in mark c, not needing to carry out stack operation, c being called the mark not needing to carry out stack operation; The present invention by introduce to make after this rule the Traceable automata improved to identify the such marker ligand with nested hierarchical structure of XML is to string language, improves the recognition capability of Traceable automata.
4) if δ (q, d), if d is blank character (blank character does not belong to input character collection, represents end of string), then shuts down and accept input of character string when q ∈ F, time refusal acceptance be blank character, without definition, then shut down and refuse to accept input of character string;
5) if δ (q, e), without definition, then shut down and refuse to accept input of character string.
The present invention not only improves Traceable automata, enhances the recognition capability of Traceable automata, and gives and the syntax improving Traceable automata equivalence, makes Traceable automata not only regular but also have the syntax, effectively extends its application.
Embodiment 4
Based on Traceable automata grammatical analysis JSAX resolver and analytic method with embodiment 1-3,
Specific design and realization based on Traceable automata grammatical analysis JSAX resolver syntax analyzer:
JSAX resolver is a kind of XML parser based on SAX interface, and JSAX resolver has carried out the grammatical analysis to XML document by the Traceable automata introducing improvement.
(1) grammer of structure description XML document:
According in XML specification, the production of document definition is included:
[1]document::=prolog element Misc*
Beginning symbol is document, needs to be prolog and element that the symbol of deriving is lowercase beginning.According to the definition of XML grammer, replace its symbol by the production of prolog and element.
First, the production of prolog is converted:
The production of initial description XML document is as follows:
[1]document::=prolog element Misc*
[22]prolog::=XMLDecl?Misc*(doctypedecl Misc*)?
[28]doctypedecl::=′<!DOCTYPE′S Name(S ExternalID)?S?(′[′intSubset′]′S?)?′>′
[28b]intSubset::=(markupdecl|DeclSep)*
[29]markupdecl::=elementdecl|AttlistDecl|EntityDecl|NotationDecl|PI|
Comment
[45]elementdecl::=′<!ELEMENT′S Name S contentspec S?′>′
[46]contentspec::=′EMPTY′|′ANY′|Mixed |children
[47]children::=(choice|seq)(′?′|′*′|′+′)?
[48]cp::=(Name|choice|seq)(′?′|′*′|′+′)?
[49]choice::=′(′S?cp(S?′|′S?cp)+S?′)′
[50]seq::=′(′S?cp(S?′,′S?cp)*S?′)′
Production doctypedecl can be transformed into:
doctypedecl::=′<!DOCTYPE′S Name(S ExtemalID)?S?(′[′(elementdecl|
AttlistDecl|EntityDecl|NotationDecl|PI|Comment|DeclSep)*′]′
S?)?′>′
According to the explanation of XML specification, the symbol started with capitalization in specification production is all regular language, item in doctypedecl only has elementdecl to be non-regular, because the expression formula of choice and cp in elementdecl quotes cp, and in the expression formula of cp, quote choice and seq, form recursive definition thus.Also just cannot with finite state determined whether resolve start time cp after symbol ' | '; But, contentspec limits element structure " validity " for describing, the invention belongs to standard x ML resolver, do not need to carry out validation verification, that is do not need to be concerned about the particular content in contentspec, so, using contentspec as simple characters string manipulation, contentspec production can be changed into:
contentspec::=[^>]*
Only require that the content in contentspec does not comprise mark terminating symbol ' > '.Such contentspec just becomes canonical grammar.
Adopt right recursive definition to replace " * " computing, with the "or" of null character (NUL) replaces "? " computing, equivalence transformation is carried out to prolog:
[22]prolog::=XMLDecl?Misc*(doctypedecl Misc*)?
Equivalence transformation is:
prolog::=(XMLDecl|ε)Miscs(doctypedecl Miscs|ε)
Miscs::=Misc Miscs|ε
Like this, the expression formula of prolog also becomes canonical grammar.
Element production is converted:
[39]element::=EmptyElemTag|STag content ETag
[43]content::=CharData?((element|Reference|CDSect|PI|Comment)CharData?)*
Do you utilize production CharData? ((other) CharData?) *: :=(CharData|other) *, is transformed to content:
content::=(element|Reference|CDSect|PI|Comment|CharData)content|ε
Then element::=EmptyElemTag|STag content ETag is substituted into content, obtains:
content::=(STag content ETag|EmptyElemTag|Reference|CDSect|PI|
Comment|CharData)content|ε
In order to express conveniently later, the present invention introduces nonterminal symbol Content_item, if:
Content_item::=EmptyElemTag|Reference|CDSect|PI|Comment| CharData is so element and content production is transformed to:
element::=EmptyElemTag|STag content ETag
content::=(STag content ETag|Content_item)content|ε
Wherein, element indicates the element in present XML document.Element or one EmptyElemTag (empty mark); Or non-null marks, non-null marks is by STag, ETag and appear at the string that the content between beginning label and end mark forms.Content is the sequence of non-null marks or Content_item composition, and contains beginning label and the end mark of same number in the sequence of content description, and the nested and coupling that these expressive notations and end mark must be correct.
The present invention re-starts description with the syntax rule of the syntax to element improving Traceable automata equivalence:
document::=prolog element Miscs
element::=EmptyElemTag|A
A::=STag B A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item B
B::=STag B B
B::=ETag
A::=Miscs
Wherein, document represents XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, and requiring to appear at mark in element must correct nested and coupling; STag represents beginning label; CharData represents character data; Reference represents and quotes; CDSect represents CDATA section; PI represents processing instruction; Comment represents annotation; EmptyElemTag represents empty rubidium marking; STag represents beginning label; Misc* represent blank in XML document, processing instruction and; B is a nonterminal symbol, can replace with end mark ETag or STag B B; A is a nonterminal symbol, can replace with Miscs or STag B A.
Finally, to the production of Misc:
Misc::=Comment|PI|S, known Misc can be described with regular expression.
In summary, for an XML document: document::=prolog element Misc*, wherein prolog is canonical grammar; Element is can with improving the syntax recalled and be automatically described; Misc* is canonical grammar.And canonical grammar is and the subset improving Traceable automata equivalent grammar.According to automaton theory, improvement Traceable automata can be constructed, identify the language construction in XML document mark stream, complete the grammatical analysis to XML document.
(2). structure Traceable automata, in the present invention, backtracking is for improving Traceable automata, carries out grammatical analysis:
The grammer describing XML document in (one) can be described with the syntax improving Traceable automata equivalence, so structure improvement Traceable automata has carried out the grammatical analysis to XML document.
In order to check the problem whether beginning label and end-tag mate, also need to arrange a namespace stack, when running into and starting label, the name of STag is pressed into namespace stack stack top, eject namespace stack top stack symbol to make comparisons with the name of ETag when running into end-tag ETag, if both differences, reporting tag is matching error not; Other labels and mark are not then needed to carry out stack operation.
With reference to accompanying drawing 1, the Traceable automata TA for XML grammatical analysis of structure:
M=(S, ∑, δ, q 0, F), wherein,
Represent state set S:{S 0, S 1, S 2, S 3, trace};
state meaning
s 0 resolve initial state
s 1 state (being also done state) after having resolved XMLDecl
s 2 the state (being also done state) arrived after having resolved doctypedecl
s 3 resolve the state that root element STag starts to resolve content
trace resolve an ETag, enter backtracking state.
Represent incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,PI,CDSection,Comment,CharData,ETag};
Initial state q 0: S 0;
Final state collection F:{S 1, S 2;
Transfer function δ, δ: S × ∑ → S ∪ trace}, with reference to accompanying drawing 1, is the set of lower column jump:
(1) (S 0, XMLDecl) and=S 1: resolve and start, be resolved to document statement XMLDecl.Transfer to S 1state;
(2) (S 1, Misc) and=S 1: Misc is resolved in circulation.
(3) (S 1, STag) and=S 3: be resolved to and start label STag, tag name press-in namespace stack stack top will be started, by current state S 1press-in state stack stack top, transfers to S 3state.
(4) (S 1, doctypedecl) and=S 2: resolve and run into doctypedecl, forward S to 2state.
(5) (S 2, Misc) and=S 2: Misc is resolved in circulation.
(6) (S 2, STag) and=S 3: resolve and run into beginning label STag, tag name press-in namespace stack stack top will be started, by current state S 2press-in state stack stack top, transfers to S 3state.
(7) (S 3, Content_item) and=S 3: circulation is resolved and is not needed stacked mark.
(8) (S 3, STag) and=S 3: circulation is resolved and is started label STag, will start tag name press-in namespace stack stack top, by current state S 3press-in state stack stack top.
(9) (S 3, ETag) and=trace: be resolved to end-tag ETag, forward backtracking state to.The name ejecting namespace stack top stack symbol and ETag label contrasts, and ejects state stack stack top state p, automat steering state p.
(3). event handler
According to SAX specification, the present invention can produce a large amount of parsing events in the process of analyzing XML file, and these are resolved event and will trigger the callback method of registered event handler.JSAX, when starting analyzing XML file, first can call startDocument method, represents and starts analyzing XML file; If run into blank character string (as space, tab, line feed etc.) or character data, then call characters method; When running into beginning label, then call startElement method; EndElement method is then called when running into end-tag; When running into PI part, then call processingInstruction method; In the process of resolving, if there is mistake, then call corresponding error handling method and process, reporting errors; When resolving complete XML document, then calling endDocument method, representing and having resolved XML document.
The JSAX resolver that the present invention provides meets the requirement of SAX interface specification, and user easily can carry out the parsing to XML document by SAX interface.Also solve XML document resolver syntax analyzer complex structure, the problem that performance is not high, have the advantages that to be easy to realize, efficiency is high, can be applicable to the parsing to XML document.
Embodiment 5
Based on Traceable automata grammatical analysis JSAX resolver and analytic method with embodiment 1-4.
Fig. 3 is an XML document, and the document stores the information of book book, comprises the title title of this book, author author and price price information inside each book element.The present invention is when resolving the document in application, carries out according to following process:
First when starting to resolve this XML document, pass to event handler startDocument event information, lexical analyzer reads the character in XML document, and exports XMLDecl mark, and with reference to accompanying drawing 1, Traceable automata is at S 0state reads in XMLDecl mark, turns to S 1state; Read next mark, next mark is PI mark, and Traceable automata turns to S 1state, and pass to event handler processingInstruction event information, show to have found PI mark, while grammatical analysis described above, by the call back function proeessingInstruction () of standard by the XML document information of grammaticalness specification, namely processingInstruction returns to user.At S 1state reads next mark, and next mark is beginning label <books>, current state S 1press-in state stack stack top, and the tag name " books " starting label <books> is pressed into namespace stack stack top, control to turn to S 3state, pass to event handler startElement event information, simultaneously by call back function startElement (String uri, String localName, String qName, Attributes attributes) information of beginning label is returned to user; At S 3--a book-->, this mark is comment, does not need to carry out stack operation, and NextState is still S 3; Read next mark, next mark is <book a=" z " >, is a beginning label STag, by current state S 3press-in state stack stack top, and token name " book " is pressed into namespace stack stack top, control to turn to S 3pass to event handler startElement event information, simultaneously by call back function startElement (String uri, String localName, String qName, Attributes attributes) information of beginning label is returned to user; [CDATA [<Tom> & <Lucy>one<two]] >, it is a CDSect mark, do not need to carry out stack operation, NextState is still S 3; Read next mark, next mark is <title>, is a STag, by current state S 3press-in state stack stack top, and the name " title " of this STag is pressed into namespace stack stack top, control to turn to S 3state, pass to event handler startElement event information, simultaneously by call back function startElement (Stringuri, String localName, String qName, Attributes attributes) information of beginning label is returned to user; Read next mark, next mark is " The Romance of the Three Kingdoms ", it is a CharData mark, do not need to carry out stack operation, pass to event handler CharData event information, and by call back function characters (charch [], int start, int length), the information of character data is returned to user; Read next mark, next mark is </title>, is an ETag, ejects state stack stack top state S 3, and ejecting namespace stack stack top mark " title ", this mark is identical with the name of current end mark </title>, shows correct coupling, controls to turn to S 3, pass to event handler endElement event information, and by call back function endElement (String uri, StringlocalName, String qName), the information of end mark returned to user; Reading next mark is that next mark is <author>, is a STag, by current state S 3press-in state stack stack top, and the name " author " of this STag is pressed into namespace stack stack top, control to turn to S 3pass to event handler startElement event information, simultaneously by call back function startElement (String uri, String localName, String qName, Attributes attributes) information of beginning label is returned to user; Read next mark, next mark is " Luo Guanzhong ", it is a CharData mark, do not need to carry out stack operation, pass to event handler CharData event information, and by call back function characters (char ch [], int start, int length), the information of character data is returned to user; Control turns to S 3state; Read next mark, next mark is end mark </author>, is an ETag, ejects state stack stack top state S 3and eject namespace stack stack top mark " author ", this mark is identical with the name of end mark </author>, show to mark correct coupling, pass to event handler endElement event information, and by call back function endElement (String uri, String localName, StringqName), the information of end mark is returned to user; Read next mark, next mark is <price>, is a STag, by current state S 3press-in state stack stack top, and the name " price " of this beginning label is pressed into namespace stack stack top, control to turn to S 3pass to event handler startElement event information, simultaneously by call back function startElement (String uri, String localName, String qName, Attributes attributes) information of beginning label is returned to user; Read next mark, next mark is " 42.2 ", see Fig. 3, " 42.2 " are CharData marks, do not need to carry out stack operation, pass to event handler CharData event information, and by call back function characters (char ch [], int start, int length) information of character data is returned to user; Reading next mark is </price>, is an end mark, by current state stack stack top S 3eject, control to turn to S 3and eject namespace stack stack top mark " price ", this mark is identical with current end-tag name name, show to mark correct coupling, pass to event handler endElement event information, and by call back function endElement (String uri, String localName, String qName), the information of end mark is returned to user; Reading next mark is </book>, and this mark is an end mark, by current state stack stack top S 3eject, control to turn to S 3and eject namespace stack stack top mark " book ", this mark is identical with current end-tag name name, show to mark correct coupling, pass to event handler endElement event information, and by call back function endElement (String uri, String localName, String qName), the information of end mark is returned to user; Reading next mark is </books>, and this mark is end mark, by current state stack stack top S 1eject, control to turn to S 1and eject namespace stack stack top mark " books ", this mark is identical with current end-tag name name, show to mark correct coupling, pass to event handler endElement event information, and by call back function endElement (String uri, String localName, String qNarne), the information of end mark is returned to user;--end of xml file-->, this mark is Comment, does not need to carry out stack operation, passes to event handler characters event information; Now arrived the end of XML document, and current state is S 1, belong to final state, pass to event handler endDocument event information, show that parse documents terminates, successfully return.In concrete resolving, need corresponding event information to pass to event handler, event handler accepts all event informations that resolver transmits, and therefrom finds desired data, such as above-mentioned " Luo Guanzhong " etc. are desired data, by call back function, these data are returned to user.Concrete analysis result is with reference to accompanying drawing 6.
Embodiment 6
Based on Traceable automata grammatical analysis JSAX resolver and analytic method with embodiment 1-4,
The present invention is a kind of XML document resolver.In order to test performance of the present invention, the present invention being run with the analysis feature good Xerces resolver of generally acknowledging under same environment, performance is contrasted;
(1) test environment
Hardware: Intel Pentium (Dual-Core) D CPU 1.73GHz, internal memory: 2.00GB
Operating system: Windows 7
JavaVM:J2SE 1.6.0 02
Testing software: Eclipse SDK, Version:3.5.2
(2) performance test data analysis:
As shown in Figure 7, with Xerces and JSAX respectively to carrying out 6 tests containing the XML document of 10,100,1000,10000,100000 elements, when starting parse documents writing time t 1, to have resolved after XML document t writing time 2, obtain the time t=t resolving each document and use for 6 times 2-t 1(unit: millisecond), and obtain parsing mean value used.Test findings is the mean value of resolving each document, and test result is as table 1, and table 2, table 3, table 4, shown in table 5.
Table 1 test document contains the result of 10 elements
Table 2 test document contains the result of 100 elements
Table 3 test document contains the result of 1000 elements
Table 4 test document contains the result of 10000 elements
Table 5 test document contains the result of 100000 elements
Found out by test result, the present invention has had more than at least 2.8% raising at aspect of performance than Xerces resolver.Time used when finding out that the present invention resolves same XML document by test data relatively, illustrates that analysis feature of the present invention is more stable.The present invention is when the parsing to the XML document containing 100000 elements, the average parsing time is all few than the existing Xerces resolver time used, the present invention is owing to being applied to the syntax analyzer of JSAX resolver by the Traceable automata of improvement, the syntax analyzer realized with pushdown automata is compared, effectively simplify the Design and implementation of the syntax analyzer of JSAX resolver, improve analyzing efficiency, especially when needs are resolved magnanimity XML document, analyzing efficiency of the present invention is high, has very high practical value.
In sum, the present invention is a kind of based on the XML document resolver under the SAX interface mode of Traceable automata grammatical analysis.The Traceable automata of improvement by redefining the action transition rule δ of Traceable automata, and is applied to syntax analyzer of the present invention by the present invention, simplifies the Design and implementation of syntax analyzer, effectively raises the efficiency of XML parser.When carrying out grammatical analysis, the mark stream that Traceable automata provides with lexical analyzer is input, when the mark that Traceable automata reads in is beginning label, then current state is pressed into stack top; To read in when mark is end mark then automat and eject a state from stack top, and as the NextState of automat; Then stack operation is not carried out when running into other marks.While carrying out grammatical analysis, by the call back function of standard, the XML document information of grammaticalness specification is returned to user.The invention solves XML document resolver syntax analyzer complex structure, the problem that performance is not high, have the advantages that to be easy to realize, efficiency is high, can be applicable to the parsing to XML document.

Claims (1)

1. the JSAX resolver based on Traceable automata grammatical analysis, under Eclipse environment, XML document is resolved, the described JSAX resolver based on Traceable automata grammatical analysis comprises lexical analyzer, syntax analyzer and event handler, the content reading XML document is responsible for by lexical analyzer, the mark of reading is exported to syntax analyzer, syntax analyzer is according to the language construction in XML code requirement identification input mark stream, corresponding event information is passed to event handler, event handler accepts all event informations of the syntax analyzer transmission in resolver and processes, therefrom find desired data, realize the parsing to XML document, and provide analysis result, wherein syntax analyzer constructs based on automat, this automat is Traceable automata, its structure is five-tuple M=(S, ∑, δ, q 0f), Traceable automata also includes a state stack and is used for preserving the partial history run, it is characterized in that: described syntax analyzer realizes based on Traceable automata, described Traceable automata improves Traceable automata, improve Traceable automata specifically to the transfer function δ of Traceable automata as given a definition, wherein δ: S × ∑ → S ∪ { trace}, this is defined as systematicness definition, comprising:
1) if (q, a)=p, represent under state q δ, when reading in mark a, current state q is pressed into state stack stack top, and wherein a representative needs the mark current q being carried out to stack-incoming operation;
2) if δ (q, b)=trace, represent under state q, when reading in mark b, and when state stack is not empty, eject state stack stack top p, and control to turn to p state, wherein b representative needs the mark carrying out backtracking action;
3) if δ (q, c)=p, representing under state q, when reading in mark c, not needing to carry out stack operation, wherein c representative does not need the mark carrying out stack operation;
4) if δ (q, d), represent that d is blank character, then shut down and accept input of character string when q ∈ F, in time, refuses to accept, and wherein said blank character does not belong to input character collection, represents end of string;
5) if δ (q, e) is without definition, then shut down and refuse to accept input of character string;
The grammar form of equal value mutually with improving Traceable automata is:
A→aβ
Wherein a ∈ T, namely a belongs to terminal symbol T; β ∈ { N 0∪ N 1∪ N 2, namely β is the string of zero, one or two nonterminal symbols N; And when containing two nonterminal symbols in β, the structure of production is: A → aCA, and this structural requirement production right part second nonterminal symbol is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
The descriptive power of these syntax is stronger than regular grammer RG, but more weak than context-free grammar CFG, is the subset of CFG, between RG and CFG;
With with the syntax improving Traceable automata equivalence, XML syntactic definition is described, obtain the syntax rule describing XML document, Traceable automata is improved according to these syntax rules structure, by the language construction improved in Traceable automata identification XML document mark stream, judge whether grammaticalness specification, complete grammatical analysis, corresponding event information is passed to event handler simultaneously;
Comprise by the syntax rule that the grammar form with improvement Traceable automata equivalence builds for describing XML syntactic definition:
document::=prolog element Miscs
element::=EmptyElemTag|A
A::=STag B A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item B
B::=STag B B
B::=ETag
A::=Miscs
Miscs::=ε|Misc Miscs
Wherein, document represents XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, and requiring to appear at mark in element must correct nested and coupling; STag represents beginning label; CharData represents character data; Reference represents and quotes; CDSect represents CDATA section; PI represents processing instruction; Comment represents annotation; EmptyElemTag represents empty rubidium marking; Misc represents blank character, annotation and processing instruction in XML document; B is a nonterminal symbol, is replaced by end mark ETag or STag B B; A is a nonterminal symbol, is replaced by Miscs or STag B A;
According to the syntax rule describing XML syntactic definition, structure improves Traceable automata, read from the language construction in the mark stream in the XML document of lexical analyzer output with improving Traceable automata, complete grammatical analysis, the improvement Traceable automata TA constructed is:
M=(S, ∑, δ, q 0, F), wherein:
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0represent and resolve initial state, S 1represent the state after having resolved XMLDecl, s 2represent the state arrived after having resolved doctypedecl, S 3represent the state starting to resolve content after having resolved root element STag, trace represents that having resolved an ETag needs to enter backtracking state;
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0;
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3, Z}; Wherein Z represents at the bottom of stack;
Transfer function δ, δ: S × ∑ → S ∪ trace} is the set of lower column jump:
(1) (S 0, XMLDecl) and=S 1: at initial state, reading in mark is that XML states XMLDecl, then transfer to S 1state;
(2) (S 1, Misc) and=S 1: at S 1it is Misc that state reads in mark, the blank character namely in XML document, annotation or processing instruction, then mark Misc is read in circulation;
(3) (S 1, STag) and=S 3: at S 1it is beginning label STag that state reads in mark, then by current state S 1press-in state stack stack top, by beginning label name press-in namespace stack stack top, transfers to S 3state;
(4) (S 1, doctypedecl) and=S 2: at S 1it is doctypedecl that state reads in mark, forwards S to 2state;
(5) (S 2, Misc) and=S 2: at S 2it is Misc that state reads in mark, the blank character namely in XML document, annotation or processing instruction; Then Misc is resolved in circulation, and state stack does not change;
(6) (S 2, STag) and=S 3: at S 2it is beginning label STag that state reads in mark, by beginning label name press-in namespace stack stack top, by current state S 2press-in state stack stack top, transfers to S 3state;
(7) (S 3, Content_item) and=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag) and=S 3: at S 3it is beginning label STag that state reads in mark, by current state S 3press-in state stack stack top, by beginning label name press-in namespace stack stack top;
(9) (S 3, ETag) and=trace: at S 3it is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejects state stack stack top p, and the name that ejection namespace stack stack top mark and ETag mark simultaneously contrasts, if identical, shows to mark correct coupling, otherwise reporting errors.
CN201210118808.0A 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton Active CN102708155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210118808.0A CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210118808.0A CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Publications (2)

Publication Number Publication Date
CN102708155A CN102708155A (en) 2012-10-03
CN102708155B true CN102708155B (en) 2015-02-18

Family

ID=46900922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210118808.0A Active CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Country Status (1)

Country Link
CN (1) CN102708155B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657075B (en) * 2016-12-26 2019-11-15 东软集团股份有限公司 Multi-layer protocol analytic method, device and data matching method and device
CN107426211B (en) * 2017-07-25 2020-08-14 北京长亭未来科技有限公司 Network attack detection method and device, terminal equipment and computer storage medium
CN111176640B (en) * 2018-11-13 2022-05-13 武汉斗鱼网络科技有限公司 Layout level display method, storage medium, device and system in Android engineering
CN109947835B (en) * 2019-03-12 2023-05-23 东华大学 Printing and dyeing quotation structured demand data extraction method based on finite state automaton
CN115118793B (en) * 2022-06-14 2023-07-07 北京经纬恒润科技股份有限公司 BLF file analysis fault tolerance method and device and computer equipment
CN114781400B (en) * 2022-06-17 2022-09-09 之江实验室 Cross-media knowledge semantic expression method and device

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1991837A (en) * 2005-12-27 2007-07-04 国际商业机器公司 Structured document processing apparatus and method
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1991837A (en) * 2005-12-27 2007-07-04 国际商业机器公司 Structured document processing apparatus and method
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
论回溯自动机;郝克刚等;《计算机学报》;19900531(第5期);340-348 *
高性能JavaSAX解析器的设计与实现;汪剑超;《中国优秀硕士学位论文全文数据库》;20050815(第4期);19-22,34-40 *

Also Published As

Publication number Publication date
CN102708155A (en) 2012-10-03

Similar Documents

Publication Publication Date Title
CN102708155B (en) JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton
US7437666B2 (en) Expression grouping and evaluation
US7251777B1 (en) Method and system for automated structuring of textual documents
Huck et al. Jedi: Extracting and synthesizing information from the web
CN101606150B (en) Xml-based translation
CN101361063B (en) System and method supporting document content mining based on rules
CN100527127C (en) Query intermediate language method and system
EP1679625B1 (en) Method and apparatus for structuring documents based on layout, content and collection
US7328403B2 (en) Device for structured data transformation
US20050091589A1 (en) Hardware/software partition for high performance structured data transformation
CN100430939C (en) Method and system for client-side manipulation of tables
Papakonstantinou et al. Incremental validation of XML documents
US20020143823A1 (en) Conversion system for translating structured documents into multiple target formats
KR100483497B1 (en) Parsing system and method of Multi-document based on elements
Kawanaka et al. biXid: a bidirectional transformation language for XML
US7752212B2 (en) Orthogonal Integration of de-serialization into an interpretive validating XML parser
CN111913739B (en) Service interface primitive defining method and system
Wood Standard generalized markup language: Mathematical and philosophical issues
Kosala et al. Information extraction from web documents based on local unranked tree automaton inference
Löwe et al. Foundations of fast communication via XML
Borsotti et al. Fast GLR parsers for extended BNF grammars and transition networks
US20060184874A1 (en) System and method for displaying an acceptance status
Klempa et al. JInfer: A framework for XML schema inference
US20060212799A1 (en) Method and system for compiling schema
US8291392B2 (en) Dynamic specialization of XML parsing

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant