CN102708155A - JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton - Google Patents

JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton Download PDF

Info

Publication number
CN102708155A
CN102708155A CN2012101188080A CN201210118808A CN102708155A CN 102708155 A CN102708155 A CN 102708155A CN 2012101188080 A CN2012101188080 A CN 2012101188080A CN 201210118808 A CN201210118808 A CN 201210118808A CN 102708155 A CN102708155 A CN 102708155A
Authority
CN
China
Prior art keywords
mark
state
stack
automat
xml
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012101188080A
Other languages
Chinese (zh)
Other versions
CN102708155B (en
Inventor
段振华
张柯柯
王小兵
田聪
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN201210118808.0A priority Critical patent/CN102708155B/en
Publication of CN102708155A publication Critical patent/CN102708155A/en
Application granted granted Critical
Publication of CN102708155B publication Critical patent/CN102708155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention relates to a JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and a parsing method based on syntactic analysis of a backtracking automaton. An action transfer rule delta of the backtracking automaton is redefined and the improved backtracking automaton is applied to a syntactic parser, so that the design and the implementation of the syntactic parser are simplified, and the efficiency of the XML parser is effectively improved. During the syntactic parse, the backtracking automaton uses a mark flow provided by a lexer parser as the input and compresses the current state to the stack top when the mark read by the backtracking automaton is a start mark; when the read mark is an end mark, the automaton pops a state out of the stack top as the next state of the automaton; and the automaton does not carry out any stack operation for other marks. During the syntactic parse, the XML document information meeting the syntactic specifications returns to the user through a standard callback function. The JSAX parser and the parsing method solve the problems that the syntactic parser of the XML document parser is complex in structure and low in performance, have the characteristics of easy implementation and high efficiency, and can be applied to parsing the XML documents.

Description

Based on the JSAX resolver and the analytic method of recalling the automat grammatical analysis
Technical field
The invention belongs to the Web technical field; Relate generally to the analytic technique of expandable mark language XML (eXtensible Markup Language) document; Relate in particular to XML document analytic technique based on simple application routine interface SAX (Simple API for XML); Specifically be a kind of, can be applicable to parsing XML document based on the JSAX resolver and the analytic method of recalling the automat grammatical analysis.
Background technology
In recent years; XML with its use simple, use advantage flexibly, be widely used in the fields such as data transmission and exchange, data integration, document storage under the Web environment, most typical is exactly Web service Web Service; Soap protocol among the Web Service and WSDL are based on XML's; In addition, XML also has plurality of applications in numerous areas such as mathematics, chemistry, physics, as being used to describe the chemical markup language CML of molecular information in the chemical field.Exchange plays critical effect with processing to the XML document resolver to conventional data in present stage Web application technology.Along with more and more widely based on the application of XML; Various industrial and scientific research have also proposed increasingly high requirement to the performance of XML document resolver, and a high performance XML resolver is most important for the speed and the throughput of system that improve analyzing XML file.
To all kinds of different demands, occurred DOM Document Object Model DOM, SAX etc. successively and resolved interface standard at present.
The SAX interface is a kind of parsing API (application programming interfaces) based on incident; The SAX resolver has adopted the model based on incident; It can trigger a series of incident and supply user processing when analyzing XML file, event type commonly used has: startDocument, the beginning of expression document; EndDocument, the end of expression document; StartElement, the beginning of expressive notation; EndElement, the end of expressive notation; Characters, expression content of text incident; IgnoreWhitespace representes blank incident.The incident of handling can be deleted from internal memory, and discharges the resource that it occupies because its performance advantage be simple and easy to usefulness, widely used by developer and user.
When SAX resolver during, need carry out lexical analysis and grammatical analysis to XML document at analyzing XML file.The most general model that carries out lexical analysis is finte-state machine FA; Can know according to the XML standard; The mark that constitutes XML is described with regular grammer; FA can discern the mark of describing with regular grammer, because the advantage that FA is easy to construct, analysis efficiency is high, so FA is widely used in the design of lexical analyzer.The method of carrying out grammatical analysis then has multiple choices; Van Engelen once adopted recursive descent parsing recursive descent parsers that XML document is carried out grammatical analysis; But because the process need maintenance system storehouse of Recursive Implementation, consume greatlyyer on the space, and recurrence has also been brought a large amount of function calls; Caused extra time overhead, so recursive descent parsing efficient is not high.A kind of in addition general instrument that carries out grammatical analysis is pushdown automata PDA; The recognition capability of pushdown automata is stronger than finte-state machine; But pushdown automata is constructed more complicated; Each step action when carrying out grammatical analysis all will confirm to change the action of general layout according to the content in current state, current input and the push-down stack, thereby the state that changes push-down stack and residue input enters into next general layout, causes analysis efficiency not high.
Therefore, the efficient of the syntax analyzer of raising XML resolver is imperative, and a high performance XML resolver has very big raising for the speed that improves analyzing XML file, can effectively improve the response speed and the handling capacity of system.
Project of the present invention is not found report or the document closely related and the same with the present invention as yet to domestic and international patent documentation and the journal article retrieval of publishing.
Summary of the invention
The syntax analyzer inefficiency that the present invention is primarily aimed at the XML resolver is difficult to problems such as realization; Through improving to recalling automat, a kind of new high-level efficiency is provided, can discern the indicia matched string language with nested structure and be easy to realize based on recalling automat grammatical analysis JSAX resolver and analytic method.The present invention can be applicable to the parsing to XML document.
Be elaborated in the face of the present invention down.
The present invention is a kind of XML resolver and analytic method based on the SAX interface, and the present invention does not support the parsing to the XML document that has name space at present, the invention belongs to standard x ML resolver.
The present invention a kind ofly comprises lexical analyzer based on the JSAX resolver of recalling the automat grammatical analysis, syntax analyzer, and event handler; Lexical analyzer is responsible for reading the content of XML document, and the mark that reads is exported to syntax analyzer, and syntax analyzer passes to event handler according to the language construction in the XML code requirement identification input mark stream with events corresponding information; Event handler is accepted all incidents of resolver report, and handles the data of being found, realizes the parsing to XML document; Wherein syntax analyzer is based on the automat structure, and the structure of recalling automat in the automat is a five-tuple, and structure is M=(S; ∑, δ, q 0F), also including a state stack, to be used for preserving the part of operation historical, it is characterized in that: said syntax analyzer is based on recalls that automat realizes; It is said that to recall automat be to improve to recall automat; Specifically be that the action transition rule δ that recalls automat is defined again, this is defined as the systematicness definition, comprising:
(1) if δ (q, a)=p, promptly under state q, when reading in mark a, q is pressed into stack top with current state, wherein a representative need be carried out the mark of stacked action;
(2) if δ (q, b)=trace, promptly under state q, when reading in mark b, and state stack ejects state stack stack top p when be empty, and controls and turn to the p state, wherein the b representative need be recalled the mark of action;
(3) (q c)=p, promptly under state q, when reading in mark c, need not carry out stack operation, wherein the c representative mark that need not carry out stack operation if δ;
(4) if δ (q; D); If d is that (blank character does not belong to the input character collection to blank character; The expression end of string), then shut down and when q ∈ F, accept input of character string, refusal is accepted when
Figure BSA00000704918100031
;
(5) if (q e) does not have definition to δ, then shuts down and refuse to accept input of character string.
Syntax analyzer of the present invention is based on recalls that automat realizes; Improve recalling automat; And recall automat and be applied to design and the realization of XML resolver syntax analyzer with the reduced grammar analyzer improved, effectively improved the efficient of JSAX resolver.
Realization of the present invention also is, has provided with improvement and has recalled automat grammar form of equal value mutually:
A→aβ
A ∈ T (a belongs to terminal symbol T) wherein, β ∈ { N 0∪ N 1∪ N 2(β is the string of zero, one or two nonterminal symbols N); And when containing two nonterminal symbols among the β, the structure of production is: A → aCA, and second nonterminal symbol of this structural requirement production right part is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
This grammatical descriptive power is stronger than regular grammer RG, but than a little less than the context-free grammar CFG, is the subclass of CFG, between RG and CFG.
The present invention has not only carried out again definition to the action transfer function of recalling automat, give with improve after recall the automat syntax of equal value mutually, these syntax can be described the indicia matched string language with level of nesting structure.
Realization of the present invention also is: use with improving and recall the automat syntax of equal value and describe the XML syntactic definition; Obtain describing the XML document syntax rule; Recall automat according to these syntax rule structure improvement; Language construction in the identification XML document mark stream judges whether the grammaticalness standard, accomplishes grammatical analysis.
Realization of the present invention also is: use and improve the rule that is used for describing the XML syntactic definition of recalling the grammatical equivalence of automat; Just use shape to make up the syntax rule that is used to describe the XML syntactic definition like the grammar form of " A → a β "; These rules are promptly described XML syntactic definition syntax rule, comprising:
document::=prolog?element?Misc*
element::=EmptyElemTag|A
A::=STag?B?A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item?B
B::=STag_B?B
B::=ETag
A::=Miscs
Miscs::=ε|Misc_Miscs
Wherein, document representes XML document; Prolog is used for describing claim information and DTD doctypedecl; Element expresses the element in the present XML document, description be nested indicia matched string with hierarchical structure; STag representes beginning label; CharData representes character data; Reference representes to quote; CDSect representes the CDATA section; PI representes processing instruction; Comment representes note; EmptyElemTag representes empty rubidium marking; STag representes beginning label; Misc representes blank, processing instruction and the note in the XML document.And the beginning label that requires to appear among the element must correct nested and coupling with end mark.Again the syntax rule of the description XML syntactic definition of definition can be described with the syntax of recalling automat with equivalence, belongs to the subclass of context-free grammar.
Realization of the present invention also is: according to the syntax rule of describing the XML syntactic definition; The structure improvement is recalled automat and is read from the language construction in the stream of the mark in the XML document of lexical analyzer output; Accomplish grammatical analysis, the improvement of being constructed is recalled automat TA and is:
M=(S, ∑, δ, q 0, F), wherein,
M is the automat of recalling of structure
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0Initial state, S are resolved in expression 1The state after the XMLDecl, S have been resolved in expression 2Resolved the shape that arrives after the doctypedecl, S 3Resolved the state that begins to resolve content after the root element STag, trace representes to have resolved an ETag and need get into and recall state.
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3;
Transfer function δ, δ: S * ∑ → S ∪ trace} is the set of column jump down:
(1) (S 0, XMLDecl)=S 1: at initial state, reading in mark is XML statement XMLDecl, then transfers to S 1State;
(2) (S 1, Misc)=S 1: at S 1It is Misc (being blank character, note or processing instruction) that state reads in mark, and then mark Misc is read in circulation;
(3) (S 1, STag)=S 3: at S 1It is beginning label STag that state reads in mark, then with current state S 1Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top, transfer to S 3State;
(4) (S 1, doctypedecl)=S 2: at S 1It is doctypedecl that state reads in mark, forwards S to 2State;
(5) (S 2, Misc)=S 2: at S 2It is that Misc (blank character, note or processing instruction) then circulates and resolves Misc that state reads in mark, and state stack does not change;
(6) (S 2, STag)=S 3: at S 2It is beginning label STag that state reads in mark, and the beginning label name is pressed into the namespace stack stack top, with current state S 2Be pressed into the state stack stack top, transfer to S 3State;
(7) (S 3, Content_item)=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag)=S 3: at S 3It is beginning label STag that state reads in mark, with current state S 3Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top;
(9) (S 3, ETag)=and trace: at S 3It is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejection state stack stack top p, and the name that ejects namespace stack stack top mark and ETag label simultaneously compares, if identical, shows the mark correct match, otherwise reporting errors.
The present invention defines the action transfer function of recalling automat again; The syntax of equal value have with it been provided; And with these syntax the XML grammer is described again, and provided syntax rule, and constructed to improve according to these syntax rules and recall automat; This automat reads the language construction in the mark stream, carries out grammatical analysis efficiently.
The present invention still is a kind of based on the JSAX analytic method of recalling the automat grammatical analysis, uses above-mentionedly based on the JSAX resolver of recalling the automat grammatical analysis, under the Eclipse environment, XML document is resolved, and concrete analyzing step comprises:
Step 1. at first reads the XMLDecl mark in the XML document in when beginning by lexical analyzer, and XMLDecl is resolved, and judges whether to meet the XMLDecl standard, proceeds to resolve for the XMLDecl of compliant, for incongruent direct reporting errors;
The XMLDecl of step 2. compliant judges whether next mark has Miscs to exist, if having, circulation is resolved to Miscs;
After step 3. has been resolved Miscs, judge whether next mark is doctypedecl, if carry out step 4, otherwise carry out step 5;
Step 4. is resolved doctypedecl, has resolved after the doctypedecl, judges whether next mark is Miscs, resolves if then Miscs is circulated; Otherwise turn to step 6;
After step 5. has been resolved doctypedecl, judge that next mark is empty rubidium marking, if not empty rubidium marking then gets into step 6; If then empty rubidium marking is resolved, get into step 10 then;
Step 6. judges whether next mark is beginning label, if, beginning label is resolved, and current state is pressed into the stack top of state stack, the beginning label name is pressed into the namespace stack stack top; Otherwise report an error;
If the next mark of step 7. is the mark that need not carry out stack-incoming operation, like CharData, CDSection, Comment, Reference, PI, EmptyElemTag, S resolves respective token respectively, continues next step; If next mark is beginning label STag, then turn to step 6;
Step 8. judges whether next mark is end mark; If; End mark is resolved; Eject state stack stack top state,, eject namespace stack stack top mark simultaneously as the NextState of automatic machine; Judge whether this mark is identical with current end mark name; If identical then get into next step, otherwise show that beginning label and end mark do not match reporting errors;
Step 9. checks that whether state stack is empty, if sky then carry out step 10, otherwise turns to step 7 after having resolved an end mark;
Step 10. judges whether document goes back markedness, if do not have, shows and reads the XML document end, resolves and finishes; If document is markedness also, judge that then next mark is Miscs, if not reporting errors then if then the Miscs mark is resolved, up to the end of reading XML document, is accomplished the resolving to whole XML document, resolve and finish.
Compared with prior art, the present invention has the following advantages:
(1) the present invention improves recalling automat, makes improvedly to recall automat and can discern the such indicia matched string language with nested structure of XML; The present invention has simultaneously provided and has improvedly recalled the rule of automat and recall the automat syntax of equal value with improvement.
(2) the present invention compares with the syntax analyzer of pushdown automata realization owing to will be improved recall the syntax analyzer that automat is applied to the JSAX resolver, effective simplification the design and the realization of syntax analyzer of JSAX resolver.
(3) the present invention compares with the recurrence decline subroutine and the syntax analyzer of pushdown automata realization owing to adopted and improvedly recall automat and carry out grammatical analysis, is significantly improved on the efficient.
(4) the JSAX resolver that provides of the present invention meets the requirement of SAX interface specification, and the user can accomplish the parsing to XML document through the SAX interface very easily.
Description of drawings
Fig. 1 recalls the state transition graph of automat for the syntax analyzer correspondence of JSAX resolver of the present invention;
Fig. 2 is the architectural schematic of JSAX resolver of the present invention;
Fig. 3 is a simple XML document;
Fig. 4 need discern the state transition diagram of mark among the XML for the lexical analyzer of JSAX resolver of the present invention;
Fig. 5 the present invention is based on the schematic flow sheet of recalling analytic method in the automat grammatical analysis JSAX resolver;
Fig. 6 is the synoptic diagram as a result after the present invention resolves the XML document shown in Fig. 3;
Fig. 7 is the performance comparison curve map of JSAX resolver of the present invention and Xerces resolver.
Embodiment
The present invention is a kind of XML document resolver and analytic method based on the SAX interface, belongs to the Web technical field, relates generally to the analytic technique of XML document based on the SAX interface.The XML extend markup language is a kind of general exchanges data language among computing machine and the Internet; Along with the widespread use in commercial production and people's daily life of computing machine and internet, the application of XML also will be penetrated into every field, and XML will play the part of more and more important role; XML uses simply, uses advantage flexibly with it; Be widely used in the field such as data transmission and exchange, data integration, document storage under the Web environment, most typical is exactly Web service Web Service, and soap protocol among the Web Service and WSDL are based on XML's; XML not only has been widely used in the various aspects of computing machine and network; Also be applied to fields such as machinery, physics, chemistry, mathematics, and bringing into play more and more important effect, the application of XML is risen just gradually; And on Internet, develop rapidly, exchange plays critical effect with processing to the XML resolver to conventional data in present stage Web application technology.A kind of just XML resolver of the present invention is applied to the parsing to XML document.
JSAX resolver of the present invention has adopted the model based on incident, and it can trigger a series of incident when analyzing XML file, and event type commonly used has: startDocument, and the expression document begins incident; EndDocument, expression document End Event; StartElement, expressive notation begins incident; EndElement, the expressive notation End Event; Characters, expression content of text incident; IgnoreWhitespace representes blank incident.Mark when the Resolver Discovery appointment; Can produce an event report and give event handler; Event handler can activate a callback method, tells the label of this method appointment to find, and application program can visit the particular content of specify labels through this method.The incident of handling can be deleted from internal memory, and discharges the resource that it occupies, and therefore, the SAX resolver occupies considerably less system resource.
When SAX resolver during, need carry out morphology and grammatical analysis to XML document at the analyzing XML file.The syntax analyzer that the recursive descent parsing method realizes is difficult to structure, and space consuming is bigger; Though and the most general pushdown automata that is used for grammatical analysis is powerful, construct more complicated, and analysis efficiency is not high yet.
To this problem, the present invention is through recalling automat and be applied to design and the realization of XML resolver syntax analyzer with the reduced grammar analyzer improved, thereby effectively improved the efficient of JSAX resolver.
Followingly describe the present invention with reference to accompanying drawing.
Embodiment 1
The present invention is based on the JSAX resolver of recalling the automat grammatical analysis, referring to Fig. 2, comprises lexical analyzer, syntax analyzer and event handler; Lexical analyzer is responsible for reading the content of XML document, and the mark that reads is exported to syntax analyzer, and syntax analyzer passes to event handler according to the language construction in the XML code requirement identification input mark stream with events corresponding information; Event handler is accepted all incidents of resolver report, and handles the data of being found, realizes the parsing to XML document; Wherein syntax analyzer is based on the automat structure, and the structure of recalling automat in the automat is a five-tuple, and structure is M=(S; ∑, δ, q 0, F), also including a state stack, to be used for preserving the part of operation historical, and syntax analyzer of the present invention is based on recalls that automat realizes.
Lexical analyzer: lexical analyzer is responsible for reading the content of XML document; Read character or character string and give the grammatical analysis part; Judge to constitute the label of XML document and the standard whether mark meets XML, and with mark offer be used for grammatical analysis recall automat as input.Because the mark among the XML uses regular grammer to describe, the present invention accomplishes the lexical analysis to XML document through the structure finte-state machine.
Syntax analyzer: the language construction in the mark stream that provides according to syntax rule identification lexical analyzer, and events corresponding information passed to event handler.For bad enough XML documents, the information of JSAX resolver meeting reporting errors XML document.In order to accomplish the grammatical analysis to XML document, the present invention improves recalling automat, makes improved automat can discern the indicia matched string language with nested hierarchical structure as XML.When carrying out grammatical analysis, recall mark that automat reads in when being beginning label when improvement, then current state is pressed into the stack top of state stack; When improvement is recalled mark that automat reads in when being end mark, then, eject stack top state p when state stack when not being empty, control steering state p, otherwise report an error; When mark that automat reads in is recalled in improvement for other mark or mark, do not carry out stack operation.In order to accomplish matching operation to mark; Also need a namespace stack to store the beginning label name; When reading in mark when being beginning label, the beginning label name is pressed into namespace stack, when the mark that reads in is end mark; Eject the namespace stack stack top and compare, if name difference then reporting errors with the end mark name.
Event handler: event handler is accepted all incidents of resolver report, and handles the data of being found, document information is returned to the user handle.Specifically be in grammatical analysis, the call back function through standard returns to the user with the XML document information of grammaticalness standard.
Of the present invention to recall automat be to improve to recall automat, specifically is that the action transition rule δ that recalls automat is defined again, and this is defined as the systematicness definition, comprising:
(1) if δ (q, a)=p, promptly under state q, when reading in mark a, q is pressed into stack top with current state, wherein a representative need be carried out the mark of stacked action.Among the present invention, when state q, reading in mark is beginning label STag, then with current state q pop down, promptly δ (q, STag)=p.
(2) if δ (q, b)=trace, promptly under state q, when reading in mark b, and state stack ejects state stack stack top p when be empty, and controls and turn to the p state, wherein the b representative need be recalled the mark of action.Among the present invention, when state q, reading in mark is end mark ETag, then recalls, and ejects stack top state p, and automat steering state p, promptly δ (q, ETag)=trace.
(3) (q c)=p, promptly under state q, when reading in mark c, need not carry out stack operation, wherein the c representative mark that need not carry out stack operation if δ.Among the present invention, when state q, reading in mark Token is CDSect, PI, EmptyElemTag, Reference, Comment, CharData, then state does not change, and does not carry out stack operation, promptly δ (q, Token)=q.
(4) if d is that (blank character does not belong to the input character collection to blank character; The expression end of string); Then shut down and accept input of character string when (q belongs to a final state) at q ∈ F, refusal is accepted when
Figure BSA00000704918100091
.Among the present invention, under the q state, reading in mark is blank mark, then shuts down and accept this character string during for final state as q, and promptly (q ε), and q ∈ F, shuts down and accept input of character string as δ.
(5) if (q e) does not have definition to δ, then shuts down and refuse to accept input of character string.
Recalling automat grammar construct of equal value mutually with improvement is:
A→aβ
A ∈ T (a belongs to terminal symbol T) wherein, β ∈ { N 0∪ N 1∪ N 2(β is the string of zero, one or two nonterminal symbols N); And when containing two nonterminal symbols among the β, the structure of production is: A → aCA, and second nonterminal symbol of this structural requirement production right part is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
This grammatical descriptive power is stronger than regular grammer RG, but than a little less than the context-free grammar CFG, is the subclass of CFG, between RG and CFG.
Use with improving and recall the automatic machine syntax of equal value and describe the XML syntactic definition; Obtain describing the syntax rule of XML document, improve based on these syntax rule structures and recall automatic machine, the language construction in the identification XML document mark stream; Judge whether the grammaticalness standard, accomplish syntactic analysis;
Recall the rule that of equal value being used for of the automat syntax describe the XML syntactic definition with improvement and just use structure to make up the syntax rule that is used to describe the XML syntactic definition, specifically comprise like the grammar form of " A → a β ":
document::=prolog?element?Misc*
element::=EmpryElemTag|A
A::=STag?B?A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item?B
B::=STag?B?B
B::=ETag
A::=Miscs
Miscs::=ε|Misc?Miscs
Wherein, document representes XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, requires to appear at the necessary correct nested and coupling of mark among the element; STag representes beginning label; CharData representes character data; Reference representes to quote; CDSect representes the CDATA section; PI representes processing instruction; Comment representes note; EmptyElemTag representes empty rubidium marking; STag representes beginning label; Misc* representes blank, processing instruction and the note in the XML document; B is a nonterminal symbol, can replace with end mark ETag or STag B B; A is a nonterminal symbol, can replace with Miscs or STag B A.
Based on the rule of describing the XML syntactic definition; Structure improves recalls automatic machine; Recall automatic machine and read with improving, to accomplish syntactic analysis, to recall automatic machine TA with reference to the improvement that Fig. 1 constructed and be from the language construction in the stream of the mark in the XML document of lexical analyzer output:
M=(S, ∑, δ, q 0, F), wherein:
M representes the automat of recalling of constructing;
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0Initial state, S are resolved in expression 1The state after the XMLDecl, S have been resolved in expression 2Resolved the state that arrives after the doctypedecl, S 3Resolved the state that begins to resolve content after the root element STag, trace representes to have resolved an ETag and need get into and recall state.
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3, Z}; (wherein Z representes at the bottom of the stack)
Transfer function δ, δ: S * ∑ → S ∪ trace}, with reference to accompanying drawing 1, transfer function is the set of column jump down:
(1) (S 0, XMLDecl)=S 1: at initial state, reading in mark is XML statement XMLDecl, then transfers to S 1State;
(2) (S 1, Misc)=S 1: at S 1It is Misc (being blank character, note or processing instruction) that state reads in mark, and then mark Misc is read in circulation;
(3) (S 1, STag)=S 3: at S 1It is beginning label STag that state reads in mark, then with current state S 1Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top, transfer to S 3State;
(4) (S 1, doctypedecl)=S 2: at S 1It is doctypedecl that state reads in mark, forwards S to 2State;
(5) (S 2, Misc)=S 2: at S 2It is that Misc (blank character, note or processing instruction) then circulates and resolves Misc that state reads in mark, and state stack does not change;
(6) (S 2, STag)=S 3: at S 2It is beginning label STag that state reads in mark, and the beginning label name is pressed into the namespace stack stack top, with current state S 2Be pressed into the state stack stack top, transfer to S 3State;
(7) (S 3, Content_item)=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag)=S 3: at S 3It is beginning label STag that state reads in mark, with current state S 3Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top;
(9) (S 3, ETag)=and trace: at S 3It is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejection state stack stack top p, and the name that ejects namespace stack stack top mark and ETag label simultaneously compares, if identical, shows the mark correct match, otherwise reporting errors.
The present invention improves recalling automat; Provided with improvement and recalled the automat syntax of equal value mutually; And define again with the syntax rule of these syntax to XML; Provided the syntax rule of describing XML document, constructed the improved automat of recalling, with the language construction in the automat identification inlet flow of recalling of structure according to these syntax rules.In order to accomplish grammatical analysis to XML document, read in mark when being beginning label STag recalling automat, then turn to NextState, and current state is pressed into the state stack stack top; When reading in mark when being end mark ETag, automat steering state stack stack top state p then, and p ejected stack top; When reading in mark is other mark, like Comment, and CDSect, PI during CharData, then need not carry out stack operation; When reading in mark in final state when being empty mark, then accept the document, show that the document meets the XML syntax rule.Recall the introducing of automat, effective simplification the design and the realization of syntax analyzer, improved the efficient of resolver.
The present invention still is a kind of based on the JSAX analytic method of recalling the automat grammatical analysis, on based on the JSAX resolver of recalling the automat grammatical analysis, resolves, and in that XML document is carried out in the resolving, referring to Fig. 5, concrete analyzing step comprises:
Step 1. at first reads the XMLDecl mark in the XML document in when beginning by lexical analyzer, and XMLDecl is resolved, and judges whether to meet the XMLDecl standard, proceeds to resolve for the XMLDecl of compliant, for incongruent direct reporting errors;
The XMLDecl of step 2. compliant judges whether next mark has Miscs to exist, if having, circulation is resolved to Miscs;
After step 3. has been resolved Miscs, judge whether next mark is doctypedecl, if carry out step 4, otherwise carry out step 5;
Step 4. is resolved doctypedecl, has resolved after the doctypedecl, judges whether next mark is Miscs, resolves if then Miscs is circulated; Otherwise turn to step 6;
After step 5. has been resolved doctypedecl, judge that next mark is empty rubidium marking, if not empty rubidium marking then gets into step 6; If then empty rubidium marking is resolved, get into step 10 then;
Step 6. judges whether next mark is beginning label, if, beginning label is resolved, and current state is pressed into the stack top of state stack, the beginning label name is pressed into the namespace stack stack top; Otherwise report an error.
If the next mark of step 7. is the mark that need not carry out stack-incoming operation, like CharData, CDSection, Comment, Reference, PI, EmptyElemTag, S resolves respective token respectively, continues next step; If next mark is beginning label STag, then turn to step 6;
Step 8. judges whether next mark is end mark; If; End mark is resolved; Eject state stack stack top state,, eject namespace stack stack top mark simultaneously as the NextState of automatic machine; Judge whether this mark is identical with current end mark name; If identical then get into next step, otherwise show that beginning label and end mark do not match reporting errors;
Step 9. checks that whether state stack is empty, if sky then carry out step 10, otherwise turns to step 7 after having resolved an end mark;
Step 10. judges whether document goes back markedness, if do not have, shows and reads the XML document end, resolves and finishes; If document is markedness also, judge that then next mark is Miscs, if not reporting errors then if then the Miscs mark is resolved, up to the end of reading XML document, is accomplished the resolving to whole XML document, resolve and finish.
The present invention has not only provided based on recalling automat grammatical analysis JSAX resolver; And the grammatical the present invention who recalls automat gives concrete resolving and step when resolving an XML document; Syntax analyzer with pushdown automata is realized is compared, and analyzing step of the present invention has obtained very big simplification.The present invention compares with the recurrence decline subroutine and the syntax analyzer of pushdown automata realization owing to adopted and improvedly recall automat and carry out grammatical analysis, is significantly improved on the efficient.
Embodiment 2
With embodiment 1, the formation angle from resolver is elaborated to the present invention again based on the JSAX resolver of recalling the automat grammatical analysis and analytic method.
Of the present inventionly mainly comprise lexical analyzer, syntax analyzer, event handler several sections based on the JSAX resolver of recalling the automat grammatical analysis.
Based on the design and the realization of recalling automat grammatical analysis JSAX resolver lexical analyzer:
Because the advantage that FA is easy to construct, analysis efficiency is high is so FA is widely used in the design of lexical analyzer.JSAX resolver of the present invention is a kind of XML document resolver of realizing with Java based on the SAX interface, and the JSAX resolver also is to carry out lexical analysis through structure FA.
With reference to accompanying drawing 4, lexical analyzer is responsible for reading the content of XML document, and the realization of lexical analyzer need be constructed finte-state machine, reads character or character string in the XML document through finte-state machine, and the mode that flows with mark is transferred to syntax analyzer.
Draw state transition diagram based on production relevant in the XML standard production, encode based on the state transition graph that obtains with morphological analysis.
When discerning various mark; The finte-state machine that needs structure identification respective token; Whether can accept this mark according to finte-state machine then and judge that a mark is the corresponding XML standard of symbol; It is ch that lexical analyzer reads current character, and the finte-state machine detailed process of structure lexical analyzer includes:
A. structure reads the finte-state machine of single character:
The production of describing single character in the XML standard is:
[2]Char::=#x9|#xA|#xD|[#x20-xD7FF]|[#xE000-#xFFFD]|[#x10000-#x10FFFF]。Referring to accompanying drawing 4 (a); It is the corresponding state transition diagram of finte-state machine that reads single character; Construct the automat that reads single character according to production [2], detailed process is: read a character ch at initial state, if this character is the character of XML standard production [2]; Transfer to final state and accept this character, otherwise report an error.
B. structure reads the finte-state machine of name class mark:
The production of describing name class mark in the XML standard has [4], [4a], [5]:
[4]NameStart::=[A-Z]|″_″|[a-z]|Extender
[4a]NameChar::=NameStart|″:″|″-″|″.″|[0-9]|CombingChar
[5]Name::=NameStart(NameChar)*
Referring to accompanying drawing 4 (b), be the corresponding state transition diagram of finte-state machine that reads name class mark, lexical analyzer reads the detailed process of the mark of legal name class and is in the JSAX resolver:
Step B1. at first, at initial state S 0Read a character, judge that this character is legal NameStart, promptly judge the begin symbol of name mark,, otherwise get into step 3 if then carry out step 2;
Step B2. is at S 1State circulation down reads name NameChar character, and circulation is read, and till the character that reads in is not the NameChar character, gets into done state, successfully returns;
Step B3. report makes mistakes, and reads in illegal name mark and returns.
C. structure reads the finte-state machine of beginning label STag mark:
The production of XML specification description beginning label mark is:
[40]STag::=<Name(S?Attribute)*S?>
[41]Attribute::=Name?Eq?AttValue
Referring to accompanying drawing 4 (c), be the corresponding state transition diagram of finte-state machine that reads the beginning label mark, the concrete steps that read STag comprise following several steps:
Step C1. is at initial state S 0, read a character, if this character be '<' then turn to step 2;
Step C2. is at S 1The name of state reading tag;
Step C3. is at S 2State reads character, if character late is ‘>', then turn to S 3State is successfully accepted and is returned; If character late is blank character Space, then get into step 4;
Step C4. is at S 4State reads character, if character late is ‘>', then get into final state, successfully accept this mark and return, otherwise if next mark is that name Name mark then turns to step 5;
Step C5. is at S 6State reads character, is '=' like character late, then turns to step 6;
Step C6. is at S 7State reads next mark, if this mark is property value AttValue mark then turns to step 2;
D. structure reads the finte-state machine of end mark ETag:
The production of end mark is described in the XML standard
[5]Name::=NameStartChar(NameChar)*
[42]ETag::=′</′Name?S?′>′
Referring to accompanying drawing 4 (d), be the corresponding state transition diagram of finte-state machine that reads the end mark mark, the concrete steps that read ETag are following:
Step D1. reads ' < ';
Step D2. reads '/';
Step D3. reads name beginning character NameStart;
Step D4. reads a character, if this character is '>', then get into done state, show and read the success of ETag mark; Otherwise turn to step 5;
If step D5. character late is a blank character, then read character again, know that this character is not blank character till, if this character is '>', then get into done state, show and read the ETag success.
All be to accomplish the identification of all kinds of marks in the lexical analyzer through the structure finte-state machine; Here according to the production in the XML standard; Provide the construction process of the finte-state machine of identification beginning label STag mark, end mark ETag mark; Identification for other mark also is according to the production in the XML standard, and the process of accomplishing the identification of mark through the structure finte-state machine is similarly, lists no longer one by one here.
Embodiment 3
Based on the formation of recalling automat grammatical analysis JSAX resolver and syntax rule with embodiment 1-2, based on recalling automat grammatical analysis JSAX analytic method with embodiment 1-2.
Be described with reference to the accompanying drawings recalling the concrete improvement of automat.
Syntax analyzer of the present invention is recalled being defined as of automat based on recalling automat: definite automat DTA that recalls is made up of M=(S, ∑, δ, q five-tuple 0, F), wherein,
M representes the automat of recalling of constructing;
S={S 0, S 1..., S nIt is the state set of non-NULL;
∑ is the input character collection;
q 0∈ S is an original state;
Figure BSA00000704918100161
is the nonempty set of final state;
δ is S * ∑ → S ∪ { mapping that trace} is last.
Recall automat and be made up of input tape, state stack and finite control, when initial, read head points to input tape high order end symbol, and state stack is empty, and finite control is in state q 0, in each step of operation, finite control is confirmed transfer action according to current state q and read head mark a pointed according to state transition function δ, there are following several kinds of situation in state transition function:
(3.1.1) (q a)=p, then is pressed into stack top with state q, and control turns to p, read head to move to right one (being called the rule that pushes on) if δ;
(3.1.2) if δ (q, a)=trace, and stack is not empty, then control turns to stack top state p, and p moves back stack, read head moves to right one (be called and recall rule); If stack is empty, then shut down and refuse to accept;
(3.1.3) if a is that (blank character does not belong to the input character collection to blank character; The expression end of string); Then shut down and when q ∈ F, accept input of character string, refusal is accepted when
Figure BSA00000704918100162
;
(3.1.4) if (q a) does not have definition to δ, then shuts down and refuse to accept input of character string.
Above-mentioned recall automat and can discern bracket pairing string language; But except the mark that is similar to bracket, also have much other character datas that does not require pairing in the XML language; The above-mentioned automat of recalling is not classified to read head mark a pointed; Mark a general reference read head mark pointed, and above-mentioned recall automat and can not discern the language as XML, therefore need the above-mentioned automat of recalling be improved; 4 rules that the present invention will recall automat change following 5 rules again into, and read head mark pointed is divided into five types of a, b, c, d, e:
1) if δ (q, a)=p, promptly when reading in mark a, q is pressed into stack top with current state, a is called the mark that need carry out stack-incoming operation;
2) if δ (q, b)=trace, promptly when reading in mark b, and preceding state stack is not when be empty, ejection stack top p, and automat turns to the p state, b is called the mark that need recall;
3) (q c)=p, promptly when reading in mark c, need not carry out stack operation, c is called the mark that need not carry out stack operation if δ; The present invention makes after this rule and improvedly recalls automat and can discern the such marker ligand with nested hierarchical structure of XML to string language through introducing, improved the recognition capability of recalling automat.
4) if δ (q; D); If d is that (blank character does not belong to the input character collection to blank character; The expression end of string), then shut down and when q ∈ F, accept input of character string, to accept be blank character to refusal when
Figure BSA00000704918100171
; Do not have definition, then shut down and refuse to accept input of character string;
5) if (q e), does not have definition to δ, then shuts down and refuse to accept input of character string.
The present invention not only improves recalling automat, has strengthened the recognition capability of recalling automat, and has provided with improvement and recall the automat syntax of equal value, makes that to recall automat not only regular but also the syntax are arranged, and has effectively expanded its application.
Embodiment 4
Based on recalling automat grammatical analysis JSAX resolver and analytic method with embodiment 1-3,
Based on the concrete design and the realization of recalling automat grammatical analysis JSAX resolver syntax analyzer:
The JSAX resolver is a kind of XML resolver based on the SAX interface, and the JSAX resolver is improvedly recalled automat and accomplished the grammatical analysis to XML document through introducing.
(1) grammer of structure description XML document:
Include according to the production to document definition in the XML standard:
[1]document::=prolog?element?Misc*
Begin symbol is document, and the symbol that need do derivation is the prolog and the element of lowercase beginning.According to the definition of XML grammer, with its symbol of production replacement of prolog and element.
At first, the production of prolog is carried out conversion:
The production of initial description XML document is following:
[1]document::=prolog?element?Misc*
[22]prolog::=XMLDecl?Misc*(doctypedecl?Misc*)?
[28]doctypedecl::=′<!DOCTYPE′S?Name(S?ExternalID)?S?(′[′intSubset′]′S?)?′>′
[28b]intSubset::=(markupdecl|DeclSep)*
[29]markupdecl::=elementdecl|AttlistDecl|EntityDecl|NotationDecl|PI|
Comment
[45]elementdecl::=′<!ELEMENT′S?Name?S?contentspec?S?′>′
[46]contentspec::=′EMPTY′|′ANY′|Mixed?|children
[47]children::=(choice|seq)(′?′|′*′|′+′)?
[48]cp::=(Name|choice|seq)(′?′|′*′|′+′)?
[49]choice::=′(′S?cp(S?′|′S?cp)+S?′)′
[50]seq::=′(′S?cp(S?′,′S?cp)*S?′)′
Production doctypedecl can be transformed into:
doctypedecl::=′<!DOCTYPE′S?Name(S?ExtemalID)?S?(′[′(elementdecl|
AttlistDecl|EntityDecl|NotationDecl|PI|Comment|DeclSep)*′]′
S?)?′>′
Explanation according to the XML standard; Symbol with the capitalization beginning in the standard production all is a regular language; It is non-canonical that item among the doctypedecl has only elementdecl; Because the expression formula of choice among the elementdecl and cp is quoted cp, and quotes choice and seq in the expression formula of cp, constitute recursive definition thus.Symbol after the cp in the time of also just can't having determined whether to resolve beginning with finite state ' | '; Yet; Contentspec is used to describe " validity " restriction to element structure, the invention belongs to standard x ML resolver, need not carry out validation verification; That is to say the particular content that need not be concerned about among the contentspec; So, can the contentspec production be changed into contentspec as the simple characters string manipulation:
contentspec::=[^>]*
Only require content among the contentspec do not comprise the mark terminating symbol ' '.Contentspec just becomes canonical grammar like this.
Adopt right recursive definition to replace " * " computing, with NUL " or " replacement "? " Computing, prolog is carried out equivalence transformation:
[22]prolog::=XMLDecl?Misc*(doctypedecl?Misc*)?
Equivalence transformation is:
prolog::=(XMLDecl|ε)Miscs(doctypedecl?Miscs|ε)
Miscs::=Misc?Miscs|ε
Like this, the expression formula of prolog also becomes canonical grammar.
The element production is carried out conversion:
[39]element::=EmptyElemTag|STag?content?ETag
[43]content::=CharData?((element|Reference|CDSect|PI|Comment)CharData?)*
Do you utilize production CharData? ((other) CharData?) *: :=(CharData|other) * is transformed to content:
content::=(element|Reference|CDSect|PI|Comment|CharData)content|ε
With element::=EmptyElemTag|STag content ETag substitution content, obtain then:
content::=(STag?content?ETag|EmptyElemTag|Reference|CDSect|PI|
Comment|CharData)content|ε
Convenient in order to express later on, the present invention introduces nonterminal symbol Content_item, establishes:
So Content_item::=EmptyElemTag|Reference|CDSect|PI|Comment| CharData element and content production are transformed to:
element::=EmptyElemTag|STag?content?ETag
content::=(STag?content?ETag|Content_item)content|ε
Wherein, element expresses the element in the present XML document.An element or an EmptyElemTag (empty mark); Or the non-NULL mark, the non-NULL mark by STag, ETag and appear at beginning label and end mark between the string that content formed.Content is the sequence that non-NULL mark or Content_item form, and contains the beginning label and the end mark of same number in the sequence of content description, and these expressive notations nested and coupling necessary correct with end mark.
The present invention uses with improving and recalls the automat syntax of equal value the syntax rule of element is described again:
document::=prolog?element?Miscs
element::=EmptyElemTag|A
A::=STag?B?A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item?B
B::=STag?B?B
B::=ETag
A::=Miscs
Wherein, document representes XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, requires to appear at the necessary correct nested and coupling of mark among the element; STag representes beginning label; CharData representes character data; Reference representes to quote; CDSect representes the CDATA section; PI representes processing instruction; Comment representes note; EmptyElemTag representes empty rubidium marking; STag representes beginning label; Misc* representes that blank, the processing instruction in the XML document reaches; B is a nonterminal symbol, can replace with end mark ETag or STag B B; A is a nonterminal symbol, can replace with Miscs or STag B A.
At last, to the production of Misc:
Misc::=Comment|PI|S can know that Misc can describe with regular expression.
To sum up can know, for an XML document: document::=prolog element Misc*, wherein prolog is a canonical grammar; Element can recall the syntax of describing automatically with improvement; Misc* is a canonical grammar.And canonical grammar is a subclass of recalling the automat equivalent grammar with improvement.Based on automaton theory, can construct to improve and recall automatic machine, discern the language construction in the XML document mark stream, accomplish syntactic analysis to XML document.
(2). structure is recalled automat, recalls among the present invention for improvement and recalls automat, carries out grammatical analysis:
In (one), describing the grammer of XML document can describe with recalling the automat syntax of equal value with improvement, accomplishes the grammatical analysis to XML document so automat is recalled in the structure improvement.
For the problem of checking whether beginning label and end-tag mate; A namespace stack also need be set; Name with STag when running into the beginning label is pressed into the namespace stack stack top; When running into end-tag ETag, eject the namespace stack top stack symbol and make comparisons, if both differences then report the label mistake that do not match with the name of ETag; Then need not carry out stack operation for other labels and mark.
With reference to accompanying drawing 1, structure be used for the XML grammatical analysis recall automat TA:
M=(S, ∑, δ, q 0, F), wherein,
Expression state set S:{S 0, S 1, S 2, S 3, trace};
State Meaning
S 0 Resolve initial state
S 1 Resolved XMLDecl state (also being done state) afterwards
S 2 Resolved the state (also being done state) that arrives after the doctypedecl
S 3 Resolve root element STag and begun to resolve the state of content
trace Resolved an ETag, got into and recall state.
Expression incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,PI,CDSection,Comment,CharData,ETag};
Initial state q 0: S 0
Final state collection F:{S 1, S 2;
Transfer function δ, δ: S * ∑ → S ∪ trace} with reference to accompanying drawing 1, is the set of column jump down:
(1) (S 0, XMLDecl)=S 1: resolve beginning, be resolved to document statement XMLDecl.Transfer to S 1State;
(2) (S 1, Misc)=S 1: Misc is resolved in circulation.
(3) (S 1, STag)=S 3: be resolved to beginning label STag, will begin tag name and be pressed into the namespace stack stack top, current state S 1Be pressed into the state stack stack top, transfer to S 3State.
(4) (S 1, doctypedecl)=S 2: parsing runs into doctypedecl, forwards S to 2State.
(5) (S 2, Misc)=S 2: Misc is resolved in circulation.
(6) (S 2, STag)=S 3: parsing runs into beginning label STag, will begin tag name and be pressed into the namespace stack stack top, with current state S 2Be pressed into the state stack stack top, transfer to S 3State.
(7) (S 3, Content_item)=S 3: circulation is resolved does not need stacked mark.
(8) (S 3, STag)=S 3: beginning label STag is resolved in circulation, will begin tag name and be pressed into the namespace stack stack top, with current state S 3Be pressed into the state stack stack top.
(9) (S 3, ETag)=and trace: be resolved to end-tag ETag, forward to and recall state.The name that ejects namespace stack top stack symbol and ETag label compares, and ejects state stack stack top state p, automat steering state p.
(3). event handler
According to the SAX standard, the present invention can produce a large amount of parsing incidents in the process of analyzing XML file, and these parsing incidents will trigger the callback method of registered event handler.JSAX at first can call the startDocument method when the beginning analyzing XML file, expression beginning analyzing XML file; If run into blank character string (like the space, tab, line feed etc.) or character data, then call the characters method; When running into the beginning label, then call the startElement method; When running into end-tag, then call the endElement method; When running into the PI part, then call the processingInstruction method; In the process of resolving, if mistake is then called the corresponding error disposal route and handled, report makes mistakes; When resolving complete XML document, then call the endDocument method, XML document has been resolved in expression.
The JSAX resolver that the present invention provides meets the requirement of SAX interface specification, and the user can easily accomplish the parsing to XML document through the SAX interface.Also solved XML document resolver syntax analyzer complex structure, the problem that performance is not high, have be easy to realize, characteristics that efficient is high, can be applicable to parsing to XML document.
Embodiment 5
Based on recalling automat grammatical analysis JSAX resolver and analytic method with embodiment 1-4.
Fig. 3 is an XML document, and the document has been stored the information of book book, and each book element the inside comprises title title, author author and the price price information of this book.Use the present invention when the document is resolved, carry out according to following process:
At first when beginning to resolve this XML document, pass to startDocument event information of event handler, lexical analyzer reads the character in the XML document, and output XMLDecl mark, with reference to accompanying drawing 1, recalls automat at S<sub >0</sub>State reads in the XMLDecl mark, turns to S<sub >1</sub>State; Read next mark, next mark is the PI mark, recalls automat and turns to S<sub >1</sub>State; And pass to processingInstruction event information of event handler; Show and found the PI mark; In grammatical analysis as stated, the call back function proeessingInstruction () through standard is with the XML document information of grammaticalness standard, and promptly processingInstruction returns to the user.At S<sub >1</sub>State reads next mark, and next mark is a beginning label<books>, current state S<sub >1</sub>Be pressed into the state stack stack top, and will begin label<books>Tag name " books " be pressed into the namespace stack stack top, control turns to S<sub >3</sub>State; Pass to startElement event information of event handler, simultaneously through call back function startElement (String uri, String localName; String qName, Attributes attributes) information with beginning label returns to the user; At S<sub >3</sub>The next mark that state reads does<--a book-->, this mark is comment, need not carry out stack operation, NextState still is S<sub >3</sub>Read next mark, next mark does<book a=" z ">, be a beginning label STag, with current state S<sub >3</sub>Be pressed into the state stack stack top, and token name " book " is pressed into the namespace stack stack top, control turns to S<sub >3</sub>Pass to startElement event information of event handler, simultaneously through call back function startElement (String uri, String localName; String qName, Attributes attributes) information with beginning label returns to the user; Read next mark, next mark does<![CDATA[<tom>&<lucy>One<two]]>, be a CDSect mark, need not carry out stack operation, NextState still is S<sub >3</sub>Read next mark, next mark does<title>, be a STag, with current state S<sub >3</sub>Be pressed into the state stack stack top, and the name " title " of this STag is pressed into the namespace stack stack top, control turns to S<sub >3</sub>State; Pass to startElement event information of event handler, simultaneously through call back function startElement (String uri, String localName; String qName, Attributes attributes) information with beginning label returns to the user; Read next mark; Next mark is " The Romance of the Three Kingdoms ", is a CharData mark, need not carry out stack operation; Pass to CharData event information of event handler; And the information of character data is returned to the user through call back function characters (char ch [], int start, int length); Read next mark, next mark does</title>, be an ETag, eject state stack stack top state S<sub >3</sub>, and eject namespace stack stack top mark " title ", this mark and current end mark</title>Name identical, show correct match, control turns to S<sub >3</sub>, pass to endElement event information of event handler, and the information of end mark returned to the user through call back function endElement (String uri, String localName, String qName); Reading next mark is that next mark does<author>, be a STag, with current state S<sub >3</sub>Be pressed into the state stack stack top, and the name " author " of this STag is pressed into the namespace stack stack top, control turns to S<sub >3</sub>Pass to startElement event information of event handler, simultaneously through call back function startElement (String uri, String localName; String qName, Attributes attributes) information with beginning label returns to the user; Read next mark; Next mark is " Luo Guanzhong ", is a CharData mark, need not carry out stack operation; Pass to CharData event information of event handler; And the information of character data is returned to the user through call back function characters (char ch [], int start, int length); Control turns to S<sub >3</sub>State; Read next mark, next mark is an end mark</author>, be an ETag, eject state stack stack top state S<sub >3</sub>, and eject namespace stack stack top mark " author ", this mark and end mark</author>Name identical; Show the mark correct match, pass to endElement event information of event handler, and through call back function endElement (String uri; String localName, String qName) information with end mark returns to the user; Read next mark, next mark does<price>, be a STag, with current state S<sub >3</sub>Be pressed into the state stack stack top, and the name " price " of this beginning label is pressed into the namespace stack stack top, control turns to S<sub >3</sub>Pass to startElement event information of event handler, simultaneously through call back function startElement (String uri, String localName; String qName, Attributes attributes) information with beginning label returns to the user; Read next mark, next mark is " 42.2 ", sees Fig. 3; " 42.2 " are CharData marks; Need not carry out stack operation, pass to CharData event information of event handler, and through call back function characters (char ch []; Int start, int length) information with character data returns to the user; Reading next mark does</price>, be an end mark, with current state stack stack top S<sub >3</sub>Eject, control turns to S<sub >3</sub>And eject namespace stack stack top mark " price ", this mark is identical with current end-tag name name, shows the mark correct match; Pass to endElement event information of event handler; And the information of end mark is returned to the user through call back function endElement (String uri, String localName, String qName); Reading next mark does</book>, this mark is an end mark, with current state stack stack top S<sub >3</sub>Eject, control turns to S<sub >3</sub>And eject namespace stack stack top mark " book ", this mark is identical with current end-tag name name, shows the mark correct match; Pass to endElement event information of event handler; And the information of end mark is returned to the user through call back function endElement (String uri, String localName, String qName); Reading next mark does</books>, this mark is an end mark, with current state stack stack top S<sub >1</sub>Eject, control turns to S<sub >1</sub>And eject namespace stack stack top mark " books ", this mark is identical with current end-tag name name, shows the mark correct match; Pass to endElement event information of event handler; And the information of end mark is returned to the user through call back function endElement (String uri, String localName, String qNarne); Read next mark<--end of xml file-->, this mark is Comment, need not carry out stack operation, passes to characters event information of event handler; Arrived the end of XML document this moment, and current state is S<sub >1</sub>, belong to final state, pass to endDocument event information of event handler, show that parse documents finishes, and successfully returns.In concrete resolving; Need events corresponding information be passed to event handler, event handler is accepted all event informations that resolver transmits, and therefrom finds desired data; Be desired data such as above-mentioned " Luo Guanzhong " etc., these data returned to the user through call back function.Concrete analysis result is with reference to accompanying drawing 6.
Embodiment 6
Based on recalling automat grammatical analysis JSAX resolver and analytic method with embodiment 1-4,
The present invention is a kind of XML document resolver.In order to test performance of the present invention, the analysis feature that the present invention is generally acknowledged together Xerces resolver preferably moves under same environment, and performance is compared;
(1) test environment
Hardware: Intel Pentium (Dual-Core) D CPU 1.73GHz, internal memory: 2.00GB
Operating system: Windows 7
JavaVM:J2SE?1.6.0?02
Testing software: Eclipse SDK, Version:3.5.2
(2) performance test data analysis:
As shown in Figure 7, respectively the XML document that contains 10,100,1000,10000,100000 elements is carried out 6 tests with Xerces and JSAX, when the beginning parse documents writing time t 1, resolved behind the XML document t writing time 2, obtain and resolve the time t=t that each document uses for 6 times 2-t 1(unit: millisecond), and obtain and resolve used mean value.Test findings is the mean value of each document of parsing, test result such as table 1, and table 2, table 3 is shown in the table 4, table 5.
Table 1 test document contains the result of 10 elements
Figure BSA00000704918100241
Table 2 test document contains the result of 100 elements
Figure BSA00000704918100242
Table 3 test document contains the result of 1000 elements
Figure BSA00000704918100251
Table 4 test document contains the result of 10000 elements
Table 5 test document contains the result of 100000 elements
Figure BSA00000704918100253
Find out that by test result the present invention has had the raising more than at least 2.8% at aspect of performance than Xerces resolver.Time ratio used when finding out that by test data the present invention resolves same XML document is more approaching, explains that analysis feature of the present invention is more stable.The present invention is to the parsing of the XML document that contains 100000 elements the time; The average parsing time is all lacked than the existing used time of Xerces resolver; The present invention compares with the syntax analyzer of pushdown automata realization owing to will be improved recall the syntax analyzer that automat is applied to the JSAX resolver, effective simplification the design and the realization of syntax analyzer of JSAX resolver; Improved analyzing efficiency; Especially when needs were resolved the magnanimity XML document, analyzing efficiency of the present invention was high, has very high practical value.
In sum, the present invention is a kind of based on the XML document resolver under the SAX interface mode of recalling the automat grammatical analysis.The present invention is through defining the action transition rule δ that recalls automat again, and recalls automat and be applied to syntax analyzer of the present invention improved, simplified the design and the realization of syntax analyzer, effectively raises the efficient of XML resolver.When carrying out grammatical analysis, the mark stream that recalling automat provides with lexical analyzer is input, when recalling mark that automat reads in when being beginning label, then current state is pressed into stack top; Read in mark when being end mark then automat eject a state from stack top, and as the NextState of automat; When running into other marks, then do not carry out stack operation.When carrying out grammatical analysis, the call back function through standard returns to the user with the XML document information of grammaticalness standard.The invention solves XML document resolver syntax analyzer complex structure, the problem that performance is not high, have be easy to realize, characteristics that efficient is high, can be applicable to parsing to XML document.

Claims (6)

1. one kind based on the JSAX resolver of recalling the automat grammatical analysis, under the Eclipse environment, XML document is resolved, and comprises lexical analyzer; Syntax analyzer and event handler, lexical analyzer are responsible for reading the content of XML document, and the mark that reads is exported to syntax analyzer; Syntax analyzer passes to event handler according to the language construction in the XML code requirement identification input mark stream with events corresponding information, and event handler is accepted all event informations of resolver transmission and handled; Therefrom find desired data, realize parsing, and provide analysis result XML document; Wherein syntax analyzer is based on the automat structure, and the structure of recalling automat in the automat is a five-tuple, and structure is M=(S; ∑, δ, q 0F), also including a state stack, to be used for preserving the part of operation historical, it is characterized in that: said syntax analyzer is based on recalls that automat realizes; It is said that to recall automat be to improve to recall automat; Specifically be that the action transition rule δ that recalls automat is defined again, this is defined as the systematicness definition, comprising:
1) if δ (q, a)=p, promptly under state q, when reading in mark a, q is pressed into stack top with current state, wherein a representative need be carried out the mark of stacked action;
2) if δ (q, b)=trace, promptly under state q, when reading in mark b, and state stack ejects state stack stack top p when be empty, and controls and turn to the p state, wherein the b representative need be recalled the mark of action;
3) (q c)=p, promptly under state q, when reading in mark c, need not carry out stack operation, wherein the c representative mark that need not carry out stack operation if δ;
4) if δ (q; D); If d is that (blank character does not belong to the input character collection to blank character; The expression end of string), then shut down and when q ∈ F, accept input of character string, refusal is accepted when ;
5) if (q e) does not have definition to δ, then shuts down and refuse to accept input of character string.
2. according to claim 1 based on the JSAX resolver of recalling the automat grammatical analysis, it is characterized in that: recalling automat grammar form of equal value mutually with improvement is:
A→aβ
A ∈ T (a belongs to terminal symbol T) wherein, β ∈ { N 0∪ N 1∪ N 2(β is the string of zero, one or two nonterminal symbols N); And when containing two nonterminal symbols among the β, the structure of production is: A → aCA, and second nonterminal symbol of this structural requirement production right part is identical with the nonterminal symbol on the production left side, and wherein A, C are nonterminal symbols;
This grammatical descriptive power is stronger than regular grammer RG, but than a little less than the context-free grammar CFG, is the subclass of CFG, between RG and CFG.
3. according to claim 2 based on the JSAX resolver of recalling the automat grammatical analysis; It is characterized in that: use with improving the syntax of recalling the automat equivalence and describe the XML syntactic definition, obtain describing the syntax rule of XML document, improve according to these syntax rules structures and recall automat; With improving the language construction of recalling in the automat identification XML document mark stream; Judge whether the grammaticalness standard, accomplish grammatical analysis, simultaneously corresponding event information is passed to event handler.
4. according to claim 3 based on the JSAX resolver of recalling the automat grammatical analysis, it is characterized in that: make up the syntax rule that is used to describe the XML syntactic definition with the described grammar form of claim 2 and comprise:
document::=prolog?element?Misc*
element::=EmptyElemTag|A
A::=STag?B?A
Content_item::=CharData|Reference|CDSect|PI|Comment|EmptyElemTag
B::=Content_item?B
B::=STag?B?B
B::=ETag
A::=Miscs
Miscs::=ε|Misc?Miscs
Wherein, document representes XML document; Prolog is used for describing claim information and DTD doctypedecl; What element described is the nested indicia matched string with hierarchical structure, requires to appear at the necessary correct nested and coupling of mark among the element; STag representes beginning label; CharData representes character data; Reference representes to quote; CDSect representes the CDATA section; PI representes processing instruction; Comment representes note; EmptyElemTag representes empty rubidium marking; STag representes beginning label; Misc* representes blank, processing instruction and the note in the XML document; B is a nonterminal symbol, can replace with end mark ETag or STag B B; A is a nonterminal symbol, can replace with Miscs or STag B A.
5. according to claim 4 based on the JSAX resolver of recalling the automat grammatical analysis; It is characterized in that: according to the syntax rule of describing the XML syntactic definition; Structure improves recalls automat; Recall automat with improvement and read from the language construction in the stream of the mark in the XML document of lexical analyzer output, accomplish grammatical analysis, the improvement of being constructed is recalled automat TA and is:
M=(S, ∑, δ, q 0, F), wherein:
State set S:{S 0, S 1, S 2, S 3, trace}, wherein S 0Initial state, S are resolved in expression 1The state after the XMLDecl, S have been resolved in expression 2Resolved the shape that arrives after the doctypedecl, S 3Resolved the state that begins to resolve content after the root element STag, trace representes to have resolved an ETag and need get into and recall state.
Incoming symbol set ∑:
{XMLDecl,Misc,doctypedecl,EmptyElemTag,STag,Reference,CDSect,CharData,PI,CDSection,Comment,ETag};
Initial state q 0: S 0
Final state collection F:{S 1, S 2;
State stack stack:{S 1, S 2, S 3, Z}; (wherein Z representes at the bottom of the stack)
Transfer function δ, δ: S * ∑ → S ∪ trace} is the set of column jump down:
(1) (S 0, XMLDecl)=S 1: at initial state, reading in mark is XML statement XMLDecl, then transfers to S 1State;
(2) (S 1, Misc)=S 1: at S 1It is Misc (being blank character, note or processing instruction) that state reads in mark, and then mark Misc is read in circulation;
(3) (S 1, STag)=S 3: at S 1It is beginning label STag that state reads in mark, then with current state S 1Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top, transfer to S 3State;
(4) (S 1, doctypedecl)=S 2: at S 1It is doctypedecl that state reads in mark, forwards S to 2State;
(5) (S 2, Misc)=S 2: at S 2It is that Misc (blank character, note or processing instruction) then circulates and resolves Misc that state reads in mark, and state stack does not change;
(6) (S 2, STag)=S 3: at S 2It is beginning label STag that state reads in mark, and the beginning label name is pressed into the namespace stack stack top, with current state S 2 pressGo into the state stack stack top, transfer to S 3State;
(7) (S 3, Content_item)=S 3: circulation is read and is not needed stacked mark, and state stack does not change;
(8) (S 3, STag)=S 3: at S 3It is beginning label STag that state reads in mark, with current state S 3Be pressed into the state stack stack top, the beginning label name is pressed into the namespace stack stack top;
(9) (S 3, ETag)=and trace: at S 3It is end mark ETag that state reads in mark, then steering state stack stack top state p, and ejection state stack stack top p, and the name that ejects namespace stack stack top mark and ETag mark simultaneously compares, if identical, shows the mark correct match, otherwise reporting errors.
6. one kind based on the JSAX analytic method of recalling the automat grammatical analysis; Use claim 1-5 based on the JSAX resolver of recalling the automat grammatical analysis; Under the Eclipse environment, XML document is resolved; It is characterized in that, XML document is being carried out in the resolving that concrete analyzing step comprises:
Step 1. at first reads the XMLDecl mark in the XML document in when beginning by lexical analyzer, and XMLDecl is resolved, and judges whether to meet the XMLDecl standard, proceeds to resolve for the XMLDecl of compliant, for incongruent direct reporting errors;
The XMLDecl of step 2. compliant judges whether next mark has Miscs to exist, if having, circulation is resolved to Miscs;
After step 3. has been resolved Miscs, judge whether next mark is doctypedecl, if carry out step 4, otherwise carry out step 5;
Step 4. is resolved doctypedecl, has resolved after the doctypedecl, judges whether next mark is Miscs, resolves if then Miscs is circulated; Otherwise turn to step 6;
After step 5. has been resolved doctypedecl, judge that next mark is empty rubidium marking, if not empty rubidium marking then gets into step 6; If then empty rubidium marking is resolved, get into step 10 then;
Step 6. judges whether next mark is beginning label, if, beginning label is resolved, and current state is pressed into the stack top of state stack, the beginning label name is pressed into the namespace stack stack top; Otherwise report an error;
If the next mark of step 7. is the mark that need not carry out stack-incoming operation, like CharData, CDSection, Comment, Reference, PI, EmptyElemTag, S resolves respective token respectively, continues next step; If next mark is beginning label STag, then turn to step 6;
Step 8. judges whether next mark is end mark; If; End mark is resolved; Eject state stack stack top state,, eject namespace stack stack top mark simultaneously as the NextState of automatic machine; Judge whether this mark is identical with current end mark name; If identical then get into next step, otherwise show that beginning label and end mark do not match reporting errors;
Step 9. checks that whether state stack is empty, if sky then carry out step 10, otherwise turns to step 7 after having resolved an end mark;
Step 10. judges whether document goes back markedness, if do not have, shows and reads the XML document end, resolves and finishes; If document is markedness also, judge that then next mark is Miscs, if not reporting errors then if then the Miscs mark is resolved, up to the end of reading XML document, is accomplished the resolving to whole XML document, resolve and finish.
CN201210118808.0A 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton Active CN102708155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210118808.0A CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210118808.0A CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Publications (2)

Publication Number Publication Date
CN102708155A true CN102708155A (en) 2012-10-03
CN102708155B CN102708155B (en) 2015-02-18

Family

ID=46900922

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210118808.0A Active CN102708155B (en) 2012-04-20 2012-04-20 JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton

Country Status (1)

Country Link
CN (1) CN102708155B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657075A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Multilayer protocol analysis method and device as well as data matching method and device
CN107426211A (en) * 2017-07-25 2017-12-01 北京长亭科技有限公司 Detection method and device, terminal device and the computer-readable storage medium of network attack
CN109947835A (en) * 2019-03-12 2019-06-28 东华大学 Printing and dyeing quotation mode demand data extracting method based on finite-state automata
CN111176640A (en) * 2018-11-13 2020-05-19 武汉斗鱼网络科技有限公司 Layout level display method, storage medium, device and system in Android project
CN114781400A (en) * 2022-06-17 2022-07-22 之江实验室 Cross-media knowledge semantic expression method and device
CN115118793A (en) * 2022-06-14 2022-09-27 北京经纬恒润科技股份有限公司 BLF file parsing fault-tolerant method and device and computer equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1991837A (en) * 2005-12-27 2007-07-04 国际商业机器公司 Structured document processing apparatus and method
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1991837A (en) * 2005-12-27 2007-07-04 国际商业机器公司 Structured document processing apparatus and method
CN101329665A (en) * 2007-06-18 2008-12-24 国际商业机器公司 Method for analyzing marking language document and analyzer

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
汪剑超: "高性能JavaSAX解析器的设计与实现", 《中国优秀硕士学位论文全文数据库》 *
郝克刚等: "论回溯自动机", 《计算机学报》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106657075A (en) * 2016-12-26 2017-05-10 东软集团股份有限公司 Multilayer protocol analysis method and device as well as data matching method and device
CN106657075B (en) * 2016-12-26 2019-11-15 东软集团股份有限公司 Multi-layer protocol analytic method, device and data matching method and device
CN107426211A (en) * 2017-07-25 2017-12-01 北京长亭科技有限公司 Detection method and device, terminal device and the computer-readable storage medium of network attack
CN107426211B (en) * 2017-07-25 2020-08-14 北京长亭未来科技有限公司 Network attack detection method and device, terminal equipment and computer storage medium
CN111176640A (en) * 2018-11-13 2020-05-19 武汉斗鱼网络科技有限公司 Layout level display method, storage medium, device and system in Android project
CN111176640B (en) * 2018-11-13 2022-05-13 武汉斗鱼网络科技有限公司 Layout level display method, storage medium, device and system in Android engineering
CN109947835A (en) * 2019-03-12 2019-06-28 东华大学 Printing and dyeing quotation mode demand data extracting method based on finite-state automata
CN109947835B (en) * 2019-03-12 2023-05-23 东华大学 Printing and dyeing quotation structured demand data extraction method based on finite state automaton
CN115118793A (en) * 2022-06-14 2022-09-27 北京经纬恒润科技股份有限公司 BLF file parsing fault-tolerant method and device and computer equipment
CN115118793B (en) * 2022-06-14 2023-07-07 北京经纬恒润科技股份有限公司 BLF file analysis fault tolerance method and device and computer equipment
CN114781400A (en) * 2022-06-17 2022-07-22 之江实验室 Cross-media knowledge semantic expression method and device

Also Published As

Publication number Publication date
CN102708155B (en) 2015-02-18

Similar Documents

Publication Publication Date Title
CN102708155B (en) JSAX (joint simple API (application program interface) for XML (extensible markup language)) parser and parsing method based on syntactic analysis of backtracking automaton
CN101361063B (en) System and method supporting document content mining based on rules
Collard et al. An XML-based lightweight C++ fact extractor
CN101606150B (en) Xml-based translation
US7458022B2 (en) Hardware/software partition for high performance structured data transformation
CN101930465B (en) Method for processing document
US7437666B2 (en) Expression grouping and evaluation
Papakonstantinou et al. Incremental validation of XML documents
EP1679625A2 (en) Method and apparatus for structuring documents based on layout, content and collection
US20070136698A1 (en) Method, system and apparatus for a parser for use in the processing of structured documents
CN101110812A (en) Text command analyzing and processing method
Wood Standard generalized markup language: Mathematical and philosophical issues
Warmer et al. The implementation of the Amsterdam SGML Parser
CN101944080B (en) Method for reading and XML conversion based on DXF file format
Löwe et al. Foundations of fast communication via XML
US20090307187A1 (en) Tree automata based methods for obtaining answers to queries of semi-structured data stored in a database environment
CN1560763B (en) Method for translating expandable mark language path inquiry into structure inquiry
Nishimura et al. XML stream transformer generation through program composition and dependency analysis
CN1910576B (en) Device for structured data transformation
Grune et al. Parsing techniques
US8291392B2 (en) Dynamic specialization of XML parsing
CN104641367A (en) Formatting module, system and method for formatting an electronic character sequence
Chidlovskii et al. Supervised learning for the legacy document conversion
Chuvilin Parametric approach to the construction of syntax trees for partially formalized text documents
CN100380322C (en) Hardware accelerated validating parser

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant