CN100489840C - Storage means and device for marking language document, and output method and device - Google Patents

Storage means and device for marking language document, and output method and device Download PDF

Info

Publication number
CN100489840C
CN100489840C CNB2007101871423A CN200710187142A CN100489840C CN 100489840 C CN100489840 C CN 100489840C CN B2007101871423 A CNB2007101871423 A CN B2007101871423A CN 200710187142 A CN200710187142 A CN 200710187142A CN 100489840 C CN100489840 C CN 100489840C
Authority
CN
China
Prior art keywords
mark
sign
language document
marking language
data block
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB2007101871423A
Other languages
Chinese (zh)
Other versions
CN101158939A (en
Inventor
王长桥
贾爱霞
汤帜
刘志云
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
LIDE TECHNOLOGY DEVELOPMENT CO LTD
Peking University
Peking University Founder Group Co Ltd
Original Assignee
Peking University
Peking University Founder Group Co Ltd
Beijing Founder Apabi Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University, Peking University Founder Group Co Ltd, Beijing Founder Apabi Technology Co Ltd filed Critical Peking University
Priority to CNB2007101871423A priority Critical patent/CN100489840C/en
Publication of CN101158939A publication Critical patent/CN101158939A/en
Application granted granted Critical
Publication of CN100489840C publication Critical patent/CN100489840C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a storage method and a device thereof and an output method and a device thereof for a markup language file, which solves problems of slow reaction speed when information equipment with low resource configuration processes the high-capacity markup language file. During storage: the markup language file is divided into at least two data blocks; a blocking sign is added between the adjacent two data blocks; a whole markup, unclosed local markup, and the markup attributes thereof are saved to the blocking sign; the data block and the blocking sing are stored. During output, sequentially select at least two data blocks; read out the markup and the markup attributes stored in the blocking sign; analyze the blocking sign and the adjacent data blocks at the same time, each data block forms a markup language file tree structure; two or more markup language file tree structures are merged into a tree structure and output. The proposal is suitable for processing the files with XML category and the streaming structuring files, which is applied to the mobile terminals with limited resources and with relatively high requirements on reaction speed thereof.

Description

The storage means of marking language document and device and output intent and device
Technical field
The present invention relates to field of digital information processing, particularly a kind of method and apparatus of marks for treatment Language Document.
Background technology
Along with the application of infotech more and more widely, the application of marking language document also more and more widely, most of webpage and the document on the Internet all is marking language document at present.XML (Extensible Markup Language such as our common expression technology information (as patent data), extensible markup language) document, express SVG (the Scalable Vector Graphics of the two-dimentional space of a whole page, scalable vector graphics) document, express MathML (the Mathematics Markup Language of scientific and technical information, the figure notation language) document, express HTML (the HyperText Markup Language of book contents, hypertext markup language) document and SGML (Standard Generalized Markup Language, standard general markup language) document, XHTML (eXtensible HyperText Markup Language, expansion hypertext markup language) document, can be used for expressing user's tagged speech document of video and audio-frequency information etc. of packing.
Marking language document is a kind ofly to express the content information of document and the document of appearance forrns by mark (Tag).With the html document is example, and the mark of HTML provides the information of document characteristic to browser, such as the version of document, coding, introductory information, title, appearance characteristics, logical organization etc.Compare with general text, html document not only comprises the substance of document, also comprises the mark that embodies document appearance attribute and logical organization.
Along with the function of marking language document from strength to strength, it is increasing that the length of single document also seems.Several million even tens XML classification document and html document can be found everywhere.Such document, general computed table can be handled calmly, and the memory configurations and the processor ability of information of mobile terminal equipment such as mobile reader, mobile phone are all limited, do not have enough spaces and speed that marking language document is carried out buffer-stored and export then or show, the user must wait for that the long time just can see the display effect of document.
Summary of the invention
The invention provides a kind of storage means that makes the marking language document that the limited mobile device of resource distribution can store the high capacity marking language document.
For achieving the above object, the present invention by the following technical solutions:
A kind of storage means of marking language document, marking language document comprises content information, mark and flag attribute, mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; Described mark comprises global mark and local flag;
This method comprises the steps:
The step that marking language document is divided at least two data blocks;
The step that between adjacent two data blocks, adds blocking sign;
With global mark and local flag of not closing and flag attribute thereof, by appearing at the step that order in the described marking language document is saved in described blocking sign;
The step that data block and blocking sign are stored.
Can store jumbo marking language document in order to make the limited mobile device of resource distribution, the present invention has been divided into the plurality of data piece to the high capacity marking language document, and the global mark drawn game standard laid down by the ministries or commissions of the Central Government that logical organization, appearance attribute and the coded format of content information worked is remembered row into and preserved.Is that unit carries out the several times sequential storage in when storage with the data block.The mechanism that this piecemeal is preserved is that the merging output of data block lays the foundation.
The present invention provides a kind of memory storage that makes the marking language document that the limited mobile device of resource distribution can store the high capacity marking language document simultaneously.
For achieving the above object, the present invention by the following technical solutions:
A kind of memory storage of marking language document, described marking language document comprises content information, mark and flag attribute, mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; State mark and comprise global mark and local flag;
This device comprises:
Divide module unit, be used for marking language document is divided at least two data blocks;
The blocking sign adding device is used for adding blocking sign between adjacent two data blocks;
Mark is preserved the unit, with global mark and local flag of not closing and flag attribute thereof, is saved in described blocking sign by the order that appears in the described marking language document;
Storage unit is used to store the content information and the blocking sign of each data block.
Can store jumbo marking language document in order to make the limited mobile device of resource distribution, divide module unit that original high capacity marking language document has been divided into the plurality of data piece, mark is preserved the unit and has been preserved global mark and the local flag that logical organization, appearance attribute and coded format to content information work.Storage unit is that unit carries out the several times sequential storage in when storage with the data block.The mechanism that this piecemeal is preserved is that the merging output of data block lays the foundation.
The present invention also provides a kind of output intent that makes the marking language document that the limited mobile device of resource distribution can export the high capacity marking language document.
For achieving the above object, the present invention by the following technical solutions:
A kind of output intent of marking language document comprises the steps:
Order is chosen the step of at least two data blocks;
Read the mark in the blocking sign that is stored in each data block correspondence and the step of flag attribute;
Blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms the step of a mark language file tree structure;
Two or more mark language file tree structures is merged into the step of a tree construction;
Step with the continuous output of tree construction after merging.
When the limited mobile device of resource distribution is exported jumbo marking language document, do not need once to read all the elements of entire chapter document, and only need once to choose at least two data blocks in proper order according to the progress of output, merge again after the data block of choosing resolved respectively, with the content output after merging.So just jumbo marking language document can be divided several times to resolve and output, the data volume of Chu Liing is compared very little with the data volume of entire chapter document each time, the resource distribution of mobile device can be satisfied the requirement of user to speed, can also accurately keep the logic and the external appearance characteristic of former document.
Also provide a kind of output unit that makes the marking language document that the limited mobile device of resource distribution can export the high capacity marking language document with the corresponding the present invention of the output intent of marking language document.
For achieving the above object, the present invention by the following technical solutions:
A kind of output unit of marking language document comprises:
Data block is chosen the unit, is used for order and chooses at least two data blocks;
Reading unit is used for reading the mark and the flag attribute that are stored in blocking sign;
Resolution unit is used for blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms a mark language file tree structure;
Merge cells is merged into a tree construction with 2 or above mark language file tree structure then;
Output unit is exported the tree construction after merging continuously.
When the limited mobile device of resource distribution can be exported jumbo marking language document, data block choose the unit according to the progress of output once order choose at least two data blocks, resolve and merge again after merge cells is resolved the data block of choosing respectively.Content output after output unit will merge.So just jumbo marking language document can be divided several times to resolve and output, the data volume of Chu Liing is compared very little with the data volume of entire chapter document each time, the resource distribution of mobile device can be satisfied the requirement of user to speed, can also accurately keep the logic and the external appearance characteristic of former document.
Description of drawings
Fig. 1 is the tree construction synoptic diagram of complete marking language document;
Fig. 2 is the local tree structural representation of marking language document;
Fig. 3 is the tree construction synoptic diagram of tree previous data block when an intranodal separates;
Fig. 4 is the tree construction synoptic diagram of a data block after tree is when an intranodal separates;
The tree construction synoptic diagram of previous data block when Fig. 5 separates between node for tree;
The tree construction synoptic diagram of a data block after when Fig. 6 separates between node for tree.
Embodiment
Do not have enough performances that the high capacity marking language document is handled the problem of (show, print) for solving the limited mobile device of resource distribution, the invention provides the storage means and the output intent of marking language document.Corresponding with method, the present invention also provides the memory storage and the output unit of marking language document.The storage means of marking language document breaks the whole up into parts the entire chapter marking language document and stores or transmit; The output intent of marking language document is then turned parts into the whole and is reverted back original marking language document and print or show.Respectively it is introduced below:
A kind of storage means of marking language document, marking language document comprises content information, mark and flag attribute, mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; Described mark comprises global mark and local flag;
It is characterized in that this method comprises the steps:
The step that marking language document is divided at least two data blocks;
The step that between adjacent two data blocks, adds blocking sign;
With global mark and local flag of not closing and flag attribute thereof, be saved in the step of described blocking sign in order;
The step that data block and blocking sign are stored.
The memory storage of the marking language document corresponding with this method comprises:
Divide module unit, be used for marking language document is divided at least two data blocks;
The blocking sign adding device is used for adding blocking sign between adjacent two data blocks;
Mark is preserved the unit, with global mark and local flag of not closing and flag attribute thereof, is saved in described blocking sign by the order that appears in the described marking language document;
Storage unit is used to store the content information and the blocking sign of each data block.
Can store jumbo marking language document in order to make the limited mobile device of resource distribution, the present invention has been divided into the plurality of data piece to the high capacity marking language document, and the global mark drawn game standard laid down by the ministries or commissions of the Central Government that logical organization, appearance attribute and the coded format of content information worked is remembered row into and preserved.Is that unit carries out the several times sequential storage in when storage with the data block.The mechanism that this piecemeal is preserved is that the merging output of data block lays the foundation.
A kind of output intent of marking language document comprises the steps:
Order is chosen the step of at least two data blocks;
Read the mark in the blocking sign that is stored in each data block correspondence and the step of flag attribute;
Blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms the step of a mark language file tree structure;
2 or above mark language file tree structure are merged into the step of a tree construction;
Step with the continuous output of tree construction after merging.
The output unit of the marking language document corresponding with this method comprises:
Data block is chosen the unit, is used for order and chooses at least two data blocks;
Reading unit is used for reading the mark and the flag attribute that are stored in blocking sign;
Resolution unit is used for blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms a mark language file tree structure;
Merge cells is merged into a tree construction with 2 or above mark language file tree structure;
Output unit is exported the tree construction after merging continuously.
When the limited mobile device of resource distribution is exported jumbo marking language document, do not need once to read all the elements of entire chapter document, and only need once to choose at least two data blocks in proper order according to the progress of output, merge again after the data block of choosing resolved respectively, with the content output after merging.So just jumbo marking language document can be divided several times to resolve and output, the data volume of Chu Liing is compared very little with the data volume of entire chapter document each time, the resource distribution of mobile device can be satisfied the requirement of user to speed, can also accurately keep the logic and the external appearance characteristic of former document.
Be example with the html document below, the present invention be described in further detail:
Below be html document (the document has omitted partial content):
Figure C200710187142D00131
<html>
<head>
<meta?http-equiv="Content-type"content="text/html;
charset=gb2312">
<title〉google search engine secret behind: the balance art of searching algorithm</title 〉
</head>
<body>
<div?id="wrap">
<div?id="texttitle">
<h1〉google search engine secret behind: the balance art of searching algorithm</h1 〉
</div>
<div?id="textbody">
<p〉<font face=regular script _ GB2312〉lead: on June 4 Beijing time, external medium are delivered the analysis article recently and are claimed that rely on powerful search engine, Google leads and bounds ahead of rivals such as Yahoo and Microsoft in web search market.So, what is google search engine secret behind?</font〉</p 〉
<p〉<strong the search be the most important thing</strong</p
<p〉along with scope of the enterprise constantly enlarges, Google begins to march to a plurality of fields, comprises that network map, digital library, video are shared and desktop software or the like.But the most important thing of Google remains search engine.By google search engine, the user can find the content that oneself needs in vast as the open sea bulk information.Just because of outstanding search engine has been arranged, Google just becomes the abundantest even the most powerful Internet firm of visit capacity maximum, profit.</p>
<p〉but, search engine is also maximum to the complaint that Google causes.They all have every day the millions of customer after using google search engine, to feel disappointed, because can't find hotel, medicine prescription or the personage's background that needs themselves.Google often can help the user to find the thing of wanting, but is not to accomplish this point.Based on this reason, A Mite Singhal (Amit Singhal) and other hundreds of slip-stick artists are devoted to improve google search engine always, and hope can be dwindled the gap between " often " and " always ".</p>
<p〉<strong the internet Source of life</strong</p
<p〉Singhal is the great master of google search engine rank algorithm.This algorithm is mainly used in determines which webpage is the optimum answer of customer problem, and it is Google's Core Team---the important component part of search quality department.For a long time, this department is a mystery always, and Google seldom allows Team Member to show one's face in public.Google is very high to the evaluation of Singhal and team thereof, and they are considered as the most basic competitive edge of company.Google thinks, wants to resist the advantage invasion from Yahoo and Microsoft, just must reduce to allow the frequency of user's disappointment.And in this course, search quality department is bringing into play irreplaceable effect.</p>
<p〉Federated Media first executive officer John Butler (John Battelle) expression: " core value of Google's establishment is exactly a rank algorithm." data presentation, online store has 1/4th to 1/2nd visitor to come from search engine; A lot of users ignore the webpage of online media sites, but directly visit required specific webpage by Google.He says: " from the above-mentioned fact as can be seen, Google has become ' Source of life ' of internet, and who be unable to do without it.”</p>
<p〉user can't see search engine algorithm and art behind, but in fact, search quality team of Google all will do for several times search engine algorithms weekly and improve.Under their effort, google search engine can be understood user's true intention more effectively.For example, some people's search " apple " is the information of being correlated with fruit in order to understand, and some people is then at the Mac or the iPod that study Apple.Although search content is identical, user's intention is as different as light from darkness.Singhal is represented: " in the past few years, search is by ' thing of wanting to me to the thing ' change into ' of my input '.”</p>
<p〉<strong〉lost inside story</strong〉</p 〉
<p〉39 years old this year of Singhal, be an Indian, joined Google in 2000.He is a researcher of Google at present, and this is the position that Google sets up for elite slip-stick artist specially.Not long ago, the reporter of a New York Times gets permission to have interviewed Singhal and other search quality Team Member.Although Google all takes great pains to guard a secret to a lot of problems, this ace reporter has still obtained a lot of lost inside stories in the past.</p>
<p〉to update in the process of search engine in Google, the ultimate challenge that faces is day by day huge scale.At present, Google has become the website of global access amount maximum, supports 112 kinds of language, includes tens billion of webpages in index, handles several hundred million searching request every day.More bad is, the purpose of a lot of Web page creates attracts eyeball exactly, and the inside has been full of a large amount of advertisements, Pornograph and financial deception information.Therefore, the user wishes that Google can get rid of these useless pages in Search Results, helps them to find maximally related information.</p>
<p〉You Dimaenbai of search quality Team Lead of Google (Udi Manber) expression: " user's expectation value is very high.When we just release search service,, will feel it is a miracle if the user can find the thing of wanting by search engine.Present situation is then different fully, if the user can not find the thing of wanting in first three result of page searching, just thinks that search engine has problems.”</p>
<p〉Google's searching service fully represented its off the beaten track management mode.Google has hundreds of slip-stick artists, and comprising the top search expert from academic institution, these people are organized together by loosely usually, is engaged in own interested project.But aspect search engine, Google can carefully, strictly check slip-stick artist's independent achievement, to guarantee that benefit that new searching algorithm brings is more than harm.In most cases, improvement and quality control all relate to art of balance.Ma Enbai represents: " improve and always bring positive and negative effect simultaneously, it is bigger that we must weigh any influence.Have only positive effect, do not have the improvement of counter productive not exist.”</p>
<p〉<strong〉search team revealed secrets</strong〉</p 〉
<p〉search quality team of Google is in the work of No. 43 building (Building 43) of Google office garden.Because company associating founder's Larry Page (Larry Page) is yearned for space travel, one has occupied the rest room in No. 43 building with " No. one, spaceship " onesize duplicate.This duplicate is also reminded the visitor at any time, and Google is just as the rapid rise of rocket.Singhal and other three top slip-stick artists' office just is seated the top layer in No. 43 building, and near the blackboard his desk, being coated with has everywhere expired chart, problem and mathematical formulae, certainly the varied opinions that also have the user that Google's engine is proposed.</p>
<p〉Google all employees all ... ... thereby solved this problem.</p>
......
......
......
<p〉<strong the talent be successfully basic</strong</p
</div>
</div>
</body>
</html>
Figure C200710187142D00161
The framed structure of html document is as follows:
Html document has comprised two parts: (HEAD mark), document body (BODY mark) come back.Mark<the HEAD that is seen in the above just〉</HEAD〉and<BODY〉</BODY 〉.
Part<the HEAD that is coming back〉</HEAD〉in, also have one group echo<TITLE〉</TITLE 〉.
<TITLE〉</TITLE the inside content information (literal) be the title of html document.
<HTML〉</HTML〉this group echo shows that this is a html file.Be usually located at the top of html document and bottom, all source codes are all wrapped.
Marking language document comprises content information, mark and flag attribute.Mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; In marking language document, the general paired appearance of mark (except the empty mark), the content information between two marks is the scope that flag attribute works.
Mark is divided into global mark and local flag;
Global mark is meant that this mark and attribute thereof expressed the logical organization of the whole content information of marking language document, appearance attribute and coded format; As mentioned in the above<head〉mark.<head〉mark and attribute thereof expressed logical organization, appearance attribute and the coded format of entire chapter marking language document content information.In other words:<head〉the pairing attribute of mark is for two<head〉content information outside the mark also works.
Local flag is meant that this mark and attribute thereof expressed logical organization, appearance attribute and the coded format of the local content information of marking language document; The local flag of not closing is meant in the mark that occurs in pairs and beginning label only occurred, the local flag that end mark does not occur as yet.
The present invention has designed blocking sign again on the basis of existing syntactic structure.Below this sign is elaborated:
Blocking sign is made up of two parts: promptly go up an end of data block mark and next data block beginning label.
The representation of a last end of data block mark is defined as:<block/ 〉;
Next data block beginning label comprises: [mark of new data block beginning label head logical organization and appearance forrns and attribute new data block beginning label tail].
The representation of new data block beginning label head is defined as:<!-NEWBLK;
The representation of new data block beginning label tail is defined as:--〉;
Therefore, the representation of blocking sign is defined as:<block/〉<!--mark of NEWBLK logical organization and appearance forrns and attribute--〉.As can be seen, the syntactic structure of blocking sign is consistent with the syntactic structure of generally labeling among the HTML, therefore in resolving, can resolve blocking sign according to the analytic method of HTML mark.
The mark of logical organization and appearance forrns and attribute are preserved mark and the flag attribute that the setting of the content information of data block is worked.Mainly be meant global mark and the local flag of not closing.
Introduce the process that above-mentioned html document is divided into the plurality of data piece below:
1. according to the size of the resource specified data piece byte of equipment.
Html document mainly shows in the screen of various computers and portable terminal, and common equipment (particularly handheld device) can single treatment 4KB data, the data of 4KB can be expressed 1000 Chinese characters, so the predetermined value of data block size is about 4KB.Obviously, we can be according to the size of concrete environmental selection data.
2. begin marking language document is counted from the document head.
From the top of being positioned at html document</html begin each character is counted, the character of the mark of document, flag attribute and content information all is the object of counting.
3. when counting reaches predetermined value (4KB), carry out piecemeal in the position of current byte;
Second byte when " people " that html document count down to " lost inside story " (Instructions Page 10 the 4th row) reached predetermined 4KB.Add blocking sign at this, form first data block.
4. the mark and the flag attribute that will work to the setting of the content information of data block is saved in described blocking sign
For when the pooled data piece, keep the logical relation and the appearance attribute of data block contents information constant.Global mark and local flag of not closing and flag attribute thereof to the content information of data block works are saved in the described blocking sign.
For example<and head〉mark is global mark, and the attribute of head mark is: metahttp-equiv=" Content-type " content=" text/HTML; Charset=gb2312 ".Attribute charset=gb231 is provided with coded format in full.Therefore, for two<head〉content information outside the mark also works.
<p〉be labeled as the local flag of not closing, flag attribute is<strong 〉, this flag attribute is also preserved.
The blocking sign that forms first data block is:<block/〉<!-NEWBLK
<HTML><head><meta?http-equiv="Content-type"content="text/HTML;
Charset=gb2312 "〉<title〉google search engine secret behind: the balance art of searching algorithm</title〉</head〉<body〉<divid=" wrap "〉<divid=" textbody "〉<p〉<strong〉--.
Like this, finish first data block.
First data block is:
Figure C200710187142D00191
<HTML>
<head>
<meta?http-equiv="Content-type"content="text/HTML;
charset=gb2312">
<title〉google search engine secret behind: the balance art of searching algorithm</title 〉
</head>
<body>
<div?id="wrap">
<divid="texttitle">
<h1〉google search engine secret behind: the balance art of searching algorithm</h1 〉
</div>
<div?id="textbody">
<p〉<font face=regular script _ GB2312〉lead: on June 4 Beijing time, external medium are delivered the analysis article recently and are claimed that rely on powerful search engine, Google leads and bounds ahead of rivals such as Yahoo and Microsoft in web search market.So, what is google search engine secret behind?</font〉</p 〉
<p〉<strong〉be not the people
Figure C200710187142D00201
Divide second data block
Begin counting from first character of " knowing ", when count down to " being coated with has everywhere expired chart " " " first byte of (the 11st page of the 6th row of instructions), reached the 4KB that is scheduled to.
Because " " be double-byte characters, " " first byte and " " second byte form Chinese character " ", thus can not be from piecemeal here, should from " " second byte piecemeal afterwards.So the rest position of second data block " " second byte back.
Global mark and the local flag of not closing and the flag attribute thereof that will work to the content information of data block are saved in the described blocking sign.
For example: head mark and attribute thereof (meta http-equiv=" Content-type " content=" text/HTML; Charset=gb2312 " 〉) setting of the content information of second data block is also worked.Therefore head mark and attribute thereof also will be saved in the blocking sign of second data block.
The blocking sign that forms second data block is:<block/〉<!--NEWBLK
<html><head><meta-equiv="Content-type"content="text/html;
Charset=gb2312 "〉<title〉google search engine secret behind: the balance art of searching algorithm
</title></head><body><div?id="wrap"><div?id="textbody"><p>-->。
The blocking sign of second data block is added to " " the back of second byte, form second data block.
Second data block is:
Figure C200710187142D00211
The inside story of knowing</strong〉</p 〉
......
<p〉<strong〉search team revealed secrets</strong〉</p 〉
<p〉search quality team of Google is in the work of No. 43 building (Building 43) of Google office garden.Because company associating founder's Larry Page (Larry Page) is yearned for space travel, one has occupied the rest room in No. 43 building with " No. one, spaceship " onesize duplicate.This duplicate is also reminded the visitor at any time, and Google is just as the rapid rise of rocket.Singhal and other three top slip-stick artists' office just is seated the top layer in No. 43 building, near the blackboard his desk, is coated with full everywhere
Figure C200710187142D00212
Divide the 3rd data block
Begin counting from first character of " figure ", when counting down to document ending place, also not enough 4KB.Therefore need not piecemeal, this document is divided into three data blocks.
In blocking process, when blocking sign is arranged in indivisible unit, search last character of indivisible unit, described blocking sign is added on the back of last character of indivisible unit; Perhaps search first character of indivisible unit, described blocking sign is added on the front of first character of indivisible unit.Described indivisible unit is the gulp of unsuitable separate storage such as mark, multibyte character, mathematical formulae, chemical formula, image and processing.
Html document is typical tree construction, and its mark and content information all are regarded as node (Node).
The HTML mark is the root node of html document tree structure, as the position of A among Fig. 1.
<head>
<meta?http-equiv="Content-type"content="text/html;
charset=gb2312">
<title〉google search engine secret behind: the balance art of searching algorithm</title 〉
</head〉(being designated hereinafter simply as the head node) be the child node of root node, as the position of B among Fig. 1.
<body〉be the document body tag, as the position of C among Fig. 1.Other marks are<body〉downstream site, as shown in Figure 1.
When marking language document is carried out piecemeal, have two kinds of situations: tree separates at an intranodal, and tree separates between node.
When marking language document is merged, also there are two kinds of situations: the merging the when merging of tree when an intranodal separates, tree separate between node.
As Fig. 2, be that this part document of local nodes model of html document is divided into two pieces.
Tree separates at an intranodal, and the step when data block merges is as follows:
(1) reads the content information of two data blocks respectively, and resolve and form tree construction such as Fig. 3,4 of data block separately.
(2) owing to the existence of blocking sign, node S1 and S2 are the nodes that needs merging as can be known.
(3) seek all superior nodes of S1 and S2 respectively, i.e. superior node B1 and the A1 of S1, i.e. superior node B2 and the A2 of S2.
(4) because S1 is the leaf node that mark (global mark and local flag) is identical, attribute is also identical with S2, they are directly merged into a leaf node, i.e. S node.During merging, keep mark and attribute thereof constant, only content information being added up in order gets final product, and the length of update content information counting.
(5) from bottom to top, the father node of S1 and S2 is merged successively, that is: B1 is used as B, only the content information length counting of B2 is added to the content information length counting of B1, delete B2 then; All child nodes of A2 are moved on under the A1 child node as A1, and upgrade the content information length counting of A1, delete A2 then.
(6) after merging is finished, form the mark language file tree structure the same with Fig. 2.
Tree separates between node, and the step when data block merges is as follows:
The place of separating when being piecemeal between second sub-Node B of node A and the 3rd the child node C.
Fig. 5 represents the tree construction of previous data block, and Fig. 6 represents the tree construction of a back data block.
(1) reads the content information of two data blocks respectively, and resolve and form tree construction such as Fig. 5,6 of data block separately.
(2) owing to the existence of blocking sign, Node B and C are the nodes that need merge processing as can be known.
(3) seek all superior nodes of B and C respectively, i.e. the superior node A1 of B, i.e. the superior node A2 of C.
(4) because B and C child node all, perhaps therefore the mark of B and C or attribute and incomplete same can not directly merge into them a node, but how to merge according to their superior node decision.
(5) from bottom to top, the father node of B and C is merged successively, that is: A1 is used as A, all child nodes of A2 are moved on under the A1 child node as A1, and upgrade the content information length counting of A, delete A2 then.
(6) after merging is finished, form the mark language file tree structure the same with Fig. 2.
Be described in further detail below in conjunction with of the merging of concrete html document data block;
1. order is chosen at least two data blocks
2. read mark and flag attribute in the blocking sign that is stored in each data block correspondence
Mainly obtain global flag and and the part do not closed sign;
The blocking sign of first data block is:<block/〉<!--NEWBLK<HTML〉<head〉<metahttp-equiv=" Content-type " content=" text/HTML; Charset=gb2312 "〉<title〉google search engine secret behind: the balance art of searching algorithm</title〉</head〉<body〉<divid=" wrap "〉<div id=" textbody "〉<p〉<strong〉--.
The blocking sign of second data block is:<block/〉<!--NEWBLK<html〉<head〉<meta-equiv=" Content-type " content=" text/html; Charset=gb2312 "〉<title〉google search engine secret behind: the balance art of searching algorithm</title〉</head〉<body〉<divid=" wrap "〉<div id=" textbody "〉<p〉--.
3. blocking sign and next-door neighbour's subsequent data block are together resolved, each data block forms a mark language file tree structure, and 2 or above mark language file tree structure are merged into a tree construction;
When resolving each data block, two tree nodes that write down the blocking sign front and back especially are S1 and S2, and judge that blocking sign is in the inside of a leaf node, still between two nodes;
If blocking sign in the inside of a leaf node, is then merged into a node S with S1 and S2; All superior nodes that merge S1 and S2 then from bottom to top, successively step by step; (situation that tree separates at an intranodal)
If blocking sign is between two node S1 and S2, then with the brotgher of node of S2 as S1, and after S1; All superior nodes that merge S1 and S2 then from bottom to top, successively step by step.(situation that tree separates between node)
4. the tree construction after will merging is the step of output continuously.
With whole marking language document continuously output show or print.
In actual applications, according to the resource distribution situation of information equipment, handle blocking sign:
(1) if the resource of information equipment very fully (as Desktop PC and notebook computer, CPU speed is very fast, internal memory is considerably beyond the size of document, more than the 200MB, show that screen is also bigger), just can ignore all blocking sign, all document contents are together handled.
(2) if the resource of information equipment is enough relatively (as disposing mobile phone, mobile device preferably, more than the CPU frequency 200MHZ, internal memory surpasses the size of document, more than the 64MB), just several adjacent piecemeals can be combined as a piecemeal group and handle, ignore the blocking sign between these piecemeals, only handle first piecemeal blocking sign before.Like this, processing and exhibiting device can reach balance between efficient and resource.
(3) if the resource-constrained of information equipment (mobile phone as disposing, mobile device, below the CPU frequency 200MHZ, in exist below the 16MB, display screen is less), just need identification and handle each blocking sign, according to the progress of handling, adjacent two piecemeals are resolved respectively, then merge.The output of document can be carried out continuously, and can keep the original logic and the outward appearance of document, the reaction velocity of information equipment also meets people's impression, and the logic of processing and exhibiting device also is unlikely to undue complexity.
One of ordinary skill in the art will appreciate that all or part of step that realizes in the foregoing description method, can finish by the programmed instruction related hardware.The software of described embodiment correspondence can be stored in a computing machine and can store in the medium that reads.

Claims (11)

1, a kind of storage means of marking language document, marking language document comprises content information, mark and flag attribute, mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; Described mark comprises global mark and local flag;
It is characterized in that this method comprises the steps:
The step that marking language document is divided at least two data blocks;
The step that between adjacent two data blocks, adds blocking sign;
With global mark and local flag of not closing and flag attribute thereof, by appearing at the step that order in the described marking language document is saved in described blocking sign;
The step that data block and blocking sign are stored.
2, the storage means of marking language document according to claim 1 is characterized in that: described global mark is meant the mark that the logical organization of the whole content information of marking language document, appearance attribute and coded format are worked; Local flag is meant the mark that logical organization, appearance attribute and the coded format to the local content information of marking language document works;
The local flag of not closing is meant, beginning label only occurred in the mark that occurs in pairs, the local flag that end mark does not occur as yet.
3, the storage means of marking language document according to claim 2 is characterized in that: the content information of described marking language document comprises indivisible unit; When blocking sign is arranged in indivisible unit, search last byte of indivisible unit, described blocking sign is added on the back of last byte of indivisible unit; Perhaps search first byte of indivisible unit, described blocking sign is added on the front of first byte of indivisible unit.
4, the storage means of marking language document according to claim 3 is characterized in that described indivisible unit is the gulp of unsuitable separate storage such as mark, multibyte character, mathematical formulae, chemical formula, image and processing.
5, the storage means of marking language document according to claim 3 is characterized in that the syntactic structure of described blocking sign is identical with the syntactic structure of marking language document itself.
6, a kind of output intent of marking language document is characterized in that comprising the steps:
Order is chosen the step of at least two data blocks;
Read the mark in the blocking sign that is stored in each data block correspondence and the step of flag attribute;
Blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms the step of a mark language file tree structure;
Two or more mark language file tree structures is merged into the step of a tree construction;
Step with the continuous output of tree construction after merging.
7, the output intent of marking language document according to claim 6 is characterized in that, comprises the steps: that also tree construction after will merging also comprises continuously the step that the marking language document that will export prints or shows after the step of output.
8, the output intent of marking language document according to claim 6 is characterized in that, blocking sign and next-door neighbour's subsequent data block are together resolved, and the step that each data block forms a mark language file tree structure comprises:
The overall situation that will extract from blocking sign and local flag together form a marking language document with data block;
Resolve described marking language document, form the mark language file tree structure;
In described mark language file tree structure, two tree nodes before and after the blocking sign between two data blocks are designated as tree node and sign back tree node before the sign respectively.
9, the output intent of marking language document according to claim 6, it is characterized in that, the step of 2 or above mark language file tree structure being merged into a tree construction comprises: judge before the blocking sign behind the tree node and blocking sign that tree node was that to belong to same leaf node be the inside of blocking sign at a leaf node before piecemeal, still belonging to two different nodes is that blocking sign is between two nodes;
If blocking sign is in the inside of a leaf node, tree node and sign back tree node are merged into a node before then will indicating;
All superior nodes that merge preceding tree node of sign and sign back tree node then from bottom to top, successively step by step;
If the blocking sign blocking sign between two nodes, then will indicate the back tree node brotgher of node of preceding tree node as a token of, and be placed on after the preceding tree node of sign;
All superior nodes that merge preceding tree node of sign and sign back tree node then from bottom to top, successively step by step.
10, a kind of memory storage of marking language document, described marking language document comprises content information, mark and flag attribute, mark and flag attribute have been expressed logical organization, appearance attribute and the coded format of described content information; Mark expressive notation attribute is to the scope of content information function; State mark and comprise global mark and local flag;
It is characterized in that this device comprises:
Divide module unit, be used for marking language document is divided at least two data blocks;
The blocking sign adding device is used for adding blocking sign between adjacent two data blocks;
Mark is preserved the unit, with global mark and local flag of not closing and flag attribute thereof, is saved in described blocking sign by the order that appears in the described marking language document;
Storage unit is used to store the content information and the blocking sign of each data block.
11, a kind of output unit of marking language document is characterized in that comprising:
Data block is chosen the unit, is used for order and chooses at least two data blocks;
Reading unit is used for reading the mark and the flag attribute that are stored in blocking sign;
Resolution unit is used for blocking sign and next-door neighbour's subsequent data block are together resolved, and each data block forms a mark language file tree structure;
Merge cells is merged into a tree construction with two or more mark language file tree structures;
Output unit is exported the tree construction after merging continuously.
CNB2007101871423A 2007-11-16 2007-11-16 Storage means and device for marking language document, and output method and device Expired - Fee Related CN100489840C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB2007101871423A CN100489840C (en) 2007-11-16 2007-11-16 Storage means and device for marking language document, and output method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB2007101871423A CN100489840C (en) 2007-11-16 2007-11-16 Storage means and device for marking language document, and output method and device

Publications (2)

Publication Number Publication Date
CN101158939A CN101158939A (en) 2008-04-09
CN100489840C true CN100489840C (en) 2009-05-20

Family

ID=39307042

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB2007101871423A Expired - Fee Related CN100489840C (en) 2007-11-16 2007-11-16 Storage means and device for marking language document, and output method and device

Country Status (1)

Country Link
CN (1) CN100489840C (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102355474B (en) * 2011-06-28 2014-12-31 无锡永中软件有限公司 Document sending method, document receiving method and document transmitting method
US9626368B2 (en) * 2012-01-27 2017-04-18 International Business Machines Corporation Document merge based on knowledge of document schema
CN110377884B (en) * 2019-06-13 2023-03-24 北京百度网讯科技有限公司 Document analysis method and device, computer equipment and storage medium

Also Published As

Publication number Publication date
CN101158939A (en) 2008-04-09

Similar Documents

Publication Publication Date Title
Trace et al. Information management in the humanities: Scholarly processes, tools, and the construction of personal collections
Wheeler Student perceptions of learning support in distance education
Nolan et al. XML and web technologies for data sciences with R
US20140115439A1 (en) Methods and systems for annotating web pages and managing annotations and annotated web pages
Harper et al. Library of Congress controlled vocabularies and their application to the Semantic Web
Ding et al. Upper tag ontology for integrating social tagging data
Rogers Data journalism is the new punk
CN105787047A (en) Extraction, analysis and conversion method of resume information
García López et al. Strategies for preserving memes as artefacts of digital culture
Lamba et al. Text Mining for Information Professionals
Rodrigues Get more eyes on your work: Visual approaches for dissemination and translation of education research
CN100489840C (en) Storage means and device for marking language document, and output method and device
Frantz et al. ERP software implementation best practices
Candela et al. A benchmark of Spanish language datasets for computationally driven research
Zouaq et al. Linked data for learning analytics: Potentials and challenges
Avery The democratization of metadata: Collective tagging, folksonomies and web 2.0
Reed Digital humanities and the study and teaching of North American religions
Peng digital humanities approach to comparative literature: opportunities and challenges
Mason Collaborative social studies teacher education across remote locations: Students' experiences and perceptions
Rao National Digital Library of India to the nation
Włodarczyk Topic map as a method for the development of subject headings vocabulary: an introduction to the project of the national library of Poland
Lamba et al. Text Data and Where to Find Them?
Hugar et al. Bibliometric Study of Current Science Journal from 1994-2017
Chang The impact of XML in digital library development
Schloen et al. Publication: Where Data Comes to Life!

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20090821

Address after: No. 5, the Summer Palace Road, Beijing, Haidian District: 100871

Co-patentee after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee after: Peking University

Co-patentee after: Beijing Founder Foread Media Technology Co.,Ltd.

Address before: No. 5, the Summer Palace Road, Beijing, Haidian District: 100871

Co-patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Co-patentee before: FOUNDER APABI TECHNOLOGY Ltd.

ASS Succession or assignment of patent right

Owner name: BEIDA FANGZHENG GROUP CO. LTD. LIDE TECHNOLOGY DEV

Free format text: FORMER OWNER: BEIDA FANGZHENG GROUP CO. LTD. BEIJING FOUNDER FEIYUE MEDIA TECHNOLOGY CO., LTD.

Effective date: 20120214

C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20120214

Address after: 100871 Beijing the Summer Palace Road, Haidian District, No. 5

Co-patentee after: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee after: Peking University

Co-patentee after: Lide Technology Development Co.,Ltd.

Address before: 100871 Beijing the Summer Palace Road, Haidian District, No. 5

Co-patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Peking University

Co-patentee before: Beijing Founder Foread Media Technology Co.,Ltd.

CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20090520

CF01 Termination of patent right due to non-payment of annual fee