Embodiment
Below with reference to accompanying drawing and combine embodiment, specify the present invention.
Fig. 2 shows the process flow diagram of content mask method according to an embodiment of the invention, comprising:
Step S10 obtains contents fragment;
Step S20 is at [S
f, S
t] content-data on matched rule [R
f, R
t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S
fBe the beginning of contents fragment, S
tBe the end of contents fragment, R
fBe the first rule of rule template, R
tBe an end rule of rule template, rule template comprises from R
fTo R
tOne group of linear orderly rule.
Prior art adopts manual mode to carry out the content mark, thus the very loaded down with trivial details work of operation, and in the present embodiment; Make up rule in advance, adopt rule to come the matching content fragment, thereby automatically coupling has obtained each data item; And the metadata token of creating in advance in the rule automatically mated to each data item; Through the establishment of rule, thereby make these operations to realize, improved the content marking efficiency through computing machine.
In addition, in the present embodiment, rule [R
f, R
t] be the rule of one group of linear ordering; This rule template is simple in structure, and the user can be at an easy rate create this rule template to the content document of various types of traffic, and the rule that computing machine is carried out this linear ordering is mated seriatim; Algorithm is realized simple, and efficient is higher.
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S
f, S
t] content-data on matched rule [R
f, R
t] process flow diagram, comprising:
1, current regular R is set
cBe R
f
2, with S
fFor starting point is carried out R
cCoupling is to obtain R
cThe data matching item, successfully the sign, end position S
r, to R
cData matching item mark R
cIn metadata token, obtain mapping relations tabulations M
r
3, judge successfully whether sign is effective;
4 if, then with M
rJoin among the M, otherwise end process;
5, judge and R
cWhether be R
t, if, end process then;
6 otherwise judge S
rWhether be S
t, if, end process then;
7 otherwise S is set
fBe S
r, R is set
fBe R
cNext rule, get back to step 1 then.
Utilize [R
f, R
t] be the rule of one group of linear ordering, this preferred embodiment has designed the flow process of this searching loop, can robotization ground with [R
f, R
t] strictly all rules sequentially to contents fragment [S
f, S
t] content-data accomplish coupling.This process is simple, is easy to through computer realization.
Preferably, R
cComprise the Data Matching condition, carry out R
cCoupling comprises: use the Data Matching condition at [S
f, S
t] content-data on match each data item, and sign is set correspondingly successfully.The preferred embodiment provides a kind of condition matched rule, can be through the mode of condition judgment, and the data item in the identification contents fragment.
Preferably, R
cAlso comprise the final position sign, final position is masked as invalid, and being used for the designation data matching condition is interval condition; Final position is masked as effectively; Being used for the designation data matching condition is locality condition; Interval condition is used to indicate the format convention that is arranged on data on the continuum; Wherein, corresponding data item is the data in the end position successive range that begin, that satisfy format convention of a last data item; Locality condition is used to indicate the format convention that is arranged on end position place data; Wherein, Corresponding data item is that the end position of an above data item is starting point, is the data between the end point with the position of satisfying format convention; Wherein, format convention is used for the regular characteristic that designation data shows.The preferred embodiment provides interval condition and locality condition for the condition matched rule; When the user can confirm the characteristic of business tine on a continuum; Can adopt interval condition to realize coupling; When the user can confirm business tine in certain locational characteristic, just can adopt locality condition to realize coupling.The preferred embodiment can satisfy the content mark demand of various dissimilar business tines.
Preferably, format convention comprises following at least a: content format rule, manifest format convention, tag format rule and Any rule, the content format rule is used for the regular characteristic that designation data shows on document content; Manifest format convention and be used for the regular characteristic that designation data shows on the space of a whole page appears; The tag format rule is used for the regular characteristic that designation data shows on applied logic; The Any rule is used to indicate all Satisfying Matching Conditions of any data.This preferred embodiment has further been indicated multiple format convention on the basis of above-mentioned preferred embodiment, thereby can satisfy the content mark demand of various dissimilar business tines better.
Preferably, R
cComprise the recurring rule number, the recurring rule number is used to indicate repeated application [R
f, R
t] middle several rules of recurring rule.This preferred embodiment provides a kind of repeated matching rule.For example for brief note, because only comprise a prefix usually in a brief note, so, obviously need not adopt the repeated matching rule to mate for the identification of prefix.In addition, possibly comprise a plurality of senses of a dictionary entry in the brief note, so it is just more suitable to adopt the repeated matching rule of this preferred embodiment to discern.
Preferably, rule comprises: minimum occurrence number, its value be for N, is used for indication and matches that data item is minimum to be that N is a nonnegative integer N time; Maximum occurrence number, its value is M, is used for indication and matches data item and be M time that at most M is a positive integer, and M >=N.
Preferably, this content mask method also comprises: each mapping relations among the traversal M, write down each metadata token and corresponding data item respectively, to make up metadata item; Metadata item is made up the metadata item table; The metadata item table is appended to contents fragment.
Preferably, this content mask method also comprises: each mapping relations among the traversal M, write down each metadata token and corresponding data item respectively, to make up metadata item; According to pairing continuum of matched data item or end position, metadata item is appended to contents fragment.
Above-mentioned two preferred embodiments have provided two kinds of simple schemes that the mapping relations tabulation of setting up is preserved.
Preferably, metadata token meets XML, wherein, when metadata token is empty mark, ignores the data item that metadata token is empty mark when being used to indicate the attaching metadata item.XML is an industry more common computerese at present, adopts XML regulation metadata token, can improve the versatility of this method.In addition, through the sky mark is provided, thus can not the recognition data content in can the contents processing fragment, improved compatibility to content document.
Preferably, this content mask method also comprises: the performance rule of analyzing each contents fragment in the content document; Create rule template according to the performance rule, rule template comprises rule [R
f, R
t].Through creating rule template in advance, when the close a plurality of electronic document of the identification form of expression, can a public rule template, avoided each needs to rebulid rule [R
f, R
t], thereby improved the reusability of content mark work.
Preferably, Rc comprises and quotes template name, is used to indicate quote the rule template with this template name.In this preferred embodiment, set up template and quoted rule, can reduce the development amount.
Rule template comprises one group of linear orderly rule, can be its create name, 1) rule template can be stored, and uses on other similar contents fragments, and this and pattern are similar; 2) other rule templates also can be quoted the rule template that has defined through this title.
Fig. 4 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and this rule template " brief note " comprises 6 matched rules that linearity is orderly.The brief note rule template that Fig. 5 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 1.
In this preferred embodiment, rule is divided three classes: condition matched rule, repeated matching rule and template are quoted rule.Any rule all comprises Column Properties down:
Wherein, any rule can be specified minimum occurrence number and maximum occurrence number, and the number of minimum occurrence number<=occurrence<=maximum occurrence number can be considered as that the match is successful.
For example: the answer of multiple-choice question possibly be shown as following format text in some workbook:
Answer: AC
This is can ({ capitalization, answer choice 1..*}), identify each and select answer (" answer choice "=" A ", " answer choice "=" B ") through condition of continuity rule.
Conditional plan can further be subdivided into two types: condition of continuity rule and termination condition rule.
Condition of continuity rule all comprises Column Properties down:
Attribute |
Explanation |
Format convention (condition) |
Specify the corresponding format convention of occurrence.(on the continuum) |
The termination condition rule all comprises Column Properties down:
Wherein, comprising the final position sign is not the sign of distinguishing condition of continuity rule and termination condition rule, and whether the scope of its expression occurrence comprises the data of final position.
With the stroke rule in the example is example:
When " comprising final position " when the TRUE (text: " picture, ", TRUE, stroke 1}), identifies that occurrence is " 4 draw, ", end position be ", " afterwards, comprise " picture, ";
When " comprising final position " when the FALSE (text: " picture, ", FALSE, stroke 1}), identifies occurrence and is " 4 ", end position be " 4 " afterwards, do not comprise " picture, ".
Recurring rule all comprises Column Properties down:
Attribute |
Explanation |
The recurring rule number |
Several matched rules are repeated to use in the appointment. |
Template is quoted rule and is all comprised Column Properties down:
Attribute |
Explanation |
Quote template name |
Specify and use the rule template that has defined. |
Wherein, template is quoted rule and specify to be used by quoting [R between the formula area that template name identifies
The f-template, R
The t-template], be a kind of method of nested application.
For example: with the rule template in the example " brief note " is example; With last lexical or textual analysis rule make into template quote rule (template is quoted: " lexical or textual analysis ", lexical or textual analysis, 1}); When using the lexical or textual analysis rule, can find automatically that (as shown in Figure 9) matees between the corresponding formula area of rule template " lexical or textual analysis ".
The metadata effect is following:
< metadata >
<word Mu>Open</word Mu>
<the traditional font font; Open </>traditional font font;
<pin Yin>K ā i</Pin Yin>
<bi Hua>4</Bi Hua>
<bu Shou>One</Bu Shou>
< lexical or textual analysis >
< senses of a dictionary entry>opened :~door |~curtain | public~| net~simultaneously.</Yi Xiang>
< senses of a dictionary entry>got through; Open up :~road |~ore deposit |~pick |~open up.</Yi Xiang>
</Shi Yi>
</Yuan Shuoju>
Concrete matching process is following:
(1) data of observation and analysis contents fragment are found its performance rule, create rule template " brief note " as shown in Figure 4;
(2) contents fragment of selecting brief note " to open ";
(2) this rule template comprises one group of orderly matched rule;
(3) analyze contents fragment, according to rule template " brief note ", identification obtains the matched data item, and is mapped to the metadata associated mark.
(4) set up metadata information as shown in Figure 5 according to the matched data item and the metadata associated mark that identify.These metadata informations can integral body append on the contents fragment that brief note " opens ".
Above-mentioned steps (3), further refinement, the rule template identification and matching may further comprise the steps:
(3.1) reference position S is set
fBeginning (" opening " word that section is first) for contents fragment is provided with end position S
tEnd (fullstop at section end) for contents fragment;
(3.2) initial regular R is set
fFirst rule (rule 1 " word order ") for rule template is provided with end rules R
tEnd rule (rule 6 " lexical or textual analysis ") for rule template.
(3.3) at interval [S
f, S
t] content-data on, carry out interval [R
f, R
t] rule match, obtain mapping relations tabulation M.
Above-mentioned steps (3.3), further refinement may further comprise the steps:
(3.3.1) current regular R is set
cBe initial regular R
f(rule 1 " word order ");
(3.3.2) with reference position S
f(" opening " word that section is first) is starting point, at interval [S
f, S
t] content-data on, executing rule R
cCoupling; Obtain regular R
cThe end position S of the successful sign (effectively) of coupling, coupling
r(" opening " word back " (") and the mapping relations tabulation M that matches
r(" word order "=" opening ");
If (3.3.3) the regular R of step (3.3.2) acquisition
cThe successful sign of coupling is effective, then with the mapping relations tabulation M that matches
r, record among the mapping relations tabulation M;
(3.3.4) judge whether to need to continue coupling (needs), if then get into step (3.3.5); Otherwise, termination;
(3.3.5) reference position S is set
fRegular R for step (3.3.2) acquisition
cThe end position S of coupling
r(" opening " word back " ("); Initial regular R will be set
fBe current regular R
cNext rule (rule 2 " traditional font fonts "); Forward step (3.3.1) to, at interval [S
f, S
t] content-data on, carry out interval [R
f, R
t] rule match.
Further refinement of above-mentioned steps (3.3.2), the coupling of single rule may further comprise the steps:
If (3.3.2.1) current regular R
cBe the condition matched rule, then get into step (3.3.2.2); If current regular R
cFor repeating matched rule, forward step (3.3.2.4) to; Otherwise current regular Rc is a template quotes matched rule, forwards step (3.3.2.5) to;
(3.3.2.2) according to current conditional plan R
cIn condition, comprise final position sign and occurrence number, to satisfy the matched data item tabulation of rule and successful sign;
Be each matched data item in the matched data item tabulation that obtains in the step (3.3.2.2) (3.3.2.3), set up and current conditional plan R
cIn metadata token between mapping relations, add mapping relations tabulation M
rIn; Termination.
(3.3.2.4) according to current recurring rule R
cIn recurring rule number and occurrence number, at interval [S
f, S
t] content-data on, repeat interval [R
The c-multiplicity, R
C-1] rule match, obtain mapping relations tabulation M
rTermination.
(3.3.2.5) according to quoting regular R when front template
cIn quote template name and occurrence number, at interval [S
f, S
t] content-data on, carry out by the interval [R that quotes the template name sign
The f-template, R
The t-template] rule match, obtain mapping relations tabulation M
rTermination.With rule 1 " word order " is example, and this rule is a condition of continuity rule (promptly adopting interval condition).Every font size from " opening " beginning is that No. three content all satisfies this condition, so its matched data item is section first " opening ".
With rule 4 " strokes " is example, and this rule is a termination condition rule (promptly adopting locality condition).Every content that finishes with " picture, " word string that begins at the back from phonetic alphabet all satisfies this condition, so its matched data item is " 4 draw, ".
The rule template that above-mentioned preferred embodiment is set up also is applicable to meeting the other guide fragment of this rule.Brief note " californium " for example, its contents fragment and metadata information, like Fig. 6 and shown in Figure 7, Fig. 6 shows another brief note of dictionary, and the brief note rule template that Fig. 7 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 6.
Preferably, this content mask method also comprises: institute's data matching item as a contents fragment, is continued to carry out at [S
f, S
t] content-data on matched rule [R
f, R
t] step.The preferred embodiment provides a kind of nested mechanism, can handle complicated more content structure, thereby can satisfy the mark demand of the content document of various types of traffic.
The metadata information that above-mentioned steps (4) is set up can also append on the data interval of contents fragment with the mode that embeds, and as shown in Figure 8, its metadata token that matches all marks brief note and " opens " in the contents fragment.Fig. 9 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and the brief note rule template that Figure 10 shows Fig. 9 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 8.
Can find out that the metadata token that the content that the brief note of Fig. 8 " is opened " is embedded into further is divided into more tiny contents fragment.To the contents fragment of these subordinates, can continue to adopt the inventive method to mark.With " lexical or textual analysis of opening " contents fragment is example, its rule template and metadata information such as Fig. 9 and shown in Figure 10.
User's refinement layer by layer, use this content mask method one by one with going forward one by one, to greatest extent, minimum particle size ground mark content document or fragment, to reach satisfied structuring effect.
Figure 11 shows the synoptic diagram of content annotation equipment according to an embodiment of the invention, comprising:
Acquisition module 10 is used to obtain contents fragment;
Matching module 20 is used at [S
f, S
t] content-data on matched rule [R
f, R
t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S
fBe the beginning of contents fragment, S
tBe the end of contents fragment, R
fBe the first rule of rule template, R
tBe an end rule of rule template, rule template comprises from R
fTo R
tOne group of linear orderly rule.
This content annotation equipment has improved the content marking efficiency.
Preferably, matching module 20 comprises:
The Set For Current module is used to be provided with current regular R
cBe R
f
Current matching module is used for S
fFor starting point is carried out R
cCoupling is to obtain R
cThe data matching item, successfully the sign, end position S
r, to R
cData matching item mark R
cIn metadata token, obtain mapping relations tabulations M
r
Add module, if be used for successfully being masked as effectively, then with M
rJoin among the M;
Judge module is used for judging successfully whether sign is effectively, and R
cWhether not R
t, and S
rWhether not S
t
The circulation module is that S then is set if be used for that above judgement is
fBe S
r, R is set
fBe R
cNext rule, continue to carry out above-mentioned steps then; Otherwise termination.
This content annotation equipment is simple in structure, is easy to through computer realization.
Preferably, this content annotation equipment also comprises: the metadata item module, be used for traveling through each mapping relations of M, and write down each metadata token and corresponding data item respectively, to make up metadata item; Metadata item table module is used for metadata item is made up the metadata item table; Add-on module is used for the metadata item table is appended to contents fragment.This preferred embodiment has provided two kinds of simple schemes that the mapping relations tabulation of setting up is preserved.
Preferably, this content annotation equipment also comprises: analysis module is used for analyzing the performance rule of each contents fragment of content document; Create module, be used for creating rule template according to the performance rule, rule template comprises rule [R
f, R
t].This preferred embodiment has improved the reusability of content mark work.
Each embodiment of the present invention can combine batch system or macros operation, thereby can discern, mate and additional metadata information the contents fragment of specific regularity in batches apace.
Can find out that from above description through each embodiment of the present invention, the user can be regular contents fragment affix metadata information easily, neatly, efficiently, exactly.The above embodiments of the present invention combine the metadata token system, applicable to various applications, as: chapter, paper, examination question, word (speech) allusion quotation etc., satisfy user's different service demand.
Obviously; It is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.