The content mask method of content document and device
Technical field
The present invention relates to digital type-setting domain, in particular to content mask method and device.
Background technology
Computer software application can help user to create various content document, in recent years, adopt structured data format, comprise markup language (as: XML etc.) or the mark standard etc. required by the other standards committee, these content documents or contents fragment are marked, describes the application structure of content.Based on this application structure, content is managed further, process, reuse, become users in the urgent need to.
The content document of some business scope presents a large amount of regular contents fragment, such as collection of thesis, examination question collection, word (word) allusion quotation etc.Fig. 1 shows a brief note (or being called entry) of dictionary.Can comprise brief note similar in a large number in dictionary, the regularity of these brief notes is embodied in, and each brief note includes prefix (or being called word order, prefix), phonetic symbol, lexical or textual analysis etc.
In order to the dictionary of Fig. 1 is converted to structural data, needing the prefix of each brief note, phonetic symbol, lexical or textual analysis etc. to be labeled as metadata, that is, is the contents fragment affix metadata information of the regularity of e-book.Prior art adopts manual mode to carry out content mark, so operation is very loaded down with trivial details.
Summary of the invention
The present invention aims to provide a kind of content mask method and device of content document, marks more loaded down with trivial details problem to solve manual content of carrying out.
In an embodiment of the present invention, provide a kind of content mask method, comprising: the contents fragment obtaining content document; Create rule template, described rule template comprises from R
fto R
tone group of linearly orderly rule [R
f, R
t]; At [S
f, S
t] content-data on matched rule [R
f, R
t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S
ffor the beginning of contents fragment, S
tfor the end of contents fragment, R
ffor the first rule of rule template, R
tfor an end rule of rule template; Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule; Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
In an embodiment of the present invention, provide a kind of content annotation equipment of content document, comprising: acquisition module, for obtaining the contents fragment of content document; Creation module, for creating rule template, described rule template comprises from R
fto R
tone group of linearly orderly rule [R
f, R
t]; Matching module, at [S
f, S
t] content-data on matched rule [R
f, R
t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S
ffor the beginning of contents fragment, S
tfor the end of contents fragment, R
ffor the first rule of rule template, R
tfor an end rule of rule template; Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule; Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
The content mask method of the embodiment of the present invention and device, because adopt regular Auto-matching contents fragment, so overcome the loaded down with trivial details problem of manual content labeling operation, improve the efficiency of content mark.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows a brief note of dictionary;
Fig. 2 shows the flow chart of content mask method according to an embodiment of the invention;
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S
f, S
t] content-data on matched rule [R
f, R
t] flow chart;
Fig. 4 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Fig. 5 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 1;
Fig. 6 shows another brief note of dictionary;
The brief note rule template that Fig. 7 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 6;
Fig. 8 shows the schematic diagram of the brief note after content mark in accordance with a preferred embodiment of the present invention;
Fig. 9 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Figure 10 shows Fig. 9 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 8;
Figure 11 shows the schematic diagram of content annotation equipment according to an embodiment of the invention.
Detailed description of the invention
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 2 shows the flow chart of content mask method according to an embodiment of the invention, comprising:
Step S10, obtains contents fragment;
Step S20, at [S
f, S
t] content-data on matched rule [R
f, R
t], to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, wherein, S
ffor the beginning of contents fragment, S
tfor the end of contents fragment, R
ffor the first rule of rule template, R
tfor an end rule of rule template, rule template comprises from R
fto R
tone group of linearly orderly rule.
Prior art adopts manual mode to carry out content mark, so the very loaded down with trivial details work of operation, and in the present embodiment, construct rule in advance, adopt rule to carry out matching content fragment, thus automatically coupling obtains each data item, and the metadata token be pre-created in rule is automatically mated to each data item, by the establishment of rule, thus these operations can be realized by computer, improve the efficiency of content mark.
In addition, in the present embodiment, rule [R
f, R
t] be the rule of one group of linear ordering, this rule template structure is simple, and user can create this rule template for the content document of miscellaneous service type easily, and the rule that computer performs this linear ordering is mated seriatim, algorithm realization is simple, and efficiency is higher.
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S
f, S
t] content-data on matched rule [R
f, R
t] flow chart, comprising:
1, current regular R is set
cfor R
f;
2, with S
ffor starting point performs R
ccoupling, to obtain R
cdata item, Success Flag, the end position S of coupling
r, to R
cthe data item mark R of coupling
cin metadata token, obtain mapping relations list M
r;
3, judge whether Success Flag is effective;
4 if, then by M
rjoin in M, otherwise end process;
5, judge and R
cwhether be R
t, if so, then end process;
6 otherwise judge S
rwhether be S
t, if so, then end process;
7 otherwise S is set
ffor S
r, R is set
ffor R
cthe next one rule, then get back to step 1.
Utilize [R
f, R
t] be the rule of one group of linear ordering, this preferred embodiment devises the flow process of this searching loop, can automatically by [R
f, R
t] strictly all rules sequentially to contents fragment [S
f, S
t] content-data complete coupling.This process is simple, realizes easily via computer.
Preferably, R
ccomprise Data Matching condition, perform R
ccoupling comprises: usage data matching condition is at [S
f, S
t] content-data on match each data item, and correspondingly Success Flag is set.The preferred embodiment provides a kind of Condition Matching rule, by the mode of condition judgment, can identify the data item in contents fragment.
Preferably, R
calso comprise final position mark, final position is masked as invalid, and being used to indicate Data Matching condition is interval condition; Final position is masked as effectively, being used to indicate Data Matching condition is locality condition, interval condition is used to indicate the format convention being arranged on data on continuum, wherein, corresponding data item is the data in from the end position of a upper data item, to meet format convention successive range; Locality condition is used to indicate the format convention being arranged on end position place data, wherein, corresponding data item is the end position of an above data item is starting point, to meet the position of format convention for the data between end point, wherein, format convention is used to indicate the regular feature that tables of data reveals.The preferred embodiment provides interval condition and locality condition for Condition Matching rule, when user can determine the feature of business tine on a continuum, interval condition can be adopted realize coupling, when user can determine the feature of business tine on certain position, just locality condition can be adopted realize coupling.The preferred embodiment can meet the content mark demand of various dissimilar business tine.
Preferably, format convention comprises following at least one: content format rule, to manifest format convention, tag format rule and Any regular, and content format rule is used to indicate the regular feature that data show on document content; Manifest format convention and be used to indicate the regular feature that data show on the space of a whole page presents; Tag format rule is used to indicate the regular feature that data show in applied logic; Any rule is used to indicate any data all Satisfying Matching Conditions.This preferred embodiment, on the basis of above preferred embodiment, indicates multiple format rule further, thus can meet the content mark demand of various dissimilar business tine better.
Preferably, R
ccomprise recurring rule number, recurring rule number is used to indicate repeated application [R
f, R
t] the middle several rule of recurring rule.This preferred embodiment provides a kind of repeated matching rule.Such as brief note, because usually only comprise a prefix in a brief note, so for the identification of prefix, obviously do not need to adopt repeated matching rule to mate.In addition, in a brief note, multiple senses of a dictionary entry may be comprised, so it is just more suitable to adopt the repeated matching rule of this preferred embodiment to carry out identification.
Preferably, rule comprises: minimum occurrence number, and its value is N, is used to indicate that to match data item minimum for N time, and N is nonnegative integer; Maximum occurrence number, its value is P, is used to indicate to match data item and mostly be P time most, and P is positive integer, and when N is 0, P > N; When N is positive integer, P >=N.
Preferably, this content mask method also comprises: each mapping relations in traversal M, records each metadata token and corresponding data item respectively, to build metadata item; Metadata item is built metadata item table; Metadata item table is attached to contents fragment.
Preferably, this content mask method also comprises: each mapping relations in traversal M, records each metadata token and corresponding data item respectively, to build metadata item; Continuum corresponding to matched data item or end position, be attached to contents fragment by metadata item.
Above-mentioned two preferred embodiments give two kinds of the mapping relations list of foundation being carried out preserving simple scheme.
Preferably, metadata token meets XML, wherein, when metadata token is empty mark, ignores the data item that metadata token is empty mark when being used to indicate attaching metadata item.XML is the more common at present computer language of industry, adopts XML to specify metadata token, can improve the versatility of this method.In addition, by providing empty mark, thus the data content that can not be able to identify in contents processing fragment, improve the compatibility to content document.
Preferably, this content mask method also comprises: the performance rule analyzing each contents fragment in content document; Create rule template according to performance rule, rule template comprises rule [R
f, R
t].By being pre-created rule template, when identifying the close multiple electronic document of the form of expression, can a public rule template, avoid and need to re-establish rule [R at every turn
f, R
t], thus improve the reusability of content mark work.
Preferably, Rc comprises and quotes template name, is used to indicate the rule template quoted and have this template name.Establish template in the preferred embodiment and quote rule, development amount can be reduced.
Rule template comprises one group of linearly orderly rule, can be its create name, 1) rule template can be stored, and use in other Similar content fragments, this and pattern comparing class are seemingly; 2) other rule templates also can quote defined rule template by this title.
Fig. 4 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and this rule template " brief note " comprises 6 linear orderly matched rules.The brief note rule template that Fig. 5 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 1.
In the preferred embodiment, rule is divided three classes: Condition Matching is regular, repeated matching is regular and template quotes rule.Any one rule all comprises lower Column Properties:
Wherein, any one rule can specify minimum occurrence number and maximum occurrence number, and the maximum occurrence number of number <=of minimum occurrence number <=occurrence can be considered as that the match is successful.
Such as: the answer of multiple-choice question may be shown as following format text in some workbook:
Answer: AC
This by condition of continuity rule ({ capitalization, answer choice, 1..*}), can identify each selection answer (" answer choice "=" A ", " answer choice "=" B ").
Conditional plan can be further subdivided into two classes: condition of continuity rule and termination condition rule.
Condition of continuity rule all comprises lower Column Properties:
attribute |
explanation |
format convention (condition) |
specify the format convention that occurrence is corresponding.(on continuum) |
Termination condition rule all comprises lower Column Properties:
Wherein, comprising final position mark is not the mark distinguishing condition of continuity rule and termination condition rule, and it represents whether the scope of occurrence comprises the data of final position.
Stroke rule in example:
Be ({ text: " draw, ", TRUE, stroke, 1}) during TRUE when " comprising final position ", identify occurrence for " 4 draw, ", after end position be ", ", comprise " picture, ";
Be ({ text: " draw, ", FALSE, stroke, 1}) during FALSE when " comprising final position ", identify occurrence for " 4 ", after end position be " 4 ", do not comprise " picture, ".
Recurring rule all comprises lower Column Properties:
attribute |
explanation |
recurring rule number |
in appointment, several matched rule is by repeated application. |
Template is quoted rule and is all comprised lower Column Properties:
attribute |
explanation |
quote template name |
specify the defined rule template of application. |
Wherein, template is quoted rule and is specified application by quoting [R between formula area that template name identifies
f-template, R
t-template], be a kind of method of nested application.
Such as: for the rule template " brief note " in example, last lexical or textual analysis rule is made into template to quote rule ({ template is quoted: " lexical or textual analysis ", lexical or textual analysis, 1}), when applying lexical or textual analysis rule, can be automatically found (as shown in Figure 9) between formula area corresponding to rule template " lexical or textual analysis " mates.
Metadata effect is as follows:
< metadata >
< word order > opens </ word order >
Font > Open </ traditional font, < traditional font font >
< phonetic >k ā i</ phonetic >
< stroke >4</ stroke >
< radicals by which characters are arranged in traditional Chinese dictionaries > mono-</ radicals by which characters are arranged in traditional Chinese dictionaries >
< lexical or textual analysis >
< senses of a dictionary entry > opens: ~ door | ~ curtain | public ~ | net ~ simultaneously.</ senses of a dictionary entry >
< senses of a dictionary entry > gets through; Open up: ~ road | ~ ore deposit | ~ pick | ~ open up.</ senses of a dictionary entry >
</ lexical or textual analysis >
</ metadata >
Concrete matching process is as follows:
(1) data of observation and analysis contents fragment, find its performance rule, create rule template " brief note " as shown in Figure 4;
(2) contents fragment of brief note "ON" is selected;
(2) this rule template comprises one group of orderly matched rule;
(3) analyze contents fragment, according to rule template " brief note ", identify and obtain matched data item, and be mapped to the metadata token of association.
(4) according to the matched data item identified and the metadata token associated foundation metadata information as shown in Figure 5.These metadata informations can overall be attached on the contents fragment of brief note "ON".
Above-mentioned steps (3), can refinement further, and rule template identification coupling comprises the following steps:
(3.1) original position S is set
ffor the beginning (the "ON" word of section head) of contents fragment, end position S is set
tfor the end (fullstop at section end) of contents fragment;
(3.2) initial regular R is set
ffor the first rule (rule 1 " word order ") of rule template, end rules R is set
tfor an end rule (rule 6 " lexical or textual analysis ") of rule template.
(3.3) at interval [S
f, S
t] content-data on, perform interval [R
f, R
t] rule match, obtain mapping relations lists M.
Above-mentioned steps (3.3), can refinement further, comprises the following steps:
(3.3.1) current regular R is set
cfor initial regular R
f(rule 1 " word order ");
(3.3.2) with original position S
f(the "ON" word of section head) is starting point, at interval [S
f, S
t] content-data on, executing rule R
ccoupling; Obtain regular R
cthe Success Flag (effectively) of coupling, the end position S of coupling
r("ON" word below " (") and the mapping relations list M matched
r(" word order "="ON");
If (3.3.3) the regular R that obtains of step (3.3.2)
cthe Success Flag of coupling is effective, then the mapping relations list M will matched
r, be recorded in mapping relations list M;
(3.3.4) judge whether to need to continue coupling (needs), if so, then enter step (3.3.5); Otherwise, termination;
(3.3.5) original position S is set
ffor the regular R that step (3.3.2) obtains
cthe end position S of coupling
r("ON" word below " ("); Initial regular R will be set
ffor current regular R
cnext one rule (rule 2 " traditional font fonts "); Forward step (3.3.1) to, at interval [S
f, S
t] content-data on, perform interval [R
f, R
t] rule match.
Above-mentioned steps (3.3.2) can refinement further, and the coupling of single rule comprises the following steps:
If (3.3.2.1) current regular R
cfor Condition Matching rule, then enter step (3.3.2.2); If current regular R
cfor repeating matched rule, forward step (3.3.2.4) to; Otherwise current regular Rc is template quotes matched rule, forwards step (3.3.2.5) to;
(3.3.2.2) according to conditions present rule R
cin condition, comprise final position mark and occurrence number, obtain meet rule the list of matched data item and Success Flag;
(3.3.2.3) be each matched data item in the matched data item list of acquisition in step (3.3.2.2), set up and conditions present rule R
cin metadata token between mapping relations, add mapping relations list M
rin; Termination.
(3.3.2.4) according to current recurring rule R
cin recurring rule number and occurrence number, at interval [S
f, S
t] content-data on, repeat interval [R
c-number of repetition, R
c-1] rule match, obtain mapping relations lists M
r; Termination.
(3.3.2.5) regular R is quoted according to current template
cin quote template name and occurrence number, at interval [S
f, S
t] content-data on, perform by quote template name identify interval [R
f-template, R
t-template] rule match, obtain mapping relations lists M
r; Termination.For rule 1 " word order ", this rule is condition of continuity rule (namely adopting interval condition).Every font size from "ON" is that the content of No. three all meets this condition, so the "ON" of its matched data item section of being head.
For rule 4 " strokes ", this rule is termination condition rule (namely adopting locality condition).Every content terminated with " draw, " word string from after phonetic alphabet all meets this condition, so its matched data item be " 4 pictures, ".
The rule template that above preferred embodiment is set up, is also applicable to the other guide fragment to meeting this rule.Such as brief note " californium ", its contents fragment and metadata information, as shown in Figure 6 and Figure 7, Fig. 6 shows another brief note of dictionary, and the brief note rule template that Fig. 7 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 6.
Preferably, this content mask method also comprises: using mated data item as a contents fragment, continues to perform at [S
f, S
t] content-data on matched rule [R
f, R
t] step.The preferred embodiment provides a kind of nested mechanism, can process more complicated content structure, thus can meet the mark demand of the content document of miscellaneous service type.
The metadata information that above-mentioned steps (4) is set up, can also be attached on the data interval of contents fragment in the mode embedded, as shown in Figure 8, its metadata token matched all marks in brief note "ON" contents fragment.Fig. 9 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and the brief note rule template that Figure 10 shows Fig. 9 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 8.
Can find out, the metadata token that the content of the brief note "ON" of Fig. 8 is embedded into is divided into more tiny contents fragment further.For the contents fragment of these subordinates, can continue to adopt the inventive method to mark.For " lexical or textual analysis of opening " contents fragment, its rule template and metadata information are as shown in Figure 9 and Figure 10.
User can refinement layer by layer, successively apply this content mask method with going forward one by one, to greatest extent, minimum particle size ground mark content document or fragment, to reach satisfied structured effect.
Figure 11 shows the schematic diagram of content annotation equipment according to an embodiment of the invention, comprising:
Acquisition module 10, for obtaining contents fragment;
Matching module 20, at [S
f, S
t] content-data on matched rule [R
f, R
t], to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, wherein, S
ffor the beginning of contents fragment, S
tfor the end of contents fragment, R
ffor the first rule of rule template, R
tfor an end rule of rule template, rule template comprises from R
fto R
tone group of linearly orderly rule.
This content annotation equipment improves the efficiency of content mark.
Preferably, matching module 20 comprises:
Current setup module, for arranging current regular R
cfor R
f;
Current matching module, for S
ffor starting point performs R
ccoupling, to obtain R
cdata item, Success Flag, the end position S of coupling
r, to R
cthe data item mark R of coupling
cin metadata token, obtain mapping relations list M
r;
Add module, if be effectively for Success Flag, then by M
rjoin in M;
Judge module, for judging whether Success Flag is effective, and R
cwhether not R
t, and S
rwhether not S
t;
For above, loop module, if judge that being is then arrange S
ffor S
r, R is set
ffor R
cthe next one rule, then continue perform above-mentioned steps; Otherwise termination.
This content annotation equipment structure is simple, realizes easily via computer.
Preferably, this content annotation equipment also comprises: metadata item module, for traveling through each mapping relations in M, records each metadata token and corresponding data item respectively, to build metadata item; Metadata item table module, for building metadata item table by metadata item; Add-on module, for being attached to contents fragment by metadata item table.This preferred embodiment gives two kinds of the mapping relations list of foundation being carried out preserving simple scheme.
Preferably, this content annotation equipment also comprises: analysis module, for analyzing the performance rule of each contents fragment in content document; Creation module, for creating rule template according to performance rule, rule template comprises rule [R
f, R
t].This preferred embodiment improves the reusability of content mark work.
Each embodiment of the present invention can in conjunction with batch system or macros operation, thus can identify the contents fragment of specific regularity of batch rapidly, mates and additional metadata information.
As can be seen from the above description, by each embodiment of the present invention, user can be regular contents fragment affix metadata information easily, neatly, efficiently, exactly.The above embodiments of the present invention, in conjunction with metadata token system, are applicable to various application, as: chapter, paper, examination question, word (word) allusion quotation etc., meet the business demand that user is different.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.