CN102486767B - Method and device for labeling content - Google Patents

Method and device for labeling content Download PDF

Info

Publication number
CN102486767B
CN102486767B CN201010578057.1A CN201010578057A CN102486767B CN 102486767 B CN102486767 B CN 102486767B CN 201010578057 A CN201010578057 A CN 201010578057A CN 102486767 B CN102486767 B CN 102486767B
Authority
CN
China
Prior art keywords
rule
data
content
metadata
template
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201010578057.1A
Other languages
Chinese (zh)
Other versions
CN102486767A (en
Inventor
杨燕菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BEIDA FOUNDER ELECTRONICS Co Ltd
New Founder Holdings Development Co ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201010578057.1A priority Critical patent/CN102486767B/en
Publication of CN102486767A publication Critical patent/CN102486767A/en
Application granted granted Critical
Publication of CN102486767B publication Critical patent/CN102486767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method for labeling content, which comprises the following steps: acquiring a content segment of the content file; creating a rule template which includes a group of linear and ordered rule [Rf, Rt] from Rf to Rt; configuring matching rules [Rf, Rt] for content data of [Sf, St]; distinguishing the matched data items, labeling metadata markers in the matching rules for all matched data items so as to obtain a mapping relation list M which is structured content data, wherein Sf is the start of the content segment, St is the end of the content segment, Rf is the first rule of a rule template, and Rt is the last rule of the rule template. The invention also provides a device for labeling the content. By adopting the method and the device for labeling the content, the efficiency of labeling the content is improved.

Description

The content mask method of content document and device
Technical field
The present invention relates to digital type-setting domain, in particular to content mask method and device.
Background technology
Computer software application can help user to create various content document, in recent years, adopt structured data format, comprise markup language (as: XML etc.) or the mark standard etc. required by the other standards committee, these content documents or contents fragment are marked, describes the application structure of content.Based on this application structure, content is managed further, process, reuse, become users in the urgent need to.
The content document of some business scope presents a large amount of regular contents fragment, such as collection of thesis, examination question collection, word (word) allusion quotation etc.Fig. 1 shows a brief note (or being called entry) of dictionary.Can comprise brief note similar in a large number in dictionary, the regularity of these brief notes is embodied in, and each brief note includes prefix (or being called word order, prefix), phonetic symbol, lexical or textual analysis etc.
In order to the dictionary of Fig. 1 is converted to structural data, needing the prefix of each brief note, phonetic symbol, lexical or textual analysis etc. to be labeled as metadata, that is, is the contents fragment affix metadata information of the regularity of e-book.Prior art adopts manual mode to carry out content mark, so operation is very loaded down with trivial details.
Summary of the invention
The present invention aims to provide a kind of content mask method and device of content document, marks more loaded down with trivial details problem to solve manual content of carrying out.
In an embodiment of the present invention, provide a kind of content mask method, comprising: the contents fragment obtaining content document; Create rule template, described rule template comprises from R fto R tone group of linearly orderly rule [R f, R t]; At [S f, S t] content-data on matched rule [R f, R t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S ffor the beginning of contents fragment, S tfor the end of contents fragment, R ffor the first rule of rule template, R tfor an end rule of rule template; Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule; Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
In an embodiment of the present invention, provide a kind of content annotation equipment of content document, comprising: acquisition module, for obtaining the contents fragment of content document; Creation module, for creating rule template, described rule template comprises from R fto R tone group of linearly orderly rule [R f, R t]; Matching module, at [S f, S t] content-data on matched rule [R f, R t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S ffor the beginning of contents fragment, S tfor the end of contents fragment, R ffor the first rule of rule template, R tfor an end rule of rule template; Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule; Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
The content mask method of the embodiment of the present invention and device, because adopt regular Auto-matching contents fragment, so overcome the loaded down with trivial details problem of manual content labeling operation, improve the efficiency of content mark.
Accompanying drawing explanation
Accompanying drawing described herein is used to provide a further understanding of the present invention, and form a application's part, schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows a brief note of dictionary;
Fig. 2 shows the flow chart of content mask method according to an embodiment of the invention;
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S f, S t] content-data on matched rule [R f, R t] flow chart;
Fig. 4 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Fig. 5 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 1;
Fig. 6 shows another brief note of dictionary;
The brief note rule template that Fig. 7 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 6;
Fig. 8 shows the schematic diagram of the brief note after content mark in accordance with a preferred embodiment of the present invention;
Fig. 9 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Figure 10 shows Fig. 9 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 8;
Figure 11 shows the schematic diagram of content annotation equipment according to an embodiment of the invention.
Detailed description of the invention
Below with reference to the accompanying drawings and in conjunction with the embodiments, describe the present invention in detail.
Fig. 2 shows the flow chart of content mask method according to an embodiment of the invention, comprising:
Step S10, obtains contents fragment;
Step S20, at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, wherein, S ffor the beginning of contents fragment, S tfor the end of contents fragment, R ffor the first rule of rule template, R tfor an end rule of rule template, rule template comprises from R fto R tone group of linearly orderly rule.
Prior art adopts manual mode to carry out content mark, so the very loaded down with trivial details work of operation, and in the present embodiment, construct rule in advance, adopt rule to carry out matching content fragment, thus automatically coupling obtains each data item, and the metadata token be pre-created in rule is automatically mated to each data item, by the establishment of rule, thus these operations can be realized by computer, improve the efficiency of content mark.
In addition, in the present embodiment, rule [R f, R t] be the rule of one group of linear ordering, this rule template structure is simple, and user can create this rule template for the content document of miscellaneous service type easily, and the rule that computer performs this linear ordering is mated seriatim, algorithm realization is simple, and efficiency is higher.
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S f, S t] content-data on matched rule [R f, R t] flow chart, comprising:
1, current regular R is set cfor R f;
2, with S ffor starting point performs R ccoupling, to obtain R cdata item, Success Flag, the end position S of coupling r, to R cthe data item mark R of coupling cin metadata token, obtain mapping relations list M r;
3, judge whether Success Flag is effective;
4 if, then by M rjoin in M, otherwise end process;
5, judge and R cwhether be R t, if so, then end process;
6 otherwise judge S rwhether be S t, if so, then end process;
7 otherwise S is set ffor S r, R is set ffor R cthe next one rule, then get back to step 1.
Utilize [R f, R t] be the rule of one group of linear ordering, this preferred embodiment devises the flow process of this searching loop, can automatically by [R f, R t] strictly all rules sequentially to contents fragment [S f, S t] content-data complete coupling.This process is simple, realizes easily via computer.
Preferably, R ccomprise Data Matching condition, perform R ccoupling comprises: usage data matching condition is at [S f, S t] content-data on match each data item, and correspondingly Success Flag is set.The preferred embodiment provides a kind of Condition Matching rule, by the mode of condition judgment, can identify the data item in contents fragment.
Preferably, R calso comprise final position mark, final position is masked as invalid, and being used to indicate Data Matching condition is interval condition; Final position is masked as effectively, being used to indicate Data Matching condition is locality condition, interval condition is used to indicate the format convention being arranged on data on continuum, wherein, corresponding data item is the data in from the end position of a upper data item, to meet format convention successive range; Locality condition is used to indicate the format convention being arranged on end position place data, wherein, corresponding data item is the end position of an above data item is starting point, to meet the position of format convention for the data between end point, wherein, format convention is used to indicate the regular feature that tables of data reveals.The preferred embodiment provides interval condition and locality condition for Condition Matching rule, when user can determine the feature of business tine on a continuum, interval condition can be adopted realize coupling, when user can determine the feature of business tine on certain position, just locality condition can be adopted realize coupling.The preferred embodiment can meet the content mark demand of various dissimilar business tine.
Preferably, format convention comprises following at least one: content format rule, to manifest format convention, tag format rule and Any regular, and content format rule is used to indicate the regular feature that data show on document content; Manifest format convention and be used to indicate the regular feature that data show on the space of a whole page presents; Tag format rule is used to indicate the regular feature that data show in applied logic; Any rule is used to indicate any data all Satisfying Matching Conditions.This preferred embodiment, on the basis of above preferred embodiment, indicates multiple format rule further, thus can meet the content mark demand of various dissimilar business tine better.
Preferably, R ccomprise recurring rule number, recurring rule number is used to indicate repeated application [R f, R t] the middle several rule of recurring rule.This preferred embodiment provides a kind of repeated matching rule.Such as brief note, because usually only comprise a prefix in a brief note, so for the identification of prefix, obviously do not need to adopt repeated matching rule to mate.In addition, in a brief note, multiple senses of a dictionary entry may be comprised, so it is just more suitable to adopt the repeated matching rule of this preferred embodiment to carry out identification.
Preferably, rule comprises: minimum occurrence number, and its value is N, is used to indicate that to match data item minimum for N time, and N is nonnegative integer; Maximum occurrence number, its value is P, is used to indicate to match data item and mostly be P time most, and P is positive integer, and when N is 0, P > N; When N is positive integer, P >=N.
Preferably, this content mask method also comprises: each mapping relations in traversal M, records each metadata token and corresponding data item respectively, to build metadata item; Metadata item is built metadata item table; Metadata item table is attached to contents fragment.
Preferably, this content mask method also comprises: each mapping relations in traversal M, records each metadata token and corresponding data item respectively, to build metadata item; Continuum corresponding to matched data item or end position, be attached to contents fragment by metadata item.
Above-mentioned two preferred embodiments give two kinds of the mapping relations list of foundation being carried out preserving simple scheme.
Preferably, metadata token meets XML, wherein, when metadata token is empty mark, ignores the data item that metadata token is empty mark when being used to indicate attaching metadata item.XML is the more common at present computer language of industry, adopts XML to specify metadata token, can improve the versatility of this method.In addition, by providing empty mark, thus the data content that can not be able to identify in contents processing fragment, improve the compatibility to content document.
Preferably, this content mask method also comprises: the performance rule analyzing each contents fragment in content document; Create rule template according to performance rule, rule template comprises rule [R f, R t].By being pre-created rule template, when identifying the close multiple electronic document of the form of expression, can a public rule template, avoid and need to re-establish rule [R at every turn f, R t], thus improve the reusability of content mark work.
Preferably, Rc comprises and quotes template name, is used to indicate the rule template quoted and have this template name.Establish template in the preferred embodiment and quote rule, development amount can be reduced.
Rule template comprises one group of linearly orderly rule, can be its create name, 1) rule template can be stored, and use in other Similar content fragments, this and pattern comparing class are seemingly; 2) other rule templates also can quote defined rule template by this title.
Fig. 4 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and this rule template " brief note " comprises 6 linear orderly matched rules.The brief note rule template that Fig. 5 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 1.
In the preferred embodiment, rule is divided three classes: Condition Matching is regular, repeated matching is regular and template quotes rule.Any one rule all comprises lower Column Properties:
Wherein, any one rule can specify minimum occurrence number and maximum occurrence number, and the maximum occurrence number of number <=of minimum occurrence number <=occurrence can be considered as that the match is successful.
Such as: the answer of multiple-choice question may be shown as following format text in some workbook:
Answer: AC
This by condition of continuity rule ({ capitalization, answer choice, 1..*}), can identify each selection answer (" answer choice "=" A ", " answer choice "=" B ").
Conditional plan can be further subdivided into two classes: condition of continuity rule and termination condition rule.
Condition of continuity rule all comprises lower Column Properties:
attribute explanation
format convention (condition) specify the format convention that occurrence is corresponding.(on continuum)
Termination condition rule all comprises lower Column Properties:
Wherein, comprising final position mark is not the mark distinguishing condition of continuity rule and termination condition rule, and it represents whether the scope of occurrence comprises the data of final position.
Stroke rule in example:
Be ({ text: " draw, ", TRUE, stroke, 1}) during TRUE when " comprising final position ", identify occurrence for " 4 draw, ", after end position be ", ", comprise " picture, ";
Be ({ text: " draw, ", FALSE, stroke, 1}) during FALSE when " comprising final position ", identify occurrence for " 4 ", after end position be " 4 ", do not comprise " picture, ".
Recurring rule all comprises lower Column Properties:
attribute explanation
recurring rule number in appointment, several matched rule is by repeated application.
Template is quoted rule and is all comprised lower Column Properties:
attribute explanation
quote template name specify the defined rule template of application.
Wherein, template is quoted rule and is specified application by quoting [R between formula area that template name identifies f-template, R t-template], be a kind of method of nested application.
Such as: for the rule template " brief note " in example, last lexical or textual analysis rule is made into template to quote rule ({ template is quoted: " lexical or textual analysis ", lexical or textual analysis, 1}), when applying lexical or textual analysis rule, can be automatically found (as shown in Figure 9) between formula area corresponding to rule template " lexical or textual analysis " mates.
Metadata effect is as follows:
< metadata >
< word order > opens </ word order >
Font > Open </ traditional font, < traditional font font >
< phonetic >k ā i</ phonetic >
< stroke >4</ stroke >
< radicals by which characters are arranged in traditional Chinese dictionaries > mono-</ radicals by which characters are arranged in traditional Chinese dictionaries >
< lexical or textual analysis >
< senses of a dictionary entry > opens: ~ door | ~ curtain | public ~ | net ~ simultaneously.</ senses of a dictionary entry >
< senses of a dictionary entry > gets through; Open up: ~ road | ~ ore deposit | ~ pick | ~ open up.</ senses of a dictionary entry >
</ lexical or textual analysis >
</ metadata >
Concrete matching process is as follows:
(1) data of observation and analysis contents fragment, find its performance rule, create rule template " brief note " as shown in Figure 4;
(2) contents fragment of brief note "ON" is selected;
(2) this rule template comprises one group of orderly matched rule;
(3) analyze contents fragment, according to rule template " brief note ", identify and obtain matched data item, and be mapped to the metadata token of association.
(4) according to the matched data item identified and the metadata token associated foundation metadata information as shown in Figure 5.These metadata informations can overall be attached on the contents fragment of brief note "ON".
Above-mentioned steps (3), can refinement further, and rule template identification coupling comprises the following steps:
(3.1) original position S is set ffor the beginning (the "ON" word of section head) of contents fragment, end position S is set tfor the end (fullstop at section end) of contents fragment;
(3.2) initial regular R is set ffor the first rule (rule 1 " word order ") of rule template, end rules R is set tfor an end rule (rule 6 " lexical or textual analysis ") of rule template.
(3.3) at interval [S f, S t] content-data on, perform interval [R f, R t] rule match, obtain mapping relations lists M.
Above-mentioned steps (3.3), can refinement further, comprises the following steps:
(3.3.1) current regular R is set cfor initial regular R f(rule 1 " word order ");
(3.3.2) with original position S f(the "ON" word of section head) is starting point, at interval [S f, S t] content-data on, executing rule R ccoupling; Obtain regular R cthe Success Flag (effectively) of coupling, the end position S of coupling r("ON" word below " (") and the mapping relations list M matched r(" word order "="ON");
If (3.3.3) the regular R that obtains of step (3.3.2) cthe Success Flag of coupling is effective, then the mapping relations list M will matched r, be recorded in mapping relations list M;
(3.3.4) judge whether to need to continue coupling (needs), if so, then enter step (3.3.5); Otherwise, termination;
(3.3.5) original position S is set ffor the regular R that step (3.3.2) obtains cthe end position S of coupling r("ON" word below " ("); Initial regular R will be set ffor current regular R cnext one rule (rule 2 " traditional font fonts "); Forward step (3.3.1) to, at interval [S f, S t] content-data on, perform interval [R f, R t] rule match.
Above-mentioned steps (3.3.2) can refinement further, and the coupling of single rule comprises the following steps:
If (3.3.2.1) current regular R cfor Condition Matching rule, then enter step (3.3.2.2); If current regular R cfor repeating matched rule, forward step (3.3.2.4) to; Otherwise current regular Rc is template quotes matched rule, forwards step (3.3.2.5) to;
(3.3.2.2) according to conditions present rule R cin condition, comprise final position mark and occurrence number, obtain meet rule the list of matched data item and Success Flag;
(3.3.2.3) be each matched data item in the matched data item list of acquisition in step (3.3.2.2), set up and conditions present rule R cin metadata token between mapping relations, add mapping relations list M rin; Termination.
(3.3.2.4) according to current recurring rule R cin recurring rule number and occurrence number, at interval [S f, S t] content-data on, repeat interval [R c-number of repetition, R c-1] rule match, obtain mapping relations lists M r; Termination.
(3.3.2.5) regular R is quoted according to current template cin quote template name and occurrence number, at interval [S f, S t] content-data on, perform by quote template name identify interval [R f-template, R t-template] rule match, obtain mapping relations lists M r; Termination.For rule 1 " word order ", this rule is condition of continuity rule (namely adopting interval condition).Every font size from "ON" is that the content of No. three all meets this condition, so the "ON" of its matched data item section of being head.
For rule 4 " strokes ", this rule is termination condition rule (namely adopting locality condition).Every content terminated with " draw, " word string from after phonetic alphabet all meets this condition, so its matched data item be " 4 pictures, ".
The rule template that above preferred embodiment is set up, is also applicable to the other guide fragment to meeting this rule.Such as brief note " californium ", its contents fragment and metadata information, as shown in Figure 6 and Figure 7, Fig. 6 shows another brief note of dictionary, and the brief note rule template that Fig. 7 shows Fig. 4 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 6.
Preferably, this content mask method also comprises: using mated data item as a contents fragment, continues to perform at [S f, S t] content-data on matched rule [R f, R t] step.The preferred embodiment provides a kind of nested mechanism, can process more complicated content structure, thus can meet the mark demand of the content document of miscellaneous service type.
The metadata information that above-mentioned steps (4) is set up, can also be attached on the data interval of contents fragment in the mode embedded, as shown in Figure 8, its metadata token matched all marks in brief note "ON" contents fragment.Fig. 9 shows the schematic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and the brief note rule template that Figure 10 shows Fig. 9 carries out the schematic diagram of the brief note metadata information of rule match acquisition to the brief note of Fig. 8.
Can find out, the metadata token that the content of the brief note "ON" of Fig. 8 is embedded into is divided into more tiny contents fragment further.For the contents fragment of these subordinates, can continue to adopt the inventive method to mark.For " lexical or textual analysis of opening " contents fragment, its rule template and metadata information are as shown in Figure 9 and Figure 10.
User can refinement layer by layer, successively apply this content mask method with going forward one by one, to greatest extent, minimum particle size ground mark content document or fragment, to reach satisfied structured effect.
Figure 11 shows the schematic diagram of content annotation equipment according to an embodiment of the invention, comprising:
Acquisition module 10, for obtaining contents fragment;
Matching module 20, at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, wherein, S ffor the beginning of contents fragment, S tfor the end of contents fragment, R ffor the first rule of rule template, R tfor an end rule of rule template, rule template comprises from R fto R tone group of linearly orderly rule.
This content annotation equipment improves the efficiency of content mark.
Preferably, matching module 20 comprises:
Current setup module, for arranging current regular R cfor R f;
Current matching module, for S ffor starting point performs R ccoupling, to obtain R cdata item, Success Flag, the end position S of coupling r, to R cthe data item mark R of coupling cin metadata token, obtain mapping relations list M r;
Add module, if be effectively for Success Flag, then by M rjoin in M;
Judge module, for judging whether Success Flag is effective, and R cwhether not R t, and S rwhether not S t;
For above, loop module, if judge that being is then arrange S ffor S r, R is set ffor R cthe next one rule, then continue perform above-mentioned steps; Otherwise termination.
This content annotation equipment structure is simple, realizes easily via computer.
Preferably, this content annotation equipment also comprises: metadata item module, for traveling through each mapping relations in M, records each metadata token and corresponding data item respectively, to build metadata item; Metadata item table module, for building metadata item table by metadata item; Add-on module, for being attached to contents fragment by metadata item table.This preferred embodiment gives two kinds of the mapping relations list of foundation being carried out preserving simple scheme.
Preferably, this content annotation equipment also comprises: analysis module, for analyzing the performance rule of each contents fragment in content document; Creation module, for creating rule template according to performance rule, rule template comprises rule [R f, R t].This preferred embodiment improves the reusability of content mark work.
Each embodiment of the present invention can in conjunction with batch system or macros operation, thus can identify the contents fragment of specific regularity of batch rapidly, mates and additional metadata information.
As can be seen from the above description, by each embodiment of the present invention, user can be regular contents fragment affix metadata information easily, neatly, efficiently, exactly.The above embodiments of the present invention, in conjunction with metadata token system, are applicable to various application, as: chapter, paper, examination question, word (word) allusion quotation etc., meet the business demand that user is different.
Obviously, those skilled in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (17)

1. a content mask method for content document, is characterized in that, comprising:
Obtain the contents fragment of content document;
Create rule template, described rule template comprises from R fto R tone group of linearly orderly rule [R f, R t];
At [S f, S t] content-data on matched rule [R f, R t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S ffor the beginning of described contents fragment, S tfor the end of described contents fragment, R ffor the first rule of rule template, R tfor an end rule of described rule template;
Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule;
Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
2. method according to claim 1, is characterized in that, at [S f, S t] content-data on matched rule [R f, R t] comprising:
Current regular R is set cfor R f;
With S ffor starting point performs R ccoupling, to obtain R cdata item, Success Flag, the end position S of coupling r, to R cthe data item mark R of coupling cin metadata token, obtain mapping relations list M r;
If Success Flag is effectively, then by M rjoin in M;
Judge whether described Success Flag is effective, and R cwhether not R t, and S rwhether not S t;
If more than judge that being is then arrange S ffor S r, R is set ffor R cthe next one rule, then continue perform from step, current regular R is set cfor R fto step judge described Success Flag whether for the institute between effectively in steps; Otherwise termination.
3. method according to claim 2, is characterized in that, R ccomprise Data Matching condition, perform R ccoupling comprises: use described Data Matching condition at [S f, S t] content-data on match each data item, and described Success Flag is correspondingly set.
4. method according to claim 3, is characterized in that, R calso comprise final position mark, it is invalid that described final position is masked as, and being used to indicate described Data Matching condition is interval condition; Described final position is masked as effectively, and being used to indicate described Data Matching condition is locality condition,
Described interval condition is used to indicate the format convention being arranged on data on continuum, and wherein, corresponding data item is the data in from the end position of a upper data item, to meet described format convention successive range;
Described locality condition is used to indicate the format convention being arranged on end position place data, and wherein, corresponding data item is the end position of an above data item is starting point, to meet the position of described format convention for the data between end point,
Wherein, described format convention is used to indicate the regular feature that tables of data reveals.
5. method according to claim 4, is characterized in that, described format convention comprises following at least one: content format rule, manifest format convention, tag format rule and Any regular,
Described content format rule is used to indicate the regular feature that data show on document content;
The described format convention that manifests is used to indicate the regular feature that data show on the space of a whole page presents;
Described tag format rule is used to indicate the regular feature that data show in applied logic;
Described Any rule is used to indicate any data all Satisfying Matching Conditions.
6. method according to claim 2, is characterized in that, R ccomprise recurring rule number, described recurring rule number is used to indicate repeated application [R f, R t] described in the several rule of recurring rule.
7. method according to claim 1, is characterized in that, described rule comprises:
Minimum occurrence number, its value is N, is used to indicate that to match data item minimum for N time, and N is nonnegative integer;
Maximum occurrence number, its value is P, is used to indicate to match data item and mostly be P time most, and P is positive integer, and when N is 0, P > N; When N is positive integer, P >=N.
8. method according to claim 1, is characterized in that, also comprises:
Each mapping relations in traversal M, record metadata token described in each and corresponding data item, respectively to build metadata item;
Described metadata item is built metadata item table;
Described metadata item table is attached to described contents fragment.
9. method according to claim 4, is characterized in that, also comprises:
Each mapping relations in traversal M, record metadata token described in each and corresponding data item, respectively to build metadata item;
Continuum corresponding to the described data item of coupling or end position, be attached to described contents fragment by described metadata item.
10. method according to claim 8 or claim 9, it is characterized in that, described metadata token meets XML, wherein, when described metadata token is empty mark, ignores the data item that described metadata token is empty mark when being used to indicate additional described metadata item.
11. methods according to claim 1, is characterized in that, also comprise:
Analyze the performance rule of each contents fragment in content document;
Create rule template according to described performance rule, described rule template comprises rule [R f, R t].
12. methods according to claim 11, is characterized in that, R ccomprise and quote template name, be used to indicate the rule template quoted and there is described template name.
13. methods according to claim 1, is characterized in that, also comprise: using mated data item as a contents fragment, continue to perform at [S f, S t] content-data on matched rule [R f, R t] step.
The content annotation equipment of 14. 1 kinds of content documents, is characterized in that, comprising:
Acquisition module, for obtaining the contents fragment of content document;
Creation module, for creating rule template, described rule template comprises from R fto R tone group of linearly orderly rule [R f, R t];
Matching module, at [S f, S t] content-data on matched rule [R f, R t], identify and obtain matched data item, to the metadata token in each data item mark institute matched rule matched, to obtain mapping relations list M, described relation list M is structurized content-data, wherein, and S ffor the beginning of described contents fragment, S tfor the end of described contents fragment, R ffor the first rule of rule template, R tfor an end rule of described rule template;
Wherein, described rule comprises: Condition Matching is regular, repeated matching is regular and template quotes rule;
Described rule comprises with properties: metadata token, minimum occurrence number and maximum occurrence number.
15. devices according to claim 14, is characterized in that, described matching module comprises:
Current setup module, for arranging current regular R cfor R f;
Current matching module, for S ffor starting point performs R ccoupling, to obtain R cdata item, Success Flag, the end position S of coupling r, to R cthe data item mark R of coupling cin metadata token, obtain mapping relations list M r;
Add module, if be effectively for Success Flag, then by M rjoin in M;
Judge module, for judging whether described Success Flag is effective, and R cwhether not R t, and S rwhether not S t;
For above, loop module, if judge that being is then arrange S ffor S r, R is set ffor R cnext one rule, then continuation order performs described current setup module, current matching module, adds step involved by module and judge module; Otherwise termination.
16. devices according to claim 14, is characterized in that, also comprise:
Metadata item module, for traveling through each mapping relations in M, records metadata token described in each and corresponding data item, respectively to build metadata item;
Metadata item table module, for building metadata item table by described metadata item;
Add-on module, for being attached to described contents fragment by described metadata item table.
17. devices according to claim 14, is characterized in that, also comprise:
Analysis module, for analyzing the performance rule of each contents fragment in content document;
Described creation module creates rule template according to described performance rule, and described rule template comprises rule [R f, R t].
CN201010578057.1A 2010-12-02 2010-12-02 Method and device for labeling content Active CN102486767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010578057.1A CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010578057.1A CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Publications (2)

Publication Number Publication Date
CN102486767A CN102486767A (en) 2012-06-06
CN102486767B true CN102486767B (en) 2015-03-25

Family

ID=46152261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010578057.1A Active CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Country Status (1)

Country Link
CN (1) CN102486767B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140037535A (en) * 2012-09-19 2014-03-27 삼성전자주식회사 Method and apparatus for creating e-book including user effects
EP3020000B1 (en) * 2013-07-09 2022-04-27 3M Innovative Properties Company Systems and methods for note content extraction and management using segmented notes

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123532A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A system and method for generating description information of communication user
CN101158953A (en) * 2007-10-08 2008-04-09 上海聆众商务咨询有限公司 Network document information processing method and device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101571859B (en) * 2008-04-28 2013-01-02 国际商业机器公司 Method and apparatus for labelling document

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123532A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A system and method for generating description information of communication user
CN101158953A (en) * 2007-10-08 2008-04-09 上海聆众商务咨询有限公司 Network document information processing method and device

Also Published As

Publication number Publication date
CN102486767A (en) 2012-06-06

Similar Documents

Publication Publication Date Title
US6606625B1 (en) Wrapper induction by hierarchical data analysis
Embley et al. Table-processing paradigms: a research survey
CN101751476B (en) Method and device for marking electronic bookmarks
Muslea et al. Hierarchical wrapper induction for semistructured information sources
CN103823838B (en) A kind of method of multi-format document typing and comparison
CN100444591C (en) Method for acquiring front-page keyword and its application system
CN102270206A (en) Method and device for capturing valid web page contents
CN106055667B (en) It is a kind of based on text-label densities web page core content extracting method
JP2007128523A (en) IMAGE SUMMARIZING METHOD, IMAGE DISPLAY DEVICE, k-TREE DISPLAY SYSTEM, k-TREE DISPLAY PROGRAM AND k-TREE DISPLAY METHOD
CA2529040A1 (en) Improving accuracy in searching digital ink
CN105426529A (en) Image retrieval method and system based on user search intention positioning
CN104899340B (en) A kind of IETM technical information fragment retrieval device and its search method based on fragment of most compacting
Merlino et al. 25 An Empirical Study of the Optimal Presentation of Multimedia Summaries of Broadcast News
CN101620738A (en) Method for generating multi-media concept map
CN108959204B (en) Internet financial project information extraction method and system
CN112084342A (en) Test question generation method and device, computer equipment and storage medium
CN109740124A (en) Difference output method, device, storage medium and the electronic equipment of document comparison
CN106953913A (en) A kind of information-pushing method and mobile terminal
CN101763424B (en) Method for determining characteristic words and searching according to file content
JP2009098763A (en) Handwritten annotation management apparatus and interface
CN106372232B (en) Information mining method and device based on artificial intelligence
CN105260396A (en) Word retrieval method and apparatus
CN102486767B (en) Method and device for labeling content
WO2011074942A1 (en) System and method of converting data from a multiple table structure into an edoc format
CA2422490C (en) Method and apparatus for extracting structured data from html pages

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220621

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Beijing Beida Founder Electronics Co., Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 5 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Beijing Beida Founder Electronics Co., Ltd.