CN102486767A - Method and device for labeling content - Google Patents

Method and device for labeling content Download PDF

Info

Publication number
CN102486767A
CN102486767A CN2010105780571A CN201010578057A CN102486767A CN 102486767 A CN102486767 A CN 102486767A CN 2010105780571 A CN2010105780571 A CN 2010105780571A CN 201010578057 A CN201010578057 A CN 201010578057A CN 102486767 A CN102486767 A CN 102486767A
Authority
CN
China
Prior art keywords
rule
data
item
content
metadata
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105780571A
Other languages
Chinese (zh)
Other versions
CN102486767B (en
Inventor
杨燕菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING BEIDA FOUNDER ELECTRONICS Co Ltd
New Founder Holdings Development Co ltd
Original Assignee
Peking University Founder Group Co Ltd
Beijing Founder Electronics Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Peking University Founder Group Co Ltd, Beijing Founder Electronics Co Ltd filed Critical Peking University Founder Group Co Ltd
Priority to CN201010578057.1A priority Critical patent/CN102486767B/en
Publication of CN102486767A publication Critical patent/CN102486767A/en
Application granted granted Critical
Publication of CN102486767B publication Critical patent/CN102486767B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a method for labeling content, which comprises the following steps: acquiring a content segment; configuring matching rules [Rf, Rt] for content data of [Sf, St]; labeling metadata markers in the matching rules for all matched data items so as to obtain a mapping relation list M, wherein Sf is the start of the content segment, St is the end of the content segment, Rf is the first rule of a rule template, Rt is the last rule of the rule template, and the rule template comprises a set of linearly ordered rules from Rf to Rt. The invention also provides a device for labeling the content. By adopting the method and the device for labeling the content, the efficiency of labeling the content is improved.

Description

Content mask method and device
Technical field
The present invention relates to digital type-setting domain, in particular to content mask method and device.
Background technology
Computer software application can help the user to create various content documents; In recent years; Adopt structured data format; Comprise the desired mark standard of SGML (as: XML etc.) or other standard committees etc., come these content documents or contents fragment are marked, describe the application structure of content.Based on this application structure, content is further managed, processes, reused etc., become pressing for of users.
The content document of some business field presents the contents fragment of a large amount of regularity, for example collection of thesis, examination question collection, word (speech) allusion quotation etc.Fig. 1 shows a brief note (or being called entry) of dictionary.Can comprise a large amount of similarly brief notes in the dictionary, the regularity of these brief notes is embodied in, and each brief note includes prefix (or being called word order, prefix), phonetic symbol, lexical or textual analysis etc.
For the dictionary with Fig. 1 converts structural data into, need the prefix of each brief note, phonetic symbol, lexical or textual analysis etc. be labeled as metadata, that is, be the contents fragment affix metadata information of the regularity of e-book.Prior art adopts manual mode to carry out the content mark, so operation is very loaded down with trivial details.
Summary of the invention
The present invention aims to provide a kind of content mask method and device, to solve the manual more loaded down with trivial details problem of content mark of carrying out.
In an embodiment of the present invention, a kind of content mask method is provided, has comprised: obtain contents fragment; At [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of contents fragment, S tBe the end of contents fragment, R fBe the first rule of rule template, R tBe an end rule of rule template, rule template comprises from R fTo R tOne group of linear orderly rule.
In an embodiment of the present invention, a kind of content annotation equipment is provided, has comprised: acquisition module is used to obtain contents fragment; Matching module is used at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of contents fragment, S tBe the end of contents fragment, R fBe the first rule of rule template, R tBe an end rule of rule template, rule template comprises from R fTo R tOne group of linear orderly rule.
The content mask method of the embodiment of the invention because adopt rule matching content fragment automatically, so overcome the problem of manual content mark complex operation, has improved the content marking efficiency with device.
Description of drawings
Accompanying drawing described herein is used to provide further understanding of the present invention, constitutes the application's a part, and illustrative examples of the present invention and explanation thereof are used to explain the present invention, do not constitute improper qualification of the present invention.In the accompanying drawings:
Fig. 1 shows a brief note of dictionary;
Fig. 2 shows the process flow diagram of content mask method according to an embodiment of the invention;
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S f, S t] content-data on matched rule [R f, R t] process flow diagram;
Fig. 4 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Fig. 5 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 1;
Fig. 6 shows another brief note of dictionary;
The brief note rule template that Fig. 7 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 6;
Fig. 8 shows the synoptic diagram of the brief note behind the content mark in accordance with a preferred embodiment of the present invention;
Fig. 9 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention;
The brief note rule template that Figure 10 shows Fig. 9 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 8;
Figure 11 shows the synoptic diagram of content annotation equipment according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawing and combine embodiment, specify the present invention.
Fig. 2 shows the process flow diagram of content mask method according to an embodiment of the invention, comprising:
Step S10 obtains contents fragment;
Step S20 is at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of contents fragment, S tBe the end of contents fragment, R fBe the first rule of rule template, R tBe an end rule of rule template, rule template comprises from R fTo R tOne group of linear orderly rule.
Prior art adopts manual mode to carry out the content mark, thus the very loaded down with trivial details work of operation, and in the present embodiment; Make up rule in advance, adopt rule to come the matching content fragment, thereby automatically coupling has obtained each data item; And the metadata token of creating in advance in the rule automatically mated to each data item; Through the establishment of rule, thereby make these operations to realize, improved the content marking efficiency through computing machine.
In addition, in the present embodiment, rule [R f, R t] be the rule of one group of linear ordering; This rule template is simple in structure, and the user can be at an easy rate create this rule template to the content document of various types of traffic, and the rule that computing machine is carried out this linear ordering is mated seriatim; Algorithm is realized simple, and efficient is higher.
Fig. 3 show in accordance with a preferred embodiment of the present invention at [S f, S t] content-data on matched rule [R f, R t] process flow diagram, comprising:
1, current regular R is set cBe R f
2, with S fFor starting point is carried out R cCoupling is to obtain R cThe data matching item, successfully the sign, end position S r, to R cData matching item mark R cIn metadata token, obtain mapping relations tabulations M r
3, judge successfully whether sign is effective;
4 if, then with M rJoin among the M, otherwise end process;
5, judge and R cWhether be R t, if, end process then;
6 otherwise judge S rWhether be S t, if, end process then;
7 otherwise S is set fBe S r, R is set fBe R cNext rule, get back to step 1 then.
Utilize [R f, R t] be the rule of one group of linear ordering, this preferred embodiment has designed the flow process of this searching loop, can robotization ground with [R f, R t] strictly all rules sequentially to contents fragment [S f, S t] content-data accomplish coupling.This process is simple, is easy to through computer realization.
Preferably, R cComprise the Data Matching condition, carry out R cCoupling comprises: use the Data Matching condition at [S f, S t] content-data on match each data item, and sign is set correspondingly successfully.The preferred embodiment provides a kind of condition matched rule, can be through the mode of condition judgment, and the data item in the identification contents fragment.
Preferably, R cAlso comprise the final position sign, final position is masked as invalid, and being used for the designation data matching condition is interval condition; Final position is masked as effectively; Being used for the designation data matching condition is locality condition; Interval condition is used to indicate the format convention that is arranged on data on the continuum; Wherein, corresponding data item is the data in the end position successive range that begin, that satisfy format convention of a last data item; Locality condition is used to indicate the format convention that is arranged on end position place data; Wherein, Corresponding data item is that the end position of an above data item is starting point, is the data between the end point with the position of satisfying format convention; Wherein, format convention is used for the regular characteristic that designation data shows.The preferred embodiment provides interval condition and locality condition for the condition matched rule; When the user can confirm the characteristic of business tine on a continuum; Can adopt interval condition to realize coupling; When the user can confirm business tine in certain locational characteristic, just can adopt locality condition to realize coupling.The preferred embodiment can satisfy the content mark demand of various dissimilar business tines.
Preferably, format convention comprises following at least a: content format rule, manifest format convention, tag format rule and Any rule, the content format rule is used for the regular characteristic that designation data shows on document content; Manifest format convention and be used for the regular characteristic that designation data shows on the space of a whole page appears; The tag format rule is used for the regular characteristic that designation data shows on applied logic; The Any rule is used to indicate all Satisfying Matching Conditions of any data.This preferred embodiment has further been indicated multiple format convention on the basis of above-mentioned preferred embodiment, thereby can satisfy the content mark demand of various dissimilar business tines better.
Preferably, R cComprise the recurring rule number, the recurring rule number is used to indicate repeated application [R f, R t] middle several rules of recurring rule.This preferred embodiment provides a kind of repeated matching rule.For example for brief note, because only comprise a prefix usually in a brief note, so, obviously need not adopt the repeated matching rule to mate for the identification of prefix.In addition, possibly comprise a plurality of senses of a dictionary entry in the brief note, so it is just more suitable to adopt the repeated matching rule of this preferred embodiment to discern.
Preferably, rule comprises: minimum occurrence number, its value be for N, is used for indication and matches that data item is minimum to be that N is a nonnegative integer N time; Maximum occurrence number, its value is M, is used for indication and matches data item and be M time that at most M is a positive integer, and M >=N.
Preferably, this content mask method also comprises: each mapping relations among the traversal M, write down each metadata token and corresponding data item respectively, to make up metadata item; Metadata item is made up the metadata item table; The metadata item table is appended to contents fragment.
Preferably, this content mask method also comprises: each mapping relations among the traversal M, write down each metadata token and corresponding data item respectively, to make up metadata item; According to pairing continuum of matched data item or end position, metadata item is appended to contents fragment.
Above-mentioned two preferred embodiments have provided two kinds of simple schemes that the mapping relations tabulation of setting up is preserved.
Preferably, metadata token meets XML, wherein, when metadata token is empty mark, ignores the data item that metadata token is empty mark when being used to indicate the attaching metadata item.XML is an industry more common computerese at present, adopts XML regulation metadata token, can improve the versatility of this method.In addition, through the sky mark is provided, thus can not the recognition data content in can the contents processing fragment, improved compatibility to content document.
Preferably, this content mask method also comprises: the performance rule of analyzing each contents fragment in the content document; Create rule template according to the performance rule, rule template comprises rule [R f, R t].Through creating rule template in advance, when the close a plurality of electronic document of the identification form of expression, can a public rule template, avoided each needs to rebulid rule [R f, R t], thereby improved the reusability of content mark work.
Preferably, Rc comprises and quotes template name, is used to indicate quote the rule template with this template name.In this preferred embodiment, set up template and quoted rule, can reduce the development amount.
Rule template comprises one group of linear orderly rule, can be its create name, 1) rule template can be stored, and uses on other similar contents fragments, and this and pattern are similar; 2) other rule templates also can be quoted the rule template that has defined through this title.
Fig. 4 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and this rule template " brief note " comprises 6 matched rules that linearity is orderly.The brief note rule template that Fig. 5 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 1.
In this preferred embodiment, rule is divided three classes: condition matched rule, repeated matching rule and template are quoted rule.Any rule all comprises Column Properties down:
Figure BSA00000378002700081
Wherein, any rule can be specified minimum occurrence number and maximum occurrence number, and the number of minimum occurrence number<=occurrence<=maximum occurrence number can be considered as that the match is successful.
For example: the answer of multiple-choice question possibly be shown as following format text in some workbook:
Answer: AC
This is can ({ capitalization, answer choice 1..*}), identify each and select answer (" answer choice "=" A ", " answer choice "=" B ") through condition of continuity rule.
Conditional plan can further be subdivided into two types: condition of continuity rule and termination condition rule.
Condition of continuity rule all comprises Column Properties down:
Attribute Explanation
Format convention (condition) Specify the corresponding format convention of occurrence.(on the continuum)
The termination condition rule all comprises Column Properties down:
Figure BSA00000378002700091
Wherein, comprising the final position sign is not the sign of distinguishing condition of continuity rule and termination condition rule, and whether the scope of its expression occurrence comprises the data of final position.
With the stroke rule in the example is example:
When " comprising final position " when the TRUE (text: " picture, ", TRUE, stroke 1}), identifies that occurrence is " 4 draw, ", end position be ", " afterwards, comprise " picture, ";
When " comprising final position " when the FALSE (text: " picture, ", FALSE, stroke 1}), identifies occurrence and is " 4 ", end position be " 4 " afterwards, do not comprise " picture, ".
Recurring rule all comprises Column Properties down:
Attribute Explanation
The recurring rule number Several matched rules are repeated to use in the appointment.
Template is quoted rule and is all comprised Column Properties down:
Attribute Explanation
Quote template name Specify and use the rule template that has defined.
Wherein, template is quoted rule and specify to be used by quoting [R between the formula area that template name identifies The f-template, R The t-template], be a kind of method of nested application.
For example: with the rule template in the example " brief note " is example; With last lexical or textual analysis rule make into template quote rule (template is quoted: " lexical or textual analysis ", lexical or textual analysis, 1}); When using the lexical or textual analysis rule, can find automatically that (as shown in Figure 9) matees between the corresponding formula area of rule template " lexical or textual analysis ".
The metadata effect is following:
< metadata >
<word Mu>Open</word Mu>
<the traditional font font; Open </>traditional font font;
<pin Yin>K ā i</Pin Yin>
<bi Hua>4</Bi Hua>
<bu Shou>One</Bu Shou>
< lexical or textual analysis >
< senses of a dictionary entry>opened :~door |~curtain | public~| net~simultaneously.</Yi Xiang>
< senses of a dictionary entry>got through; Open up :~road |~ore deposit |~pick |~open up.</Yi Xiang>
</Shi Yi>
</Yuan Shuoju>
Concrete matching process is following:
(1) data of observation and analysis contents fragment are found its performance rule, create rule template " brief note " as shown in Figure 4;
(2) contents fragment of selecting brief note " to open ";
(2) this rule template comprises one group of orderly matched rule;
(3) analyze contents fragment, according to rule template " brief note ", identification obtains the matched data item, and is mapped to the metadata associated mark.
(4) set up metadata information as shown in Figure 5 according to the matched data item and the metadata associated mark that identify.These metadata informations can integral body append on the contents fragment that brief note " opens ".
Above-mentioned steps (3), further refinement, the rule template identification and matching may further comprise the steps:
(3.1) reference position S is set fBeginning (" opening " word that section is first) for contents fragment is provided with end position S tEnd (fullstop at section end) for contents fragment;
(3.2) initial regular R is set fFirst rule (rule 1 " word order ") for rule template is provided with end rules R tEnd rule (rule 6 " lexical or textual analysis ") for rule template.
(3.3) at interval [S f, S t] content-data on, carry out interval [R f, R t] rule match, obtain mapping relations tabulation M.
Above-mentioned steps (3.3), further refinement may further comprise the steps:
(3.3.1) current regular R is set cBe initial regular R f(rule 1 " word order ");
(3.3.2) with reference position S f(" opening " word that section is first) is starting point, at interval [S f, S t] content-data on, executing rule R cCoupling; Obtain regular R cThe end position S of the successful sign (effectively) of coupling, coupling r(" opening " word back " (") and the mapping relations tabulation M that matches r(" word order "=" opening ");
If (3.3.3) the regular R of step (3.3.2) acquisition cThe successful sign of coupling is effective, then with the mapping relations tabulation M that matches r, record among the mapping relations tabulation M;
(3.3.4) judge whether to need to continue coupling (needs), if then get into step (3.3.5); Otherwise, termination;
(3.3.5) reference position S is set fRegular R for step (3.3.2) acquisition cThe end position S of coupling r(" opening " word back " ("); Initial regular R will be set fBe current regular R cNext rule (rule 2 " traditional font fonts "); Forward step (3.3.1) to, at interval [S f, S t] content-data on, carry out interval [R f, R t] rule match.
Further refinement of above-mentioned steps (3.3.2), the coupling of single rule may further comprise the steps:
If (3.3.2.1) current regular R cBe the condition matched rule, then get into step (3.3.2.2); If current regular R cFor repeating matched rule, forward step (3.3.2.4) to; Otherwise current regular Rc is a template quotes matched rule, forwards step (3.3.2.5) to;
(3.3.2.2) according to current conditional plan R cIn condition, comprise final position sign and occurrence number, to satisfy the matched data item tabulation of rule and successful sign;
Be each matched data item in the matched data item tabulation that obtains in the step (3.3.2.2) (3.3.2.3), set up and current conditional plan R cIn metadata token between mapping relations, add mapping relations tabulation M rIn; Termination.
(3.3.2.4) according to current recurring rule R cIn recurring rule number and occurrence number, at interval [S f, S t] content-data on, repeat interval [R The c-multiplicity, R C-1] rule match, obtain mapping relations tabulation M rTermination.
(3.3.2.5) according to quoting regular R when front template cIn quote template name and occurrence number, at interval [S f, S t] content-data on, carry out by the interval [R that quotes the template name sign The f-template, R The t-template] rule match, obtain mapping relations tabulation M rTermination.With rule 1 " word order " is example, and this rule is a condition of continuity rule (promptly adopting interval condition).Every font size from " opening " beginning is that No. three content all satisfies this condition, so its matched data item is section first " opening ".
With rule 4 " strokes " is example, and this rule is a termination condition rule (promptly adopting locality condition).Every content that finishes with " picture, " word string that begins at the back from phonetic alphabet all satisfies this condition, so its matched data item is " 4 draw, ".
The rule template that above-mentioned preferred embodiment is set up also is applicable to meeting the other guide fragment of this rule.Brief note " californium " for example, its contents fragment and metadata information, like Fig. 6 and shown in Figure 7, Fig. 6 shows another brief note of dictionary, and the brief note rule template that Fig. 7 shows Fig. 4 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 6.
Preferably, this content mask method also comprises: institute's data matching item as a contents fragment, is continued to carry out at [S f, S t] content-data on matched rule [R f, R t] step.The preferred embodiment provides a kind of nested mechanism, can handle complicated more content structure, thereby can satisfy the mark demand of the content document of various types of traffic.
The metadata information that above-mentioned steps (4) is set up can also append on the data interval of contents fragment with the mode that embeds, and as shown in Figure 8, its metadata token that matches all marks brief note and " opens " in the contents fragment.Fig. 9 shows the synoptic diagram of brief note rule template in accordance with a preferred embodiment of the present invention, and the brief note rule template that Figure 10 shows Fig. 9 carries out the synoptic diagram of the brief note metadata information that rule match obtains to the brief note of Fig. 8.
Can find out that the metadata token that the content that the brief note of Fig. 8 " is opened " is embedded into further is divided into more tiny contents fragment.To the contents fragment of these subordinates, can continue to adopt the inventive method to mark.With " lexical or textual analysis of opening " contents fragment is example, its rule template and metadata information such as Fig. 9 and shown in Figure 10.
User's refinement layer by layer, use this content mask method one by one with going forward one by one, to greatest extent, minimum particle size ground mark content document or fragment, to reach satisfied structuring effect.
Figure 11 shows the synoptic diagram of content annotation equipment according to an embodiment of the invention, comprising:
Acquisition module 10 is used to obtain contents fragment;
Matching module 20 is used at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of contents fragment, S tBe the end of contents fragment, R fBe the first rule of rule template, R tBe an end rule of rule template, rule template comprises from R fTo R tOne group of linear orderly rule.
This content annotation equipment has improved the content marking efficiency.
Preferably, matching module 20 comprises:
The Set For Current module is used to be provided with current regular R cBe R f
Current matching module is used for S fFor starting point is carried out R cCoupling is to obtain R cThe data matching item, successfully the sign, end position S r, to R cData matching item mark R cIn metadata token, obtain mapping relations tabulations M r
Add module, if be used for successfully being masked as effectively, then with M rJoin among the M;
Judge module is used for judging successfully whether sign is effectively, and R cWhether not R t, and S rWhether not S t
The circulation module is that S then is set if be used for that above judgement is fBe S r, R is set fBe R cNext rule, continue to carry out above-mentioned steps then; Otherwise termination.
This content annotation equipment is simple in structure, is easy to through computer realization.
Preferably, this content annotation equipment also comprises: the metadata item module, be used for traveling through each mapping relations of M, and write down each metadata token and corresponding data item respectively, to make up metadata item; Metadata item table module is used for metadata item is made up the metadata item table; Add-on module is used for the metadata item table is appended to contents fragment.This preferred embodiment has provided two kinds of simple schemes that the mapping relations tabulation of setting up is preserved.
Preferably, this content annotation equipment also comprises: analysis module is used for analyzing the performance rule of each contents fragment of content document; Create module, be used for creating rule template according to the performance rule, rule template comprises rule [R f, R t].This preferred embodiment has improved the reusability of content mark work.
Each embodiment of the present invention can combine batch system or macros operation, thereby can discern, mate and additional metadata information the contents fragment of specific regularity in batches apace.
Can find out that from above description through each embodiment of the present invention, the user can be regular contents fragment affix metadata information easily, neatly, efficiently, exactly.The above embodiments of the present invention combine the metadata token system, applicable to various applications, as: chapter, paper, examination question, word (speech) allusion quotation etc., satisfy user's different service demand.
Obviously; It is apparent to those skilled in the art that above-mentioned each module of the present invention or each step can realize that they can concentrate on the single calculation element with the general calculation device; Perhaps be distributed on the network that a plurality of calculation element forms; Alternatively, they can be realized with the executable program code of calculation element, carried out by calculation element thereby can they be stored in the memory storage; Perhaps they are made into each integrated circuit modules respectively, perhaps a plurality of modules in them or step are made into the single integrated circuit module and realize.Like this, the present invention is not restricted to any specific hardware and software combination.
The above is merely the preferred embodiments of the present invention, is not limited to the present invention, and for a person skilled in the art, the present invention can have various changes and variation.All within spirit of the present invention and principle, any modification of being done, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (17)

1. a content mask method is characterized in that, comprising:
Obtain contents fragment;
At [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of said contents fragment, S tBe the end of said contents fragment, R fBe the first rule of rule template, R tBe an end rule of said rule template, said rule template comprises from R fTo R tOne group of linear orderly rule.
2. method according to claim 1 is characterized in that, at [S f, S t] content-data on matched rule [R f, R t] comprising:
Current regular R is set cBe R f
With S fFor starting point is carried out R cCoupling is to obtain R cThe data matching item, successfully the sign, end position S r, to R cData matching item mark R cIn metadata token, obtain mapping relations tabulations M r
If successfully be masked as effectively, then with M rJoin among the M;
Judge whether said successfully sign is effectively, and R cWhether not R t, and S rWhether not S t
Be that S then is set if above judgement is fBe S r, R is set fBe R cNext rule, continue to carry out above-mentioned steps then; Otherwise termination.
3. method according to claim 2 is characterized in that R cComprise the Data Matching condition, carry out R cCoupling comprises: use said Data Matching condition at [S f, S t] content-data on match each data item, and correspondingly be provided with said successfully the sign.
4. method according to claim 3 is characterized in that R cAlso comprise the final position sign, it is invalid that said final position is masked as, and being used to indicate said Data Matching condition is interval condition; Said final position is masked as effectively, and being used to indicate said Data Matching condition is locality condition,
Said interval condition is used to indicate the format convention that is arranged on data on the continuum, and wherein, corresponding data item is the data in the end position successive range that begin, that satisfy said format convention of a last data item;
Said locality condition is used to indicate the format convention that is arranged on end position place data, and wherein, corresponding data item is that the end position of an above data item is starting point, is the data between the end point with the position of satisfying said format convention,
Wherein, said format convention is used for the regular characteristic that designation data shows.
5. method according to claim 4 is characterized in that, said format convention comprises following at least a: content format rule, manifest format convention, tag format rule and Any rule,
Said content format rule is used for the regular characteristic that designation data shows on document content;
The said format convention that manifests is used for the regular characteristic that designation data shows on the space of a whole page appears;
Said tag format rule is used for the regular characteristic that designation data shows on applied logic;
Said Any rule is used to indicate all Satisfying Matching Conditions of any data.
6. method according to claim 2 is characterized in that R cComprise the recurring rule number, said recurring rule number is used to indicate repeated application [R f, R t] described in several rules of recurring rule.
7. method according to claim 1 is characterized in that, said rule comprises:
Minimum occurrence number, its value is N, is used for indication and matches that data item is minimum to be that N is a nonnegative integer N time;
Maximum occurrence number, its value is M, is used for indication and matches data item and be M time that at most M is a positive integer, and M >=N.
8. method according to claim 1 is characterized in that, also comprises:
Each mapping relations among the traversal M write down each said metadata token and corresponding data item respectively, to make up metadata item;
Said metadata item is made up the metadata item table;
Said metadata item table is appended to said contents fragment.
9. method according to claim 4 is characterized in that, also comprises:
Each mapping relations among the traversal M write down each said metadata token and corresponding data item respectively, to make up metadata item;
According to coupling pairing continuum of said data item or end position, said metadata item is appended to said contents fragment.
10. according to Claim 8 or 9 described methods, it is characterized in that said metadata token meets XML, wherein, when said metadata token is empty mark, ignore the data item that said metadata token is empty mark when being used to indicate additional said metadata item.
11. method according to claim 1 is characterized in that, also comprises:
Analyze the performance rule of each contents fragment in the content document;
Create rule template according to said performance rule, said rule template comprises rule [R f, R t].
12. method according to claim 11 is characterized in that, R cComprise and quote template name, be used to indicate and quote rule template with said template name.
13. method according to claim 1 is characterized in that, also comprises: institute's data matching item as a contents fragment, is continued to carry out at [S f, S t] content-data on matched rule [R f, R t] step.
14. a content annotation equipment is characterized in that, comprising:
Acquisition module is used to obtain contents fragment;
Matching module is used at [S f, S t] content-data on matched rule [R f, R t], to the metadata token in each data item mark institute matched rule that matches, to obtain mapping relations tabulation M, wherein, S fBe the beginning of said contents fragment, S tBe the end of said contents fragment, R fBe the first rule of rule template, R tBe an end rule of said rule template, said rule template comprises from R fTo R tOne group of linear orderly rule.
15. device according to claim 14 is characterized in that, said matching module comprises:
The Set For Current module is used to be provided with current regular R cBe R f
Current matching module is used for S fFor starting point is carried out R cCoupling is to obtain R cThe data matching item, successfully the sign, end position S r, to R cData matching item mark R cIn metadata token, obtain mapping relations tabulations M r
Add module, if be used for successfully being masked as effectively, then with M rJoin among the M;
Judge module is used to judge whether said successfully sign is effectively, and R cWhether not R t, and S rWhether not S t
The circulation module is that S then is set if be used for that above judgement is fBe S r, R is set fBe R cNext rule, continue to carry out above-mentioned steps then; Otherwise termination.
16. device according to claim 14 is characterized in that, also comprises:
The metadata item module is used for traveling through each mapping relations of M, writes down each said metadata token and corresponding data item respectively, to make up metadata item;
Metadata item table module is used for said metadata item is made up the metadata item table;
Add-on module is used for said metadata item table is appended to said contents fragment.
17. device according to claim 14 is characterized in that, also comprises:
Analysis module is used for analyzing the performance rule of each contents fragment of content document;
Create module, be used for creating rule template according to said performance rule, said rule template comprises rule [R f, R t].
CN201010578057.1A 2010-12-02 2010-12-02 Method and device for labeling content Active CN102486767B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201010578057.1A CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010578057.1A CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Publications (2)

Publication Number Publication Date
CN102486767A true CN102486767A (en) 2012-06-06
CN102486767B CN102486767B (en) 2015-03-25

Family

ID=46152261

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010578057.1A Active CN102486767B (en) 2010-12-02 2010-12-02 Method and device for labeling content

Country Status (1)

Country Link
CN (1) CN102486767B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103677803A (en) * 2012-09-19 2014-03-26 三星电子株式会社 System and method for creating e-book including user effects
TWI623888B (en) * 2013-07-09 2018-05-11 3M新設資產公司 Systems and methods for note content extraction and management by segmenting notes

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123532A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A system and method for generating description information of communication user
CN101158953A (en) * 2007-10-08 2008-04-09 上海聆众商务咨询有限公司 Network document information processing method and device
US20090271353A1 (en) * 2008-04-28 2009-10-29 Ben Fei Method and device for tagging a document

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101123532A (en) * 2006-08-07 2008-02-13 华为技术有限公司 A system and method for generating description information of communication user
CN101158953A (en) * 2007-10-08 2008-04-09 上海聆众商务咨询有限公司 Network document information processing method and device
US20090271353A1 (en) * 2008-04-28 2009-10-29 Ben Fei Method and device for tagging a document

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103677803A (en) * 2012-09-19 2014-03-26 三星电子株式会社 System and method for creating e-book including user effects
TWI623888B (en) * 2013-07-09 2018-05-11 3M新設資產公司 Systems and methods for note content extraction and management by segmenting notes
TWI646457B (en) * 2013-07-09 2019-01-01 3M新設資產公司 Method of extracting note content, note recognition system and non-transitory computer-readable storage device

Also Published As

Publication number Publication date
CN102486767B (en) 2015-03-25

Similar Documents

Publication Publication Date Title
CN110909548B (en) Chinese named entity recognition method, device and computer readable storage medium
CN101751476B (en) Method and device for marking electronic bookmarks
US20060277159A1 (en) Accuracy in searching digital ink
CN103823838B (en) A kind of method of multi-format document typing and comparison
WO2020000717A1 (en) Web page classification method and device, and computer-readable storage medium
CN102402604A (en) Effective Forward Ordering Of Search Engine
US20100198827A1 (en) Method for finding text reading order in a document
CN102567421B (en) Document retrieval method and device
CN107590291A (en) A kind of searching method of picture, terminal device and storage medium
CN102122280A (en) Method and system for intelligently extracting content object
Delaye et al. A flexible framework for online document segmentation by pairwise stroke distance learning
CN107741972A (en) A kind of searching method of picture, terminal device and storage medium
CN109857912A (en) A kind of font recognition methods, electronic equipment and storage medium
CN112084342A (en) Test question generation method and device, computer equipment and storage medium
CN101281449A (en) Hand-written character recognizing method and system
CN101763424B (en) Method for determining characteristic words and searching according to file content
CN108959204B (en) Internet financial project information extraction method and system
WO2011074942A1 (en) System and method of converting data from a multiple table structure into an edoc format
CN110059253A (en) A kind of sort method and system and equipment based on natural language analysis
CN102486767A (en) Method and device for labeling content
US20140177951A1 (en) Method, apparatus, and storage medium having computer executable instructions for processing of an electronic document
CN109635075B (en) Method and device for marking word-dividing marks on text contents
Souza et al. ARCTIC: metadata extraction from scientific papers in pdf using two-layer CRF
WO2021042527A1 (en) Character recognition method and apparatus, and computer-readable storage medium
CN102567420B (en) Document retrieval method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220621

Address after: 3007, Hengqin international financial center building, No. 58, Huajin street, Hengqin new area, Zhuhai, Guangdong 519031

Patentee after: New founder holdings development Co.,Ltd.

Patentee after: Beijing Beida Founder Electronics Co., Ltd.

Address before: 100871, Beijing, Haidian District Cheng Fu Road 298, founder building, 5 floor

Patentee before: PEKING UNIVERSITY FOUNDER GROUP Co.,Ltd.

Patentee before: Beijing Beida Founder Electronics Co., Ltd.

TR01 Transfer of patent right