CN101667203A

CN101667203A - Digital knowledge discovery method

Info

Publication number: CN101667203A
Application number: CN200910169826A
Authority: CN
Inventors: 蒋贤春; 郑珑; 蓝德康; 谢术清; 朱人杰
Original assignee: BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd
Current assignee: BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd
Priority date: 2009-09-04
Filing date: 2009-09-04
Publication date: 2010-03-10

Abstract

The invention provides a digital knowledge discovery method. The knowledge comprising classifying words, identifying characters and meanings of characters, words and sentences can be obtained under the condition that the cost is not added almost in digitalizing process; and all names, place names, event names and the like in documents and all initial characters, phrases, sentences, book names andthe like can be discovered. The invention has the characteristics that the discovered knowledge points are determined before a document is processed by a meaning marking method, a character marking method and a character identification marking method provided by the invention by digital processing units; and after the processing is finished, knowledge in specified format is generated by a knowledge generation module at first, and then the classification knowledge in documents are discovered by a knowledge generation acquisition module.

Description

A kind of method of digital knowledge discovery

Technical field

The invention belongs to IT field, it provides a kind of method of digital knowledge discovery, making is increasing under the condition of cost in the digitized process hardly, can obtain knowledge, comprise the classification of speech, the meaning of approval word and word, speech and sentence, as excavate all names in the document, place name, event name etc. and all prefix words, phrase, example sentence, title etc.

Numeral process unit by meaning labeling method provided by the invention, emphatically the speech labeling method, admit the word mark method, the knowledge point that definition will be excavated before a kind of document of processing, generate earlier the knowledge of format specification machining the back, excavate classificating knowledge in the document by knowledge acquisition module again by the knowledge generation module.

Background technology

Along with the fast development of modern computer communication, network technology, be that the information revolution second time of core is just fully under way in countries in the world with the content revolution, digital information resource becomes the essential of modern information society.The reader can obtain a large amount of knowledge by network, and data volume is too big, though can reduce scope by retrieval, data remain magnanimity, and chief reason is that data are not classified, and does not just set up the knowledge point for tourist's cluster.

The knowledge excavation of data content is at present paid attention to by increasing expert and reader, but the cost height very that existing technical know-how is excavated, they all are by expert or special messenger former document word for word to be read after digitizing, then in mark knowledge point, relevant position.

Summary of the invention

The present invention is a kind of method of digital knowledge discovery, comprises meaning method for digging, speech method for digging, approval word method for digging.

One, knowledge labeling method

1. meaning labeling method

(1) font is added signification (meaning) attribute;

(2) the meaning property value can be: document name, title, image header, form title, text, index title, index 1, index 2, index 3, form literal, catalogue 1, catalogue 2, catalogue 3, catalogue 4, catalogue 5, catalogue 6, annotations and comments theme, annotations and comments, footer, header, the centre joint, list of references, example sentence, sentence, phrase, proverb, Chinese idiom, idiom, speech, word, two row small characters, triplex row small character, four lines small character, insert word, other.

(3) some knowledge point meaning attribute need be described by the meaning explanation.

2. speech labeling method

(1) creating speech marker character emphatically, is decoration as according with label emphatically with the XML definition;

(2) speech comprises signification (meaning) attribute emphatically;

(3) the meaning property value can be: name, place name, event name, title, attached annotations and comments etc.

3. admit the word mark method

(1) the approval word attribute that creates text is admitted by the word of the position mark correspondence position of approval word in text.

Two, knowledge generation module

The raw data that different digital process unit machines disperses, and normally exists in database and the data file, and the form disunity by knowledge generation module provided by the invention, with raw data formatization, is convenient to knowledge acquisition.

1. font meaning generation module: font meaning raw data is generated the cannonical format data.

2. focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.

3. admit the word generation module: will admit the word raw data and generate the cannonical format data.

Three, knowledge acquisition module

1. font meaning acquisition module: obtain knowledge by font meaning attribute and property value.

2. focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.

3. admit the word acquisition module: obtain the approval word by the position of approval word in text.

Description of drawings

Fig. 1: knowledge labeling method.

Fig. 2: knowledge generation module.

Fig. 3: knowledge acquisition module.

Fig. 4: English-Chinese dictionary entry word example.

Fig. 5: focus on the speech example.

Fig. 6: approval word example.

Embodiment

Provide application process of the present invention below by example:

1. meaning method for digging

As the prefix font in the English-Chinese dictionary, phrase font, example sentence font etc., the data operator on the production line need not to consider the problem of knowledge excavation when normal input, only needs to confirm that normally font gets final product.As Fig. 4.

Wherein " adamantine " is prefix; "～chains " is example sentence; "～spar " is phrase.The data operator is input as "+1adamantine+ ", "+2～chains+ " and "+3～spar+ " respectively when importing them, wherein "+" is the font instruction character; Numeral after the font be which plants font, font 1 expression here be the prefix font; What font 2 was represented is the example sentence font; What font 3 was represented is the phrase font.Program is when making the XML file, and when running into "+" instruction character, the font meaning generation module in the knowledge generation module is converted to corresponding label with instruction character "+", and transformation result is as follows:

＜font size=" 48 " weight=" bold " signification=" prefix " 〉

</font>

＜font size=" 40 " style=" italic " signification=" example sentence " 〉

<text>～chains</text>

</font>

＜font size=" 40 " weight=" bold " signification=" phrase " 〉

</font>

After a kind of document digitizing is finished, by the font meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all knowledge, as all prefixes in the dictionary, example sentence, phrase etc.

2. speech method for digging

Normally a kind of grammatical attribute of speech adds frame as name emphatically; Place name is other to be underlined; The other wave etc. that adds of event name.

In digitizing processing, when the entry personnel runs into Fig. 5, input "

Have garrison troops open up wasteland and grow food grain

", wherein

Expression is the speech instruction character emphatically, when making the XML file, the speech meaning generation module emphatically in the knowledge generation module will "

Have garrison troops open up wasteland and grow food grain

" be converted to:

＜decoration signification=" place name "〉have garrison troops open up wasteland and grow food grain＜/decoration 〉

After a kind of document digitizing is finished, by the emphatically speech meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all classificating knowledges, as all place names in the document, name, event name etc.

3. admit the word method for digging

In digitizing processing, when the entry personnel runs into Fig. 6, input " ⊥ Swam on Lantern Festival ", wherein the word instruction character is admitted in " ⊥ " expression, and when making the XML file, the approval word generation module in the knowledge generation module is converted to " ⊥ Swam on Lantern Festival ":

＜text reverse=" reverse " variant=" 3 "〉Yuan Xiao Swam＜/text 〉

After a kind of document digitizing is finished, by the approval word acquisition module in the knowledge acquisition module explanation of XML is obtained all approval words, obtain all variant Chinese character of this coding standardized form of Chinese charcters according to the coding of approval word.

When the digitizing quantity of document reached certain scale, approval word acquisition module provided the allosome word table of all standardized forms of Chinese charcters.

Claims

1. the method for a digital knowledge discovery comprises knowledge labeling method, knowledge generation module, knowledge acquisition module.

2. knowledge labeling method as claimed in claim 1 comprises meaning labeling method, speech labeling method, approval word mark method.

3. meaning labeling method as claimed in claim 2 comprises:

Font is added the meaning attribute: create font meaning attribute, create meaning by the meaning property value, set up the knowledge point, thereby make the content in the digitizing document be endowed the meaning of appointment;

Font is added the meaning explanation: meaning supplementary notes means are provided, the meaning that might produce ambiguous appointment is furnished an explanation.

4. speech labeling method as claimed in claim 2 comprises:

Speech mark emphatically: create speech marker character emphatically, it not only mark focus on the part of speech type, also comprise the content of speech itself;

Speech meaning attribute emphatically: create speech meaning attribute emphatically,, set up the knowledge point, thereby make the entry in the digitizing document be endowed the meaning of appointment by part of speech;

5. approval word mark method as claimed in claim 2 comprises:

Approval word attribute: the approval word attribute that creates text, admitted by the word of the position mark correspondence position of approval word in text.

6. knowledge generation module as claimed in claim 1 comprises:

Font meaning generation module: font meaning raw data is generated the cannonical format data.

Focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.

Approval word generation module: will admit the word raw data and generate the cannonical format data.

7. knowledge acquisition module as claimed in claim 1 comprises:

Font meaning acquisition module: obtain knowledge by font meaning attribute and property value.

Focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.

Approval word acquisition module: obtain the approval word by the position of approval word in text.