CN101667203A - Digital knowledge discovery method - Google Patents

Digital knowledge discovery method Download PDF

Info

Publication number
CN101667203A
CN101667203A CN200910169826A CN200910169826A CN101667203A CN 101667203 A CN101667203 A CN 101667203A CN 200910169826 A CN200910169826 A CN 200910169826A CN 200910169826 A CN200910169826 A CN 200910169826A CN 101667203 A CN101667203 A CN 101667203A
Authority
CN
China
Prior art keywords
meaning
knowledge
speech
word
font
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN200910169826A
Other languages
Chinese (zh)
Inventor
蒋贤春
郑珑
蓝德康
谢术清
朱人杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd
Original Assignee
BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd filed Critical BEIJING CHINA-E CHINA-S ELECTRONICS Co Ltd
Priority to CN200910169826A priority Critical patent/CN101667203A/en
Publication of CN101667203A publication Critical patent/CN101667203A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Document Processing Apparatus (AREA)

Abstract

The invention provides a digital knowledge discovery method. The knowledge comprising classifying words, identifying characters and meanings of characters, words and sentences can be obtained under the condition that the cost is not added almost in digitalizing process; and all names, place names, event names and the like in documents and all initial characters, phrases, sentences, book names andthe like can be discovered. The invention has the characteristics that the discovered knowledge points are determined before a document is processed by a meaning marking method, a character marking method and a character identification marking method provided by the invention by digital processing units; and after the processing is finished, knowledge in specified format is generated by a knowledge generation module at first, and then the classification knowledge in documents are discovered by a knowledge generation acquisition module.

Description

A kind of method of digital knowledge discovery
Technical field
The invention belongs to IT field, it provides a kind of method of digital knowledge discovery, making is increasing under the condition of cost in the digitized process hardly, can obtain knowledge, comprise the classification of speech, the meaning of approval word and word, speech and sentence, as excavate all names in the document, place name, event name etc. and all prefix words, phrase, example sentence, title etc.
Numeral process unit by meaning labeling method provided by the invention, emphatically the speech labeling method, admit the word mark method, the knowledge point that definition will be excavated before a kind of document of processing, generate earlier the knowledge of format specification machining the back, excavate classificating knowledge in the document by knowledge acquisition module again by the knowledge generation module.
Background technology
Along with the fast development of modern computer communication, network technology, be that the information revolution second time of core is just fully under way in countries in the world with the content revolution, digital information resource becomes the essential of modern information society.The reader can obtain a large amount of knowledge by network, and data volume is too big, though can reduce scope by retrieval, data remain magnanimity, and chief reason is that data are not classified, and does not just set up the knowledge point for tourist's cluster.
The knowledge excavation of data content is at present paid attention to by increasing expert and reader, but the cost height very that existing technical know-how is excavated, they all are by expert or special messenger former document word for word to be read after digitizing, then in mark knowledge point, relevant position.
Summary of the invention
The present invention is a kind of method of digital knowledge discovery, comprises meaning method for digging, speech method for digging, approval word method for digging.
One, knowledge labeling method
1. meaning labeling method
(1) font is added signification (meaning) attribute;
(2) the meaning property value can be: document name, title, image header, form title, text, index title, index 1, index 2, index 3, form literal, catalogue 1, catalogue 2, catalogue 3, catalogue 4, catalogue 5, catalogue 6, annotations and comments theme, annotations and comments, footer, header, the centre joint, list of references, example sentence, sentence, phrase, proverb, Chinese idiom, idiom, speech, word, two row small characters, triplex row small character, four lines small character, insert word, other.
(3) some knowledge point meaning attribute need be described by the meaning explanation.
2. speech labeling method
(1) creating speech marker character emphatically, is decoration as according with label emphatically with the XML definition;
(2) speech comprises signification (meaning) attribute emphatically;
(2) speech comprises signification (meaning) attribute emphatically;
(3) the meaning property value can be: name, place name, event name, title, attached annotations and comments etc.
3. admit the word mark method
(1) the approval word attribute that creates text is admitted by the word of the position mark correspondence position of approval word in text.
Two, knowledge generation module
The raw data that different digital process unit machines disperses, and normally exists in database and the data file, and the form disunity by knowledge generation module provided by the invention, with raw data formatization, is convenient to knowledge acquisition.
1. font meaning generation module: font meaning raw data is generated the cannonical format data.
2. focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.
3. admit the word generation module: will admit the word raw data and generate the cannonical format data.
Three, knowledge acquisition module
1. font meaning acquisition module: obtain knowledge by font meaning attribute and property value.
2. focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.
3. admit the word acquisition module: obtain the approval word by the position of approval word in text.
Description of drawings
Fig. 1: knowledge labeling method.
Fig. 2: knowledge generation module.
Fig. 3: knowledge acquisition module.
Fig. 4: English-Chinese dictionary entry word example.
Fig. 5: focus on the speech example.
Fig. 6: approval word example.
Embodiment
Provide application process of the present invention below by example:
1. meaning method for digging
As the prefix font in the English-Chinese dictionary, phrase font, example sentence font etc., the data operator on the production line need not to consider the problem of knowledge excavation when normal input, only needs to confirm that normally font gets final product.As Fig. 4.
Wherein " adamantine " is prefix; "~chains " is example sentence; "~spar " is phrase.The data operator is input as "+1adamantine+ ", "+2~chains+ " and "+3~spar+ " respectively when importing them, wherein "+" is the font instruction character; Numeral after the font be which plants font, font 1 expression here be the prefix font; What font 2 was represented is the example sentence font; What font 3 was represented is the phrase font.Program is when making the XML file, and when running into "+" instruction character, the font meaning generation module in the knowledge generation module is converted to corresponding label with instruction character "+", and transformation result is as follows:
<font size=" 48 " weight=" bold " signification=" prefix " 〉
<text>1ad·a·man·tine</text>
</font>
<font size=" 40 " style=" italic " signification=" example sentence " 〉
<text>~chains</text>
</font>
<font size=" 40 " weight=" bold " signification=" phrase " 〉
<text>~spar</text>
</font>
After a kind of document digitizing is finished, by the font meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all knowledge, as all prefixes in the dictionary, example sentence, phrase etc.
2. speech method for digging
Normally a kind of grammatical attribute of speech adds frame as name emphatically; Place name is other to be underlined; The other wave etc. that adds of event name.
In digitizing processing, when the entry personnel runs into Fig. 5, input "
Figure A20091016982600051
Have garrison troops open up wasteland and grow food grain
Figure A20091016982600052
", wherein
Figure A20091016982600053
Expression is the speech instruction character emphatically, when making the XML file, the speech meaning generation module emphatically in the knowledge generation module will "
Figure A20091016982600054
Have garrison troops open up wasteland and grow food grain
Figure A20091016982600055
" be converted to:
<decoration signification=" place name "〉have garrison troops open up wasteland and grow food grain</decoration 〉
After a kind of document digitizing is finished, by the emphatically speech meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all classificating knowledges, as all place names in the document, name, event name etc.
3. admit the word method for digging
In digitizing processing, when the entry personnel runs into Fig. 6, input " ⊥ Swam on Lantern Festival ", wherein the word instruction character is admitted in " ⊥ " expression, and when making the XML file, the approval word generation module in the knowledge generation module is converted to " ⊥ Swam on Lantern Festival ":
<text reverse=" reverse " variant=" 3 "〉Yuan Xiao Swam</text 〉
After a kind of document digitizing is finished, by the approval word acquisition module in the knowledge acquisition module explanation of XML is obtained all approval words, obtain all variant Chinese character of this coding standardized form of Chinese charcters according to the coding of approval word.
When the digitizing quantity of document reached certain scale, approval word acquisition module provided the allosome word table of all standardized forms of Chinese charcters.

Claims (7)

1. the method for a digital knowledge discovery comprises knowledge labeling method, knowledge generation module, knowledge acquisition module.
2. knowledge labeling method as claimed in claim 1 comprises meaning labeling method, speech labeling method, approval word mark method.
3. meaning labeling method as claimed in claim 2 comprises:
Font is added the meaning attribute: create font meaning attribute, create meaning by the meaning property value, set up the knowledge point, thereby make the content in the digitizing document be endowed the meaning of appointment;
Font is added the meaning explanation: meaning supplementary notes means are provided, the meaning that might produce ambiguous appointment is furnished an explanation.
4. speech labeling method as claimed in claim 2 comprises:
Speech mark emphatically: create speech marker character emphatically, it not only mark focus on the part of speech type, also comprise the content of speech itself;
Speech meaning attribute emphatically: create speech meaning attribute emphatically,, set up the knowledge point, thereby make the entry in the digitizing document be endowed the meaning of appointment by part of speech;
5. approval word mark method as claimed in claim 2 comprises:
Approval word attribute: the approval word attribute that creates text, admitted by the word of the position mark correspondence position of approval word in text.
6. knowledge generation module as claimed in claim 1 comprises:
Font meaning generation module: font meaning raw data is generated the cannonical format data.
Focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.
Approval word generation module: will admit the word raw data and generate the cannonical format data.
7. knowledge acquisition module as claimed in claim 1 comprises:
Font meaning acquisition module: obtain knowledge by font meaning attribute and property value.
Focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.
Approval word acquisition module: obtain the approval word by the position of approval word in text.
CN200910169826A 2009-09-04 2009-09-04 Digital knowledge discovery method Pending CN101667203A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN200910169826A CN101667203A (en) 2009-09-04 2009-09-04 Digital knowledge discovery method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN200910169826A CN101667203A (en) 2009-09-04 2009-09-04 Digital knowledge discovery method

Publications (1)

Publication Number Publication Date
CN101667203A true CN101667203A (en) 2010-03-10

Family

ID=41803819

Family Applications (1)

Application Number Title Priority Date Filing Date
CN200910169826A Pending CN101667203A (en) 2009-09-04 2009-09-04 Digital knowledge discovery method

Country Status (1)

Country Link
CN (1) CN101667203A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN107688600A (en) * 2017-07-12 2018-02-13 百度在线网络技术(北京)有限公司 Knowledge point method for digging and device

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102622346A (en) * 2011-01-26 2012-08-01 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN102622346B (en) * 2011-01-26 2014-04-09 中国科学院上海生命科学研究院 Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database
CN107688600A (en) * 2017-07-12 2018-02-13 百度在线网络技术(北京)有限公司 Knowledge point method for digging and device

Similar Documents

Publication Publication Date Title
CN111737969B (en) Resume parsing method and system based on deep learning
CN108228676B (en) Information extraction method and system
CN103077164B (en) Text analyzing method and text analyzer
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN101079025B (en) File correlation computing system and method
CN102693222B (en) Carapace bone script explanation machine translation method based on example
CN101216819B (en) Name card information Chinese to English automatic translation method based on domain ontology
CN106528536A (en) Multilingual word segmentation method based on dictionaries and grammar analysis
CN101196881A (en) Words symbolization processing method and system for number and special symbol string in text
CN103996055A (en) Identification method based on classifiers in image document electronic material identification system
CN112765999A (en) Machine translation bilingual comparison method and system
CN113761202A (en) Optimization system for mapping unstructured financial Excel table to database
Gillis-Webber et al. The shortcomings of language tags for linked data when modeling lesser-known languages
CN101777043A (en) Word conversion method and device
CN102110108B (en) Method and device for processing galley proof file
CN101667203A (en) Digital knowledge discovery method
CN112506488A (en) Method for generating programming language class based on sql creating statement
CN103164395A (en) Chinese-Kirgiz language electronic dictionary and automatic translating Chinese-Kirgiz language method thereof
CN105843802A (en) Corpus intervention module and method in translation
CN103164396A (en) Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof
CN106649219B (en) A kind of telecommunication satellite design document automatic generation method
Petran et al. ReM: A reference corpus of Middle High German--corpus compilation, annotation, and access
Dousset et al. Developing a database for Australian Indigenous kinship terminology: The AustKin project
Wong et al. Updating the ice annotation system: tagging, parsing and validation
CN103605755B (en) A kind of construction method of proverb literary composition database and proverb literary composition database retrieval system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20100310