CN101667203A - Digital knowledge discovery method - Google Patents
Digital knowledge discovery method Download PDFInfo
- Publication number
- CN101667203A CN101667203A CN200910169826A CN200910169826A CN101667203A CN 101667203 A CN101667203 A CN 101667203A CN 200910169826 A CN200910169826 A CN 200910169826A CN 200910169826 A CN200910169826 A CN 200910169826A CN 101667203 A CN101667203 A CN 101667203A
- Authority
- CN
- China
- Prior art keywords
- meaning
- knowledge
- speech
- word
- font
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Document Processing Apparatus (AREA)
Abstract
The invention provides a digital knowledge discovery method. The knowledge comprising classifying words, identifying characters and meanings of characters, words and sentences can be obtained under the condition that the cost is not added almost in digitalizing process; and all names, place names, event names and the like in documents and all initial characters, phrases, sentences, book names andthe like can be discovered. The invention has the characteristics that the discovered knowledge points are determined before a document is processed by a meaning marking method, a character marking method and a character identification marking method provided by the invention by digital processing units; and after the processing is finished, knowledge in specified format is generated by a knowledge generation module at first, and then the classification knowledge in documents are discovered by a knowledge generation acquisition module.
Description
Technical field
The invention belongs to IT field, it provides a kind of method of digital knowledge discovery, making is increasing under the condition of cost in the digitized process hardly, can obtain knowledge, comprise the classification of speech, the meaning of approval word and word, speech and sentence, as excavate all names in the document, place name, event name etc. and all prefix words, phrase, example sentence, title etc.
Numeral process unit by meaning labeling method provided by the invention, emphatically the speech labeling method, admit the word mark method, the knowledge point that definition will be excavated before a kind of document of processing, generate earlier the knowledge of format specification machining the back, excavate classificating knowledge in the document by knowledge acquisition module again by the knowledge generation module.
Background technology
Along with the fast development of modern computer communication, network technology, be that the information revolution second time of core is just fully under way in countries in the world with the content revolution, digital information resource becomes the essential of modern information society.The reader can obtain a large amount of knowledge by network, and data volume is too big, though can reduce scope by retrieval, data remain magnanimity, and chief reason is that data are not classified, and does not just set up the knowledge point for tourist's cluster.
The knowledge excavation of data content is at present paid attention to by increasing expert and reader, but the cost height very that existing technical know-how is excavated, they all are by expert or special messenger former document word for word to be read after digitizing, then in mark knowledge point, relevant position.
Summary of the invention
The present invention is a kind of method of digital knowledge discovery, comprises meaning method for digging, speech method for digging, approval word method for digging.
One, knowledge labeling method
1. meaning labeling method
(1) font is added signification (meaning) attribute;
(2) the meaning property value can be: document name, title, image header, form title, text, index title, index 1, index 2, index 3, form literal, catalogue 1, catalogue 2, catalogue 3, catalogue 4, catalogue 5, catalogue 6, annotations and comments theme, annotations and comments, footer, header, the centre joint, list of references, example sentence, sentence, phrase, proverb, Chinese idiom, idiom, speech, word, two row small characters, triplex row small character, four lines small character, insert word, other.
(3) some knowledge point meaning attribute need be described by the meaning explanation.
2. speech labeling method
(1) creating speech marker character emphatically, is decoration as according with label emphatically with the XML definition;
(2) speech comprises signification (meaning) attribute emphatically;
(2) speech comprises signification (meaning) attribute emphatically;
(3) the meaning property value can be: name, place name, event name, title, attached annotations and comments etc.
3. admit the word mark method
(1) the approval word attribute that creates text is admitted by the word of the position mark correspondence position of approval word in text.
Two, knowledge generation module
The raw data that different digital process unit machines disperses, and normally exists in database and the data file, and the form disunity by knowledge generation module provided by the invention, with raw data formatization, is convenient to knowledge acquisition.
1. font meaning generation module: font meaning raw data is generated the cannonical format data.
2. focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.
3. admit the word generation module: will admit the word raw data and generate the cannonical format data.
Three, knowledge acquisition module
1. font meaning acquisition module: obtain knowledge by font meaning attribute and property value.
2. focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.
3. admit the word acquisition module: obtain the approval word by the position of approval word in text.
Description of drawings
Fig. 1: knowledge labeling method.
Fig. 2: knowledge generation module.
Fig. 3: knowledge acquisition module.
Fig. 4: English-Chinese dictionary entry word example.
Fig. 5: focus on the speech example.
Fig. 6: approval word example.
Embodiment
Provide application process of the present invention below by example:
1. meaning method for digging
As the prefix font in the English-Chinese dictionary, phrase font, example sentence font etc., the data operator on the production line need not to consider the problem of knowledge excavation when normal input, only needs to confirm that normally font gets final product.As Fig. 4.
Wherein " adamantine " is prefix; "~chains " is example sentence; "~spar " is phrase.The data operator is input as "+1adamantine+ ", "+2~chains+ " and "+3~spar+ " respectively when importing them, wherein "+" is the font instruction character; Numeral after the font be which plants font, font 1 expression here be the prefix font; What font 2 was represented is the example sentence font; What font 3 was represented is the phrase font.Program is when making the XML file, and when running into "+" instruction character, the font meaning generation module in the knowledge generation module is converted to corresponding label with instruction character "+", and transformation result is as follows:
<font size=" 48 " weight=" bold " signification=" prefix " 〉
<text>1ad·a·man·tine</text>
</font>
<font size=" 40 " style=" italic " signification=" example sentence " 〉
<text>~chains</text>
</font>
<font size=" 40 " weight=" bold " signification=" phrase " 〉
<text>~spar</text>
</font>
After a kind of document digitizing is finished, by the font meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all knowledge, as all prefixes in the dictionary, example sentence, phrase etc.
2. speech method for digging
Normally a kind of grammatical attribute of speech adds frame as name emphatically; Place name is other to be underlined; The other wave etc. that adds of event name.
In digitizing processing, when the entry personnel runs into Fig. 5, input "
Have garrison troops open up wasteland and grow food grain
", wherein
Expression is the speech instruction character emphatically, when making the XML file, the speech meaning generation module emphatically in the knowledge generation module will "
Have garrison troops open up wasteland and grow food grain
" be converted to:
<decoration signification=" place name "〉have garrison troops open up wasteland and grow food grain</decoration 〉
After a kind of document digitizing is finished, by the emphatically speech meaning acquisition module in the knowledge acquisition module explanation of XML is obtained all classificating knowledges, as all place names in the document, name, event name etc.
3. admit the word method for digging
In digitizing processing, when the entry personnel runs into Fig. 6, input " ⊥ Swam on Lantern Festival ", wherein the word instruction character is admitted in " ⊥ " expression, and when making the XML file, the approval word generation module in the knowledge generation module is converted to " ⊥ Swam on Lantern Festival ":
<text reverse=" reverse " variant=" 3 "〉Yuan Xiao Swam</text 〉
After a kind of document digitizing is finished, by the approval word acquisition module in the knowledge acquisition module explanation of XML is obtained all approval words, obtain all variant Chinese character of this coding standardized form of Chinese charcters according to the coding of approval word.
When the digitizing quantity of document reached certain scale, approval word acquisition module provided the allosome word table of all standardized forms of Chinese charcters.
Claims (7)
1. the method for a digital knowledge discovery comprises knowledge labeling method, knowledge generation module, knowledge acquisition module.
2. knowledge labeling method as claimed in claim 1 comprises meaning labeling method, speech labeling method, approval word mark method.
3. meaning labeling method as claimed in claim 2 comprises:
Font is added the meaning attribute: create font meaning attribute, create meaning by the meaning property value, set up the knowledge point, thereby make the content in the digitizing document be endowed the meaning of appointment;
Font is added the meaning explanation: meaning supplementary notes means are provided, the meaning that might produce ambiguous appointment is furnished an explanation.
4. speech labeling method as claimed in claim 2 comprises:
Speech mark emphatically: create speech marker character emphatically, it not only mark focus on the part of speech type, also comprise the content of speech itself;
Speech meaning attribute emphatically: create speech meaning attribute emphatically,, set up the knowledge point, thereby make the entry in the digitizing document be endowed the meaning of appointment by part of speech;
5. approval word mark method as claimed in claim 2 comprises:
Approval word attribute: the approval word attribute that creates text, admitted by the word of the position mark correspondence position of approval word in text.
6. knowledge generation module as claimed in claim 1 comprises:
Font meaning generation module: font meaning raw data is generated the cannonical format data.
Focus on speech meaning generation module: will focus on speech meaning raw data and generate the cannonical format data.
Approval word generation module: will admit the word raw data and generate the cannonical format data.
7. knowledge acquisition module as claimed in claim 1 comprises:
Font meaning acquisition module: obtain knowledge by font meaning attribute and property value.
Focus on speech meaning acquisition module: by speech attribute and property value obtain classificating knowledge emphatically.
Approval word acquisition module: obtain the approval word by the position of approval word in text.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910169826A CN101667203A (en) | 2009-09-04 | 2009-09-04 | Digital knowledge discovery method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN200910169826A CN101667203A (en) | 2009-09-04 | 2009-09-04 | Digital knowledge discovery method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101667203A true CN101667203A (en) | 2010-03-10 |
Family
ID=41803819
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN200910169826A Pending CN101667203A (en) | 2009-09-04 | 2009-09-04 | Digital knowledge discovery method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101667203A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622346A (en) * | 2011-01-26 | 2012-08-01 | 中国科学院上海生命科学研究院 | Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database |
CN107688600A (en) * | 2017-07-12 | 2018-02-13 | 百度在线网络技术(北京)有限公司 | Knowledge point method for digging and device |
-
2009
- 2009-09-04 CN CN200910169826A patent/CN101667203A/en active Pending
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102622346A (en) * | 2011-01-26 | 2012-08-01 | 中国科学院上海生命科学研究院 | Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database |
CN102622346B (en) * | 2011-01-26 | 2014-04-09 | 中国科学院上海生命科学研究院 | Method, device and system for protein knowledge mining and discovery in Chinese bibliographic database |
CN107688600A (en) * | 2017-07-12 | 2018-02-13 | 百度在线网络技术(北京)有限公司 | Knowledge point method for digging and device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111737969B (en) | Resume parsing method and system based on deep learning | |
CN108228676B (en) | Information extraction method and system | |
CN103077164B (en) | Text analyzing method and text analyzer | |
CN106066866A (en) | A kind of automatic abstracting method of english literature key phrase and system | |
CN101079025B (en) | File correlation computing system and method | |
CN102693222B (en) | Carapace bone script explanation machine translation method based on example | |
CN101216819B (en) | Name card information Chinese to English automatic translation method based on domain ontology | |
CN106528536A (en) | Multilingual word segmentation method based on dictionaries and grammar analysis | |
CN101196881A (en) | Words symbolization processing method and system for number and special symbol string in text | |
CN103996055A (en) | Identification method based on classifiers in image document electronic material identification system | |
CN112765999A (en) | Machine translation bilingual comparison method and system | |
CN113761202A (en) | Optimization system for mapping unstructured financial Excel table to database | |
Gillis-Webber et al. | The shortcomings of language tags for linked data when modeling lesser-known languages | |
CN101777043A (en) | Word conversion method and device | |
CN102110108B (en) | Method and device for processing galley proof file | |
CN101667203A (en) | Digital knowledge discovery method | |
CN112506488A (en) | Method for generating programming language class based on sql creating statement | |
CN103164395A (en) | Chinese-Kirgiz language electronic dictionary and automatic translating Chinese-Kirgiz language method thereof | |
CN105843802A (en) | Corpus intervention module and method in translation | |
CN103164396A (en) | Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof | |
CN106649219B (en) | A kind of telecommunication satellite design document automatic generation method | |
Petran et al. | ReM: A reference corpus of Middle High German--corpus compilation, annotation, and access | |
Dousset et al. | Developing a database for Australian Indigenous kinship terminology: The AustKin project | |
Wong et al. | Updating the ice annotation system: tagging, parsing and validation | |
CN103605755B (en) | A kind of construction method of proverb literary composition database and proverb literary composition database retrieval system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20100310 |