CN114791955A - Traditional Chinese medicine literature corpus and knowledge base integrated system - Google Patents

Traditional Chinese medicine literature corpus and knowledge base integrated system Download PDF

Info

Publication number
CN114791955A
CN114791955A CN202210413257.4A CN202210413257A CN114791955A CN 114791955 A CN114791955 A CN 114791955A CN 202210413257 A CN202210413257 A CN 202210413257A CN 114791955 A CN114791955 A CN 114791955A
Authority
CN
China
Prior art keywords
labeling
module
semantic
document
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210413257.4A
Other languages
Chinese (zh)
Inventor
刘丽红
朱彦
李海燕
贾李蓉
杨硕
姚克宇
高博
陈超
聂莹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Information On Traditional Chinese Medicine Cacms
Original Assignee
Institute Of Information On Traditional Chinese Medicine Cacms
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Information On Traditional Chinese Medicine Cacms filed Critical Institute Of Information On Traditional Chinese Medicine Cacms
Priority to CN202210413257.4A priority Critical patent/CN114791955A/en
Publication of CN114791955A publication Critical patent/CN114791955A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/80Information retrieval; Database structures therefor; File system structures therefor of semi-structured data, e.g. markup language structured data such as SGML, XML or HTML
    • G06F16/84Mapping; Conversion
    • G06F16/86Mapping to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16HHEALTHCARE INFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR THE HANDLING OR PROCESSING OF MEDICAL OR HEALTHCARE DATA
    • G16H70/00ICT specially adapted for the handling or processing of medical references
    • G16H70/40ICT specially adapted for the handling or processing of medical references relating to drugs, e.g. their side effects or intended usage

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Chemical & Material Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Medicinal Chemistry (AREA)
  • Pharmacology & Pharmacy (AREA)
  • Toxicology (AREA)
  • Epidemiology (AREA)
  • Medical Informatics (AREA)
  • Primary Health Care (AREA)
  • Public Health (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Machine Translation (AREA)

Abstract

The application provides an integrated system of a traditional Chinese medicine document corpus and a knowledge base, which comprises a metadata module, the corpus, a document marking module, a query module and a semantic knowledge base; the metadata module is used for setting entity class, dictionary and semantic relation and maintaining; the corpus is used for forming a semi-structured document according to the imported documents; the document labeling module labels the semi-structured document by taking the dictionary as a labeling basis; the query module is used for querying the metadata to obtain query results of entity classes, dictionaries and semantic relations. The method and the system can be used for marking, inquiring and semantically retrieving the document, and provide a system integrating corpus, document marking, knowledge processing, analysis and knowledge base retrieval; the method can not only search relevant basic information of semantics independently, but also form association with documents, track the associated documents and similar knowledge, and has high searching efficiency.

Description

Traditional Chinese medicine literature corpus and knowledge base integrated system
Technical Field
The application belongs to the technical field of traditional Chinese medicine informatization, and particularly relates to a traditional Chinese medicine document corpus and knowledge base integrated system.
Background
The text literature of traditional Chinese medicine is a carrier of traditional Chinese medicine science, records rich theoretical knowledge and clinical experience accumulated in traditional Chinese medicine and western medicine for thousands of years in the form of pictures and texts, and plays an extremely important role in the development process of traditional Chinese medicine and pharmacology for a long time. Textual documents are made up of various types of terms that are used in certain scientific fields to represent collections of conceptual designations.
The inventor of the application discovers that text documents, term labeling and semantic knowledge bases are also information isolated islands as a complete corpus collection, document labeling, knowledge processing, analysis and knowledge base retrieval integrated method and a systematic concept analysis theory are not formed so far in the research and development process; how to automatically label terms in a text document and correctly and properly use the text document by using a computer technology to fully play the role of the terms in the propagation of scientific technology is a problem which needs to be solved urgently.
Disclosure of Invention
In order to overcome the problems in the related technology at least to a certain extent, the application provides an integrated system of a traditional Chinese medicine document corpus and a knowledge base.
According to the embodiment of the application, the application provides a traditional Chinese medicine document corpus and knowledge base integrated system, which comprises a metadata module, a corpus, a document marking module, a query module and a semantic knowledge base;
the metadata module is used for setting entity classes, dictionaries and semantic relations and maintaining the entity classes, the dictionaries and the semantic relations;
the corpus is used for forming a semi-structured document according to the imported documents;
the document marking module marks the semi-structured document by taking the dictionary as a marking basis;
the query module is used for querying the metadata to obtain query results of entity classes, dictionaries and semantic relations;
the semantic knowledge base is used for retrieving semantic information, semantic provenance and original text conditions.
In the system integrating the traditional Chinese medicine literature corpus and the knowledge base, entity classes and semantic relations are arranged in the metadata module, and each entity class comprises at least one dictionary; and the semantic relation defines the relation among the entity classes according to the attributes of the entity classes.
In the system integrating the traditional Chinese medicine literature corpus and the knowledge base, the corpus comprises at least one topic, and a plurality of literatures are covered under each topic; the document is presented in a tree structure.
In the system integrating the traditional Chinese medicine document corpus and the knowledge base, the document labeling module comprises an online labeling module and a corpus labeling module; the online labeling module is used for acquiring a labeled text input by a user and performing online labeling on the labeled text; and the corpus labeling module is used for labeling data in the semi-structured document.
Furthermore, the marking mode of the document marking module comprises manual marking and machine marking; and the online labeling module adopts a manual labeling mode for labeling.
Furthermore, the specific process of labeling by the online labeling module in a manual labeling mode is as follows:
manually selecting a certain document or a certain phrase in the certain document, and labeling the selected data by using 'entity coding';
the online labeling module matches the contents in the files by matching the entity, the dictionary and the semantic relation in the metadata, realizes the automatic labeling machine labeling of the documents after matching, and displays the labeled contents by adopting the labeling color of the entity;
labeling semantic relations among all terms in the literature;
and manually checking the terms marked by the machine, and finally completing the marking of the document.
Further, the specific process of the machine labeling is as follows:
obtaining a training data set based on a result set of manual labeling;
continuously performing machine learning on a training data set by means of a word segmentation algorithm to establish a semantic model;
inputting a training data set into a semantic model, and performing iteration and parameter adjustment;
and automatically labeling by using a labeling rule and the trained semantic model.
In the system integrating the traditional Chinese medicine document corpus and the knowledge base, the query module is used for querying entity classes, dictionaries and semantic relations;
when the query module queries the entity class, performing accurate or fuzzy retrieval according to the attribute field; when the query module queries the dictionary, accurate or fuzzy retrieval is carried out according to the labeling condition of the dictionary; and when the query module queries the semantic relation, acquiring a query result of the related semantic relation from the retrieval results of the entity class and the dictionary.
In the system integrating the traditional Chinese medicine literature corpus and the knowledge base, the retrieval result of the semantic knowledge base comprises a semantic retrieval result and a full-text retrieval result; displaying the 'semantics' and 'synonym' of the search in the semantic search result; in the full-text search result, through the searched keywords, the contents of the keywords in the structured data and the unstructured data are displayed.
The system integrating the traditional Chinese medicine document corpus and the knowledge base further comprises a system management module, wherein the system management module comprises an organization management module, a user management module, a permission management module, a role management module, a dictionary management module and a log management module.
According to the above embodiments of the present application, at least the following advantages are obtained: the system for integrating the traditional Chinese medicine document corpus and the knowledge base comprises a metadata module, the corpus, a document labeling module, a query module and a semantic knowledge base, can label, query and semantically search a document, and provides a system integrating the corpus, the document labeling, the knowledge processing, the analysis and the knowledge base search; the method can not only search relevant basic information of semantics independently, but also form association with documents, track the associated documents and similar knowledge, and has high searching efficiency.
It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the scope of the invention, as claimed.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of the specification of the application, illustrate embodiments of the application and together with the description, serve to explain the principles of the application.
Fig. 1 is a block diagram of a system integrating a traditional Chinese medicine literature corpus and a knowledge base according to an embodiment of the present disclosure.
Description of the reference numerals:
1. a metadata module; 2. a corpus; 3. a document labeling module; 4. a query module; 5. and (5) a semantic knowledge base.
Detailed Description
For the purpose of promoting an understanding of the objects, aspects and advantages of the embodiments of the present application, reference will now be made to the drawings and detailed description, wherein the same are to be understood as being modified and equivalents thereof may be made by those skilled in the art without departing from the spirit and scope of the present application.
The exemplary embodiments and descriptions of the present application are provided to explain the present application and should not be taken as limiting the present application. Additionally, the same or similar numbered elements/components used in the drawings and the embodiments are used to represent the same or similar parts.
As used herein, "first," "second," …, etc., are not specifically intended to mean in a sequential or chronological order, nor are they intended to limit the application, but merely to distinguish between elements or operations described in the same technical language.
As used herein, the terms "comprising," "including," "having," "containing," and the like are open-ended terms that mean including, but not limited to.
As used herein, "and/or" includes any and all combinations of the described items.
References to "plurality" herein include "two" and "more than two"; reference to "a plurality of groups" herein includes "two groups" and "more than two groups".
Certain words used to describe the present application are discussed below or elsewhere in this specification to provide additional guidance to those skilled in the art in describing the present application.
The traditional semantic retrieval system can only retrieve related basic information of the semantics independently, cannot form association with documents, and cannot track documents and similar knowledge related to the semantics. In addition, the traditional semantic retrieval system can achieve the effect of correlation of semantic relations, documents and original texts only through repeated retrieval. The most important defect of the traditional semantic retrieval system is low efficiency.
As shown in fig. 1, the present application provides an integrated system of a traditional Chinese medicine document corpus and a knowledge base, which includes a metadata module 1, a corpus 2, a document labeling module 3, a query module 4 and a semantic knowledge base 5.
The metadata module 1 is used for setting entity classes, dictionaries and semantic relations and maintaining the entity classes, dictionaries and semantic relations.
The corpus 2 is used to form semi-structured documents from the imported documents.
The document labeling module 3 labels the processed semi-structured document by taking the dictionary as a labeling basis.
The query module 4 is used for performing query and statistical analysis on the metadata to obtain query results of entity classes, dictionaries and semantic relations.
The semantic knowledge base 5 is used for searching semantic information, semantic provenance and original text conditions.
In this embodiment, entity classes and semantic relationships are set in the metadata module 1, and each entity class includes at least one dictionary. The dictionary data can be uploaded in batch in the form of Excel, and the dictionary can be maintained on line. Semantic relationships are relationships between entity classes defined according to their attributes. Each entity class has a corresponding entity relationship and points to a fixed one or more entity classes.
For example, in disease-associated-syndrome, the relationship between two entity classes, disease and syndrome, is related.
An entity is something that is distinguishable and independent, and similar entities are grouped into the same entity class. For example, the entity classes of TCM include diseases, syndromes, symptoms, prescriptions, and Chinese medicines, and the entity classes of Western medicine include SNoMedCt and LOINC.
The management of the entity class mainly realizes the maintenance of entity related attributes such as entity names, entity codes (for labeling), labeling colors, entity basic information and the like. And the entity class is maintained by supporting online addition and import of EXCEL and other forms.
For a certain class of entities, the expression is made in the form of a triple set "entity-relationship-entity".
For example, in diabetes-related-lung heat and fluid consumption syndrome, the relationship between two entities is related.
The system initial dictionary data is based on traditional Chinese medicine and pharmacy domain term set Traditional Chinese Medicine Clinical Term System (TCMCTS), traditional Chinese medicine and pharmacy language system (TCMLS), Western medicine domain term set SNOMED and CT clinical term system, LOINC observation index identifier logic naming and coding system (mainly a term set for experiment indexes) and other construction concepts, semantic type and semantic relation are set, term resource collection and integration are carried out according to relevant authority standards, word lists and dictionaries, and 'metadata' is formed.
In this embodiment, the corpus 2 includes at least one topic, and each topic covers a plurality of documents. The documents are imported in a batch mode in an EXCEL form, and the EXCEL comprises the hierarchy and the content of the documents. The introduced documents are presented in a tree structure.
The corpus 2 forms semi-structured documents for use in computer processing, so that semi-automated collection of document data is possible.
In this embodiment, the document labeling module 3 includes an online labeling module and a corpus labeling module. The online labeling module is used for acquiring a labeled text input by a user and performing online labeling on the labeled text. And the corpus labeling module is used for labeling the data in the semi-structured document.
And the document labeling module 3 labels the semantic type and semantic relationship of the text document content by relying on the term set corpus 2. During the annotation process, the term set of corpus 2 can be continuously refined.
The labeling mode of the document labeling module 3 comprises manual labeling and machine labeling.
The online labeling module mainly adopts a manual labeling mode, and the specific labeling process is as follows:
manually selecting a certain document or a certain phrase in a certain document, and labeling the selected data by using entity coding.
And the online labeling module is used for accurately matching the contents in the file by matching the entity, dictionary and semantic relation in the metadata, realizing automatic labeling of documents after matching, and displaying the labeled contents by adopting the labeling color of the entity.
Annotating semantic relationships between individual terms in the document.
And manually checking the terms marked by the machine, and finally completing the marking of the document.
Along with the popularization of natural language processing, in order to improve the efficiency and the accuracy of labeling, the traditional Chinese medicine document corpus and knowledge base integrated system provided by the application incorporates a machine labeling function based on natural language processing. The specific process of machine labeling is as follows:
and obtaining a training data set based on the result set of the manual labeling.
And continuously performing machine learning on the training data set by means of a word segmentation algorithm to establish a semantic model. Specifically, the open source pkuseg chinese word segmentation toolkit can be used for medical field word segmentation and model training. The user-defined dictionary can be set by using a HanLP (Han Language Processing) and part-of-speech tagging can be performed.
And inputting the training data set into the semantic model, continuously iterating, and adjusting parameters to improve the accuracy and recall rate of the labeling result.
And automatically labeling by using a labeling rule and the trained semantic model.
For example, the word segmentation is performed by using a custom dictionary in an open source pkuseg Chinese word segmentation toolkit, and the dictionary uses a dictionary of metadata in the system.
Code:
importpkuseg
lexicon [ 'Beijing university', 'Beijing Tiananmen' ] # when word segmentation is desired, the words in the user dictionary are fixed and do not separate
segDefault ═ pkuseg
Pkuseg (user _ fact) load model, given a user dictionary
Cut ('my love beijing tiananmen') # performs word segmentation
Cut ('i love beijing tiananmen') # performs word segmentation
print(textDefault)
print(text)
As a result:
loading model
finish
loading model
finish
[ 'I', 'love', 'Beijing', 'Tiananmen' ]
[ 'I', 'love', 'Beijing Tiananmen' ]
[Finished in 40.2s]
The specific process of training the model is as follows:
code:
importpkuseg
pkuseg (model _ name ═ ctb8') # the model was loaded by setting model _ name, assuming that the user had downloaded the model of ctb8 and placed under the ' ctb8' directory
Cut ('i love beijing tianan') # performs word segmentation
print(text)
As a result:
loading model
finish
[ 'I', 'love', 'Beijing', 'Tiananmen' ]
[Finished in 24.6s]
The specific process of retraining a word segmentation model comprises the following steps:
the import pkuseg # training file is 'msr _ training.utf8', the test file is 'msr _ test _ gold.utf8', the model is stored in the directory of '/models', and 20 processes are started to train the model
pkuseg.train('msr_training.utf8','msr_test_gold.utf8','./models',nthread=20)
The parameters are explained:
pkuseg.pkuseg(model_name='msra',user_dict='safe_lexicon')
model _ name model path, default is 'msra' representing a pre-trained model (only for users with pip downloads). The user may fill in the path of the downloaded or trained model, such as model _ name ═ models'.
user _ fact sets the user dictionary. Default to 'safe _ lexicon' denotes one chinese dictionary (pip only) provided. The user may enter an iterator containing several custom words.
pkuseg.test(readFile,outputFile,model_name='msra',user_dict='safe_lexicon',nthread=10)
readFile input file path
output File Path
model _ name is the same as pkuseg
The user _ fact is the same as pkuseg
Number of processes started during Nthread test
pkuseg.train(trainFile,testFile,savedir,nthread=10)
trainFile training file path
testFile path
Savedir training model saving path
Number of processes started during Nthread training
A user-defined dictionary is set by adopting HanLP for part-of-speech tagging, and the specific process is as follows:
code:
Figure BDA0003604583550000091
Figure BDA0003604583550000101
and (3) operating results:
[ Fung/ng, da/a, leprosy/nhd, r/k,/w, b/v, wind/nr, long/a,/w, start/nr, time/qt, hair/v, in/p, hand/n,/w, according/p, skin/n, e.g./v, septum/v, one/m, paper/n,/w, spill/vi, down/n, not/d, kernel/ag,/w, or/c, meet/v, rain/n, or/c, to/p, night/n,/w, then/d, muscle/n, in/f, e.g./v, spill/v, ran/rz, or/c, pain/a, or/c, itch/a,/w, gradual/d, to/p, flesh/n, hardness/ng, stubborn/ag,/w, enucleation/v, excision/v, agnostic/v,/w, body/n, virtual/a, swelling/vi. /w, this/rzs, symptom/ng, most/d, easy/ad, putrid/v,/w, hand/foot/n, muscular/vg, contracture/n,/w, stink/a, nausea/a, disuse/vn,/w, due to/p, longitudinal/ng, intention/ng, delusions/nz,/w, not/d, aversion to wind/v, chills/n, dampness/a,/w, six/m, desire/d, seven emotions/nz,/w, make/v, honor/ag, qi deficiency/n,/w, defensive qi excess/nr,/w, pathogen/ag, income/v, in/p, muscle/n,/w, qi/n, hysteresis/vg, and/cc, nor/d, on/v, also/d,. /w, this/rzs, symptom/ng, in/p, marijuana/n, Tang and/nr, Lengdan/prescription,/w, Shenxian/nnd, Regu/nz, Dan/b,/w, Zhu/ng, Yunssan/nz,/w, Lengdan/nz, Ren/d, Zhengdan/nr, etc./udeng, medicine/n, treatment/v,/uzhi. (iii)/w, again/d, cloud/vg, strong wind/n, person/k,/w, initial/nr, body/ng, gas/n, fumigation/v, heat/a,/w, gas/n, from/p, chest/s, up/down/f,/w, no/d, pain/a,/w, limb/n, heavy/a, happiness/v, lying/vi,/w, good/ag, girdling/g, girdling/e, sour/a,/w, body surface/a, edema/vi,/w, round/vn, not/d, time/qt,/w, long/a, and/cc, brain/n, bloating/a, v,/w, meat/n, cracking/v,/w, eye/ng, line/v, pain/a,/w, nausea/a, smell/v, voice/n,/w, danger/ag, /y! [ w ]
Found out [ prescription ]: a pill for taking a life.
In this embodiment, the query module 4 is configured to query the entity class, the dictionary, and the semantic relationship. When the query module 4 queries the entity class, the query module performs accurate or fuzzy retrieval according to the attribute field, and performs statistical analysis and display on the labeling condition.
When the query module 4 queries the dictionary, accurate or fuzzy retrieval is performed according to the labeling condition of the dictionary, so that the labeled result is visually analyzed and displayed.
When the query module 4 queries the semantic relations, the user can view the query results of the related semantic relations from the search results of the entity classes and the dictionaries.
In the present embodiment, the semantic knowledge base 5 has a full-text search function, and can specifically perform dictionary search, semantic search, corpus search, and topical browsing.
For example, when searching for "radix astragali", the semantic knowledge base 5 searches for the synonym "radix astragali" at the same time, and obtains the search result of the synonym "radix astragali".
The semantic knowledge base 5 can also perform semantic search on documents, for example, when the corpus 2 is labeled with relevant semantic relations, a search result of "six ingredients with rehmannia decoction (treatment) for diseases" can be obtained in the search.
The search results of the semantic knowledge base 5 include semantic search results and full-text search results.
The semantic name, the belonged entity, the description, the forward attribute, the reverse attribute, the entity class label and the original text are displayed in the semantic retrieval result.
In the full-text retrieval result, the contents of the keywords in the structured data and the unstructured data are displayed through the retrieved keywords, and special marks are made. The "unstructured data" includes pdf documents, word documents, and the like.
A topic management page is established in the semantic knowledge base 5, and related semantics can be queried in the topics by means of the topics in the corpus 2.
In the embodiments, the system for integrating the traditional Chinese medicine document corpus and the knowledge base further comprises a system management module, wherein the system management module mainly comprises an organization management module, a user management module, a permission management module, a role management module, a dictionary management module, a log management module and the like, and support is provided for application through construction of the basic modules.
The embodiments of the present application described above may be implemented in various hardware, software code, or a combination of both. For example, the embodiments of the present application may also be program code for executing the above-described method in a data signal processor. The present application may also relate to various functions performed by a computer processor, digital signal processor, microprocessor, or field programmable gate array. The above-described processors may be configured in accordance with the present application to perform certain tasks by executing machine-readable software code or firmware code that defines certain methods disclosed herein. Software code or firmware code may be developed in different programming languages and in different formats or forms. Software code may also be compiled for different target platforms. However, different code styles, types, and languages of software code and other types of configuration code to perform tasks according to the present application do not depart from the spirit and scope of the present application.
The foregoing is merely an illustrative embodiment of the present application, and any equivalent changes and modifications made by those skilled in the art without departing from the spirit and principles of the present application shall fall within the protection scope of the present application.

Claims (10)

1. A traditional Chinese medicine document corpus and knowledge base integrated system is characterized by comprising a metadata module, a corpus, a document labeling module, a query module and a semantic knowledge base;
the metadata module is used for setting entity classes, dictionaries and semantic relations and maintaining the entity classes, the dictionaries and the semantic relations;
the corpus is used for forming a semi-structured document according to the imported documents;
the document marking module marks the semi-structured document by taking the dictionary as a marking basis;
the query module is used for querying the metadata to obtain a query result of an entity class, a dictionary and a semantic relation;
the semantic knowledge base is used for retrieving semantic information, semantic provenance and original text conditions.
2. The system for integrating a Chinese medicine literature corpus and a knowledge base according to claim 1, wherein entity classes and semantic relations are set in the metadata module, and each entity class comprises at least one dictionary; and the semantic relation defines the relation among the entity classes according to the attributes of the entity classes.
3. The system of claim 1, wherein the corpus comprises at least one topic, and each topic contains a plurality of documents; the documents are presented in a tree structure.
4. The system of claim 1, wherein the document labeling module comprises an online labeling module and a corpus labeling module; the online labeling module is used for acquiring a labeled text input by a user and performing online labeling on the labeled text; and the corpus labeling module is used for labeling data in the semi-structured document.
5. The system for integrating a corpus of traditional Chinese medicine documents and a knowledge base as claimed in claim 4, wherein the labeling mode of the document labeling module comprises manual labeling and machine labeling; and the online labeling module adopts a manual labeling mode for labeling.
6. The system of claim 5, wherein the online labeling module performs labeling in a manual labeling manner by:
manually selecting a certain document or a certain phrase in the certain document, and marking the selected data by using 'entity coding';
the online labeling module matches the content in the file by matching the entity, dictionary and semantic relation in the metadata, realizes automatic labeling of documents after matching, and displays the labeled content by adopting the labeling color of the entity;
labeling semantic relationships between terms in documents;
and manually checking the terms marked by the machine, and finally completing the marking of the document.
7. The system of claim 6, wherein the specific process of machine labeling is as follows:
obtaining a training data set based on a result set of manual labeling;
continuously performing machine learning on a training data set by means of a word segmentation algorithm to establish a semantic model;
inputting a training data set into a semantic model, and performing iteration and parameter adjustment;
and automatically labeling by using the labeling rule and the trained semantic model.
8. The system for integrating a Chinese medicine literature corpus and a knowledge base according to claim 2, wherein the query module is used for querying entity classes, dictionaries and semantic relations;
when the query module queries the entity class, performing accurate or fuzzy retrieval according to the attribute field; when the query module queries the dictionary, accurate or fuzzy retrieval is carried out according to the labeling condition of the dictionary; and when the query module queries the semantic relation, acquiring a query result of the related semantic relation from the retrieval results of the entity class and the dictionary.
9. The system of claim 1, wherein the search results of the semantic knowledge base include semantic search results and full-text search results; displaying the 'semantics' and 'synonym' of the search in the semantic search result; in the full-text search result, through the searched keywords, the contents of the keywords in the structured data and the unstructured data are shown.
10. The system for integrating a corpus of traditional Chinese medicine documents and a knowledge base as claimed in claim 1, further comprising a system management module, wherein the system management module comprises an organization management module, a user management module, a permission management module, a role management module, a dictionary management module and a log management module.
CN202210413257.4A 2022-04-20 2022-04-20 Traditional Chinese medicine literature corpus and knowledge base integrated system Pending CN114791955A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210413257.4A CN114791955A (en) 2022-04-20 2022-04-20 Traditional Chinese medicine literature corpus and knowledge base integrated system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210413257.4A CN114791955A (en) 2022-04-20 2022-04-20 Traditional Chinese medicine literature corpus and knowledge base integrated system

Publications (1)

Publication Number Publication Date
CN114791955A true CN114791955A (en) 2022-07-26

Family

ID=82461720

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210413257.4A Pending CN114791955A (en) 2022-04-20 2022-04-20 Traditional Chinese medicine literature corpus and knowledge base integrated system

Country Status (1)

Country Link
CN (1) CN114791955A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117493641A (en) * 2024-01-02 2024-02-02 中国电子科技集团公司第二十八研究所 Secondary fuzzy search method based on semantic metadata
CN118210960A (en) * 2023-12-13 2024-06-18 西湖大学 Construction and use method of natural medicinal material special domain knowledge base

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118210960A (en) * 2023-12-13 2024-06-18 西湖大学 Construction and use method of natural medicinal material special domain knowledge base
CN117493641A (en) * 2024-01-02 2024-02-02 中国电子科技集团公司第二十八研究所 Secondary fuzzy search method based on semantic metadata
CN117493641B (en) * 2024-01-02 2024-03-22 中国电子科技集团公司第二十八研究所 Secondary fuzzy search method based on semantic metadata

Similar Documents

Publication Publication Date Title
Alwaneen et al. Arabic question answering system: a survey
Ide et al. Essie: a concept-based search engine for structured biomedical text
Leroy et al. Meeting medical terminology needs-the ontology-enhanced medical concept mapper
CN112487202B (en) Chinese medical named entity recognition method and device fusing knowledge map and BERT
CN114791955A (en) Traditional Chinese medicine literature corpus and knowledge base integrated system
CN102750316B (en) Based on the conceptual relation label abstracting method of semantic co-occurrence patterns
CN107341264A (en) A kind of electronic health record system and method for supporting custom entities
Bast et al. Broccoli: Semantic full-text search at your fingertips
Tandon et al. Deriving a web-scale common sense fact database
US20080221868A1 (en) Digital universal language
Markó et al. Bootstrapping dictionaries for cross-language information retrieval
Scheible et al. A multilingual browser platform for medical subject headings
JP2008210206A (en) Similar sentence retrieval program
Attar et al. KEDMA—Linguistic tools for retrieval systems
Medelyan Automatic keyphrase indexing with a domain-specific thesaurus
Morris A weighted O* Net keyword search (WWS)
Atzori et al. Answering End-User Questions, Queries and Searches on Wikipedia and its History.
Arslan et al. Quality benchmarking relational databases and Lucene in the TREC4 adhoc task environment
Liu The application of RAG technology in traditional chinese medicine
CN110188169A (en) A kind of knowledge matching process, system and equipment based on simplified label
Rossi-Mori et al. An Entity-RelationshipModel fora European Machine-DictionaryofMedicine
Kassim et al. Enhanced rules application order approach to stem reduplication words in malay texts
Wang et al. Retrieval methods of natural language based on automatic indexing
Khalfallah et al. Had, a platform to create a historical dictionary
Ansari et al. Evaluation of the Conformity of Nursing and Midwifery Thesis Keywords of Tehran University of Medical Sciences with Medical Subject Headings of MeSH Thesaurus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination