CN102024027B - Method for establishing medical database - Google Patents

Method for establishing medical database Download PDF

Info

Publication number
CN102024027B
CN102024027B CN 201010547722 CN201010547722A CN102024027B CN 102024027 B CN102024027 B CN 102024027B CN 201010547722 CN201010547722 CN 201010547722 CN 201010547722 A CN201010547722 A CN 201010547722A CN 102024027 B CN102024027 B CN 102024027B
Authority
CN
China
Prior art keywords
index
medical
search
feature words
full
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 201010547722
Other languages
Chinese (zh)
Other versions
CN102024027A (en
Inventor
成飞
史彤毅
高瞻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Health Online Network Technology Co Ltd
Original Assignee
Beijing Health Online Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Health Online Network Technology Co Ltd filed Critical Beijing Health Online Network Technology Co Ltd
Priority to CN 201010547722 priority Critical patent/CN102024027B/en
Publication of CN102024027A publication Critical patent/CN102024027A/en
Application granted granted Critical
Publication of CN102024027B publication Critical patent/CN102024027B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method for establishing a medical database, comprising the following the steps of: (1) transforming a source document into full text data capable of being subjected to text searching so as to establish the following retrieval indexes: a. a full text retrieval index, b. a comments index, c. a literature external characteristic index, d. and a difficulty level grading index; (2) and selecting part of the retrieval indexes for weighting distribution so as to establish a weighted grading and sequencing program based on the appearing frequency of searched words. The medical database established by using the method of the invention is capable of retrieving medical literatures, which have high correlation degree and are suitable to users, according to the intellectual level of the users. The method for establishing the medical database is capable of greatly improving retrieval efficiency, and is convenient to use for the users with different levels of medical knowledge.

Description

A kind of method for building up of medical data base
Technical field
The present invention relates to the method for building up of a kind of medical data base (being the medical literature retrieval system).
Background technology
Existing medical data base all is based on the most conventional search method foundation, and mostly these search methods are by keyword retrieval or utilize boolean calculation to carry out simple combined retrieval.Yet, after the said method retrieval, can obtain very many result for retrieval, have no idea further to screen the highest documentation ﹠ info of correlativity, will waste time and energy very much by artificial screening.Although can reduce result for retrieval by the quantity that increases term, be easy in this case occur the undetected situation of document that the degree of correlation is very high.
In addition, for different users, can there are differences the degree of understanding of medical knowledge, be not that the higher document of the degree of correlation is just more useful to the user, and this know-how with the user is relevant, and existing medical data base is not directed to this and takes Counter-measures.There is a big difference for the medical personnel's of China professional standards, if can make medical data base is the medical personnel of basic unit service, can retrieve easily the medical literature data that adapt with oneself know-how, be certain to be conducive to improve the medical personnel's of basic unit working level.
Summary of the invention
The purpose of this invention is to provide a kind of method for building up that can retrieve exactly the medical data base of the medical literature that adapts with levels of user sophistication.
The method for building up of medical data base provided by the invention may further comprise the steps: (1) is converted to source document the full-text data that can carry out String searching, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up the weighted scoring sequencer program based on the term occurrence frequency.
Better, described b. note sex cords draws and comprises index index and Feature Words index, described index is comprised of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis word for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described Feature Words is for characterizing the word of patient's classification.
The document surface of described c. document surface index can comprise exercise question, author's name, publisher and identification number.
Preferably, the search index that assigns weight is the index term of index index, Feature Words, the content in the full-text search and the general introduction of Feature Words index.The weight of described index term, Feature Words, content and general introduction can be respectively 2,1.4,1.1 and 1.3.
Preferably, described structure based on the evaluate formula of the weighted scoring sequencer program of term occurrence frequency is:
Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein
Score (q, d): scoring score value;
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;
Idf (t): for the score value factor of the simple search item of particular index;
GetBoost (t.field in d): for the gain factor of search term field;
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.
The medical data base that utilizes method provided by the invention to set up, it is very high to retrieve the degree of correlation according to user's know-how, is suitable for the medical literature that this user uses.Can greatly improve recall precision, the user who conveniently has the different stage medical knowledge uses.
Embodiment
In order to be illustrated more clearly in the present invention, a kind of embodiment is described below, simultaneously explanation utilizes the using method of the medical data base of this embodiment foundation.
1. the foundation of index
1.1. the index field of record
1.1.1. full-text search index
Set up a text retrieval system, at first source document to be converted to the full-text database that to carry out String searching, comprise dividing processing in full and cannonical format etc., be pre-treatment work, after pre-treatment is finished, can begin to set up index, filter out first the composing symbol in the source document, format effector etc., again information recording/till the appearance of each word in the source document, word, phrase in index database.
1.1.2. index index and Feature Words index
The index terms of full-text search all comes from literature content, but when the searching key word of user input is the situations such as synonym, another name of word in the literature content, just can not arrive the content that oneself needs by the indexed search of full-text search, therefore, full-text search can not be satisfied all demands of user.On this basis, the full text document is split to the minimum paragraph that independent medical significance is arranged, and minimal segment is dropped into rower draw, extract Feature Words wherein, index descriptor and subheadings are set up the index index, the Feature Words index set up in the document feature word.
Index is comprised of descriptor and subheadings two parts.
Descriptor: refer to from natural language through specification handles and optimization process, and can reflect the word of biomedical concept.Descriptor is the concept that specially refers to, can independently express medical concept.
Subheadings: also be determiner, be used for limiting descriptor, namely be used for emphasizing some aspect that specially refers to of the concept that descriptor is represented.Subheadings is the concept of general reference, and quantity of information is little, can not use separately, needs and descriptor assembly use.As, pathology, pharmacology, treatment, diagnosis, medicinal treatment, rehabilitation, complication, etc.
Feature Words: and the word or the phrase that often run into, acquire a special sense interested for clinician, biomedical scientific research personnel and medical teaching personnel, as, man's (hero) property, woman's (female) property, the baby, children, the elderly, gestation, etc.
1.1.3. document surface index
The document surface is a kind of document indexing language.The retrieval language of document surface mainly refers to the retrieval of the contents such as piece of writing name (exercise question), author's name, publisher, report number, the patent No. to document.Different documents is arranged according to the word order of piece of writing name, author's title, perhaps arranged formed retrieval language of meeting consumers' demand with the search channel of piece of writing name, author and number according to report number, the number sequence of the patent No..
In order to help the content that the user finds more fast and accurately to be needed, database has added the Advanced Search function, and the user can inquire about by the surface of the documents such as title, publishing house.These surfaces are done participle and two kinds of index of participle not simultaneously, guaranteed recall ratio and the precision ratio of user search.
1.1.4. complexity scoring index
Marked by medical editors according to the content complexity, after medical expert's approval, will mark and set up index.After user's registration, according to its log-on message, by medical editors user's know-how is roughly graded.According to user's grading and scoring index, help the user to find easily the medical knowledge that is consistent with its know-how.
1.2. weight allocation
1.2.1. weighting object
Be divided into two kinds of Filed and Document.Filed comprises: index, Feature Words, particular content, title, publishing house, subject.
1.2.2. weighting setting
According to investigation and the user psychology analysis to user's request, use different weighting settings to test, in the hope of reaching the purpose that helps the user to find the content that needs most.The result is as follows:
Index term boots=2
Feature Words boots=1.4
Content boots=1.1
Document(specially refers to " general introduction ") boots=1.3
2. search
2.1. scoring
Content is cut word at first in full, and the frequency that word occurs is in the text marked according to following formula:
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
Wherein:
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase.
Idf (t): for the score value factor of the simple search item of particular index.
GetBoost (t.field in d): for the gain factor of search term field.
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises.This value is kept in the index.These values and field gain are kept in the index together, and the score value of each field by searching code and each Search Results multiplies each other.
The field precision that coupling is long is lower, so this implementation method is returned less score value usually when numTikuns is larger, and hour returns larger score value at numTokens.
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document.
Most query search terms occurs and represent better matching inquiry, so this implementation method is returned larger score value usually when the ratio of these parameters is larger, and these ratios hour return less score value.
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.This value is used for and each query search terms multiplies each other.
Example: " by the standard that in October, 1999, China's hypertension prevention and control guide proposed.Normal adult is not being taken in the situation of antihypertensive, twice above institute of different time measuring blood pressure, systolic pressure 〉=140mmHg and (or) diastolic pressure 〉=90mmHg is decided to be hypertension (table 1).”
The Boost that 3 acquiescences appearred in search " hypertension ", high crushing by snow in the text is 1.0 minutes
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
score(q,d)=3*1.8351948*1.0*0.036002904*3*0.066072345
=0.3929
2.2. weighting (boots) formula
score_d?=?score(q,d)*boots
2.3. retrieving
2.3.1. extraction record
After the user inputted keyword, system was retrieved keyword in index database, and record that will be relevant with keyword is marked according to the scoring Weighted Rule, sorts from high to low by score value, extracts last hundred records that meet levels of user sophistication.Through analysis of experiments, 100 later records and the keyword degree of correlation are relatively poor, in order to improve recall precision, so only extract last hundred records.
2.3.2. filtration duplicate record
According to name of document, classifying content, document ID the record that extracts is filtered, remove the record that wherein repeats.
2.3.3. Search Results classification
Search Results is classified according to subject and show the user.To preferentially be showed with the Search Results of the highest subject of the keyword degree of correlation.
2.3.4. medical knowledge is showed
The user finds the content that oneself needs in search result list, click and check that when detailed, the page can be directly targeted to the in the text position at place of searching key word.

Claims (4)

1. the method for building up of a medical data base, it is characterized in that, may further comprise the steps: (1) is converted to source document the full-text data that can carry out String searching, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up the weighted scoring sequencer program based on the term occurrence frequency,
Described b. note sex cords draws and comprises index index and Feature Words index, described index is comprised of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis word for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described Feature Words is for characterizing the word of patient's classification.
2. the method for building up of medical data base according to claim 1 is characterized in that, the document surface of described c. document surface index comprises exercise question, author's name, publisher and identification number.
3. the method for building up of medical data base according to claim 1 is characterized in that, the search index that assigns weight is the index term of index index, Feature Words, the content in the full-text search and the general introduction of Feature Words index.
4. the method for building up of medical data base according to claim 3 is characterized in that, the weight of described index term, Feature Words, content and general introduction is respectively 2,1.4,1.1 and 1.3.
CN 201010547722 2010-11-17 2010-11-17 Method for establishing medical database Expired - Fee Related CN102024027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010547722 CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010547722 CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Publications (2)

Publication Number Publication Date
CN102024027A CN102024027A (en) 2011-04-20
CN102024027B true CN102024027B (en) 2013-03-20

Family

ID=43865324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010547722 Expired - Fee Related CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Country Status (1)

Country Link
CN (1) CN102024027B (en)

Families Citing this family (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760141A (en) * 2011-04-28 2012-10-31 冯雪莲 Literature inquiry service system
CN102651015A (en) * 2012-03-30 2012-08-29 梁宗强 Method and module for distributing weight for searched drugs
CN103268326A (en) * 2013-05-02 2013-08-28 百度在线网络技术(北京)有限公司 Personalized cross-language retrieval method and device
JP6101563B2 (en) * 2013-05-20 2017-03-22 株式会社日立製作所 Information structuring system
WO2015023686A1 (en) 2013-08-12 2015-02-19 Ironwood Medical Information Technologies, LLC Medical data system and method
CN109804437A (en) * 2016-10-11 2019-05-24 皇家飞利浦有限公司 Clinical knowledge centered on patient finds system
CN106599547A (en) * 2016-11-23 2017-04-26 中山健康医疗信息技术有限公司 Intelligent medical knowledge base management system based on tags
CN110678858B (en) * 2017-06-01 2021-07-09 互动解决方案公司 Data information storage device for search
CN112732946B (en) * 2019-10-12 2023-04-18 四川医枢科技有限责任公司 Modular data analysis and database establishment method for medical literature
CN110990376B (en) * 2019-11-20 2023-05-09 中国农业科学院农业信息研究所 Subject classification automatic indexing method based on multi-factor mixed ordering mechanism
CN112507179A (en) * 2020-12-11 2021-03-16 杭州依图医疗技术有限公司 Medical data processing method and retrieval method, device and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101178707A (en) * 2006-11-08 2008-05-14 许丰 Multidimensional searching method and software
CN101233513A (en) * 2005-07-29 2008-07-30 雅虎公司 System and method for reordering a result set
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device
CN101727535A (en) * 2008-10-30 2010-06-09 北大方正集团有限公司 Cross indexing method for patients crossing system and system thereof

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TW200719174A (en) * 2005-11-11 2007-05-16 Inventec Appliances Corp Translation system and method
US20080275877A1 (en) * 2007-05-04 2008-11-06 International Business Machines Corporation Method and system for variable keyword processing based on content dates on a web page

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233513A (en) * 2005-07-29 2008-07-30 雅虎公司 System and method for reordering a result set
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101178707A (en) * 2006-11-08 2008-05-14 许丰 Multidimensional searching method and software
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101727535A (en) * 2008-10-30 2010-06-09 北大方正集团有限公司 Cross indexing method for patients crossing system and system thereof
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device

Also Published As

Publication number Publication date
CN102024027A (en) 2011-04-20

Similar Documents

Publication Publication Date Title
CN102024027B (en) Method for establishing medical database
Dietz et al. TREC Complex Answer Retrieval Overview.
Che et al. Sentence compression for aspect-based sentiment analysis
CN106682411B (en) A method of disease label is converted by physical examination diagnostic data
Nayak et al. Survey on pre-processing techniques for text mining
CN105808931B (en) A kind of the acupuncture decision support method and device of knowledge based map
Aizawa et al. NTCIR-11 Math-2 Task Overview.
Rigouts Terryn et al. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset
CN110555153A (en) Question-answering system based on domain knowledge graph and construction method thereof
CN105468596B (en) Picture retrieval method and device
EP2021959A2 (en) Annotation by search
CN106055539B (en) The method and apparatus that name disambiguates
Lossio-Ventura et al. Biotex: A system for biomedical terminology extraction, ranking, and validation
WO2006074324A8 (en) Systems, methods, software, and interfaces for multilingual information retrieval
CN106502991B (en) Publication treating method and apparatus
CN109829042B (en) Knowledge question-answering system and method based on biomedical literature
CN111190920B (en) Data interaction query method and system based on natural language
Yi et al. Revisiting the syntactical and structural analysis of Library of Congress Subject Headings for the digital environment
CN110970112B (en) Knowledge graph construction method and system for nutrition and health
Brugman et al. Nederlab: Towards a single portal and research environment for diachronic Dutch text corpora
Khoo et al. Augmenting Dublin core digital library metadata with Dewey decimal classification
CN104281565A (en) Semantic dictionary constructing method and device
CN105389356A (en) Music database retrieval system based on feature extraction
Yoshinaga et al. Open-domain attribute-value acquisition from semi-structured texts
CN103064982A (en) Method for intelligent recommendation of patents in patent searching

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110420

Assignee: Health online education training Limited

Assignor: Beijing Health Online Network Technology Co., Ltd.

Contract record no.: 2013990000184

Denomination of invention: Method for establishing medical database

Granted publication date: 20130320

License type: Exclusive License

Record date: 20130428

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130320

Termination date: 20181117