CN102024027A - Method for establishing medical database - Google Patents

Method for establishing medical database Download PDF

Info

Publication number
CN102024027A
CN102024027A CN2010105477220A CN201010547722A CN102024027A CN 102024027 A CN102024027 A CN 102024027A CN 2010105477220 A CN2010105477220 A CN 2010105477220A CN 201010547722 A CN201010547722 A CN 201010547722A CN 102024027 A CN102024027 A CN 102024027A
Authority
CN
China
Prior art keywords
index
search
medical
field
data base
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010105477220A
Other languages
Chinese (zh)
Other versions
CN102024027B (en
Inventor
成飞
史彤毅
高瞻
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Health Online Network Technology Co Ltd
Original Assignee
Beijing Health Online Network Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Health Online Network Technology Co Ltd filed Critical Beijing Health Online Network Technology Co Ltd
Priority to CN 201010547722 priority Critical patent/CN102024027B/en
Publication of CN102024027A publication Critical patent/CN102024027A/en
Application granted granted Critical
Publication of CN102024027B publication Critical patent/CN102024027B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention provides a method for establishing a medical database, comprising the following the steps of: (1) transforming a source document into full text data capable of being subjected to text searching so as to establish the following retrieval indexes: a. a full text retrieval index, b. a comments index, c. a literature external characteristic index, d. and a difficulty level grading index; (2) and selecting part of the retrieval indexes for weighting distribution so as to establish a weighted grading and sequencing program based on the appearing frequency of searched words. The medical database established by using the method of the invention is capable of retrieving medical literatures, which have high correlation degree and are suitable to users, according to the intellectual level of the users. The method for establishing the medical database is capable of greatly improving retrieval efficiency, and is convenient to use for the users with different levels of medical knowledge.

Description

A kind of method for building up of medical data base
Technical field
The present invention relates to the method for building up of a kind of medical data base (being the medical literature searching system).
Background technology
Existing medical data base all is based on the most conventional search method foundation, and mostly these search methods are by keyword retrieval or utilize boolean calculation to carry out simple combined retrieval.Yet, after the said method retrieval, can obtain very many result for retrieval, have no idea further to screen the highest documentation ﹠ info of correlativity, will waste time and energy very much by artificial screening.Though can reduce result for retrieval by the quantity that increases term, be easy to occur the situation of the document omission that the degree of correlation is very high in this case.
In addition, for different users, can there are differences the degree of understanding of medical knowledge, be not that the high more document of the degree of correlation is just useful more to the user, and this know-how with the user is relevant, and existing medical data base is not directed to this and takes corresponding countermeasure.There is a big difference for the medical personnel's of China professional standards, if can make medical data base be the medical personnel of basic unit service, make it can retrieve the medical literature data that adapt with oneself know-how easily, be certain to help improving the medical personnel's of basic unit working level.
Summary of the invention
The purpose of this invention is to provide a kind of method for building up that can retrieve the medical data base of the medical literature that adapts with levels of user sophistication exactly.
The method for building up of medical data base provided by the invention may further comprise the steps: (1) is converted to source document can carry out the full-text data that text is searched, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up weighted scoring sequencer program based on the term occurrence frequency.
Preferable, described b. note sex cords draws and comprises index index and feature glossarial index, described index is made up of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis speech for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described feature speech is for characterizing the word of patient's classification.
The document surface of described c. document surface index can comprise exercise question, author's name, publisher and identification number.
Preferably, the search index that assigns weight is the index term of index index, the feature speech of feature glossarial index, content and the general introduction in the full-text search.The weight of described index term, feature speech, content and general introduction can be respectively 2,1.4,1.1 and 1.3.
Preferably, described structure based on the scoring formula of the weighted scoring sequencer program of term occurrence frequency is:
Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein,
Score (q, d): the scoring score value;
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;
Idf (t): at the score value factor of the simple search item of particular index;
GetBoost (t.field in d): at the gain factor of search term field;
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.
The medical data base that utilizes method provided by the invention to set up, it is very high to retrieve the degree of correlation according to user's know-how, is suitable for the medical literature that this user uses.Can improve recall precision greatly, the user who conveniently has the different stage medical knowledge uses.
Embodiment
In order to be illustrated more clearly in the present invention, a kind of embodiment is described below, explanation simultaneously utilizes the using method of the medical data base of this embodiment foundation.
1. the foundation of index
1.1. the index field of record
1.1.1. full-text search index
Set up a text retrieval system, at first source document to be converted to and carry out the full-text database that text is searched, comprise dividing processing in full and cannonical format etc., be pre-treatment work, after pre-treatment is finished, can begin to set up index, filter out the composing symbol in the source document earlier, format effectors etc. record information till the appearance of each word in the source document, speech, phrase in the index database again.
1.1.2. index index and feature glossarial index
The index terms of full-text search all comes from literature content, but when the searching key word of user input is the situations such as synonym, another name of word in the literature content, just can not arrive the content that oneself needs by the indexed search of full-text search, therefore, full-text search can not be satisfied all demands of user.On this basis, the full text document is split to the minimum paragraph that independent medical significance is arranged, and minimal segment is dropped into rower draw, extract feature speech wherein, index descriptor and subheadings are set up the index index, the feature glossarial index set up in the document feature speech.
Index is made up of descriptor and subheadings two parts.
Descriptor: refer to from natural language through specification handles and optimization process, and can reflect the word of biomedical notion.Descriptor is the notion that specially refers to, can independently express medical concept.
Subheadings: also be determiner, be used to limit descriptor, some aspect that specially refers to of the notion that promptly is used to emphasize that descriptor is represented.Subheadings is the notion of general reference, and quantity of information is little, can not use separately, needs and descriptor assembly use.As, pathology, pharmacology, treatment, diagnosis, medicinal treatment, rehabilitation, complication, etc.
The feature speech: and the speech or the phrase that often run into, acquire a special sense interested at clinician, biomedical scientific research personnel and medical teaching personnel, as, man's (hero) property, woman's (female) property, the baby, children, the elderly, gestation, etc.
1.1.3. document surface index
The document surface is a kind of literature search language.The retrieval language of document surface mainly is meant the retrieval of contents such as piece of writing name (exercise question), author's name to document, publisher, report number, the patent No..Different documents is arranged according to the word preface of piece of writing name, author's title, perhaps arranged formed retrieval language of meeting consumers' demand with the search channel of piece of writing name, author and number according to report number, the number sequence of the patent No..
In order to help the content that the user finds more fast and accurately to be needed, database has added the Advanced Search function, and the user can inquire about by the surface of documents such as title, publishing house.These surfaces are done participle and two kinds of index of participle not simultaneously, guaranteed the recall ratio and the precision ratio of user search.
1.1.4. complexity scoring index
Mark by the medical science editor according to the content complexity, after medical expert's approval, will mark and set up index.After user's registration,, user's know-how is roughly graded by the medical science editor according to its log-on message.According to user's grading and scoring index, help the user to find the medical knowledge that is consistent with its know-how easily.
1.2. weight allocation
1.2.1. weighting object
Be divided into two kinds of Filed and Document.Filed comprises: index, feature speech, particular content, title, publishing house, subject.
1.2.2. weighting setting
According to investigation and user psychology analysis to user's request, use different weighting settings to test, help the user to find the purpose of the content that needs most in the hope of reaching.The result is as follows:
Index term boots=2
Feature speech boots=1.4
Content boots=1.1
Document(specially refers to " general introduction ") boots=1.3
2. search
2.1. scoring
Content is cut speech at first in full, and the frequency that word occurs is in the text marked according to following formula:
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
Wherein:
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase.
Idf (t): at the score value factor of the simple search item of particular index.
GetBoost (t.field in d): at the gain factor of search term field.
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises.This value is kept in the index.These values and field gain are kept in the index together, and the score value of each field by searching code and each Search Results multiplies each other.
The field precision that coupling is long is lower, so this implementation method is returned less score value usually when numTikuns is big, and hour returns bigger score value at numTokens.
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document.
Most query search terms occurs and represent better matching inquiry, so this implementation method is returned bigger score value usually when the ratio of these parameters is big, and these ratios are than hour returning less score value.
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.This value is used for multiplying each other with each query search terms.
Example: " by the standard that in October, 1999, China's hypertension prevention and control guide proposed.Normal adult is not being taken under the situation of antihypertensive, twice above institute of different time measuring blood pressure, systolic pressure 〉=140mmHg and (or) diastolic pressure 〉=90mmHg is decided to be hypertension (table 1).”
The Boost that 3 acquiescences appearred in search " hypertension ", high crushing by snow in the text is 1.0 minutes
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
score(q,d)=3*1.8351948*1.0*0.036002904*3*0.066072345
=0.3929
2.2. weighting (boots) formula
score_d?=?score(q,d)*boots
2.3. retrieving
2.3.1. extraction record
After the user imported keyword, system was retrieved keyword in index database, and record that will be relevant with keyword is marked according to scoring weighting rule, sorts from high to low by score value, extracts last hundred records that meet levels of user sophistication.Through analysis of experiments, 100 the later records and the keyword degree of correlation are relatively poor, in order to improve recall precision, so only extract last hundred records.
2.3.2. filtration duplicate record
According to name of document, classifying content, document ID the record that extracts is filtered, remove the record that wherein repeats.
2.3.3. Search Results classification
Search Results is classified according to subject and show the user.To preferentially be showed with the Search Results of the highest subject of the keyword degree of correlation.
2.3.4. medical knowledge is showed
The user finds the content that oneself needs in search result list, click and check that when detailed, the page can directly navigate to the searching key word position at place in the text.

Claims (6)

1. the method for building up of a medical data base, it is characterized in that, may further comprise the steps: (1) is converted to source document can carry out the full-text data that text is searched, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up weighted scoring sequencer program based on the term occurrence frequency.
2. the method for building up of medical data base according to claim 1, it is characterized in that, described b. note sex cords draws and comprises index index and feature glossarial index, described index is made up of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis speech for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described feature speech is for characterizing the word of patient's classification.
3. the method for building up of medical data base according to claim 1 is characterized in that, the document surface of described c. document surface index comprises exercise question, author's name, publisher and identification number.
4. the method for building up of medical data base according to claim 1 is characterized in that, the search index that assigns weight is the index term of index index, the feature speech of feature glossarial index, content and the general introduction in the full-text search.
5. the method for building up of medical data base according to claim 4 is characterized in that, the weight of described index term, feature speech, content and general introduction is respectively 2,1.4,1.1 and 1.3.
6. the method for building up of medical data base according to claim 5 is characterized in that, described structure based on the scoring formula of the weighted scoring sequencer program of term occurrence frequency is:
Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein,
Score (q, d): the scoring score value;
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;
Idf (t): at the score value factor of the simple search item of particular index;
GetBoost (t.field in d): at the gain factor of search term field;
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.
CN 201010547722 2010-11-17 2010-11-17 Method for establishing medical database Expired - Fee Related CN102024027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201010547722 CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201010547722 CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Publications (2)

Publication Number Publication Date
CN102024027A true CN102024027A (en) 2011-04-20
CN102024027B CN102024027B (en) 2013-03-20

Family

ID=43865324

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201010547722 Expired - Fee Related CN102024027B (en) 2010-11-17 2010-11-17 Method for establishing medical database

Country Status (1)

Country Link
CN (1) CN102024027B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102651015A (en) * 2012-03-30 2012-08-29 梁宗强 Method and module for distributing weight for searched drugs
CN102760141A (en) * 2011-04-28 2012-10-31 冯雪莲 Literature inquiry service system
CN103268326A (en) * 2013-05-02 2013-08-28 百度在线网络技术(北京)有限公司 Personalized cross-language retrieval method and device
CN104182450A (en) * 2013-05-20 2014-12-03 株式会社日立制作所 Information structuring system
CN103902724B (en) * 2014-04-10 2016-11-30 辽宁医学院 A kind of search method of english literature
CN106599547A (en) * 2016-11-23 2017-04-26 中山健康医疗信息技术有限公司 Intelligent medical knowledge base management system based on tags
US9886548B2 (en) 2013-08-12 2018-02-06 Ironwood Medical Information Technologies, LLC Medical data system and method
CN109804437A (en) * 2016-10-11 2019-05-24 皇家飞利浦有限公司 Clinical knowledge centered on patient finds system
CN110990376A (en) * 2019-11-20 2020-04-10 中国农业科学院农业信息研究所 Subject classification automatic indexing method based on multi-factor mixed sorting mechanism
CN112507179A (en) * 2020-12-11 2021-03-16 杭州依图医疗技术有限公司 Medical data processing method and retrieval method, device and storage medium
CN112732946A (en) * 2019-10-12 2021-04-30 四川医枢科技股份有限公司 Modular data analysis and database establishment method for medical literature
CN113407671A (en) * 2017-06-01 2021-09-17 互动解决方案公司 Data information storage device for search

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070112557A1 (en) * 2005-11-11 2007-05-17 Inventec Appliances Corp. Translation System And Method
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101178707A (en) * 2006-11-08 2008-05-14 许丰 Multidimensional searching method and software
CN101233513A (en) * 2005-07-29 2008-07-30 雅虎公司 System and method for reordering a result set
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
US20080275877A1 (en) * 2007-05-04 2008-11-06 International Business Machines Corporation Method and system for variable keyword processing based on content dates on a web page
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device
CN101727535A (en) * 2008-10-30 2010-06-09 北大方正集团有限公司 Cross indexing method for patients crossing system and system thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101233513A (en) * 2005-07-29 2008-07-30 雅虎公司 System and method for reordering a result set
US20070112557A1 (en) * 2005-11-11 2007-05-17 Inventec Appliances Corp. Translation System And Method
CN101055580A (en) * 2006-04-13 2007-10-17 Lg电子株式会社 System, method and user interface for retrieving documents
CN101178707A (en) * 2006-11-08 2008-05-14 许丰 Multidimensional searching method and software
US20080275877A1 (en) * 2007-05-04 2008-11-06 International Business Machines Corporation Method and system for variable keyword processing based on content dates on a web page
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101727535A (en) * 2008-10-30 2010-06-09 北大方正集团有限公司 Cross indexing method for patients crossing system and system thereof
CN101599078A (en) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 A kind of method of text retrieval and device

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102760141A (en) * 2011-04-28 2012-10-31 冯雪莲 Literature inquiry service system
CN102651015A (en) * 2012-03-30 2012-08-29 梁宗强 Method and module for distributing weight for searched drugs
CN103268326A (en) * 2013-05-02 2013-08-28 百度在线网络技术(北京)有限公司 Personalized cross-language retrieval method and device
CN104182450A (en) * 2013-05-20 2014-12-03 株式会社日立制作所 Information structuring system
US9886548B2 (en) 2013-08-12 2018-02-06 Ironwood Medical Information Technologies, LLC Medical data system and method
CN103902724B (en) * 2014-04-10 2016-11-30 辽宁医学院 A kind of search method of english literature
CN109804437A (en) * 2016-10-11 2019-05-24 皇家飞利浦有限公司 Clinical knowledge centered on patient finds system
CN106599547A (en) * 2016-11-23 2017-04-26 中山健康医疗信息技术有限公司 Intelligent medical knowledge base management system based on tags
CN113407671A (en) * 2017-06-01 2021-09-17 互动解决方案公司 Data information storage device for search
CN112732946A (en) * 2019-10-12 2021-04-30 四川医枢科技股份有限公司 Modular data analysis and database establishment method for medical literature
CN110990376A (en) * 2019-11-20 2020-04-10 中国农业科学院农业信息研究所 Subject classification automatic indexing method based on multi-factor mixed sorting mechanism
CN110990376B (en) * 2019-11-20 2023-05-09 中国农业科学院农业信息研究所 Subject classification automatic indexing method based on multi-factor mixed ordering mechanism
CN112507179A (en) * 2020-12-11 2021-03-16 杭州依图医疗技术有限公司 Medical data processing method and retrieval method, device and storage medium

Also Published As

Publication number Publication date
CN102024027B (en) 2013-03-20

Similar Documents

Publication Publication Date Title
CN102024027B (en) Method for establishing medical database
Nayak et al. Survey on pre-processing techniques for text mining
Aizawa et al. NTCIR-11 Math-2 Task Overview.
CN106682411B (en) A method of disease label is converted by physical examination diagnostic data
CN110555153A (en) Question-answering system based on domain knowledge graph and construction method thereof
Rigouts Terryn et al. Termeval 2020: Shared task on automatic term extraction using the annotated corpora for term extraction research (acter) dataset
Hienert et al. Digital library research in action–supporting information retrieval in sowiport
EP2021959A2 (en) Annotation by search
WO2006074324A8 (en) Systems, methods, software, and interfaces for multilingual information retrieval
Lossio-Ventura et al. Biotex: A system for biomedical terminology extraction, ranking, and validation
CN103729402A (en) Method for establishing mapping knowledge domain based on book catalogue
CN106055539B (en) The method and apparatus that name disambiguates
CN106502991B (en) Publication treating method and apparatus
CN109829042B (en) Knowledge question-answering system and method based on biomedical literature
CN111190920B (en) Data interaction query method and system based on natural language
Loginova et al. Reference lists for the evaluation of term extraction tools
CN110970112B (en) Knowledge graph construction method and system for nutrition and health
Khoo et al. Augmenting Dublin core digital library metadata with Dewey decimal classification
CN104281565A (en) Semantic dictionary constructing method and device
Prudhomme et al. Automatic Integration of Spatial Data into the Semantic Web.
CN109766442A (en) A kind of couple of user takes down notes the method and system classified
Gao et al. ICST Math Retrieval System for NTCIR-11 Math-2 Task.
Kavila et al. Extractive text summarization using modified weighing and sentence symmetric feature methods
Dray et al. Opinion mining from blogs
CN110782955B (en) Method and system for extracting natural product data information from research literature

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
EE01 Entry into force of recordation of patent licensing contract

Application publication date: 20110420

Assignee: Health online education training Limited

Assignor: Beijing Health Online Network Technology Co., Ltd.

Contract record no.: 2013990000184

Denomination of invention: Method for establishing medical database

Granted publication date: 20130320

License type: Exclusive License

Record date: 20130428

LICC Enforcement, change and cancellation of record of contracts on the licence for exploitation of a patent or utility model
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130320

Termination date: 20181117

CF01 Termination of patent right due to non-payment of annual fee