CN102024027A

CN102024027A - Method for establishing medical database

Info

Publication number: CN102024027A
Application number: CN2010105477220A
Authority: CN
Inventors: 成飞; 史彤毅; 高瞻
Original assignee: Beijing Health Online Network Technology Co Ltd
Current assignee: Beijing Health Online Network Technology Co Ltd
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2011-04-20
Anticipated expiration: 2030-11-17
Also published as: CN102024027B

Abstract

The invention provides a method for establishing a medical database, comprising the following the steps of: (1) transforming a source document into full text data capable of being subjected to text searching so as to establish the following retrieval indexes: a. a full text retrieval index, b. a comments index, c. a literature external characteristic index, d. and a difficulty level grading index; (2) and selecting part of the retrieval indexes for weighting distribution so as to establish a weighted grading and sequencing program based on the appearing frequency of searched words. The medical database established by using the method of the invention is capable of retrieving medical literatures, which have high correlation degree and are suitable to users, according to the intellectual level of the users. The method for establishing the medical database is capable of greatly improving retrieval efficiency, and is convenient to use for the users with different levels of medical knowledge.

Description

A kind of method for building up of medical data base

Technical field

The present invention relates to the method for building up of a kind of medical data base (being the medical literature searching system).

Background technology

Existing medical data base all is based on the most conventional search method foundation, and mostly these search methods are by keyword retrieval or utilize boolean calculation to carry out simple combined retrieval.Yet, after the said method retrieval, can obtain very many result for retrieval, have no idea further to screen the highest documentation ﹠ info of correlativity, will waste time and energy very much by artificial screening.Though can reduce result for retrieval by the quantity that increases term, be easy to occur the situation of the document omission that the degree of correlation is very high in this case.

In addition, for different users, can there are differences the degree of understanding of medical knowledge, be not that the high more document of the degree of correlation is just useful more to the user, and this know-how with the user is relevant, and existing medical data base is not directed to this and takes corresponding countermeasure.There is a big difference for the medical personnel's of China professional standards, if can make medical data base be the medical personnel of basic unit service, make it can retrieve the medical literature data that adapt with oneself know-how easily, be certain to help improving the medical personnel's of basic unit working level.

Summary of the invention

The purpose of this invention is to provide a kind of method for building up that can retrieve the medical data base of the medical literature that adapts with levels of user sophistication exactly.

The method for building up of medical data base provided by the invention may further comprise the steps: (1) is converted to source document can carry out the full-text data that text is searched, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up weighted scoring sequencer program based on the term occurrence frequency.

Preferable, described b. note sex cords draws and comprises index index and feature glossarial index, described index is made up of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis speech for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described feature speech is for characterizing the word of patient's classification.

The document surface of described c. document surface index can comprise exercise question, author's name, publisher and identification number.

Preferably, the search index that assigns weight is the index term of index index, the feature speech of feature glossarial index, content and the general introduction in the full-text search.The weight of described index term, feature speech, content and general introduction can be respectively 2,1.4,1.1 and 1.3.

Preferably, described structure based on the scoring formula of the weighted scoring sequencer program of term occurrence frequency is:

Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein,

Score (q, d): the scoring score value;

Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;

Idf (t): at the score value factor of the simple search item of particular index;

GetBoost (t.field in d): at the gain factor of search term field;

LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;

Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;

QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.

The medical data base that utilizes method provided by the invention to set up, it is very high to retrieve the degree of correlation according to user's know-how, is suitable for the medical literature that this user uses.Can improve recall precision greatly, the user who conveniently has the different stage medical knowledge uses.

Embodiment

In order to be illustrated more clearly in the present invention, a kind of embodiment is described below, explanation simultaneously utilizes the using method of the medical data base of this embodiment foundation.

1. the foundation of index

1.1. the index field of record

1.1.1. full-text search index

Set up a text retrieval system, at first source document to be converted to and carry out the full-text database that text is searched, comprise dividing processing in full and cannonical format etc., be pre-treatment work, after pre-treatment is finished, can begin to set up index, filter out the composing symbol in the source document earlier, format effectors etc. record information till the appearance of each word in the source document, speech, phrase in the index database again.

1.1.2. index index and feature glossarial index

The index terms of full-text search all comes from literature content, but when the searching key word of user input is the situations such as synonym, another name of word in the literature content, just can not arrive the content that oneself needs by the indexed search of full-text search, therefore, full-text search can not be satisfied all demands of user.On this basis, the full text document is split to the minimum paragraph that independent medical significance is arranged, and minimal segment is dropped into rower draw, extract feature speech wherein, index descriptor and subheadings are set up the index index, the feature glossarial index set up in the document feature speech.

Index is made up of descriptor and subheadings two parts.

Descriptor: refer to from natural language through specification handles and optimization process, and can reflect the word of biomedical notion.Descriptor is the notion that specially refers to, can independently express medical concept.

Subheadings: also be determiner, be used to limit descriptor, some aspect that specially refers to of the notion that promptly is used to emphasize that descriptor is represented.Subheadings is the notion of general reference, and quantity of information is little, can not use separately, needs and descriptor assembly use.As, pathology, pharmacology, treatment, diagnosis, medicinal treatment, rehabilitation, complication, etc.

The feature speech: and the speech or the phrase that often run into, acquire a special sense interested at clinician, biomedical scientific research personnel and medical teaching personnel, as, man's (hero) property, woman's (female) property, the baby, children, the elderly, gestation, etc.

1.1.3. document surface index

The document surface is a kind of literature search language.The retrieval language of document surface mainly is meant the retrieval of contents such as piece of writing name (exercise question), author's name to document, publisher, report number, the patent No..Different documents is arranged according to the word preface of piece of writing name, author's title, perhaps arranged formed retrieval language of meeting consumers' demand with the search channel of piece of writing name, author and number according to report number, the number sequence of the patent No..

In order to help the content that the user finds more fast and accurately to be needed, database has added the Advanced Search function, and the user can inquire about by the surface of documents such as title, publishing house.These surfaces are done participle and two kinds of index of participle not simultaneously, guaranteed the recall ratio and the precision ratio of user search.

1.1.4. complexity scoring index

Mark by the medical science editor according to the content complexity, after medical expert's approval, will mark and set up index.After user's registration,, user's know-how is roughly graded by the medical science editor according to its log-on message.According to user's grading and scoring index, help the user to find the medical knowledge that is consistent with its know-how easily.

1.2. weight allocation

1.2.1. weighting object

Be divided into two kinds of Filed and Document.Filed comprises: index, feature speech, particular content, title, publishing house, subject.

1.2.2. weighting setting

According to investigation and user psychology analysis to user's request, use different weighting settings to test, help the user to find the purpose of the content that needs most in the hope of reaching.The result is as follows:

Index term boots=2

Feature speech boots=1.4

Content boots=1.1

Document(specially refers to " general introduction ") boots=1.3

2. search

2.1. scoring

Content is cut speech at first in full, and the frequency that word occurs is in the text marked according to following formula:

score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))

Wherein:

Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase.

Idf (t): at the score value factor of the simple search item of particular index.

GetBoost (t.field in d): at the gain factor of search term field.

LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises.This value is kept in the index.These values and field gain are kept in the index together, and the score value of each field by searching code and each Search Results multiplies each other.

The field precision that coupling is long is lower, so this implementation method is returned less score value usually when numTikuns is big, and hour returns bigger score value at numTokens.

Coord (q, d): the score value factor of all query search terms fragments that comprise based on document.

Most query search terms occurs and represent better matching inquiry, so this implementation method is returned bigger score value usually when the ratio of these parameters is big, and these ratios are than hour returning less score value.

QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.This value is used for multiplying each other with each query search terms.

Example: " by the standard that in October, 1999, China's hypertension prevention and control guide proposed.Normal adult is not being taken under the situation of antihypertensive, twice above institute of different time measuring blood pressure, systolic pressure 〉=140mmHg and (or) diastolic pressure 〉=90mmHg is decided to be hypertension (table 1).”

The Boost that 3 acquiescences appearred in search " hypertension ", high crushing by snow in the text is 1.0 minutes

score(q,d)=3*1.8351948*1.0*0.036002904*3*0.066072345

=0.3929

2.2. weighting (boots) formula

score_d?=?score(q,d)*boots

2.3. retrieving

2.3.1. extraction record

After the user imported keyword, system was retrieved keyword in index database, and record that will be relevant with keyword is marked according to scoring weighting rule, sorts from high to low by score value, extracts last hundred records that meet levels of user sophistication.Through analysis of experiments, 100 the later records and the keyword degree of correlation are relatively poor, in order to improve recall precision, so only extract last hundred records.

2.3.2. filtration duplicate record

According to name of document, classifying content, document ID the record that extracts is filtered, remove the record that wherein repeats.

2.3.3. Search Results classification

Search Results is classified according to subject and show the user.To preferentially be showed with the Search Results of the highest subject of the keyword degree of correlation.

2.3.4. medical knowledge is showed

The user finds the content that oneself needs in search result list, click and check that when detailed, the page can directly navigate to the searching key word position at place in the text.

Claims

1. the method for building up of a medical data base, it is characterized in that, may further comprise the steps: (1) is converted to source document can carry out the full-text data that text is searched, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up weighted scoring sequencer program based on the term occurrence frequency.

2. the method for building up of medical data base according to claim 1, it is characterized in that, described b. note sex cords draws and comprises index index and feature glossarial index, described index is made up of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis speech for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described feature speech is for characterizing the word of patient's classification.

3. the method for building up of medical data base according to claim 1 is characterized in that, the document surface of described c. document surface index comprises exercise question, author's name, publisher and identification number.

4. the method for building up of medical data base according to claim 1 is characterized in that, the search index that assigns weight is the index term of index index, the feature speech of feature glossarial index, content and the general introduction in the full-text search.

5. the method for building up of medical data base according to claim 4 is characterized in that, the weight of described index term, feature speech, content and general introduction is respectively 2,1.4,1.1 and 1.3.

6. the method for building up of medical data base according to claim 5 is characterized in that, described structure based on the scoring formula of the weighted scoring sequencer program of term occurrence frequency is:

Score (q, d): the scoring score value;

GetBoost (t.field in d): at the gain factor of search term field;