CN102024027B

CN102024027B - Method for establishing medical database

Info

Publication number: CN102024027B
Application number: CN 201010547722
Authority: CN
Inventors: 成飞; 史彤毅; 高瞻
Original assignee: Beijing Health Online Network Technology Co Ltd
Current assignee: Beijing Health Online Network Technology Co Ltd
Priority date: 2010-11-17
Filing date: 2010-11-17
Publication date: 2013-03-20
Anticipated expiration: 2030-11-17
Also published as: CN102024027A

Abstract

The invention provides a method for establishing a medical database, comprising the following the steps of: (1) transforming a source document into full text data capable of being subjected to text searching so as to establish the following retrieval indexes: a. a full text retrieval index, b. a comments index, c. a literature external characteristic index, d. and a difficulty level grading index; (2) and selecting part of the retrieval indexes for weighting distribution so as to establish a weighted grading and sequencing program based on the appearing frequency of searched words. The medical database established by using the method of the invention is capable of retrieving medical literatures, which have high correlation degree and are suitable to users, according to the intellectual level of the users. The method for establishing the medical database is capable of greatly improving retrieval efficiency, and is convenient to use for the users with different levels of medical knowledge.

Description

A kind of method for building up of medical data base

Technical field

The present invention relates to the method for building up of a kind of medical data base (being the medical literature retrieval system).

Background technology

Existing medical data base all is based on the most conventional search method foundation, and mostly these search methods are by keyword retrieval or utilize boolean calculation to carry out simple combined retrieval.Yet, after the said method retrieval, can obtain very many result for retrieval, have no idea further to screen the highest documentation ﹠ info of correlativity, will waste time and energy very much by artificial screening.Although can reduce result for retrieval by the quantity that increases term, be easy in this case occur the undetected situation of document that the degree of correlation is very high.

In addition, for different users, can there are differences the degree of understanding of medical knowledge, be not that the higher document of the degree of correlation is just more useful to the user, and this know-how with the user is relevant, and existing medical data base is not directed to this and takes Counter-measures.There is a big difference for the medical personnel's of China professional standards, if can make medical data base is the medical personnel of basic unit service, can retrieve easily the medical literature data that adapt with oneself know-how, be certain to be conducive to improve the medical personnel's of basic unit working level.

Summary of the invention

The purpose of this invention is to provide a kind of method for building up that can retrieve exactly the medical data base of the medical literature that adapts with levels of user sophistication.

The method for building up of medical data base provided by the invention may further comprise the steps: (1) is converted to source document the full-text data that can carry out String searching, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up the weighted scoring sequencer program based on the term occurrence frequency.

Better, described b. note sex cords draws and comprises index index and Feature Words index, described index is comprised of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis word for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described Feature Words is for characterizing the word of patient's classification.

The document surface of described c. document surface index can comprise exercise question, author's name, publisher and identification number.

Preferably, the search index that assigns weight is the index term of index index, Feature Words, the content in the full-text search and the general introduction of Feature Words index.The weight of described index term, Feature Words, content and general introduction can be respectively 2,1.4,1.1 and 1.3.

Preferably, described structure based on the evaluate formula of the weighted scoring sequencer program of term occurrence frequency is:

Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein

Score (q, d): scoring score value;

Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;

Idf (t): for the score value factor of the simple search item of particular index;

GetBoost (t.field in d): for the gain factor of search term field;

LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;

Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;

QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.

The medical data base that utilizes method provided by the invention to set up, it is very high to retrieve the degree of correlation according to user's know-how, is suitable for the medical literature that this user uses.Can greatly improve recall precision, the user who conveniently has the different stage medical knowledge uses.

Embodiment

In order to be illustrated more clearly in the present invention, a kind of embodiment is described below, simultaneously explanation utilizes the using method of the medical data base of this embodiment foundation.

1. the foundation of index

1.1. the index field of record

1.1.1. full-text search index

Set up a text retrieval system, at first source document to be converted to the full-text database that to carry out String searching, comprise dividing processing in full and cannonical format etc., be pre-treatment work, after pre-treatment is finished, can begin to set up index, filter out first the composing symbol in the source document, format effector etc., again information recording/till the appearance of each word in the source document, word, phrase in index database.

1.1.2. index index and Feature Words index

The index terms of full-text search all comes from literature content, but when the searching key word of user input is the situations such as synonym, another name of word in the literature content, just can not arrive the content that oneself needs by the indexed search of full-text search, therefore, full-text search can not be satisfied all demands of user.On this basis, the full text document is split to the minimum paragraph that independent medical significance is arranged, and minimal segment is dropped into rower draw, extract Feature Words wherein, index descriptor and subheadings are set up the index index, the Feature Words index set up in the document feature word.

Index is comprised of descriptor and subheadings two parts.

Descriptor: refer to from natural language through specification handles and optimization process, and can reflect the word of biomedical concept.Descriptor is the concept that specially refers to, can independently express medical concept.

Subheadings: also be determiner, be used for limiting descriptor, namely be used for emphasizing some aspect that specially refers to of the concept that descriptor is represented.Subheadings is the concept of general reference, and quantity of information is little, can not use separately, needs and descriptor assembly use.As, pathology, pharmacology, treatment, diagnosis, medicinal treatment, rehabilitation, complication, etc.

Feature Words: and the word or the phrase that often run into, acquire a special sense interested for clinician, biomedical scientific research personnel and medical teaching personnel, as, man's (hero) property, woman's (female) property, the baby, children, the elderly, gestation, etc.

1.1.3. document surface index

The document surface is a kind of document indexing language.The retrieval language of document surface mainly refers to the retrieval of the contents such as piece of writing name (exercise question), author's name, publisher, report number, the patent No. to document.Different documents is arranged according to the word order of piece of writing name, author's title, perhaps arranged formed retrieval language of meeting consumers' demand with the search channel of piece of writing name, author and number according to report number, the number sequence of the patent No..

In order to help the content that the user finds more fast and accurately to be needed, database has added the Advanced Search function, and the user can inquire about by the surface of the documents such as title, publishing house.These surfaces are done participle and two kinds of index of participle not simultaneously, guaranteed recall ratio and the precision ratio of user search.

1.1.4. complexity scoring index

Marked by medical editors according to the content complexity, after medical expert's approval, will mark and set up index.After user's registration, according to its log-on message, by medical editors user's know-how is roughly graded.According to user's grading and scoring index, help the user to find easily the medical knowledge that is consistent with its know-how.

1.2. weight allocation

1.2.1. weighting object

Be divided into two kinds of Filed and Document.Filed comprises: index, Feature Words, particular content, title, publishing house, subject.

1.2.2. weighting setting

According to investigation and the user psychology analysis to user's request, use different weighting settings to test, in the hope of reaching the purpose that helps the user to find the content that needs most.The result is as follows:

Index term boots=2

Feature Words boots=1.4

Content boots=1.1

Document(specially refers to " general introduction ") boots=1.3

2. search

2.1. scoring

Content is cut word at first in full, and the frequency that word occurs is in the text marked according to following formula:

score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))

Wherein:

Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase.

Idf (t): for the score value factor of the simple search item of particular index.

GetBoost (t.field in d): for the gain factor of search term field.

LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises.This value is kept in the index.These values and field gain are kept in the index together, and the score value of each field by searching code and each Search Results multiplies each other.

The field precision that coupling is long is lower, so this implementation method is returned less score value usually when numTikuns is larger, and hour returns larger score value at numTokens.

Coord (q, d): the score value factor of all query search terms fragments that comprise based on document.

Most query search terms occurs and represent better matching inquiry, so this implementation method is returned larger score value usually when the ratio of these parameters is larger, and these ratios hour return less score value.

QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.This value is used for and each query search terms multiplies each other.

Example: " by the standard that in October, 1999, China's hypertension prevention and control guide proposed.Normal adult is not being taken in the situation of antihypertensive, twice above institute of different time measuring blood pressure, systolic pressure 〉=140mmHg and (or) diastolic pressure 〉=90mmHg is decided to be hypertension (table 1).”

The Boost that 3 acquiescences appearred in search " hypertension ", high crushing by snow in the text is 1.0 minutes

score(q,d)=3*1.8351948*1.0*0.036002904*3*0.066072345

=0.3929

2.2. weighting (boots) formula

score_d?=?score(q,d)*boots

2.3. retrieving

2.3.1. extraction record

After the user inputted keyword, system was retrieved keyword in index database, and record that will be relevant with keyword is marked according to the scoring Weighted Rule, sorts from high to low by score value, extracts last hundred records that meet levels of user sophistication.Through analysis of experiments, 100 later records and the keyword degree of correlation are relatively poor, in order to improve recall precision, so only extract last hundred records.

2.3.2. filtration duplicate record

According to name of document, classifying content, document ID the record that extracts is filtered, remove the record that wherein repeats.

2.3.3. Search Results classification

Search Results is classified according to subject and show the user.To preferentially be showed with the Search Results of the highest subject of the keyword degree of correlation.

2.3.4. medical knowledge is showed

The user finds the content that oneself needs in search result list, click and check that when detailed, the page can be directly targeted to the in the text position at place of searching key word.

Claims

1. the method for building up of a medical data base, it is characterized in that, may further comprise the steps: (1) is converted to source document the full-text data that can carry out String searching, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up the weighted scoring sequencer program based on the term occurrence frequency,

Described b. note sex cords draws and comprises index index and Feature Words index, described index is comprised of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis word for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described Feature Words is for characterizing the word of patient's classification.

2. the method for building up of medical data base according to claim 1 is characterized in that, the document surface of described c. document surface index comprises exercise question, author's name, publisher and identification number.

3. the method for building up of medical data base according to claim 1 is characterized in that, the search index that assigns weight is the index term of index index, Feature Words, the content in the full-text search and the general introduction of Feature Words index.

4. the method for building up of medical data base according to claim 3 is characterized in that, the weight of described index term, Feature Words, content and general introduction is respectively 2,1.4,1.1 and 1.3.