Summary of the invention
The purpose of this invention is to provide a kind of method for building up that can retrieve exactly the medical data base of the medical literature that adapts with levels of user sophistication.
The method for building up of medical data base provided by the invention may further comprise the steps: (1) is converted to source document the full-text data that can carry out String searching, set up following search index: a. full-text search index, b. the note sex cords draws, c. document surface index, d. complexity scoring index; (2) select the part of described search index to assign weight, make up the weighted scoring sequencer program based on the term occurrence frequency.
Better, described b. note sex cords draws and comprises index index and Feature Words index, described index is comprised of descriptor and subheadings two parts, described descriptor is equal to the lexical or textual analysis word for some vocabulary in the content in full, and described subheadings is the word of the expression medical domain that comprises the medical science subject under the described descriptor; Described Feature Words is for characterizing the word of patient's classification.
The document surface of described c. document surface index can comprise exercise question, author's name, publisher and identification number.
Preferably, the search index that assigns weight is the index term of index index, Feature Words, the content in the full-text search and the general introduction of Feature Words index.The weight of described index term, Feature Words, content and general introduction can be respectively 2,1.4,1.1 and 1.3.
Preferably, described structure based on the evaluate formula of the weighted scoring sequencer program of term occurrence frequency is:
Score (q, d)=sum (tf (t in d) * idf (t) * getBoost (t.field in d) * lengthNorm (t.field in d) * coord (q, d) * queryNorm (q)), wherein
Score (q, d): scoring score value;
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase;
Idf (t): for the score value factor of the simple search item of particular index;
GetBoost (t.field in d): for the gain factor of search term field;
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises;
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document;
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.
The medical data base that utilizes method provided by the invention to set up, it is very high to retrieve the degree of correlation according to user's know-how, is suitable for the medical literature that this user uses.Can greatly improve recall precision, the user who conveniently has the different stage medical knowledge uses.
Embodiment
In order to be illustrated more clearly in the present invention, a kind of embodiment is described below, simultaneously explanation utilizes the using method of the medical data base of this embodiment foundation.
1. the foundation of index
1.1. the index field of record
1.1.1. full-text search index
Set up a text retrieval system, at first source document to be converted to the full-text database that to carry out String searching, comprise dividing processing in full and cannonical format etc., be pre-treatment work, after pre-treatment is finished, can begin to set up index, filter out first the composing symbol in the source document, format effector etc., again information recording/till the appearance of each word in the source document, word, phrase in index database.
1.1.2. index index and Feature Words index
The index terms of full-text search all comes from literature content, but when the searching key word of user input is the situations such as synonym, another name of word in the literature content, just can not arrive the content that oneself needs by the indexed search of full-text search, therefore, full-text search can not be satisfied all demands of user.On this basis, the full text document is split to the minimum paragraph that independent medical significance is arranged, and minimal segment is dropped into rower draw, extract Feature Words wherein, index descriptor and subheadings are set up the index index, the Feature Words index set up in the document feature word.
Index is comprised of descriptor and subheadings two parts.
Descriptor: refer to from natural language through specification handles and optimization process, and can reflect the word of biomedical concept.Descriptor is the concept that specially refers to, can independently express medical concept.
Subheadings: also be determiner, be used for limiting descriptor, namely be used for emphasizing some aspect that specially refers to of the concept that descriptor is represented.Subheadings is the concept of general reference, and quantity of information is little, can not use separately, needs and descriptor assembly use.As, pathology, pharmacology, treatment, diagnosis, medicinal treatment, rehabilitation, complication, etc.
Feature Words: and the word or the phrase that often run into, acquire a special sense interested for clinician, biomedical scientific research personnel and medical teaching personnel, as, man's (hero) property, woman's (female) property, the baby, children, the elderly, gestation, etc.
1.1.3. document surface index
The document surface is a kind of document indexing language.The retrieval language of document surface mainly refers to the retrieval of the contents such as piece of writing name (exercise question), author's name, publisher, report number, the patent No. to document.Different documents is arranged according to the word order of piece of writing name, author's title, perhaps arranged formed retrieval language of meeting consumers' demand with the search channel of piece of writing name, author and number according to report number, the number sequence of the patent No..
In order to help the content that the user finds more fast and accurately to be needed, database has added the Advanced Search function, and the user can inquire about by the surface of the documents such as title, publishing house.These surfaces are done participle and two kinds of index of participle not simultaneously, guaranteed recall ratio and the precision ratio of user search.
1.1.4. complexity scoring index
Marked by medical editors according to the content complexity, after medical expert's approval, will mark and set up index.After user's registration, according to its log-on message, by medical editors user's know-how is roughly graded.According to user's grading and scoring index, help the user to find easily the medical knowledge that is consistent with its know-how.
1.2. weight allocation
1.2.1. weighting object
Be divided into two kinds of Filed and Document.Filed comprises: index, Feature Words, particular content, title, publishing house, subject.
1.2.2. weighting setting
According to investigation and the user psychology analysis to user's request, use different weighting settings to test, in the hope of reaching the purpose that helps the user to find the content that needs most.The result is as follows:
Index term boots=2
Feature Words boots=1.4
Content boots=1.1
Document(specially refers to " general introduction ") boots=1.3
2. search
2.1. scoring
Content is cut word at first in full, and the frequency that word occurs is in the text marked according to following formula:
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
Wherein:
Tf (t in d): the score value factor of occurrence number in document based on search terms or phrase.
Idf (t): for the score value factor of the simple search item of particular index.
GetBoost (t.field in d): for the gain factor of search term field.
LengthNorm (t.field in d): to a given field, the standard value of the sum of the search terms that wherein comprises.This value is kept in the index.These values and field gain are kept in the index together, and the score value of each field by searching code and each Search Results multiplies each other.
The field precision that coupling is long is lower, so this implementation method is returned less score value usually when numTikuns is larger, and hour returns larger score value at numTokens.
Coord (q, d): the score value factor of all query search terms fragments that comprise based on document.
Most query search terms occurs and represent better matching inquiry, so this implementation method is returned larger score value usually when the ratio of these parameters is larger, and these ratios hour return less score value.
QueryNorm (q): to given inquiry, the standard value of the summation of the weight of all query search terms.This value is used for and each query search terms multiplies each other.
Example: " by the standard that in October, 1999, China's hypertension prevention and control guide proposed.Normal adult is not being taken in the situation of antihypertensive, twice above institute of different time measuring blood pressure, systolic pressure 〉=140mmHg and (or) diastolic pressure 〉=90mmHg is decided to be hypertension (table 1).”
The Boost that 3 acquiescences appearred in search " hypertension ", high crushing by snow in the text is 1.0 minutes
score(q,d)?=?sum(?tf(t?in?d)?*?idf(t)?*?getBoost(t.field?in?d)?*?lengthNorm(t.field?in?d)?*?coord(q,d)?*?queryNorm(q))
score(q,d)=3*1.8351948*1.0*0.036002904*3*0.066072345
=0.3929
2.2. weighting (boots) formula
score_d?=?score(q,d)*boots
2.3. retrieving
2.3.1. extraction record
After the user inputted keyword, system was retrieved keyword in index database, and record that will be relevant with keyword is marked according to the scoring Weighted Rule, sorts from high to low by score value, extracts last hundred records that meet levels of user sophistication.Through analysis of experiments, 100 later records and the keyword degree of correlation are relatively poor, in order to improve recall precision, so only extract last hundred records.
2.3.2. filtration duplicate record
According to name of document, classifying content, document ID the record that extracts is filtered, remove the record that wherein repeats.
2.3.3. Search Results classification
Search Results is classified according to subject and show the user.To preferentially be showed with the Search Results of the highest subject of the keyword degree of correlation.
2.3.4. medical knowledge is showed
The user finds the content that oneself needs in search result list, click and check that when detailed, the page can be directly targeted to the in the text position at place of searching key word.