CN100535893C

CN100535893C - Computerized indexing and searching method

Info

Publication number: CN100535893C
Application number: CNB2004100009360A
Authority: CN
Inventors: 刘千祥; 季晓燕; 周群; 苏华; 赵静
Original assignee: CHINA COMPUTER WORLD PUBLICATION SERVICE Co
Current assignee: CHINA COMPUTER WORLD PUBLICATION SERVICE Co
Priority date: 2004-01-17
Filing date: 2004-01-17
Publication date: 2009-09-02
Anticipated expiration: 2024-01-17
Also published as: CN1641638A

Abstract

The invention provides a method of auto indexing and searching by computer system. It processes auto indexing and searching by computer system that comprises content analyzing subsystem, common indexing subsystem, implied conception-indexing subsystem and searching subsystem. It can make the searched file more universal and accuracy by adding implied conception indexing.

Description

The method of a kind of computing machine index and retrieval

Technical field

The present invention relates to Computerized Information Processing Tech, particularly a kind of method of utilizing computing machine document to be carried out index and retrieval.

Background technology

Along with computer technology developing rapidly of Internet technology particularly, information explosion ground increases.How obtaining own needed information quickly and accurately is people's active demand.Utilize computer system to carry out developing into us strong support being provided of automatic indexing and retrieval technique.Retrieval technique based on vocabulary is present modal retrieval technique, and by the Machine Retrieval System that this method makes up, the user is as long as the enquirement that input is formed with vocabulary, phrase or sentence just can realize the retrieval to pertinent literature.

The computer system of carrying out automatic indexing and retrieval at present generally comprises content analysis subsystem, index subsystem and retrieval subsystem (is the patent documentation of CN1435776A referring to publication number).Its method of carrying out index mainly may further comprise the steps: carry out text analyzing, automatic word segmentation, according to information extraction keywords such as word frequency, and calculate the degree of correlation of each keyword and document, setting up with vocabulary for document is the inverted index of content.Its method of retrieving mainly may further comprise the steps: the retrieval input string carries out word segmentation processing, obtain search key, retrieve according to certain information retrieval model with search key, obtain the degree of correlation of document and retrieval input string, then according to the output of sorting of the degree of correlation of document.

Mostly present automatic indexing is literal index, just the keyword that occurs in the text is designated as the index terms of this article, also has part to carry out synonym or hypernym index.For example: " computing machine " speech appears in the text, literal index only with " computing machine " as index terms, the synonym index then simultaneously with its synonym " computer " as index terms; " Windows 2000 " speech appears in the text, literal index only with " Windows 2000 " as index terms, and the hypernym index simultaneously with its hypernym " operating system " as index terms.Above-mentioned indexing method does not carry out the index of deeper implicit notion, can not disclose the implicit notion of text.

When retrieval, information retrieval model commonly used at present has: Boolean retrieval model, vector space model, probability retrieval model.Wherein vector space model is meant, in vector space model (Vector Space Model), piece document of each in the searching system and each question-type are all used vector representation, for example: Di=(T1, T2, T3, ..., Tm), Q=(T1, T2, T3 ..., Tn), wherein, Di is the i piece of writing document in the literature collection; Q is for puing question to; Tk represents k component in document vector or the question vector, i.e. contained k index terms or term in document representation or the question-type.So vectorial Di (Dvi), Q (Qv) can be expressed as follows:

Dvi＝(DWi1，DWi2，DWi3，...DWim)

Qv＝(QW1，QW2，QW3，...QWn)

DWij and QWj are respectively the weights of each component (speech) in document and the enquirement, i.e. speech weight after the weighting in specific document.Vector space model with DWij and QWj in [0,1] interval value.Like this, just can constitute a vector space, the matching treatment process of document in the information retrieval and enquirement is converted into the similarity computational problem of document vector and question vector in the vector space.The degree of correlation of a certain document and a certain enquirement by calculate this vector between similarity measure.

Compute vector between the simplest method of similarity use the dot product function exactly, it is defined as the similarity of document vector and question vector:

Sim (Dvi, Qv) = Σ_{j = 1}^{n} DWij * QWj - - - (1)

In the formula (1), Sim (Dvi, Qv) similarity of expression document vector Di and question vector Q.

Compute vector between similarity method commonly used be to use cosine function, it is defined as similarity:

Sim (Dvi, Qv) = \frac{Σ_{j = 1}^{n} DWij * QWj}{\sqrt{(Σ_{j = 1}^{n} {DWij}^{2}) (Σ_{j = 1}^{n} {QWj}^{2})}} - - - (2)

In the formula (2), (Dvi, the Qv) similarity of expression document vector Di and question vector Q, the essence of this method are exactly the included angle cosine between document vector and the question vector in the calculating hyperspace to Sim.When two vectors were identical, they were overlapped in this space, and promptly angle is 0, and function (similarity) reaches maximal value.Denominator in the formula works as certain normalization factor.If angle is very little between the vector, and what adopt is standardized vector, and the included angle cosine between the vector is approximately equal to the distance between the end points of corresponding vector so.

When the similarity of whole document vectors and certain question vector all calculate finish after, system just surpasses the document (perhaps according to the predetermined quantity of document that will detect) of a certain defined threshold to similarity by similarity size descending sort output.

Use formula (1) compute vector similarity that tangible limitation is arranged, because the just simple addition of vector of formula (1), do not consider the angle of vector, the vectorial similarity and the actual conditions that calculate like this may have greater difference, as: the identical document of similarity that calculates after with the simple addition of vector according to formula (1), its vector angle may be also inequality, and so actual similarity is also inequality, the actual similarity height that vector angle is little; Formula (2) has only been measured the corner dimension of two vectors, and consider the length of vector, the vectorial similarity that calculates so also may have greater difference with actual conditions, as: according to the result of calculation of formula (2), vector angle size unanimity and the different document similarity of vector length is identical, and the actual similarity height of the document that in fact vector length is big.So above-mentioned two kinds of computing method all have certain limitation.

Another limitation of retrieval technique is the degree of correlation that result for retrieval is only considered document at present, and do not consider the time, in fact only by relevancy ranking many times and do not meet the demand of user's reality because in the ordinary course of things, the user needs not only relevant but also ageing strong document.

Summary of the invention

Method of the present invention is at the deficiency of above-mentioned indexing method, proposes a kind of computer system of utilizing and carries out search method on automatic indexing and the basis thereof, makes the document of retrieval output more comprehensive; As a further improvement on the present invention, also, propose new technical scheme, make retrieval output document higher, guarantee the accuracy of retrieval with the degree of correlation of puing question to keyword at the deficiency of above-mentioned search method.

The objective of the invention is to be achieved through the following technical solutions:

The method of a kind of computing machine index and retrieval, utilization comprises that the computer system of content analysis subsystem, simple index subsystem and retrieval subsystem carries out automatic indexing and retrieval, described computer system also includes implicit conceptual index subsystem, store the implicit conceptual rule storehouse of being made up of implicit conceptual rule table in this subsystem, the content of implicit conceptual rule table includes the weights of implicit notion, the implicit pairing speech of notion, the implicit pairing speech of notion; Said method comprising the steps of:

Extract the keyword of document,, the implicit notion that wherein reaches predetermined threshold value with the document degree of correlation is added in the literature index speech according to the degree of correlation of implicit notion of the weights information calculations of implicit conceptual rule storehouse of setting up in advance and document keyword and document;

When receiving the retrieval request that the user sends, determine to put question to keyword according to the retrieving information of user's input;

According to the weights of the literature index speech that comprises implicit conceptual index speech and put question to the weights of keyword to determine the degree of correlation of document and enquirement, export result for retrieval according to the size of the degree of correlation.

The computing formula of calculating the degree of correlation of implicit notion and document is:

Sim (Dvi, Cvi) = \frac{Σ_{j = 1}^{n} DWij * CWij}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(CWij)}^{2} - Σ_{j = 1}^{n} DWij * CWij}

(Dvi Cvi) is the degree of correlation of implicit notion and document to Sim in the formula, and DWij is the weights of each keyword in the document, and CWij is the weights of the pairing keyword of each implicit notion in the implicit conceptual rule table.

The vector space model of described method exploit information retrieval model is retrieved, and the computing formula of the degree of correlation of document and enquirement is:

Sim (Dvi, Qv) = \frac{Σ_{j = 1}^{n} DWij * QWj}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(QWj)}^{2} - Σ_{j = 1}^{n} DWij * QWj}

(DWij is the weights of each index terms in the document to Sim in the formula for Dvi, the Qv) degree of correlation of expression document and enquirement, and QWj is the weights of each keyword in puing question to.

As a further improvement on the present invention, result for retrieval is to export according to the comprehensive relevancy ranking of document, the computing method of the comprehensive degree of correlation of document are: the time of document and the degree of correlation of document and enquirement are weighted processing, obtain the comprehensive degree of correlation of document; Computing formula is as follows:

SimT(Dvi，Qv)＝Sim(Dvi，Qv)+k*Si

((Dvi Qv) is the degree of correlation of document and enquirement to Sim to SimT in the formula, and k is the time weight coefficient, and Si is the time weights for Dvi, Qv) the comprehensive degree of correlation of expression document.

The method of calculating described time weights Si is as follows:

To the time be divided into the time period according to the mistiming with the current time, every period is set weights respectively, and which time of document belong to and then give its corresponding time weights time period.

Compared with prior art, the present invention has the following advantages:

1, mostly the index of automatic indexing at present is literal index, fail to carry out the index of deeper implicit notion, can not disclose the implicit notion of text, the present invention proposes automatic indexing adds implicit notion on the basis of literal index index, feasible index document more exactly.

The algorithm (dot product function, cosine function) that calculates the degree of correlation in the vector space model that adopts when 2, retrieving at present can not be taken into account angle and two factors of length of vector, thereby the degree of accuracy of result of calculation is limited.A kind of new algorithm that calculates document and put question to the degree of correlation that the present invention adopts can be taken into account vectorial angle and two factors of length simultaneously, and the degree of accuracy of result of calculation improves greatly.

3, only the not consideration time causes document very early to appear at the result for retrieval prostatitis to the result for retrieval of existing at present search method by relevancy ranking, perhaps do not consider that the degree of correlation causes the very little document of correlativity to appear at the result for retrieval prostatitis according to time sequence the time, the present invention proposes to adopt the degree of correlation and time way of combining, ordering output.Assurance is relevant and the time is near, meets user's actual need.

Description of drawings:

Fig. 1 is the basic framework of computing machine index of the present invention and retrieval

Fig. 2 is the workflow of the index of the implicit notion of the present invention

Fig. 3 is the workflow of computer search method of the present invention

Embodiment

Below in conjunction with the drawings and specific embodiments the present invention is further described:

Method of the present invention is applicable to that the information bank of the literal document being formed by computer system carries out automatic indexing and retrieval.Fig. 1 has shown the basic framework of the computer system that the method for present embodiment is utilized.This method utilization includes the computer system of content analysis subsystem, index subsystem, implicit conceptual index subsystem and retrieval subsystem and carries out computing machine automatic indexing and retrieval.Wherein the content analysis subsystem be to retrieval institute at the literal bibliographic data base in each piece literature content analyze, obtain the information such as word frequency of document, extraction keyword by text analyzing, automatic word segmentation; Index subsystem carries out index commonly used such as literal index, synonym index according to the information such as word frequency of document; Store the implicit conceptual rule database of forming by implicit conceptual rule table in the implicit conceptual index subsystem, the index of the implicit notion of implicit conceptual index subsystem responsible; Retrieval subsystem is responsible for responding user's retrieval request, finishes the work of retrieval and result output.Aspect concrete equipment disposition, can use a station server to move above-mentioned several subsystem, also can move respectively with multiple servers.

The computing machine indexing method that present embodiment adopts may further comprise the steps:

1, obtains the information such as word frequency of document by text analyzing, automatic word segmentation, extract the keyword of document;

2,, on the basis of literal index, synonym index, add the index of implicit notion according to the weights information of document keyword;

3, utilize literature index speech and the weights information thereof obtain at last, set up inverted index (D, T, W).Wherein D is a document, and T is a speech, and W is the relevant weights of document speech.

Fig. 2 is the workflow of carrying out the index of the implicit notion in the above-mentioned steps 2, and the indexing method of implicit notion comprises following steps:

21, foundation is implied the conceptual rule storehouse and it is stored in the implicit conceptual index subsystem.This rule base is made up of implicit conceptual rule table, and rule list comprises implicit notion, the implicit pairing speech of notion, the implicit pairing speech weights of notion, can be expressed as that (CWij), wherein Ci is implicit notion for Ci, Tij, and Tij is a speech, and CWij is the speech weights.For example for implicit notion " new website ", the rule list content can be designed as: new website, website, 1.0; New website, open-minded, 0.9; New website releases 0.9; New website, issue, 0.9; Wherein " new website " is implicit notion, and " website " " open-minded ", " release ", " issue " etc. are relevant speech, and " 1.0 ", " 0.9 ", " 0.9 ", " 0.9 " etc. are the weights of different related terms.The weighted factor of these weights can comprise word frequency, be inverted document frequency, the standard factor etc.

22, the word frequency information of the document that will obtain by text analyzing, automatic word segmentation is carried out normalized, obtains normalized document keyword weights information, can be expressed as (Di, Tij, DWij).Wherein Di is the i piece of writing document in the literature collection, and Tij is a j keyword in the i piece of writing document, the weights of j keyword in the DWij i piece of writing document.Can adopt word frequency commonly used at present for carrying out normalized, be inverted document frequency formula (being the TF-IDF formula) calculating speech weights, weighted factor has word frequency, is inverted document frequency, the standard factor etc.

23, according to implicit conceptual rule storehouse of having set up and the document keyword weights information after the normalization, the utilization vector space model calculates the degree of correlation of implicit notion and document.Computing formula is:

Sim (Dvi, Cvi) = \frac{Σ_{j = 1}^{n} DWij * CWij}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(CWij)}^{2} - Σ_{j = 1}^{n} DWij * CWij} - - - (3)

(3) (Dvi Cvi) is the degree of correlation of implicit notion with document, and DWij is the weights of each keyword in the document, and CWij is for implying the weights of pairing this keyword of each implicit notion in the conceptual rule table for Sim in the formula.

24, the implicit notion that will reach certain threshold value with the degree of correlation of document is carried out index as index terms.

Fig. 3 has shown the workflow of present embodiment computer search method, and this method provides result for retrieval according to the literature index that comprises implicit conceptual index, comprises following concrete steps:

1, receive the retrieval request that the user proposes, so-called retrieval request is meant the character string of user's input when retrieval, just put question to, and can be vocabulary, phrase or sentence.

2, the retrieval input string is carried out word segmentation processing, obtain search key.

3, the vector space model of exploit information retrieval model is retrieved, and obtains the degree of correlation of document and enquirement.Computing formula is as follows:

Sim (Dvi, Qv) = \frac{Σ_{j = 1}^{n} DWij * QWj}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(QWj)}^{2} - Σ_{j = 1}^{n} DWij * QWj} - - - (4)

(4) Sim in the formula (DWij is the weights of each index terms in the document for Dvi, the Qv) similarity of expression document vector Dvi and question vector Qv, the weights of each keyword in the QWj enquirement, DWij and QWj are in [0,1] interval value.

Above-mentioned formula can be taken into account angle and two factors of length of vector simultaneously.When document and subject of question wide of the mark (their keyword set is not occured simultaneously), the value of Sim is 0.When they were identical, Sim was 1.When they were similar and inequality, the value of Sim was between 0 and 1, so the value of Sim becomes the standard of judging document and puing question to similarity.

Enquirement Q is for example arranged, and (T1, T2), its vector representation is that (QW1 QW2)=(1,1), can think that generally speaking each vectorial weights of question vector are 1 to Qv=.

Suppose to have in the document databse two pieces of documents (Dv1, Dv2) associated, Dv1=(0.9,0.1), wherein 0.9 is the weights of T1 at Dv1, the 0.1st, T2 is at the weights of Dv1; Dv2=(0.6,0.4), wherein 0.6 is the weights of T1 at Dv2, the 0.4th, T2 is at the weights of Dv2.

Utilize dot product function formula (1) try to achieve sim (Dv1, Qv)=1.0, sim (Dv2, Qv)=1.0

Utilize cosine function formula (2) try to achieve sim (Dv1, Qv)=0.78, sim (Dv2, Qv)=0.96

Utilize formula (4) try to achieve sim (Dv1, Qv)=0.55, sim (Dv2, Qv)=0.66

Dot product function formula result calculated be two pieces of documents (Dv1, Dv2) for put question to Q (T1, T2) no less important, and in fact document Dv2 for put question to Q (T1, T2) more relevant, that is to say that the result of formula (2), (4) is more accurate.

Suppose to have in the document databse two pieces of documents (Dv1, Dv2) associated, Dv1=(0.3,0.3), wherein 0.3 is the weights of T1 at Dv1, the 0.3rd, T2 is at the weights of Dv1; Dv2=(0.7,0.7), wherein 0.7 is the weights of T1 at Dv2, the 0.7th, T2 is at the weights of Dv2.

Utilize dot product function formula (1) try to achieve sim (Dv1, Qv)=0.6, sim (Dv2, Qv)=1.4

Utilize cosine function formula (2) try to achieve sim (Dv1, Qv)=1.0, sim (Dv2, Qv)=1.0

Utilize formula (4) try to achieve sim (Dv1, Qv)=0.38, sim (Dv2, Qv)=0.89

Cosine function formula result calculated be two pieces of documents (Dv1, Dv2) for put question to Q (T1, T2) no less important, and in fact document Dv2 for put question to Q (T1, T2) more relevant, that is to say that the result of formula (1), (4) is more accurate.

By relatively with result of calculation and actual conditions, can think that formula (4) can take into account angle and two factors of length of vector simultaneously, more can accurately reflect the degree of correlation of document and enquirement than formula (1), (2).

4, the time and the degree of correlation according to document is weighted processing, obtains the comprehensive degree of correlation of document

The time weight table of document is shown Si, according to the time weights Si of document and above-mentioned steps 3 obtain puing question to the degree of correlation Sim of document (Dvi Qv) calculates the comprehensive degree of correlation of document, and formula is as follows:

SimT(Dvi，Qv)＝Sim(Dvi，Qv)+k*Si

(k is the time weight coefficient to SimT in the formula for Dvi, Qv) the comprehensive degree of correlation of expression document.

In the present embodiment, the method for we weights computing time Si is as follows:

To the time be divided into the time period according to the mistiming with the current time, every period is set weights respectively, and which time of document belong to and then give its corresponding time weights time period.For example: with 1 year was that unit is divided into the time period, and all time weights of document when the year before last are 1, and with the document that ought compare 1 year per morning in the year before last, its time weights reduce 0.05, and ought compare in the year before last early 20 years or above document time weights all are set at 0.That is to say that in the time of 2003, document in 2003 is for working as the document the year before last, its time weights are that the time weights of the document in 1,2002 year are 0.95, give the time weights for the document of each year of calendar year 2001,2000 years, 1999 years or the like by that analogy; In the time of 2004, document in 2004 is for working as the document the year before last, the time weights of document in 2004 are that the time weights of the document in 1,2003 year then are 0.95, are 2002 by that analogy, the document of each year of calendar year 2001,2000,1999 or the like gives the time weights.

5, according to the output of sorting of the comprehensive degree of correlation

Sort according to the comprehensive degree of correlation of calculating the document of gained in the step 4, will output to result for retrieval greater than the document of certain threshold level.

By the method for above-mentioned index and retrieval, can guarantee that the result for retrieval that obtains is not only relevant but also ageing strong with the retrieval input string, meet reader's actual demand.

Claims

1, the method for a kind of computing machine index and retrieval, utilization comprises that the computer system of content analysis subsystem, simple index subsystem and retrieval subsystem carries out automatic indexing and retrieval, it is characterized in that, described computer system also includes implicit conceptual index subsystem, store the implicit conceptual rule storehouse of being made up of implicit conceptual rule table in this subsystem, the content of implicit conceptual rule table includes the weights of implicit notion, the implicit pairing speech of notion, the implicit pairing speech of notion; Said method comprising the steps of:

2, the method for a kind of computing machine index according to claim 1 and retrieval is characterized in that, the computing formula of calculating the degree of correlation of implicit notion and document is:

Sim (Dvi, Cvi) = \frac{Σ_{j = 1}^{n} DWij * CWij}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(CWij)}^{2} - Σ_{j = 1}^{n} DWij * CWij}

3, the method for a kind of computing machine index according to claim 1 and 2 and retrieval is characterized in that, the vector space model of described method exploit information retrieval model is retrieved, and the computing formula of the degree of correlation of document and enquirement is:

Sim (Dvi, Qv) = \frac{Σ_{j = 1}^{n} DWij * QWj}{Σ_{j = 1}^{n} {(DWij)}^{2} + Σ_{j = 1}^{n} {(QWj)}^{2} - Σ_{j = 1}^{n} DWij * QWj}

4, the method for a kind of computing machine index according to claim 3 and retrieval, it is characterized in that, result for retrieval is to export according to the comprehensive relevancy ranking of document, the computing method of the comprehensive degree of correlation of document are: the time of document and the degree of correlation of document and enquirement are weighted processing, obtain the comprehensive degree of correlation of document; Computing formula is as follows:

SimT(Dvi，Qv)＝Sim(Dvi，Qv)+k*Si

5, the method for a kind of computing machine index according to claim 4 and retrieval is characterized in that, the method for calculating described time weights Si is as follows:

6, the method for a kind of computing machine index according to claim 1 and 2 and retrieval, it is characterized in that, result for retrieval is to export according to the comprehensive relevancy ranking of document, the computing method of the comprehensive degree of correlation of document are: the time of document and the degree of correlation of document and enquirement are weighted processing, obtain the comprehensive degree of correlation of document; Computing formula is as follows:

SimT(Dvi，Qv)＝Sim(Dvi，Qv)+k*Si

7, the method for a kind of computing machine index according to claim 6 and retrieval is characterized in that, the method for calculating described time weights Si is as follows: