CN103294741A - Similar document retrieval auxiliary device and similar document retrieval auxiliary method - Google Patents

Similar document retrieval auxiliary device and similar document retrieval auxiliary method Download PDF

Info

Publication number
CN103294741A
CN103294741A CN2012105391303A CN201210539130A CN103294741A CN 103294741 A CN103294741 A CN 103294741A CN 2012105391303 A CN2012105391303 A CN 2012105391303A CN 201210539130 A CN201210539130 A CN 201210539130A CN 103294741 A CN103294741 A CN 103294741A
Authority
CN
China
Prior art keywords
mentioned
retrieval
file
essential factor
input file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2012105391303A
Other languages
Chinese (zh)
Other versions
CN103294741B (en
Inventor
间赖久雄
藤稿航平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Publication of CN103294741A publication Critical patent/CN103294741A/en
Application granted granted Critical
Publication of CN103294741B publication Critical patent/CN103294741B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides a similar document retrieval auxiliary device and a similar document retrieval auxiliary method. The extent of the impact exerted on retrieval precision by an essential factor affecting similar document retrieval precision and information concerning countermeasures for improving retrieval precision are shown to a user, and therefore the retrieval work of the user can cyclically run in a highly-efficient manner, and the efficiency and quality of the retrieval work can be improved. Based on a right set of previous input documents and right answer documents, the essential factor is analyzed, and a correspondence relation is established between a value range of the essential factor and retrieval precision and is stored in a table. Through computer processing operation, newly input documents are subjected to the same essential analysis operation, retrieval precision corresponding to the value range which conforms with the essential factor value of the newly input documents is calculated based on comparison with the above table. Then, through computer processing, the retrieval precision and/or an average deviation value compared with the whole retrieval precision of the previous input documents is shown to the user. In a more ideal condition, information of countermeasures for improving the retrieval precision is shown to the user.

Description

Similar document retrieval servicing unit and similar document retrieval householder method
Technical field
The present invention relates to document search device and the document retrieval method of the desirable file of retrieval from a large amount of file sets.Particularly, the present invention relates to similar document retrieval servicing unit and similar document retrieval householder method, wherein, from with the article of user's appointment or file as search condition, seemingly or in the file set of file as searching object of connection retrieve putting down in writing content class therewith, and begin to export successively from file similar or that the connection degree is high.
Background technology
By popularizing and the multifunction/high performance of the softwares such as high capacity/low price, searching system or file editor of high speed, storer or the hard disk of low price, CPU etc. of hardware such as communication network such as internet or PC/ portable phone, general people can easily visit a large amount of fileinfos.And on the other hand, rapidly accurate and low labor capacity ground is retrieved/is obtained desirable file and but becomes difficult from a large amount of file sets.
As the mode of the desirable file of retrieval from a large amount of file sets, it generally is key search.In key search, the user make by and the connected more than one key word of desirable file and expression key word between the key word logical formula that constitutes of the logical operator (AND/OR/NOT etc.) of logical relation.Document search device receives the logical formula from the user, and only retrieving this logical formula from the searching object file set is genuine file, and illustrates to the user.
But, in key search, often exist the user unexpected for the result for retrieval file is compressed to the number that can read, should make the situation of what kind of key word logical formula.In addition, preferential output has reflected the practice of the result for retrieval file that user's retrieval is intended to, and also is difficult in precision.
And recently, in the field of key search, from will be by the file arbitrarily of the article arbitrarily of user input or appointment as search condition, from putting down in writing content class therewith seemingly or the file of connection as retrieval the file set of searching object, popularize from the technology that file similar or that the connection degree is high begins to export successively.This technology is called as similar document retrieval.In addition, this technology is also referred to as conceptual retrieval, natural language searching, natural statement retrieval, fuzzy search, associative search.
Similar document retrieval realizes by following processing.At first, extract the feature word of the feature of performance record content as each searching object file of the file set of searching object from formation, after this, each feature word is calculated/gives the weight corresponding with its importance degree, thus, the feature word vector that generation is made of more than one weighted feature word, and be stored in catalog in advance.In addition, also by identical method, the article of importing from the user or specified file (below, be referred to as " input file ") extract weighted feature word and generating feature word vector.Then, the eigenvector that will generate by input file and the eigenvector of each searching object file contrast, and calculate both similar degrees.Inner product between eigenvector or eigenvector become the cosine of an angle value through being often used as the calculating of similar degree.After this, will by descending to similar degree sort the forward file of the cis-position that obtains as and the similar file output of input file.
The prior art document
Patent documentation
Patent documentation 1 TOHKEMY 2002-230032 communique
Patent documentation 2 TOHKEMY 1995-192020 communiques
Patent documentation 3 TOHKEMY 2000-311173 communiques
Invent problem to be solved
In similar document retrieval, any article of remembering in the own brain or file at hand directly can be specified as search condition, therefore, have and need not the advantage that the user makes the key word logical formula.In addition, can begin to give cis-position ground like the high file of degree from the content class with input file and export, therefore, also have the advantage that the user can find desirable file rapidly.
But, in similar document retrieval, with the feature word vector of a large amount of weighted feature words as key element, judge the similarity between input file and the searching object file by contrast.Therefore, the shortcoming that has is: the user is difficult to understand the retrieval foundation, and namely why this file is used as similar file output.More particularly, in similar document retrieval, there are 4 problems shown below.
Problem (1): the contribution of which kind of degree has been made in the output that can not understand which feature word in the input file and be similar document retrieval result.
Problem (2): can not understand the degree that similar document retrieval make good progress.
Problem (3): can not understand similar document retrieval and make progress under the ill situation, what its reason is.
Problem (4): can not understand similar document retrieval and make progress under the ill situation, next how could obtain the better retrieval result.
As with the connected technical literature of above-mentioned problem (1), comprise patent documentation 1 and patent documentation 2.The invention of putting down in writing in these patent documentations is by serving as that the table that constitutes of axle or the form of curve show result for retrieval with employed project in result for retrieval and the retrieval.
In patent documentation 1, based on a plurality of judgment standards, calculate the file fit value of pressing judgment standard, and calculate these values are gathered the comprehensive file fit value that obtains.When the output file result for retrieval, as 2, output is being the table of value by the comprehensive file fit value of result for retrieval file and by the file fit value of judgment standard with result for retrieval file and judgment standard.By this table, which judgment standard the user can understand has been made for the output of which result for retrieval file for what kind of contribution.
In patent documentation 2, input file is resolved, be divided into a plurality of different viewpoints, become retrieval command by viewpoint change, divide the similar degree between each viewpoint ground calculating input file and the searching object file, and these are integrated the output result for retrieval.As a result the time, use appointed viewpoint as axle at outgoing inspection, show the similar degree of retrieval command and result for retrieval file two-dimentional or three-dimensionally.By this demonstration, which viewpoint the user can understand based on to be output which result for retrieval file.
The invention of putting down in writing in the above-mentioned patent documentation 1 and 2 is used with employed project (viewpoint, judgment standard) in result for retrieval and the retrieval and is solved above-mentioned problem (1) thus for table or the curve that axle constitutes shows result for retrieval.But these inventions do not mention the structure that solves other problems (2), (3), (4).
For example, about above-mentioned problem (2), whether make good progress in order to make the user can understand similar document retrieval, need resolve similarity between input file and the searching object file according to various essential factors, and can provide scheme by the mode that essential factor is estimated the quality of similar document retrieval with the user.
With comprise patent documentation 3 in the related technical literature of this problem (2).Put down in writing following gimmick in the patent documentation 3: at first, according to the result for retrieval in past, calculate the retrieval precision corresponding with the value scope of the similar degree of the similar file that retrieves by similar document retrieval in advance by the classification of giving the result for retrieval file; Then, according to each similar degree and the classification to the result for retrieval file of new input file, determine the retrieval precision corresponding with similar degree during this is classified; After this, the value displacement of the value of the similar degree of this result for retrieval file and the retrieval precision that this is determined is used as accuracy, with accuracy from high to low order rearrangement row result for retrieval and show, thus, improve retrieval precision.
But the gimmick of putting down in writing in the patent documentation 3 only based on the corresponding relation of similar degree and retrieval precision, is replaced into retrieval precision with similar degree, and the DISPLAY ORDER of result for retrieval file is carried out revisal (permutatation).Therefore, by the structure that mentions in the patent documentation 3, the user can not understand the ill essential factor of retrieval or next what this does based on this essential factor.
In similar document retrieval, often require the circulation of " tendency or essential factor → search condition correction → retrieval again that result for retrieval was carried out → held in search condition appointment → retrieval " such retrieval operation processing to turn round efficiently, that is, make retrieval operation high efficiency.The high efficiency of this retrieval operation needs following structure: to the user result for retrieval is shown, and the information of foundation/reason/countermeasure about result for retrieval etc. is shown, with the user can be efficiently in the face of next retrieval and the mode of revising search condition exactly assist.
But, the gimmick of putting down in writing in the patent documentation 3 only limits to the permutatation based on the result for retrieval file of the corresponding relation of similar degree and retrieval precision, and the tendency of unexposed assurance result for retrieval or essential factor are revised search condition and retrieved the structure that circulation that such being used for handle the retrieval operation is turned round efficiently again.As a result, by the gimmick of putting down in writing in the patent documentation 3, can not solve above-mentioned problem (3), (4).
What have in mind in the patent documentation 3 in addition, only is the value of similar degree itself and the classification under the result for retrieval file.But, represent that quantitatively the homophylic similar degree between file generally is the value of calculating under a plurality of microcosmic essential factor influences.As the concrete example of alleged here essential factor, can list: be used for the quality of inequality, file author's the similarities and differences number of the content/structure/article amount of the quality of feature word of input file of retrieval and quantity, searching object file or non-specificity, the employed feature word of searching object file or unequal.
Therefore, only by the value of similar degree itself and the relation property between the retrieval precision are resolved, can not deterministic retrieval make progress ill essential factor.The essential factor here determines and must resolve the more essential factor of microcosmic and the relation of retrieval precision, the send as an envoy to essential factor that retrieval precision improves and the essential factor that makes the retrieval precision reduction of identification well, and illustrate to the user quantitatively.But, in the gimmick of patent documentation 3, do not mention about deterministic retrieval and make progress the technology of ill essential factor.Therefore, can not solve above-mentioned problem (3) by the gimmick of putting down in writing in the patent documentation 3.
Summary of the invention
The present invention finishes under above-mentioned technical background and conventional art are investigated, and it provides a kind of be used in particular for solving problem (3) in above-mentioned 4 problems that similar document retrieval runs into and the technology of (4).That is, the invention enables under similar document retrieval makes progress ill situation, it is what that the user can understand reason.In addition, the invention enables under similar document retrieval makes progress ill situation, the user can understand and what is next done can obtain result for retrieval preferably.And by solving these problems, the circulation that the invention enables the user that the retrieval operation is handled is turned round efficiently.
Be used for solving the means of problem
In order to solve above-mentioned problem (3), the present invention defines the essential factor of the precision that influences similar document retrieval, afterwards, at result for retrieval by each essential factor calculate the retrieval precision of it seems from each essential factor and/or and the irrelevance of precision between average, and illustrate to the user.For example similar document retrieval servicing unit of the present invention and program use hardware resource to carry out following processing.At first, parsing about each essential factor is carried out in the input file in past and the right set of correct answer file, made the value scope of essential factor corresponding with retrieval precision foundation, and be stored in the table.Then, new input file being carried out identical essential factor resolves.After this, by with the contrast of above-mentioned table, determine corresponding to being worth the retrieval precision of scope accordingly with the essential factor value of new input file, to the user illustrate accuracy of detection and/or and average for the retrieval precision of the input file integral body in past between irrelevance.
In addition, in order further to solve above-mentioned problem (4), the present invention prepares the countermeasure table, in this countermeasure table, each viewpoint from above-mentioned essential factor, stored by the essential factor group: put down in writing and do and so on the countermeasure content, put down in writing and how to have carried out the method for operating of above-mentioned countermeasure content, the image information that should move in order to carry out the aforesaid operations method, made the user obtain better similar document retrieval result's countermeasure information as being used for.Then, when informing the result for retrieval file set to the user, illustrate to the user and to be stored in essential factor value and retrieval precision and/or the deviation value that precision influences kilsyth basalt, and make in the countermeasure content put down in writing in the above-mentioned countermeasure table, method of operating, the image information at least one be appended hereto the essential factor group to represent.
The invention effect
By the present invention, the user can hold similar document retrieval result's foundation.That is, the user can understand the degree that similar document retrieval make good progress, what the reason under the ill situation of progress is.And, when to the user retrieval precision and/or deviation value being shown, at least one that makes the countermeasure content put down in writing in the countermeasure table, method of operating, image information is appended hereto under the situation that the essential factor group represents, the user can understand under similar document retrieval makes progress ill situation, what is next done can obtain result for retrieval preferably.As a result, the circulation that the retrieval operation is handled is turned round efficiently, shortens the retrieval activity duration, and can obtain the high result for retrieval of quality.Problem, formation and effect beyond above-mentioned will be able to clearly by the explanation of following embodiment.
Description of drawings
Fig. 1 is the figure of the configuration example of expression input file assigned picture.
Fig. 2 is the figure of expression corresponding to the configuration example of the search condition editing pictures of countermeasure information.
Fig. 3 is that representation class is like the figure of the configuration example of document retrieval result's summary display frame.
Fig. 4 A is that representation class is like the figure (picture top) of the configuration example of document retrieval result's detailed display frame.
Fig. 4 B is that representation class is like the figure (picture bottom) of the configuration example of document retrieval result's detailed display frame.
Fig. 5 is the synoptic diagram that the functional block of similar document retrieval servicing unit constitutes.
Fig. 6 is the figure of the configuration example of expression catalog 505.
Fig. 7 records the figure of the configuration example of table 507 for expression.
Fig. 8 is the figure of the configuration example of expression teacher file table 508.
Fig. 9 is the figure of the configuration example of representation feature vocabulary 510.
Figure 10 is the figure of the configuration example of expression result for retrieval table 512.
Figure 11 is the figure because of the configuration example of table 514 of indicating.
Figure 12 is the figure of the configuration example of the representation feature word table of comparisons 515.
Figure 13 is the figure of expression by an example of the disposal route of essential factor data extract portion 513 execution.
Figure 14 is the figure of the configuration example of expression retrieval precision table 517.
Figure 15 is the figure of expression by an example of the disposal route of retrieval precision analysis unit 516 execution.
Figure 16 is the figure of expression by the concrete example of the disposal route of retrieval precision analysis unit 516 execution.
Figure 17 influences the figure of the configuration example of kilsyth basalt 520 for the expression precision.
Figure 18 is the figure of an example of the disposal route of expression precision degree of influence calculating part 519
Figure 19 is that representation class is like the figure of the configuration example of the hardware of document retrieval servicing unit.
Figure 20 A is the figure (picture top) of other configuration examples of the detailed display frame of expression result for retrieval.
Figure 20 B is the figure (picture bottom) of other configuration examples of the detailed display frame of expression result for retrieval.
Figure 21 is the figure of the configuration example of expression countermeasure information display screen.
Figure 22 is the figure of the configuration example of expression countermeasure table 522.
Among the figure:
501 document data banks
502 feature word extraction units
503 word dictionaries
504 catalog generating units
505 catalogs
506 record extraction unit
507 record table
508 teacher's file tables
509 feature word collection units
510 feature vocabularys
511 similar document retrieval portions
512 result for retrieval tables
513 essential factor data extract portions
514 essential factor tables
The 515 feature word tables of comparisons
516 retrieval precision analysis units
517 retrieval precision tables
518 new input file numbers
519 precision degree of influence calculating parts
520 precision influence kilsyth basalt
521 result for retrieval efferents
522 countermeasure tables
530 input medias
540 output units.
Embodiment
Below, based on accompanying drawing, embodiments of the invention are illustrated.Following embodiment suppose one with patent document as search condition, and to the similar similar patent search system retrieved of in the past patent document of the summary of the invention of the patent document that is transfused to.Specifically, suppose that such one makes use-case (use case): in known when example of retrieval from the patent document in past as the patented claim of examination object, import whole application documents, and the similar patent document of summary of the invention is therewith retrieved.
But embodiments of the invention are not limited to this and make use-case.In addition, in the present embodiment, with patent document as searching object, but also can be with the file of paper or news report, design document or Email, webpage etc. as searching object.
In the present embodiment, as the foundation of the result for retrieval of similar file, the function of understanding to following content is provided: which the feature word in the input file is that what kind of contribution the output of result for retrieval has made; The degree that similar document retrieval make good progress; What the reason that similar document retrieval makes progress under the ill situation is; Similar document retrieval makes progress under the ill situation, what is next done could obtain result for retrieval etc. preferably.
At first, use the picture example, the input and output image of native system is illustrated.The configuration example of the input file assigned picture of native system shown in Fig. 1.In input file assigned picture 100, the user will be input to input area 101 as the number of patent application of the identifier of the file of wanting to retrieve.Behind the importation patent application number, if press " retrieval " button 103, then carry out similar document retrieval, result for retrieval will be output to other pictures.In addition, if press " emptying " button 102, then the content of input area 101 is by cancellation.
In input file assigned picture 100, the option as retrieval is provided with: check box 104, and be used for to select input whether before retrieval is carried out, to carry out the pre-edit that the content of the feature word that extracts from input file and weight thereof is confirmed/revised; Check box 105 is used for selecting input whether to carry out retrieval after the feature word that will extract from input file expands into synonym.If under check box 104 and/or 105 selecteed states, press index button 103, then will demonstrate the picture that the search condition of feature word shown in Figure 2 or synonym etc. is edited.The back will be recorded and narrated the detailed formation of this picture.
In the present embodiment, suppose the such file ID of input application number when specifying input file, be pasted on input area but also the textual portions of patent can be copied, or directly the text typewriting be input to input area.Perhaps, also can specify input file from the file that shows by forms such as document retrieval result guide looks, to select to specify arbitrarily the form of file.
The configuration example of summary display frame 300 of demonstration that is used for the result for retrieval of similar file shown in Fig. 3.In summary display frame 300, the file that is retrieved as similar file is from beginning to be shown successively with the high file of the similar degree (similar degree) of input file.At this moment, show that by the file that is retrieved cis-position 308, similar degree 309, the application number of expression retrieval cis-position are file ID 310, are equivalent to the denomination of invention 311 of file name, projects of applicant 312.Certainly, also can show above-mentioned these description information or text messages in addition such as classification or summary article.
Under the situation of present embodiment, " text " button 302 of " making a copy of " button of making a copy of data 301 of expression selecteed file by selecting check box 307 and expression textual data is located at the picture top of summary display frame 300.In addition, if press the Back button 304 of being located at picture top equally, then display frame turns back to input file assigned picture 100.In addition, if press " backward " button 306, then show following ten result for retrieval files, if press " forward " button 305, then show preceding ten result for retrieval files.
The configuration example of similar document retrieval result's detailed display frame shown in Fig. 4 A and Fig. 4 B.This picture is to be located at summary display frame 300(Fig. 3 by pressing) " detailed content " button 303 on picture top show.In view of the restriction of length, at Fig. 4 A the table 400 that is shown in picture top part is shown, the table 470 that is shown in picture bottom part is shown at Fig. 4 B.Table 400 are expressions to the result of the similar document retrieval that in Fig. 3, is output whether smoothly, if smoothly then its reason be the result what is resolved.
This table 400 is by the essential factor 410 of the similar document retrieval precision of influence, constitute with respect to the essential factor of essential factor value 440, " to the degree of influence of retrieval precision " 450 that obtain by essential factor.Essential factor 410 is made of the classification of the essential factor under each essential factor 420, essential factor title 430.Essential factor value 440 is by as with respect to the value 441 of the essential factor value of this input file, constitute with respect to the field average 442 of the essential factor value of a plurality of teacher's input files." to the degree of influence of retrieval precision " 450 by " the corresponding essential factor group " 451 under the value 441 of essential factor, the retrieval precision 452 corresponding with corresponding essential factor group 451, with this essential factor to the influence degree of similar document retrieval precision as with degree of influence 453 formations of representing with respect to the departure degree of the retrieval precision mean value of whole teacher's input file.Can be considered as: the value of degree of influence 453 is positive essential factor, and its absolute value is more big, and more for the raising of retrieval precision contributes, the value of degree of influence 453 is negative essential factor, and its absolute value is more big, more becomes the reason that retrieval precision reduces.The user can understand retrieval and whether make good progress, makes what the essential factor of retrieval precision reduction is by confirming the value of this degree of influence.Certainly, it is also conceivable that the either party's who only shows retrieval precision 452 and degree of influence 453 situation.
The weighted feature word that table 470 will extract from input file is represented the result for retrieval file as the longitudinal axis as transverse axis.In addition, in table 470, according to the size of the weighted value of each feature word, tint to represent value corresponding to each the feature word in the result for retrieval file 472 by what changed concentration.In table 470, by weight sequence list from high to low all 20 feature words that extract as from input file are shown on its longitudinal axis, forward 30 of the cis-position of result for retrieval file 472 is shown by similar degree sequence list from high to low on its transverse axis.
The weight 477 of occurrence frequency 475 in input file of " the forward number of packages that hits of cis-position " 474 of the number of packages that hits in forward 30 the result for retrieval file 472 of cis-position by the title 473 of feature word, as the feature word about the data of the feature word 471 of input file, feature word, the intrinsic degree 476 that calculates according to the appearance number of files of the feature word in the document data bank, the feature word that calculates according to occurrence frequency 475 and intrinsic degree 476.
In similar degree 479, tint (the Japanese: Tu り つ ぶ) that changed concentration by the size corresponding to the value of the similar degree of result for retrieval file 472 comes it is represented.In classification 480, the classification of giving input file and the classification of giving result for retrieval file 472 are compared, up to the classification of more following level also consistent result for retrieval file 472 still, more increase concentration and tint and represented.In applicant 481, applicant/the inventor of input file is compared with the applicant/inventor who gives result for retrieval file 472, represent the result for retrieval file 472 that the inventor is identical by heavier the tinting of concentration, represent the result for retrieval file 472 that the applicant is identical with shallow slightly the tinting of concentration.
In addition, be selected if constitute any of key element 473~477 of the feature word 471 of input file, then to being that the content of the capable descending permutatation that will show of key (key) is represented again with selecteed key element.Zone 482 expressions are by corresponding to cis-position being the cell that the concentration of size of value of the weights W ij of the feature word i in the result for retrieval file of j is tinted.Saturate cell is the feature word that is much accounted of in this result for retrieval file more, and colourless cell represents that this feature word is not contained in this result for retrieval file.In addition, replace weights W ij, also can be by corresponding to cis-position being the cell that the concentration of size of value of the part similar degree Sij of the feature word i among j the similar degree Sj of result for retrieval file is tinted.
In the present embodiment, become 100 times value of cosine of an angle to calculate with similar degree between the file as the vector that is constituted by the weighted feature word file.Therefore, part similar degree Sij can be by the multiplied by weight with the feature word i of the weight of the feature word i of input file and result for retrieval file j, and divided by the size of the feature word vector of the size of the feature word vector of input file and result for retrieval file j amass calculate.The user is by with reference to this table 470, and the contribution of which kind of degree has been made in the output that can hold which feature word visually, intuitively and be result for retrieval.
In addition, can be by with reference to this table 470(Fig. 4 B) hold table 400(Fig. 4 A) shown in the detailed content of degree of influence.The value of for example expressing essential factor " total hits " 432 in table 400 is " 166 ", and this with table 470 in the quilt cell of being tinted total consistent.Therefore, by getting a bird's eye view table 470, a glance just can be held the cell that 166 quilts tint and be formed what kind of distribution.
In addition, express essential factor and " hit feature word number more " 436 value is " 5 ", and this is consistent with the forward quantity of hitting the feature word of number of packages 474 more than threshold value (be in the present embodiment be equivalent to 80% 24) of the cis-position in the table 470.Therefore, by getting a bird's eye view table 470 or getting a bird's eye view the forward feature word of cis-position that is positioned at the forward table that to hit number of packages 474 be key is obtained table 470 descending sort of cis-position, which feature word the user can hold is equivalent to hit the feature word more.
Like this, by showing essential factor and the table 400 of degree of influence and the table 470 of the contrast relationship between representation feature word and result for retrieval file that makes expression relevant with similar document retrieval precision in couples, the user can set up correspondence by suitably making both, comes more accurate and profoundly understands the tendency of result for retrieval.
Next, use chart, the calculating that comprises shown in Fig. 4 A and Fig. 4 B is illustrated by formation, data configuration, the disposal route of the similar document retrieval backup system of the processing of the retrieval precision of each essential factor and degree of influence (departing from) thereof.
The functional block of the similar document retrieval servicing unit 500 of present embodiment shown in Fig. 5 constitutes.Patent document data Jie as searching object is stored in document data bank 501 by input media 530.Feature word extraction unit 502 is extracted the weight of feature word and its importance degree of expression and occurrence frequency and the intrinsic degree that is used for calculating weight from each patent document that is stored in document data bank 501.Under the situation of present embodiment, feature word extraction unit 502 is resolved with reference to the morphactin that word dictionary 503 carries out article is divided into word, and is that the word of noun or verb is extracted as the feature word with part of speech.Catalog generating unit 504 to be can carrying out the mode of similar document retrieval efficiently, the feature word of each file that will obtain by feature word extraction unit 502 and gather and be stored in catalog 505 about the numeric data of weight.Record extraction unit 506 and extract open day or the applying date, patent classification or applicant, inventor's etc. description information from each patent document that is stored in document data bank 501, and be divided into descriptive entry name and descriptive entry value, be stored in by file and record table 507.Feature word extraction unit 502, catalog preparing department 504, the contents processing of recording extraction unit 506 are implemented in the most similar document retrieval system that sell in market, therefore, no longer mention in the present embodiment.In addition, feature word extraction unit 502, catalog preparing department 504, to record extraction unit 506 are processing of carrying out in advance in order actual to specify input file carry out similar document retrieval.
The configuration example of catalog shown in Fig. 6 505.Under the situation of present embodiment, catalog 505 is 2 with file and the feature word that is contained in document data bank 501, by with the weight of correspondence as the weight catalogue 600 of value, the occurrence frequency of correspondence is constituted as the occurrence frequency catalogue 610 of value, the intrinsic degree catalogue 620 that is made of feature word and intrinsic degree thereof.
In the present embodiment, the weight w of the feature word T among the calculation document d by the following method.At first, obtain the logarithm value logTF of the occurrence frequency TF of the feature word w among the file d.Then, the intrinsic degree IDF that obtains feature word w is stored in the number of files N of document data bank 501 divided by the logarithm value log(N/n of the number of files n income value that comprises this feature word w).At last, calculate weight w by calculating (1+logTF) * log(N/n).But, under the situation of TF=0, the value of w is made as 0.This method is widely known by the people as the TF-IDF method, therefore, no longer mentions.
Record the configuration example of table 507 shown in Fig. 7.Recording table 507 is made of sequence number 700, file ID 701, descriptive entry name 702, descriptive entry value 703.In the present embodiment, by in the recording of file storage patent with open day, the applying date, as the relevant data of the IPC of patent classification and exercise question, applicant, inventor, but also can store recording in addition.
Get back to the explanation of Fig. 5.Teacher's file table 508 is the data that are made of a plurality of pair sets, above-mentionedly be known input file (hereinafter referred to as " teacher's input file ") right with corresponding to the correct answer file of teacher's input file to being the patent document wanting to retrieve (hereinafter referred to as " correctly answer file (Japanese: normal solution literary composition Books) "), this teacher's file table 508 is to be situated between by input media 530, by the data of user or system operator input.
The configuration example of the file of teacher shown in Fig. 8 table 508.Teacher's file table 508 by teacher's data ID 801, teacher's input file ID802, correctly answer file ID 803 and constitute store a plurality of so that these contents are set up corresponding form.In the present embodiment, the patent that will quote in for Patent Office's notification of examiner's opinion of applying for a patent that examination is finished in the past is defined as " correctly answering file " corresponding with teacher's input file.Certainly, can use user or system operator alone with the correct answer of viewpoint definition arbitrarily file, and the content that makes teacher's input file and correct answer file login connectedly/accumulate, also can stipulate correctly to answer file according to other definition.In addition, also can there be a plurality of correct answer files with respect to 1 teacher's input file.In addition, can be only with in existing many correct answer files with the most similar file of input file as correct answer, also can be only the forward correct answer file of cis-position among the similar document retrieval result be used as correctly answering file.
Get back to the explanation of Fig. 5.Feature word collection unit 509 is passed through with reference to catalog 505, extract with the teacher's input file that is stored in teacher's file table 508 or user and be situated between by the specified new input file number 518 characteristic of correspondence words of input media 530, and the result who extracts is stored in feature vocabulary 510.
In the present embodiment, be made as for the feature word of new input file numbers 518 and the data of recording with for the feature word of the teacher's input file in teacher's file table 508 and the data of recording and all be stored in catalog 505 respectively and record table 507.Therefore, when the feature word that extracts for these input files, can easily collect by from catalog 505, taking out corresponding to the value of the feature word of input file and weight thereof, occurrence frequency and being stored in the feature vocabulary 510.
In addition, for intrinsic degree, value that can be by taking out the intrinsic degree corresponding with the feature word that extracts from catalog 505 also is stored in the feature vocabulary 510, easily collects.But, in input file assigned picture 100 shown in Figure 1, user's article arbitrarily is made as under the situation about can import, and the feature word is not stored in catalog 505, therefore, the article of importing being given feature word extraction unit 502 carries out the processing of extracting the feature word and giving weight and gets final product.
The configuration example of the vocabulary of feature shown in Fig. 9 510.Feature vocabulary 510 is made of file ID 901, title 902, occurrence frequency 903, intrinsic degree 904, weight 905.
Get back to the explanation of Fig. 5.Similar document retrieval portion 511 is with reference to catalog 505, be stored in by retrieval and feature word collection unit 509 that file calculates similar degree like the collection class of weighted feature word of feature vocabulary 510, and forward 30 of the cis-position of result for retrieval is stored in result for retrieval table 512.
As mentioned above, in the present embodiment, become 100 times value of cosine of an angle to calculate with similar degree between the file as the vector that is constituted by the weighted feature word file.Therefore, similar degree is got the value between 0 to 100, and means similar degree more near 100, and similarity is more high.The set of feature word is interpreted as vector, and the homophylic method that becomes angle or inner product to obtain both by vector is widely known by the people as vector space model, so, no longer mention.
The configuration example of the table 512 of result for retrieval shown in Figure 10.Result for retrieval table 512 is made of input file ID1001, retrieval cis-position 1002, similar degree 1003, result for retrieval file ID 1004.In addition, as a result the time, also can add following option like document retrieval in output class: the applying date, open day to input file and result for retrieval file compare, and only retrieve the patent that had been disclosed in the past in the applying date of input file.At this, processing to all teacher's input file applicable characteristic word collection units 509 of being stored in teacher's file table 508 and similar document retrieval portion 511, thus, be formed on feature vocabulary 510 and result for retrieval table 512 and store state with a plurality of teacher's input file characteristic of correspondence words and result for retrieval respectively.
Get back to the explanation of Fig. 5.Essential factor data extract portion 513 is suitable for feature vocabulary 510, the result for retrieval table 512 of the data that above-mentioned processing obtains and records in the table 507 at least more than one each teacher's input file with reference to having stored, extract the value 441 corresponding with the essential factor 410 shown in Fig. 4 A, and be stored in essential factor table 514.In addition, in order to generate the table 470 of Fig. 4 B, the corresponding relation of 513 pairs of feature words of essential factor data extract portion and result for retrieval file is resolved, and together is stored in the feature word table of comparisons 515 with data about feature word and weight thereof.
As also illustrating among Fig. 4 A and Fig. 4 B, in the present embodiment, as the essential factor of the similar document retrieval precision of influence, use following 8 kinds.In addition, these essential factors can roughly be divided into three essential factor classification.
(essential factor classification 1) feature word hits tendency
This is about the feature word of input file and the essential factor of hitting tendency between the result for retrieval file.That is the essential factor that can calculate according to the data (these data are stored in the feature word table of comparisons 515) of the table 470 that hits situation between the feature word shown in the presentation graphs 4B and the result for retrieval file itself.Specifically, comprise 6 kinds of following essential factors.
(essential factor 1) validity feature word number
The forward number of packages 474 that hits of the cis-position of table 470 is the quantity of the above feature word of preassigned threshold value (being 4 in the present embodiment).If this value is little, then the quantity as the feature word of the clue of similar document retrieval tails off, and existence causes dysgenic possibility to retrieval precision.
(essential factor 2) total hits
In other words this cell number for being tinted in the table 470, hits the summation of the value of number of packages 474 for cis-position is forward.If this value is little, then form the few state of result for retrieval file that the feature word hits, exist retrieval precision is caused dysgenic possibility.On the contrary, if value is big, then the result for retrieval file that hits of feature word becomes many, and formation can't be contracted to similar file the state of minority, exists retrieval precision is caused dysgenic possibility.
(essential factor 3) high hits
This is for being had the cell number of the above value (color is dark) of preassigned threshold value (being " 20 " in the present embodiment) in the cell of tinting in the table 470.If this value is little, the importance degree of feature word in the result for retrieval file that then hits is low, and therefore, formation is difficult to dwindle the state of similar file extent, exists retrieval precision is caused dysgenic possibility.
(essential factor 4) high hit rate
This is that above-mentioned high hits are divided by the value of above-mentioned total hits.If this value is little, then form the many states of feature word of unessential input file in the result for retrieval file, exist retrieval precision is caused dysgenic possibility.
(essential factor 5) value is average
This is average for the value of the cell of being tinted in the table 470.If this value is little, then form the many states of feature word of unessential input file in the result for retrieval file, exist retrieval precision is caused dysgenic possibility.
(essential factor 6) hits feature word number more
This is in the result for retrieval file, than the feature word number that gives the input file that comprises in the result for retrieval file that first specified threshold value (in the present embodiment for being equivalent to 80% 24) Duos.With hit the corresponding feature word of feature word more and mostly be in this technical field (classification) word that often uses or the word that in general file, also often uses.If to hit feature word number big more, then can roughly dwindle the scope of associated with, but the state that reduces the scope with the main points of file content (in patent, being the part of the feature (novelty/creativeness) of expression invention) does not exist retrieval precision is caused dysgenic possibility.
(essential factor classification 2) recorded and hit tendency
This is the essential factor about the intercommunity between the description information of the description information of input file and result for retrieval file.Description information can easily extract from recording table 507, therefore, by these contents are contrasted, can resolve intercommunity.Specifically, comprise following essential factor.
Number of packages is hit in (essential factor 7) classification
This is the classification result for retrieval file number of packages common with the classification of giving the result for retrieval file of giving input file.Under the situation of patent document, comprise that (IPC/FI, exercise question/Fterm), it forms multilayer respectively and constitutes (portion, group, big group etc.) a plurality of taxonomic hierarchieses.In the present embodiment, on big group the level of IPC, the result for retrieval file number of packages of classifying common is calculated, but also can calculate in other levels.
As about recording other essential factors of hitting tendency, except number of packages is hit in classification, can list " applying date irrelevance " etc. of the mean value of the above result for retrieval file number of packages of " applicant hits number of packages ", the expression applying date interval threshold of the identical result for retrieval file number of packages of expression inventor/applicant or spacing value.Also can use these essential factors.
(essential factor classification 3) similar degree
This be with for the relevant essential factor of the value of the similar degree of the result for retrieval file of input file.Specifically, comprise following essential factor.
(essential factor 8) similar degree attenuation rate
Result after this situation of carrying out what kind of decay along with the decline of cis-position for similar degree that the forward similar result for retrieval file of cis-position is had quantizes.Specifically, preassigned cis-position R2(is 30 in the present embodiment) the similar degree of result for retrieval file be 1 in the present embodiment with respect to preassigned cis-position R1() the ratio of similar degree of result for retrieval file as the similar degree attenuation rate of this result for retrieval.If the similar degree attenuation rate is low, then form a large amount of states of exporting of similar file that similar degree resists mutually, exist retrieval precision is caused dysgenic possibility.
The configuration example of the table 514 of essential factor shown in Figure 11.Essential factor table 514 is by input file ID1101, store the exercise question of giving patent document in the present embodiment for the correct answer file ID 1102 of input file, the classification 1103(that gives input file), the retrieval cis-position 1104 of the correct answer file ID 1102 among the similar document retrieval result constitutes, from validity feature word several 1105 to similar degree attenuation rate 1112 corresponding to above-mentioned essential factor, and the value (essential factor value) that calculates by input file ID1101 of storage.In addition, as described later, wanting the branch technical field to calculate under the situation of each essential factor to the degree of influence of similar document retrieval precision, when classification 1103 is used to based on classification 1103 filtration teacher input files.
The configuration example of the word of feature shown in Figure 12 table of comparisons 515.The feature word table of comparisons 515 is divided into the part 1201 that stores the data relevant with input file feature word and the part 1210 that stores the weighted value of the feature word in the result for retrieval file.The former is made of the intrinsic degree 1205 in the document data bank 501 of the occurrence frequency 1204 that hits the feature word in number of packages 1203, the input file of the feature word in the title 1202 of feature word, 30 the result for retrieval files, feature word, the weight 1206 of feature word.In addition, when the table 470 shown in the presentation graphs 4B, also with reference to this feature word table of comparisons 515.
One example of the disposal route of carrying out by essential factor data extract portion 513 shown in Figure 13.Essential factor data extract portion 513 is made of following processing: for carry out the extraction of value corresponding with the essential factor that belongs to essential factor classification " the feature word hits tendency " in the above-mentioned essential factor efficiently, generate having stored with the result for retrieval file in the feature word table of comparisons of the feature word table of comparisons 515 that hits the relevant data of content of feature word of input file generate processing 1302; And the reference feature word table of comparisons 515 etc. calculates the processing 1303~1310 of each the essential factor value in each input file.
Processing below in essential factor data extract portion 513, carrying out.In step 1301, essential factor data extract portion 513 has judged whether untreated input file, in the absence of " " untreated input file, and end process.On the other hand, under the situation of " having " untreated input file, essential factor data extract portion 513 carries out the feature word table of comparisons and generates processing 1302.
The feature word table of comparisons generate to be handled 1302 and is made of processing shown below 1351~1356.In step 1351, essential factor data extract portion 513 takes out title, occurrence frequency, intrinsic degree, the weight of the feature word of input file from feature vocabulary 510, and is stored in the corresponding region of the feature word table of comparisons 515 respectively.In following step 1352, essential factor data extract portion 513 extracts preassigned M (being 30 in the present embodiment) corresponding to the forward result for retrieval file of the cis-position of this input file from result for retrieval table 512.In following step 1353, essential factor data extract portion 513 each result for retrieval file characteristic of correspondence word and weight from M the result for retrieval file that the weight catalogue 600 of catalog 505 is extracted and is extracted.
In following step 1354, essential factor data extract portion 513 judges whether to exist the untreated feature word relevant with this input file.Under the situation that does not have untreated feature word, essential factor data extract portion 513 enters step 1303.Relative therewith, under the situation that untreated feature word is arranged, essential factor data extract portion 513 at first includes the weight of this feature word of the result for retrieval file of this feature word in step 1355 is taken out the result for retrieval file of M spare, and be stored in respectively in the feature word table of comparisons 515 with this result for retrieval file and this feature word zone accordingly.
In following step 1356, the number of packages that includes the result for retrieval file of this feature word in the result for retrieval file of 513 pairs of M spares of essential factor data extract portion is counted, and be stored in feature word table of comparisons 515(Figure 12) the zone of " hitting number of packages 1203 ", turn back to step 1354.
The processing of computing 1303 for the value of essential factor " effectively word feature word number " is calculated counted in the validity feature word, is made of step 1373.In step 1373, the 513 pairs of feature words table of comparisons 515(Figure 12 of essential factor data extract portion) the feature word number of " hitting number of packages 1203 " more than preassigned threshold value (being 4 in the present embodiment) counted, and is stored in the zone of the validity feature word number of essential factor table 514.
The processing of total hits computing 1304 for the value of essential factor " total hits " is calculated is made of step 1374.In step 1374, essential factor data extract portion 513 obtains feature word table of comparisons 515(Figure 12) the summation of " hitting number of packages 1203 ", and be stored in the zone of total hits of essential factor table 514.
The processing of high hits computing 1305 for the value of essential factor " high hits " is calculated is made of step 1375.In step 1375, essential factor data extract portion 513 obtains the sum of the feature word of weight more than preassigned threshold value (being 20 in the present embodiment) of the feature word that is removed and is stored in the feature word table of comparisons 515 in above-mentioned steps 1355, and is stored in the zone of the high hits of essential factor table 514.
The processing of high hit rate computing 1306 for the value of essential factor " high hit rate " is calculated is made of step 1376.In step 1376, essential factor data extract portion 513 obtains the high hits of taking-up in above-mentioned steps 1375 divided by the value of the total hits that take out in above-mentioned steps 1374, and is stored in the zone of the high hit rate of essential factor table 514.
It is the processing that the value of essential factor " value is average " is calculated that the value average computation handles 1307, is made of step 1377.In step 1377, the weight that essential factor data extract portion 513 obtains the feature word that is removed and is stored in the feature word table of comparisons 515 in above-mentioned steps 1355 is average greater than the weight of 0 feature word, and is stored in the average zone of value of essential factor table 514.
Hit the feature word more and count the processing that computing 1308 is calculated for the value that essential factor " is hit feature word number " more, constituted by step 1378.In step 1378, the 513 pairs of feature languages table of comparisons 515(Figure 12 of essential factor data extract portion) " hitting number of packages 1203 " counts for the above feature word number of preassigned threshold value (being 24 in the present embodiment), and is stored in the zone of hitting feature word number of essential factor table 514 more.
The processing of number of packages computing 1309 for the value of essential factor " number of packages is hit in classification " is calculated hit in classification, is made of step 1379.In step 1379, essential factor data extract portion 513 extracts corresponding to big group of the IPC of the result for retrieval file of this input file and M spare respectively from recording table 507, obtain the result for retrieval number of files with big group of common IPC of more than one and this input file, and be stored in the zone that number of packages is hit in the classification of essential factor table 514.
The processing of similar degree attenuation rate computing 1310 for the value of essential factor " similar degree attenuation rate " is calculated is made of step 1380.In step 1380, essential factor data extract portion 513 obtains preassigned cis-position R2(and is 30 in the present embodiment) the similar degree of result for retrieval file be 1 in the present embodiment with respect to the preassigned retrieval cis-position R1(of result for retrieval table 512) the ratio value of similar degree of result for retrieval file, and be stored in the zone of the similar degree attenuation rate of essential factor table 514.After this, essential factor data extract portion 513 turns back to step 1301.
Get back to the explanation of Fig. 5.Retrieval precision analysis unit 516 is according to being stored in essential factor data essential factor table 514, relevant with the teacher's input file set in teacher's file table 508, calculating is for the retrieval precision of each essential factor, and poor (deviation value) of the mean value of the retrieval precision of calculating and whole teacher's input file.The deviation value that calculates is illustrated to the user afterwards as the index of each essential factor of expression to the degree of influence of retrieval precision.Here institute's result calculated is stored in retrieval precision table 517.In the present embodiment, retrieval precision is defined as " the retrieval cis-position of correctly answering file is 100 in the present embodiment at preassigned threshold value R() with the ratio of interior input file number of packages ".Certainly, other definition also can.
The configuration example of the table 517 of retrieval precision shown in Figure 14.Retrieval precision table 517 is by the essential factor ID 1401 of identification essential factor, essential factor classification 1402 with the essential factor grouping, essential factor title 1403, the essential factor group ID1404 that the essential factor group that constitutes each essential factor is identified, the essential factor group name claims 1405, but the lower limit 1406 of essential factor group value, but the higher limit 1407 of essential factor group value, the retrieval precision 1408 that belongs to teacher's input file of essential factor group, constitute as " and precision between average depart from 1409 " with respect to the difference of the retrieval precision 1408 of the retrieval precision of whole teacher's input file.
In the retrieval precision table 517, essential factor ID1401, essential factor classification 1402, the data of essential factor title 1403 for being fixed in advance.About essential factor is divided into several groups, be divided into three in the present embodiment respectively, also can be divided into the group by the quantity of user's appointment.
One example of the disposal route of being carried out by retrieval precision analysis unit 516 shown in Figure 15.In addition, the concrete example of this disposal route shown in Figure 16.
As shown in figure 15, retrieval precision analysis unit 516 at first judges whether to exist untreated essential factor in step 1501, end process under the situation of " nothing ".On the other hand, having under the situation of untreated essential factor, retrieval precision analysis unit 516 is taken out input file ID1101 as analysis object, retrieval cis-position 1104, the essential factor value (from 1105 to 1112 arbitrary values) corresponding with this processing object essential factor in step 1502 from essential factor table 514, and temporarily is stored as two-dimensional array.The example of up to the present result is shown in the table 1600 of the left end of Figure 16.
Under the situation of present embodiment, retrieval precision analysis unit 516 uses all teacher's input files that are stored in teacher's file table 508 to generate retrieval precision table 517.But, also can leach teacher's input file based on the classification 1103 of essential factor table 514, only use the data relevant with the teacher's input file that has been endowed certain specific classification to generate retrieval precision table 517.Can expect that similar document retrieval precision is to a great extent also by about technical field institute.Therefore, can expect that it is effective only taking out that the teacher's input file satisfy certain conditions resolves.In addition, as the benchmark that leaches, not only classify 1103, also can be with the applying date or applicant etc. as benchmark.
Then, in step 1503, retrieval precision analysis unit 516 will be that preassigned threshold value R(is 100 in the present embodiment corresponding to the retrieval cis-position of the correct answer file of all essential factor values of taking out) calculate as " precision is average " with the ratio of interior input file number of packages.
Then, in step 1504, retrieval precision analysis unit 516 is key with the essential factor value, sorts by the two-dimensional array of ascending order to stored input file ID in above-mentioned steps 1502, retrieval cis-position, essential factor value.The example of up to the present result is shown in the table 1610 of the central authorities of Figure 16.
Then, in step 1505, retrieval precision analysis unit 516 is based on the size of essential factor value, and the quantity N(that two-dimensional array is divided into preassigned essential factor group is 3 in the present embodiment) (grouping).The example of up to the present result is shown in 1612~1614 in the table 1610 of right-hand member of Figure 16.In the example of Figure 16, essential factor group " low " and " height " are made of 5 input files, " in " constituted by 10 input files.About the input file of how much quantity/ratio is housed in which essential factor group, can make its unanimity in all essential factor groups, also can make it variable by the essential factor group.In addition, also can be specified by the user.
In following step 1506, retrieval precision analysis unit 516 judges whether to exist untreated essential factor group.Under the situation that does not have untreated essential factor group, retrieval precision analysis unit 516 is got back to step 1501 and is entered the processing of ensuing essential factor.On the other hand, under the situation that untreated essential factor group is arranged, retrieval precision analysis unit 516 is at first obtained higher limit and lower limit with respect to the essential factor value of this essential factor group in step 1507.The example of the result of this step is shown in 1614 in the table 1610 of right-hand member of Figure 16.
As the essential factor value, comprise the essential factor value of taking discrete value, the essential factor value of taking successive value.For example validity feature word number is the discrete value that is made of integer, and the similar degree attenuation rate is to get the successive value of real number.
Under the situation of the higher limit of determinant group and lower limit, can not there be the value that does not belong to any essential factor group on the border of adjacent essential factor group.Therefore, exist under the situation of the value that does not belong to any essential factor group on the border of adjacent essential factor group, must decision which essential factor group be this value be put into.For example under the situation of Figure 16, the higher limit of essential factor group " low " is " 12 ", and the essential factor group " in " lower limit be " 14 ".Therefore, be under the situation of " 13 " in the essential factor value, uncertain where should putting into.Therefore, in the present embodiment, with the essential factor component be " low ", " in ", " height " three, and the value that is suitable for above-mentioned nothing ownership all be contained in " in " such heuristic process solves the problems referred to above.As the table 1610 of the right-hand member of Figure 16 1614 shown in, by this processing, the essential factor group " in " lower limit be not " 14 ", but become " 13 ".Certainly, also can be to the higher limit of " low " and " in " on average the calculating and carry out the impartial additive method that distributes etc. of lower limit.
In following step 1508, retrieval precision analysis unit 516 for the retrieval cis-position corresponding with the essential factor value in this essential factor group, is calculated retrieval precision by the method identical with step 1503.In following step 1509, retrieval precision analysis unit 516 deducts the precision that calculates from the retrieval precision of this essential factor group of calculating in above-mentioned steps 1503 above-mentioned steps 1508 average, thus, both deviation value (poor) of value calculated.The example of up to the present result is shown in the table 1610 of the right-hand member of Figure 16.In the table 1610 of the right-hand member of Figure 16, include 5 teacher's input files in essential factor group " low ", 2 retrieval cis-positions wherein are positioned in 100.Therefore, the retrieval precision in the essential factor group " low " is 40%(2/5).Teacher's input file one has 20, and therefore, average (the whole retrieval precision) 1616 of its precision is 60%(12/20).Therefore, with essential factor group " low " in the average deviation value 1617 of the precision of retrieval precision be-20%(=40%-60%).Similarly, the essential factor group " in " and each deviation value 1617 of " height " be respectively 0% and+20%.
In following step 1510, retrieval precision analysis unit 516 is stored in higher limit, lower limit, retrieval precision, the deviation value relevant with the essential factor group that calculates respectively the zone of the corresponding essential factor group of retrieval precision table 517.Then, get back to step 1506.
Get back to the explanation of Fig. 5.Precision degree of influence calculating part 519 is at the new input file number 518 by user's appointment, with teacher's input file similarly, will contrast with retrieval precision table 517 through the essential factor table 514 that following processing obtains.At this, essential factor table 514 is that the obtaining of the similar document retrieval result that undertaken by similar document retrieval portion 511 by the collection of the feature word that is undertaken by feature word collection unit 509 through (1) and weight thereof, (2), (3) obtain by the calculating of the essential factor value that essential factor data extract portion 513 carries out.If precision degree of influence calculating part 519 is by above-mentioned contrast, determines the corresponding essential factor group of essential factor value with new input file by essential factor, then further determines the degree of influence (and precision deviation value between on average) to retrieval precision, and is stored in precision and influences kilsyth basalt 520.
Precision shown in Figure 17 influences the configuration example of kilsyth basalt 520.Precision influences kilsyth basalt 520 and 1707 is made of average the departing from of essential factor ID1701, essential factor classification 1702, essential factor title 1703, essential factor value 1704, corresponding essential factor group 1705, retrieval precision 1706, retrieval precision 1706 and precision corresponding to corresponding essential factor group.
One example of the disposal route of being carried out by precision degree of influence calculating part 519 shown in Figure 18.Precision degree of influence calculating part 519 judges whether to exist untreated essential factor in step 1801.In the absence of " " untreated essential factor, precision degree of influence calculating part 519 end process.Under the situation of " having " untreated essential factor, precision degree of influence calculating part 519 in step 1802, extracts essential factor ID and the essential factor value corresponding with this essential factor of new input file from essential factor table 514.Then, precision degree of influence calculating part 519 in step 1803, contrasts higher limit and the lower limit of corresponding essential factor in the essential factor value that is extracted and the retrieval precision table 517, and determines the essential factor group under this essential factor value.Then, precision degree of influence calculating part 519, in step 1804, taking-up corresponding to the essential factor ID1401 of the essential factor group that is determined, essential factor classification 1402, essential factor title 1403, essential factor group name claim 1405, retrieval precision 1408, with precision average depart from 1409, and with the essential factor value together, be stored in respectively the essential factor ID1701 that infers table 520 as a result, essential factor classification 1702, essential factor title 1703, essential factor value 1704, corresponding essential factor group 1705, retrieval precision 1706, and precision average depart from 1707.
Get back to the explanation of Fig. 5.Result for retrieval efferent 521 influences kilsyth basalt 520 based on the feature word table of comparisons 515 and precision, generates the output picture shown in Fig. 4 A and Fig. 4 B, and is situated between and is illustrated to the user by output unit 540.The table 400 of Fig. 4 A can influence kilsyth basalt 520 by precision and easily generate.The table 470 of Fig. 4 B can easily generate by the feature word table of comparisons 515.
As described later, about the essential factor (average deviation value is negative essential factor with precision) that similar document retrieval precision is reduced, improve for the viewpoint of essential factor from then on makes similar document retrieval precision, countermeasure table 522 will be set up corresponding with essential factor and store for to the countermeasure information shown in the user for next doing and so on countermeasure information.
As described above, the similar document retrieval servicing unit of present embodiment constitutes by using functional block shown in Figure 5, can illustrate to the user influencing the essential factor of retrieval precision and effect (with average the departing from of precision) thereof the foundation as similar document retrieval result.
The hardware configuration example of the similar document retrieval servicing unit of present embodiment shown in Figure 19.It is by the treating apparatus 1950 of carrying out computing that this device is assigned to haply, be used in user's input operation the input media 1930 of data perhaps, export computing result's output unit 1940, the storage program relevant with the processing for the treatment of apparatus 1950 and the memory storage 1960 of data constitutes to the user.
Input media 1930 is made of keyboard 1951 and mouse 1952.Output unit 1940 is made of output monitor 1953.And other computing machine between under the situation of transmitting-receiving inputoutput data, inputoutput data is situated between and carries out information transmit-receive by network 1954.
Memory storage 1960 is by the perform region 1961 of the deal with data in the temporary transient stores processor device 1950, the document data bank storage area 1962 of storage data, word dictionary storage area 1963, catalog storage area 1964, record table storage area 1965, teacher's file table storage area 1966, result for retrieval table storage area 1967, feature vocabulary storage area 1968, essential factor table storage area 1969, the feature word is according to closing table storage area 1970, retrieval precision table storage area 1971, precision influences kilsyth basalt storage area 1972, countermeasure table storage area 1973, stored program feature word extraction unit storage area 1974, catalog generating unit storage area 1975, record extraction unit storage area 1976, feature word collection unit storage area 1977, similar document retrieval portion storage area 1978, essential factor data extract portion storage area 1979, retrieval precision analysis unit storage area 1980, precision degree of influence calculating part storage area 1981, result for retrieval efferent storage area 1982 constitutes.
Treating apparatus 1950 loads necessary programs and data from memory storage 1960, and the result after will carrying out repeatedly is stored in memory storage 1960, and the processing that puts rules into practice.
The variation of above-described embodiment then, is described.
(variation 1)
In the above-described embodiments, retrieval precision analysis unit 516 is when the retrieval precision of calculating according to teacher's input file with respect to each essential factor, essential factor is divided into several essential factor groups and calculates retrieval precision by the essential factor batch total, and then, degree of influence calculating part 519 contrasts essential factor value and the essential factor group that obtains from new input file, and has determined the retrieval precision of corresponding essential factor group.
To this, in this variation, be not to determine the essential factor group and determine corresponding retrieval precision, but determine to have from the essential factor value of new input file acquisition or near the teacher's input file that is worth it, and calculate retrieval precision according to this teacher's input file.
For example, in Figure 16, the essential factor value that obtains from new input file is under the situation of " 18 ", in the above-described embodiment, be regarded as belonging to the essential factor group " in ", retrieval precision is 60%, deviation value is 0%.On the other hand, in this variation, determine that the essential factor value has teacher's input file of " 18 " or near its value.If centered by essential factor value " 18 ", extract and be equivalent to whole 30% 6 teacher's input files of getting the value before and after it, then obtain 6 (#12 to #17 of the table 1610 of the central authorities of Figure 16) essential factor values and have " 17 " to teacher's input file of the value of " 19 ".The retrieval precision corresponding with these 6 files is 67%(4/6), deviation value is+7%(67%-60%).
This variation can realize by carrying out at degree of influence calculating part 519 from the processing of a certain number of essential factor value with above-mentioned new input file of the essential factor data extract that is stored in essential factor table 514 or near be worth it teacher's input file with according to the processing that the retrieval cis-position of the teacher's input file that has extracted calculates retrieval precision.
(variation 2)
In the above-described embodiments, about 8 kinds of essential factors, the degree of influence that calculates retrieval precision is used as deviation value, and this degree of influence is independently being resolved under the prerequisite by each essential factor.
To this, in this variation that the essential factor more than 2 kinds is combined, form " integration essential factor group " that the essential factor group with each essential factor combines.That is, in this variation, for teacher's input file, calculate retrieval precision by integration essential factor group, and based on the combination of the essential factor value that obtains from new input file, determine corresponding integration essential factor group.After this, determine corresponding retrieval precision and the deviation value average with precision, and illustrate to the user.Can which essential factor of predetermined fixed and the combination of which essential factor, also can be selected by the user.
For example with essential factor " total hits " and essential factor " similar degree attenuation rate " combination.In this situation, under situation about being become by 3 kinds of essential factor fabrics respectively, generate 9(=3 * 3) kind integration essential factor group.Retrieval precision analysis unit 516 is in the step 1504 of disposal route shown in Figure 15, essential factor is carried out the branch time-like, classify and be divided into 3 groups with first essential factor value in this essential factor, and then with second essential factor value divided each group is classified and is divided into 3 groups respectively, by repeating such processing, can generate integration essential factor group.After this processing can realize by identical processing.
(extension example 1)
Then, the extension example of above-described embodiment is advanced to be recorded and narrated.In the above-described embodiment, by the displaying scheme shown in Fig. 4 A and Fig. 4 B, to the user degree of influence to retrieval precision is shown by essential factor.The user can make retrieval precision improve/reduce according to which essential factor of the content understanding that is illustrated.
But not necessarily all users can both understand specifically how this does and how to operate such game method in order to obtain best result for retrieval.If do not understand game method, then retrieve operation in this and just will interrupt, so, can not be rapidly and successfully retrieve operation.
Therefore, in this extension example, about the essential factor (average deviation value is negative essential factor with precision) that similar document retrieval precision is reduced, for the viewpoint of essential factor is from then on set out and similar document retrieval precision is improved, to set up correspondingly for next doing and so on countermeasure information and essential factor, and illustrate to the user.Specifically, constitute similarly with functional block shown in Figure 5, countermeasure table 522 with storage countermeasure information, according to the requirement from the user, to the user illustrate about next will this do and so on " countermeasure content " with specifically how to carry out relevant " method of operating ".
The configuration example of the detailed display frame of the similar document retrieval result shown in Figure 20 A and Figure 20 B in this extension example.In addition, among Figure 20 A and Figure 20 B, represented for giving prosign with the counterpart of Fig. 4 A and Fig. 4 B.Figure 20 A represents to be shown in the table 400 of picture top part, and Figure 20 B represents to be shown in the table 470 of picture bottom part.The difference of the table 400 shown in the table 400 shown in Figure 20 A and Fig. 4 A is to represent to be appended to table 400 this point shown in Figure 20 A for the project of the game method 2001 of each essential factor.
Value that for example will degree of influence 453 be made as negative value and make the big essential factor of its absolute value (number of packages 437 or validity feature word several 431 etc. are hit in classification) for make that retrieval precision reduces essential factor.For the viewpoint from this essential factor goes out to send to improve retrieval precision, wondering under the situation that what is to be done that " countermeasure " that the user presses with the corresponding game method 2001 of this essential factor links 2002.Like this, an example as shown in figure 21 is such, and countermeasure content 2103 and method of operating 2104 are set up with essential factor 2101 and essential factor group 2102 and be expressed out accordingly.And then in method of operating 2104, if press " pre-edit picture " link 2105, then demonstration conduct as shown in Figure 2 is for the pre-edit picture of the picture that carries out this countermeasure content.The user navigates according to these, can suitably revise search condition, and also can not feel awkward for operation.
Figure 22 is the figure of an example of the formation of expression countermeasure table 522.Countermeasure table 522 to be claimed 2204, have been put down in writing next that this is done and so on countermeasure content 2205, has put down in writing method of operating 2206 how to operate to realize the countermeasure content, the migration destination picture 2207 that move in order operating constitutes by essential factor ID2201, essential factor title 2202, essential factor group ID2203, essential factor group name.It is corresponding that the essential factor corresponding with selecteed countermeasure in the table 400 shown in Figure 20 A and the data that have been described in countermeasure table 522 can be that key is set up with essential factor title and essential factor group, therefore, can take out the data consistent with essential factor from countermeasure table 522, and easily realize showing with form shown in Figure 21.
In addition, about migration destination picture 2207, in Figure 21, as the link anchor in the literary composition, make it possible to skip to migration destination picture, show but " picture migration " button also can be set in addition, if the user presses this button, then will skip to by the defined migration of countermeasure table 522 destination picture.
Fig. 2 illustrates for similar document retrieval precision is improved the configuration example of the search condition editing pictures 200 that search condition (the deletion weight correction synonym that appends of feature word launches, by recording the screening carried out etc.) is edited.
Search condition editing pictures 200 by the feature word editor sprite 201 of the correction of the deletion of carrying out the feature word and weight, the feature word that appends that carries out the feature word append sprite 202, the synonym that carries out synon expansion launches sprite 203, based on classification or applicant, the applying date etc. record screen or the condition of the recording editor sprite 204 of expanding result for retrieval constitutes.
In feature word editor sprite 201, show the data about the feature word that is used for retrieval.At this, select check box 211 to be selection mode (have * state) if make, then this feature word is used to retrieval, if remove selection mode (do not have * state), then this feature word is not used to retrieval.In addition, in this sprite, the value of weight 212 can be altered to value arbitrarily.
Append in the sprite 202 at the feature word, show the feature word that is not used to retrieve with the feature word that is contained in input file.In addition, also can show the feature word that is contained in the result for retrieval file.At this, also can append for the feature word of retrieving by the selection of selecting check box 221.The weight 222 of the feature word that can append in addition, changes to value arbitrarily.
In synonym expansion sprite 203, show the synonym data for the feature word that is used to retrieve.About the synonym data, it can be stored in word dictionary 503, also can be used as synonym dictionary and is stored as other data.If select feature word (being " circular " at this) arbitrarily from the tabulation 231 of feature word, then in the table 232 on right side, show synon candidate and certainty factor thereof.As synonym, the check box that makes suitable word is selection mode, thus, appends selecteed word as the feature word.
In recording condition editor sprite 204, carry out the screening of being undertaken by recording.If select arbitrarily descriptive entry (being " classification (IPC) " at this) from the tabulation 241 of descriptive entry, then in the table 242 on right side, show the distribution of value of this descriptive entry of the forward result for retrieval file of cis-position with number of packages.By selecting check box to select the value of recording, thus, the screening result for retrieval.
In this search condition editing pictures 200, utilize this content according to institute's motion (suggest) in the picture shown in Figure 21 to revise search condition, and carry out similar document retrieval again.For example in Figure 21, appending of feature word (suggest) in motion, and is shown as " append picture at the feature word and append the feature word " as method of operating 2206.Therefore, append in the sprite 202 at the feature word, find and append suitable feature word, press index button 250 and carry out retrieval again.In addition, in Fig. 2, a plurality of sprites are shown in a picture by unified, but also can only to the user necessary sprite be shown.
Pass through above-described embodiment, variation and extension example, the user can understand the foundation as result for retrieval, what kind of Tribute which feature word in the input file has made to similar document retrieval result's output is offered, the degree that similar document retrieval make good progress, what the reason that similar document retrieval makes progress under the ill situation is, similar document retrieval makes progress under the ill situation, how what in connecing down does and could obtain good result for retrieval if doing, owing to can successfully move into ensuing action, so, the circulation that the retrieval operation is handled is turned round efficiently, and can obtain high-quality result for retrieval.

Claims (14)

1. similar document retrieval householder method comprises:
The feature word extracts treatment step, the searching object file that is stored in document data bank resolved, and the weight of extracting the feature word and representing its importance degree, and be stored in catalog;
Similar document retrieval treatment step, from appointed input file extracts corresponding weighted feature word by the operation of input media is imported, contrast with the weighted feature word that is stored in above-mentioned catalog, and calculate similar degree between above-mentioned input file and the above-mentioned searching object file, begin to determine successively to be the result for retrieval file set from the high searching object file of similar degree; And
Result for retrieval output treatment step is informed above-mentioned result for retrieval file set to the user,
In this similar document retrieval householder method, have:
Feature word collection and treatment step, extract treatment step by above-mentioned feature word, text in teacher's input file extracts or collects the weighted feature word corresponding with each teacher's input file of formation teacher file table from above-mentioned catalog, and be stored in the feature vocabulary, it is a plurality of right to have in this teacher's file table, described to being that correctly to answer file be the right of known teacher's input file and the above-mentioned correct answer file corresponding with above-mentioned teacher's input file;
Essential factor data extraction process step, based on the result for retrieval file set that is determined by above-mentioned similar document retrieval treatment step at each above-mentioned teacher's input file, determine the retrieval cis-position of the above-mentioned correct answer file corresponding with each teacher's input file, and by the reference above-mentioned feature vocabulary corresponding with above-mentioned each teacher's input file, above-mentioned result for retrieval file set, in description information and the above-mentioned catalog more than one, extract the essential factor value of above-mentioned each teacher's input file, and being stored in the essential factor table, the essential factor value of above-mentioned each teacher's input file influences wanting of similar document retrieval precision thereby corresponding by predefined each essential factor with conduct;
Retrieval precision dissection process step, at being stored in above-mentioned essential factor table, the above-mentioned essential factor value corresponding with the teacher's input file set in above-mentioned teacher's file table, combination based on the distribution of the distribution of the essential factor value relevant with essential factor or the essential factor value relevant with a plurality of essential factors, above-mentioned teacher's input file set is divided into the essential factor group, retrieval cis-position according to the above-mentioned correct answer file corresponding with the above-mentioned teacher's input file that belongs to an essential factor group calculates the retrieval precision corresponding with this essential factor group, and the difference of calculating the retrieval precision mean value that the above-mentioned retrieval precision that calculates calculates with respect to the integral body at above-mentioned teacher's input file is used as deviation value, with above-mentioned essential factor group, meet the scope that the above-mentioned essential factor value of this essential factor group can be got, above-mentioned retrieval precision, be stored in the retrieval precision table with above-mentioned deviation value; And
Degree of influence computing step, to contrast for unknown new the input file above-mentioned essential factor value that obtains and the value scope that is stored in each essential factor group of above-mentioned retrieval precision table at above-mentioned correct answer file, thus, extract above-mentioned retrieval precision and the deviation value corresponding with the essential factor group that satisfies above-mentioned value scope, and together be stored in the above-mentioned essential factor value of this new input file and influence kilsyth basalt
Export in the treatment step at above-mentioned result for retrieval, illustrate to the user and be stored in the above-mentioned above-mentioned essential factor value corresponding with new input file and above-mentioned retrieval precision and/or the above-mentioned deviation value that influences in the kilsyth basalt.
2. similar document retrieval householder method comprises:
The feature word extracts treatment step, the searching object file that is stored in document data bank resolved, and the weight of extracting the feature word and representing its importance degree, and be stored in catalog;
Similar document retrieval treatment step, from appointed input file extracts corresponding weighted feature word by the operation of input media is imported, contrast with the weighted feature word that is stored in above-mentioned catalog, and calculate similar degree between above-mentioned input file and the above-mentioned searching object file, begin to determine successively to be the result for retrieval file set from the high searching object file of similar degree; And
Result for retrieval output treatment step is informed above-mentioned result for retrieval file set to the user,
In this similar document retrieval householder method, have:
Feature word collection and treatment step, extract treatment step by above-mentioned feature word, text in teacher's input file extracts or collects with the teacher's input file that constitutes teacher's file table from above-mentioned catalog distinguishes corresponding weighted feature word, and be stored in the feature vocabulary, it is a plurality of right to have in this teacher's file table, described to being that correctly to answer file be the right of known teacher's input file and the above-mentioned correct answer file corresponding with above-mentioned teacher's input file;
Essential factor data extraction process step, based on the result for retrieval file set that is determined by above-mentioned similar document retrieval treatment step at each above-mentioned teacher's input file, determine the retrieval cis-position of the above-mentioned correct answer file corresponding with each teacher's input file, and by the reference above-mentioned feature vocabulary corresponding with above-mentioned each teacher's input file, above-mentioned result for retrieval file set, in description information and the above-mentioned catalog more than one, extract the essential factor value of above-mentioned each teacher's input file, and being stored in the essential factor table, the essential factor value of above-mentioned each teacher's input file influences wanting of similar document retrieval precision thereby corresponding by predefined each essential factor with conduct; And
Degree of influence computing step, for the above-mentioned essential factor value that obtains for unknown new input file at above-mentioned correct answer file, determine by satisfying the corresponding essential factor value of relevant with essential factor and new input file or near the above-mentioned teacher's input file that is worth it, or satisfy the corresponding essential factor value of relevant with a plurality of essential factors and new input file fully or near the file group of above-mentioned teacher's input file formation of being worth it, retrieval cis-position according to the above-mentioned correct answer file corresponding with the above-mentioned teacher's input file that belongs to above-mentioned file group calculates the retrieval precision corresponding with this document group, and the difference of calculating the retrieval precision mean value that the above-mentioned retrieval precision that calculates calculates with respect to the integral body to above-mentioned teacher's input file is used as deviation value, with above-mentioned essential factor value, above-mentioned retrieval precision and above-mentioned deviation value are stored in and influence kilsyth basalt
Export in the treatment step at above-mentioned result for retrieval, illustrate to the user and be stored in the above-mentioned above-mentioned essential factor value corresponding with new input file and above-mentioned retrieval precision and/or the above-mentioned deviation value that influences in the kilsyth basalt.
3. according to the similar document retrieval householder method described in claim 1 or 2, it is characterized in that,
The essential factor that influences above-mentioned similar document retrieval precision comprises at least more than one in (1) shown below~(12):
(1) with respect to total hits or its ratio of each the feature word forward result for retrieval file of each cis-position that is constituted by preassigned number of packages, in the input file;
(2) weight of feature word in the result for retrieval file in total hits of above-mentioned (1), in the input file is quantity or its ratio more than the preassigned threshold value;
(3) part similar degree in total hits of above-mentioned (1), relevant with the feature word in the input file or this part similar degree account for the ratio of the similar degree of result for retrieval file;
(4) value that obtains after divided by the quantity of above-mentioned (1) or its ratio of the quantity of above-mentioned (2) or its ratio;
(5) value that obtains after divided by the quantity of above-mentioned (1) or its ratio of the quantity of above-mentioned (3) or its ratio;
(6) in the forward result for retrieval file of above-mentioned cis-position, a feature word of input file hit number or its ratio that number of packages is the above feature word of preassigned threshold value;
(7) in the forward result for retrieval file of above-mentioned cis-position, a feature word of input file hit number or its ratio that number of packages is the following feature word of preassigned threshold value;
(8) similar degree of the forward result for retrieval file of above-mentioned cis-position is followed the reduction of retrieving cis-position and the ratio that decays;
(9) in the forward result for retrieval file of above-mentioned cis-position, be endowed number of packages or its ratio of the classification that is given to input file;
(10) in the All Files as searching object, be endowed number of packages or its ratio of the classification that is given to input file;
(11) in the forward result for retrieval file of above-mentioned cis-position, and common number of packages or its ratio of author between the input file;
(12) in the forward result for retrieval file of above-mentioned cis-position, and the issue date between the input file depart from for preassigned threshold value with interior number of packages or its ratio.
4. according to the similar document retrieval householder method described in claim 1 or 2, it is characterized in that,
Above-mentioned retrieval precision is identified as in the ratio of preassigned cis-position with the number of packages of interior above-mentioned teacher's input file by above-mentioned similar document retrieval treatment step for the above-mentioned correct answer file corresponding with above-mentioned teacher's input file.
5. according to the similar document retrieval householder method described in the claim 1, it is characterized in that,
Essential factor value in the corresponding essential factor table of above-mentioned teacher's input file that uses in above-mentioned retrieval precision dissection process step only is made of the corresponding essential factor value of above-mentioned teacher's input file that satisfies preassigned condition.
6. according to the similar document retrieval householder method described in claim 1 or 2, it is characterized in that,
In above-mentioned result for retrieval output treatment step, be stored in the above-mentioned essential factor value corresponding with new input file that influences in the kilsyth basalt illustrating to the user, and when retrieval precision and/or deviation value, the demonstration corresponding tables of enclosing, as 2 axles, and the part similar degree Sij that the feature word j of the weighted value Wij of the feature word j of the new input file among the forward result for retrieval file of the above-mentioned cis-position i or the new input file among the forward result for retrieval file of the above-mentioned cis-position i is had is as value with the feature word of above-mentioned new input file and the forward result for retrieval file of cis-position corresponding with above-mentioned new input file for this corresponding tables.
7. according to the similar document retrieval householder method described in claim 1 or 2, it is characterized in that,
The countermeasure table is set, this countermeasure table is from the viewpoint of above-mentioned each essential factor, store by each above-mentioned essential factor group and to have put down in writing the user and should do and so on the countermeasure content, put down in writing and how to have carried out the method for operating of above-mentioned countermeasure content, the image information that should move in order to carry out the aforesaid operations method, be used as be used to making the user obtain better similar document retrieval result's countermeasure information
In above-mentioned result for retrieval output treatment step, illustrating to the user when being stored in the above-mentioned essential factor value that influences kilsyth basalt and retrieval precision and/or deviation value, at least one that makes the above-mentioned countermeasure content that is recorded in the above-mentioned countermeasure table, aforesaid operations method, above-mentioned image information accompany the essential factor group and shows.
8. similar document retrieval servicing unit has:
Input media is accepted operation input and/or data input from the user;
Document data bank has been stored the detected object file;
Feature word extraction unit is resolved the searching object file that is stored in above-mentioned document data bank, and the weight of extracting the feature word and representing its importance degree;
Catalog is stored the above-mentioned weighted feature word that is extracted;
Similar document retrieval portion, from appointed input file extracts corresponding weighted feature word by the operation of above-mentioned input media is imported, contrast with the weighted feature word that is stored in above-mentioned catalog, calculate the similar degree between above-mentioned input file and the above-mentioned searching object file, begin to determine successively to be the result for retrieval file set from the high searching object file of similar degree; And
Output unit is informed above-mentioned result for retrieval file set to the user,
In this similar document retrieval servicing unit, have:
Teacher's file table, it is a plurality of right to have, described to being that correctly to answer file be the right of known teacher's input file and the above-mentioned correct answer file corresponding with above-mentioned teacher's input file;
Feature word collection unit, by above-mentioned feature word extraction unit, the text in teacher's input file extracts or collects the weighted feature word corresponding with each teacher's input file from above-mentioned catalog, and is stored in the feature vocabulary;
Essential factor data extract portion, based on the result for retrieval file set that is determined by above-mentioned similar document retrieval portion at each above-mentioned teacher's input file, determine the retrieval cis-position of the above-mentioned correct answer file corresponding with each teacher's input file, and by the reference above-mentioned feature vocabulary corresponding with above-mentioned each teacher's input file, above-mentioned result for retrieval file set, in description information and the above-mentioned catalog more than one, extract the essential factor value of above-mentioned each teacher's input file, and being stored in the essential factor table, the essential factor value of above-mentioned each teacher's input file influences wanting of similar document retrieval precision thereby corresponding by predefined each essential factor with conduct;
The retrieval precision analysis unit, at being stored in above-mentioned essential factor table, the above-mentioned essential factor value corresponding with the teacher's input file set in above-mentioned teacher's file table, combination based on the distribution of the distribution of the essential factor value relevant with essential factor or the essential factor value relevant with a plurality of essential factors, above-mentioned teacher's input file set is divided into the essential factor group, retrieval cis-position according to the above-mentioned correct answer file corresponding with the above-mentioned teacher's input file that belongs to an essential factor group calculates the retrieval precision corresponding with this essential factor group, and the difference of calculating the retrieval precision mean value that the above-mentioned retrieval precision that calculates calculates with respect to the integral body at above-mentioned teacher's input file is used as deviation value, with above-mentioned essential factor group, meet the scope that the above-mentioned essential factor value of this essential factor group can be got, above-mentioned retrieval precision, be stored in the retrieval precision table with above-mentioned deviation value; And
Degree of influence calculating part, to contrast for unknown new the input file above-mentioned essential factor value that obtains and the value scope that is stored in each essential factor group of above-mentioned retrieval precision table at above-mentioned correct answer file, thus, extract above-mentioned retrieval precision and the deviation value corresponding with the essential factor group that satisfies above-mentioned value scope, and together be stored in the above-mentioned essential factor value of this new input file and influence kilsyth basalt
By above-mentioned output unit, illustrate to the user and to be stored in the above-mentioned above-mentioned essential factor value corresponding with new input file and above-mentioned retrieval precision and/or the above-mentioned deviation value that influences in the kilsyth basalt.
9. similar document retrieval servicing unit has:
Input media is accepted operation input and/or data input from the user;
Document data bank has been stored the searching object file;
Feature word extraction unit is resolved the searching object file that is stored in above-mentioned document data bank, and the weight of extracting the feature word and representing its importance degree;
Catalog is stored the above-mentioned weighted feature word that is extracted;
Similar document retrieval portion, from appointed input file extracts corresponding weighted feature word by the operation of above-mentioned input media is imported, contrast with the weighted feature word of the searching object file that is stored in above-mentioned catalog, calculate the similar degree between above-mentioned input file and the above-mentioned searching object file, begin to regard as successively the result for retrieval file set from the high searching object file of similar degree; And
Output unit is informed above-mentioned result for retrieval file set to the user,
In this similar document retrieval servicing unit, have:
Teacher's file table, it is a plurality of right to have, described to being that correctly to answer file be the right of known teacher's input file and the above-mentioned correct answer file corresponding with above-mentioned teacher's input file;
Feature word collection unit, by above-mentioned feature word extraction unit, the text in teacher's input file extracts or collects the weighted feature word corresponding with each teacher's input file of the above-mentioned teacher's file table of formation from above-mentioned catalog, and is stored in the feature vocabulary;
Essential factor data extract portion, based on the result for retrieval file set that is determined by above-mentioned similar document retrieval portion at each above-mentioned teacher's input file, determine the retrieval cis-position of the above-mentioned correct answer file corresponding with each teacher's input file, and by the reference above-mentioned feature vocabulary corresponding with above-mentioned each teacher's input file, above-mentioned result for retrieval file set, in description information and the above-mentioned catalog more than one, extract the essential factor value of above-mentioned each teacher's input file, and being stored in the essential factor table, the essential factor value of above-mentioned each teacher's input file influences wanting of similar document retrieval precision thereby corresponding by predefined each essential factor with conduct; And
Degree of influence calculating part, for the above-mentioned essential factor value that obtains for the new input file of the unknown at above-mentioned correct answer file, determine by satisfying the corresponding essential factor value of relevant with essential factor and new input file or near the above-mentioned teacher's input file that is worth it, or satisfy the corresponding essential factor value of relevant with a plurality of essential factors and new input file fully or near the file group of above-mentioned teacher's input file formation of being worth it, retrieval cis-position according to the above-mentioned correct answer file corresponding with the above-mentioned teacher's input file that belongs to above-mentioned file group calculates the retrieval precision corresponding with this document group, and the difference of calculating the retrieval precision mean value that the above-mentioned retrieval precision that calculates calculates with respect to the integral body to above-mentioned teacher's input file is used as deviation value, with above-mentioned essential factor value, above-mentioned retrieval precision and above-mentioned deviation value are stored in and influence kilsyth basalt
By above-mentioned output unit, illustrate to the user and to be stored in the above-mentioned above-mentioned essential factor value corresponding with new input file and above-mentioned retrieval precision and/or the above-mentioned deviation value that influences in the kilsyth basalt.
10. according to Claim 8 or the similar document retrieval servicing unit described in 9, it is characterized in that,
The essential factor that influences above-mentioned similar document retrieval precision comprises at least more than one in (1) shown below~(12):
(1) with respect to total hits or its ratio of each the feature word forward result for retrieval file of each cis-position that is constituted by preassigned number of packages, in the input file;
(2) weight of feature word in the result for retrieval file in total hits of above-mentioned (1), in the input file is quantity or its ratio more than the preassigned threshold value;
(3) part similar degree in total hits of above-mentioned (1), relevant with the feature word in the input file or this part similar degree account for the ratio of the similar degree of result for retrieval file;
(4) value that obtains after divided by the quantity of above-mentioned (1) or its ratio of the quantity of above-mentioned (2) or its ratio;
(5) value that obtains after divided by the quantity of above-mentioned (1) or its ratio of the quantity of above-mentioned (3) or its ratio;
(6) in the forward result for retrieval file of above-mentioned cis-position, a feature word of input file hit number or its ratio that number of packages is the above feature word of preassigned threshold value;
(7) in the forward result for retrieval file of above-mentioned cis-position, a feature word of input file hit number or its ratio that number of packages is the following feature word of preassigned threshold value;
(8) similar degree of the forward result for retrieval file of above-mentioned cis-position is followed the reduction of retrieving cis-position and the ratio that decays;
(9) in the forward result for retrieval file of above-mentioned cis-position, be endowed number of packages or its ratio of the classification that is given to input file;
(10) in the All Files as searching object, be endowed number of packages or its ratio of the classification that is given to input file;
(11) in the forward result for retrieval file of above-mentioned cis-position, and common number of packages or its ratio of author between the input file;
(12) in the forward result for retrieval file of above-mentioned cis-position, and the issue date between the input file depart from for preassigned threshold value with interior number of packages or its ratio.
11. according to Claim 8 or the similar document retrieval servicing unit described in 9, it is characterized in that,
Above-mentioned retrieval precision is identified as in the ratio of preassigned cis-position with the number of packages of interior above-mentioned teacher's input file by above-mentioned similar document retrieval portion for the above-mentioned correct answer file corresponding with above-mentioned teacher's input file.
12. the similar document retrieval servicing unit according to Claim 8 is characterized in that,
Essential factor value in the corresponding essential factor table of above-mentioned teacher's input file that uses in above-mentioned retrieval precision analysis unit only is made of the corresponding essential factor value of above-mentioned teacher's input file that satisfies preassigned condition.
13. according to Claim 8 or the similar document retrieval servicing unit described in 9, it is characterized in that,
By above-mentioned output unit, illustrating to the user for being stored in the above-mentioned essential factor value corresponding with new input file that influences in the kilsyth basalt, and when retrieval precision and/or deviation value, the demonstration corresponding tables of enclosing, this corresponding tables with the feature word of above-mentioned new input file and the cis-position forward result for retrieval file corresponding with above-mentioned new input file as 2 axles, and with the weighted value Wij of the feature word j of the new input file among the forward result for retrieval file of the above-mentioned cis-position i, or the part similar degree Sij that the feature word j of this new input file among the forward result for retrieval file of the above-mentioned cis-position i has is as value.
14. according to Claim 8 or the similar document retrieval servicing unit described in 9, it is characterized in that,
The countermeasure table is set, this countermeasure table is from the viewpoint of above-mentioned each essential factor, store by each above-mentioned essential factor group and to have put down in writing the user and should do and so on the countermeasure content, put down in writing and how to have carried out the method for operating of above-mentioned countermeasure content, the image information that should move in order to carry out the aforesaid operations method, be used as be used to making the user obtain better similar document retrieval result's countermeasure information
In above-mentioned result for retrieval efferent, illustrating to the user when being stored in the above-mentioned essential factor value that influences kilsyth basalt and retrieval precision and/or deviation value, at least one that makes the above-mentioned countermeasure content that is recorded in the above-mentioned countermeasure table, aforesaid operations method, above-mentioned image information accompany the essential factor group and shows.
CN201210539130.3A 2012-02-24 2012-12-13 Similar document retrieval auxiliary device and similar document retrieval householder method Expired - Fee Related CN103294741B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012-038163 2012-02-24
JP2012038163A JP5324677B2 (en) 2012-02-24 2012-02-24 Similar document search support device and similar document search support program

Publications (2)

Publication Number Publication Date
CN103294741A true CN103294741A (en) 2013-09-11
CN103294741B CN103294741B (en) 2016-12-21

Family

ID=49095624

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210539130.3A Expired - Fee Related CN103294741B (en) 2012-02-24 2012-12-13 Similar document retrieval auxiliary device and similar document retrieval householder method

Country Status (2)

Country Link
JP (1) JP5324677B2 (en)
CN (1) CN103294741B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112352230A (en) * 2018-06-28 2021-02-09 三菱电机株式会社 Search device, search method, and machine learning device

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019028631A1 (en) * 2017-08-07 2019-02-14 深圳益强信息科技有限公司 Method for determining relative confidentiality of technical know-how
CN107609021A (en) * 2017-08-07 2018-01-19 深圳益强信息科技有限公司 The secret of know-how judges system
WO2019028628A1 (en) * 2017-08-07 2019-02-14 深圳益强信息科技有限公司 System for determining confidentiality of technical know-how
KR102004145B1 (en) * 2018-11-29 2019-07-29 한국과학기술정보연구원 Method for recommending content and apparatus thereof

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN1495641A (en) * 2002-08-06 2004-05-12 美商苹果电脑股份有限公司 Adaptive cotext sensitive analysis of abstention statement of limited copyright
CN1694100A (en) * 2004-04-15 2005-11-09 微软公司 Content propagation for enhanced document retrieval
US20080162455A1 (en) * 2006-12-27 2008-07-03 Rakshit Daga Determination of document similarity
CN101295307A (en) * 2007-04-27 2008-10-29 株式会社日立制作所 Document retrieval system and document retrieval method
JP2008282111A (en) * 2007-05-09 2008-11-20 Hitachi Ltd Similar document retrieval method, program and device

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3356519B2 (en) * 1993-03-12 2002-12-16 株式会社東芝 Document information retrieval device
JP2000311173A (en) * 1999-04-27 2000-11-07 Toshiba Corp Device and method for retrieving similar document
JP2002230032A (en) * 2001-01-30 2002-08-16 Canon Inc Document retrieval result display device, its display method and storage medium
JP2009151373A (en) * 2007-12-18 2009-07-09 Nec Corp Citation relation extraction system, citation relation extraction method, and citation relation extracting program

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1495641A (en) * 2002-08-06 2004-05-12 美商苹果电脑股份有限公司 Adaptive cotext sensitive analysis of abstention statement of limited copyright
CN1477563A (en) * 2003-07-03 2004-02-25 复旦大学 High-dimensional vector data quick similar search method
CN1694100A (en) * 2004-04-15 2005-11-09 微软公司 Content propagation for enhanced document retrieval
US20080162455A1 (en) * 2006-12-27 2008-07-03 Rakshit Daga Determination of document similarity
CN101295307A (en) * 2007-04-27 2008-10-29 株式会社日立制作所 Document retrieval system and document retrieval method
JP2008282111A (en) * 2007-05-09 2008-11-20 Hitachi Ltd Similar document retrieval method, program and device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112352230A (en) * 2018-06-28 2021-02-09 三菱电机株式会社 Search device, search method, and machine learning device
CN112352230B (en) * 2018-06-28 2021-08-27 三菱电机株式会社 Search device, search method, and machine learning device

Also Published As

Publication number Publication date
JP2013174988A (en) 2013-09-05
CN103294741B (en) 2016-12-21
JP5324677B2 (en) 2013-10-23

Similar Documents

Publication Publication Date Title
Rai Identifying key product attributes and their importance levels from online customer reviews
Bauer et al. Quantitive evaluation of Web site content and structure
US10366117B2 (en) Computer-implemented systems and methods for taxonomy development
CN108563620A (en) The automatic writing method of text and system
US20100114561A1 (en) Latent metonymical analysis and indexing (lmai)
CN104077407B (en) A kind of intelligent data search system and method
CN105843897A (en) Vertical domain-oriented intelligent question and answer system
Gkotsis et al. It's all in the content: state of the art best answer prediction based on discretisation of shallow linguistic features
CN103294741A (en) Similar document retrieval auxiliary device and similar document retrieval auxiliary method
Homoceanu et al. Will I like it? Providing product overviews based on opinion excerpts
CN110134845A (en) Project public sentiment monitoring method, device, computer equipment and storage medium
CN111353044A (en) Comment-based emotion analysis method and system
Baowaly et al. Predicting the helpfulness of game reviews: A case study on the steam store
Haque et al. Opinion mining from bangla and phonetic bangla reviews using vectorization methods
Yordanova et al. Sentiment classification of hotel reviews in social media with decision tree learning
CN111125561A (en) Network heat display method and device
Rathan et al. Every post matters: a survey on applications of sentiment analysis in social media
US20220327445A1 (en) Workshop assistance system and workshop assistance method
KR20100069118A (en) Method for constructing query index database, method for recommending query by using the query index database
CN113010639A (en) Commodity analysis method and device based on E-commerce platform
CN115659961B (en) Method, apparatus and computer storage medium for extracting text views
CN109408808B (en) Evaluation method and evaluation system for literature works
CN110347934A (en) A kind of text data filtering method, device and medium
KR20090126862A (en) System and method for analyzing emotional information from natural language sentence, and medium for storaging program for the same
JP2016162357A (en) Analysis device and program of user's emotion to product

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20161221

Termination date: 20211213