CN1151558A - Information searching method and system - Google Patents

Information searching method and system Download PDF

Info

Publication number
CN1151558A
CN1151558A CN 95118142 CN95118142A CN1151558A CN 1151558 A CN1151558 A CN 1151558A CN 95118142 CN95118142 CN 95118142 CN 95118142 A CN95118142 A CN 95118142A CN 1151558 A CN1151558 A CN 1151558A
Authority
CN
China
Prior art keywords
character string
mentioned
document
effective consistent
starting position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN 95118142
Other languages
Chinese (zh)
Inventor
久保田理惠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
International Business Machines Corp
Original Assignee
International Business Machines Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by International Business Machines Corp filed Critical International Business Machines Corp
Publication of CN1151558A publication Critical patent/CN1151558A/en
Pending legal-status Critical Current

Links

Images

Abstract

Provided is a character string retrieval method which can retrieve a document resembled to the sense of a man with fuzzy retrieval. A system for retrieving the document containing a character string whose arrangement of characters is similar to that of a designated character string at high speed by using an index file containing intra-document position information on a character string pattern is provided. In the system, the character string to be retrieved and retrieval precision (more than zero and not exceeding one) are designated and the document containing the 'resemble character string' whose 'resemble degree' with the character string to be retrieved is more than designated retrieval precision and the intra-document position of the 'resemble character string' can be specified.

Description

Information retrieval method and system
The present invention relates at a high speed and allow that with desired blur level retrieves, for example, for being stored in the system and method that the lot of documents in the disk is retrieved with the text form.
Up to now, wish with amount of literature data such as the news report of with natural language writing of high speed retrieve stored in disk, patent gazette, scientific and technical literatures always, and proposed various retrieval modes.These retrieval modes are roughly distinguished as follows.(a) key search mode
In this mode, make index at the key word of representing each document and document content in advance.At this moment, determine the method for key word, with good grounds font decomposition etc. detects key word method, artificial key feeding key words method automatically, reaches the two combined method.But, this mode can only be retrieved at the character string that key word index is arranged, and decomposes according to font, and the precision that detects key word automatically will depend on the precision of word, grammer dictionary, so will be at a large amount of manpowers of cost in the work of compiling and edit of dictionary, this is its shortcoming.(b) no index full-text search mode
Though this is a kind ofly not make index of reference but all will work as the mode that the document of appointment character string to be retrieved as searching object scans in full at every turn.Also be that a kind of special hardware that will use is to improve the mode of retrieval rate.But, use the system of special hardware will increase expense, also to be subjected to keeping within bounds of customer service environment aspect simultaneously, be confined to some operable type.(c) press index full-text search mode
The present invention promptly belongs to by index full-text search mode.This is by making index of reference, attempts to reach and retrieves carrying out high speed in full, and known have several technical methods shown in following.
Open in the flat 4-205560 communique the spy and to announce that the retrieval unit that will be as the character string of searching object uses during by retrieval divides, each additional ascending order symbol of retrieval unit to this, attribute symbol to this retrieval unit that is divided into its logical partitioning of additional representation, and to character position sequence number as each character of character string additional representation of searching object present position in the retrieval unit, make by retrieval identified in units symbol, the character location information that character position sequence number and attribute symbol are formed, and this character location information is stored in every kind of character area, be compiled into retrieving files.
Open in the flat 4-215181 communique the spy and to announce: in order to reduce in retrieval process to the number of times of checking of character string, and check in order to adopt the general information treating apparatus to carry out high speed, each character group character group positional information of present position in character string that expression is constituted the searching object character string is compiled into the retrieving files that divides into groups according to the classification of various character group.
, also run into and not only will retrieve and the on all four character string of searching character string, and want to retrieve the document that comprises the consistent character string of part through regular meeting.For example, the user perhaps can run into character string and various distortion occur the vagueness in memory of searching character string, and situation about can meet difficulty is sometimes all enumerated in these distortion.
The designation method of typical partial character string is the expression way of employing standard in the prior art.According to this method, can specify and repeat at any character more than 0 time, repeat to put and any character in absolute character code scope etc. in any character more than 1 time, the last position of being expert at, row first place.
In addition, open flat 63-99830 communique the spy and be published in the system that possesses the searching character string data and the consistent function of string data part that is retrieved, being provided with storage representation has the tables of data of similar language relation and represent the tables of data whether this searching character string data occurs in any one is retrieved string data with the searching character string data.
Opening flat 62-221027 communique the spy announces: when one section character string that part object character string is disconnected from the outset in dictionary, fail to retrieve find the time, by pushing away retrieval before 1 pair of next character string that disconnects is carried out to only its length being added, both can reduce invalid retrieval number of times, and speed that can be higher is carried out effective word and is read.
Open in flat 4-326164 communique and the flat 5-174067 communique of Te Kai the spy and to announce: in database retrieval system, stored auto-correlation information with each incident of searching object, obtain the consistent degree of auto-correlation information with the above-mentioned auto-correlation information of searching object of search key at each incident, and press the retrieval method of the descending outgoing event numbering of consistent degree.
But in the character string retrieving method of above-mentioned prior art, it is difficult wanting to specify the blur level of answering searching character string, thereby contained character string has and is not that the user is desired more in result for retrieval, or illogical character string.
The purpose of this invention is to provide a kind of character string retrieving method that can specify the blur level of answering searching character string arbitrarily.
Another object of the present invention provides the index structure that is used to realize specifying arbitrarily the character string retrieving method of answering the searching character string blur level.
Another purpose of the present invention is to provide a kind of character string retrieval technique that makes retrieval near people's sensation by fuzzy search.
According to the present invention, in order to carry out full-text search at the database that constitutes by a plurality of documents, to the additional unique numbering (or symbol) of each document, and in the information storage indexed file with the identification number at each continuous N character in each document and this N character place and position in the literature thereof.Index file is suitable for being made up of font file and two files of positional information file.In the font file, storing Position Number residing position in the message file of position in font separator and the identification number corresponding and the document with it.In the message file of position, storing Position Number in identification number and the document.
According to the present invention, provide a kind of and use above-mentioned index file containing the mode of carrying out the high speed retrieval with the document of designated character string and the homotaxial character string of character.In this mode, can specify character string to be retrieved and retrieval precision (greater than 0 less than 1), and can determine to contain the document that surpasses " the similar character string " of specifying retrieval precision to " similarity " of character string to be retrieved, and " similar character string " position in the literature.
This mode specifically, is exactly a selected and character string " similar character string " to be retrieved from document, and from which character is continuous unanimity, has how many unnecessary characters to be clipped in wherein two viewpoints to " similarity " processing that quantizes.
At this moment, be 1 as the mxm. of " similarity ", the expression character string is in full accord, and when character string was in full accord, then " similarity " must be 1.When in similar character string, accompanying the redundant character of the character string that is not to be retrieved or have only the part of character string to be retrieved to appear in the similar character string, its " similarity " is the value less than 1, but according to the present invention, if such " similarity " value but quite meets people's sensation, that is exactly useful.
Above-mentioned index file is because N continuous character arbitrarily in can the high speed searching document, so by use above-mentioned index file and continuously with answer N character sequence of searching character string to compare, just can detect which character at a high speed is to belong to continuous unanimity and have how many unnecessary characters to be clipped in the middle of it.
Fig. 1 is the block diagram of expression hardware configuration.
Fig. 2 is the block diagram of processing unit.
Fig. 3 is the structural drawing of index file.
Fig. 4 is the process flow diagram that the establishment of expression index file is handled.
Fig. 5 is to use the string search processing flow chart of index file.
Fig. 6 is to use index to carry out the process flow diagram that fuzzy search is handled.
Following with reference to the description of drawings embodiments of the invention.A. hardware constitutes
With reference to Fig. 1, there is shown and be used for implementing system architecture sketch of the present invention.This structure is for following each several part being connected in the general structure of bus 101, comprising the CPU (central processing unit) (CPU) 102 with computing and input and output control function, load program and the primary memory (RAM) 104 of workspace is provided for CPU102, be used for keying in the keyboard 106 of order and character string etc., storage is used for controlling the operating system of CPU (central processing unit), database file, gopher, the hard disk 108 of index file etc., be used for video data library searching result display 110 and be used for the optional position of designated display 110 panels and send this positional information the Genius mouse 112 of CPU (central processing unit) to.
The standard support GUI multi-windowed environment that preferably waits as operating system with the X-WINDOW system (MIT trade mark) of Windows (Microsoft's trade mark), OS/2 (IBM trade mark), AIX (IBM trade mark) upgrading, but the present invention also can realize under PC-DOS (IBM trade mark), MS-DOS character reference atmospheres such as (Microsoft's registered trademarks), not limit specific operating system environment.
In addition, Fig. 1 shows the system of freestanding environment, but in general, because database file requires jumbo disk drive, if in client server system, adopt the present invention, then database file and gopher can be configured in server, and client computer is done the local area network connection to server by Ethernet, token ring etc., also can only will check that the display control section of result for retrieval is configured in the service pusher side.B. system constitutes
Below, with reference to the block diagram of Fig. 2, system architecture of the present invention is described.Should note among Fig. 2 unit with each block representation be as data file or program file separately or global storage in the hard disk 108 of Fig. 1.
Main imagination of the present invention is to think that database 202 is a plurality of documents of the database, patent gazette database of storage news report usefulness etc.But should notice that the scope of application of the present invention does not limit the database of being made up of a plurality of documents, also can be suitable for the retrieval in the single document.At this moment, the content of single document (for example) but be by the retrieval mode storage with the form of text.In addition, to the additional unique identification number of each document.Identification number is preferably since 1 ascending order numbering, but but to unique identification numbers such as patent gazette database also request for utilization number or publication No..In order to discern each document, also can be without serial number, and use " ABC ", " ﹠amp; XYZ " waits symbol.But usually the byte number of needs is more than numeral in order to represent this distinguished symbol, so in fact the mode of discerning document with serial number is optimal.
Because the news report of being stored for database 202 or the huge like this information content of patent gazette are directly retrieved need the very long processing time, so news report content of being stored at database 202 in advance normally, utilize index to generate update module 206, weave into index file 204.In Xu Shu the embodiments of the invention, the index file of weaving in such a way 204 is made up of font file and two files of positional information file in the back.In the font file, storing font separator and identification number and Position Number document in position message file the residing position corresponding with it.In the message file of position, storing Position Number in identification number and the document.
Database 202 also can be with each document as independent file management, perhaps, whole document series arrangement can be become continuous single file, in brief, come down to the additional unique numbering of each document, and can visit the content of each document by this unique numbering.Under the former situation, database 202 is at managing with the unique number of each document and the tables of data of the corresponding setting of actual file name of storage document, and in the latter case, database 202 is at managing with the unique number of each document and the tables of data of position side-play amount in the single database file and the corresponding setting of document size.Gopher 208 has been come search index file 204 from the searching character string of retrieving character load module 210 as input since being, and has and will contain the identification number (can be a plurality of numerals) of input searching character string document and the function that this input searching character string position (also can be a plurality of numerals) is in the literature returned.Searching character load module 210 preferably is made of the dialog box of multi-windowed environment, and has with keyboard 106 desired form of answering searching character to be input to this input frame.
In addition, according to feature of the present invention, searching character load module 210 can be by the similarity of 0~1 numerical value (also can by 0~100 numeral of percentage) input fuzzy search.For this reason, searching character load module 210 shows sliding shoe or the scroll bar with pointer of indication optional position between 0~1.The pointer of this sliding shoe (for example) but indication mechanism setting value 1, or available Genius mouse 112 operation, dilatory moving hand is indicated other values.
Display module 212 is according to from the identification number of the result for retrieval of gopher 208 and the position appears in searching character in the document value accessing database 202 as a result, and suitable independent result for retrieval display window show with the document in the corresponding row in this position.When can not holding any more in the panel of result for retrieval at this window, will show the scroll bar, the removable scroll bar of user notes checking result for retrieval successively.C. the structure of index file and preparation method
In the present invention, with all N continuous character and in document in position and the document carve information weave into document, and additional index forms file.Hereof, the typical separator information in the document in respect of ".", ", " etc. the document separator of the used separator of article and broad sense such as " the 1st chapter ", " summary ".C1. the standardization of character string
For the necessary initial processing of produce index file will be carried out character string standardization as described below.In other words, particularly the document when the preparation retrieval is the Japanese text file, and is the mixed file that forms with half-angle and DbChar Format.Therefore, carry out the half-angle character is changed into the processing of corresponding double byte character.C2. the taking-up of font information
The next procedure of produce index file is the N continuous character (to call font in the following text) that the alphabet of character string after the standardization is started anew by beginning place intercepting, and deposits it in index file in the lump together with Position Number in identification number, the document.But should get N 〉=1, as be Japanese, get N=2 for suitable.
Position Number is at the inner unique serial number of the additional document of the alphabet of searching object in the document in the document.And font is started in the document of character Position Number as Position Number in the document of this font.When the successive character of document end amounted to less than N, filling character of regulations such as the X ' 00 ' that pack into made it add up to N.
In addition, in the present embodiment, the document that each is independent is according to being divided into piece to retrieving significant division methods, and carve information is stored in the index file.The storage of carve information is carried out according to the form identical with aforesaid font.Promptly do not adopt from the mode of standardized character string intercepting font and store, use instead the separator of special provision is stored in the lump together with positional information in the document of the border character of the identification number of document and piece.
Because separator has polytype, so multiple different dividing method can be arranged.But, must must not repeat with the font that reads by the standardization character string by the regulation separator.In the present embodiment, when being that 1 byte code is transformed into 2 byte code, thereby when 2 bytes are treated as 1 word by standardization, as the value of this 1 word below 255, then inapplicable character code commonly used.Therefore, 0~255 arbitrary word value can be distributed to polytype separator separately.
Carve information is as follows according to the advantage of the form storage identical with font.
The generation of-index is upgraded simple.Need not be to the carve information otherwise processed.
-capacity of index is significantly increased.
For example, and the form of the additional block number under it of Position Number in each document is compared, the increase of capacity is very little.C3. the object lesson of Position Number in the document
For example, will contain " the fine な り of this day は in beginning.Among だ ぃ ま, the マ イ Network テ ス ト." (today, it was fine, the existing microphone of just testing.) document of such one section article is stored in the database 202 (Fig. 2).As to Position Number in the additional document of each character of this section article, then as follows.
Position Number 123456789 10111213141516171819202122 in the document of character
The fine な り of standardized character string this day は.Among だ い ま, the マ イ Network テ ス ト.
Separation mode 1 | |
Separation mode 2 | | |
The identification number of now establishing the document is No. 1; And establishes the number of characters N=2 of above-mentioned font.So, as follows to each font (length is 2) additional relevant identification number and the interior Position Number of document.11 days this days of Position Number は 12 は fine 13 fine 14 days な 15 な り 16 り in the font identification number document.1 7. Among 18 separators 118 separators 218 だ 19 だ 1 10 い 1 11 ま, 1 12, マ 1 13 separators 21 13 マ イ 1 14 イ Network 1 15 Network 1 16 テ 1 17 テ ス 1 18 ス ト 1 19 ト in 1 20. 1 21. The effect of carve information in, 1 22 separators, 11 22 separators, the 21 22C4. documents
The value of carve information (separation) in retrieval in the document below is described.Only with specific retrieval as object
For example, when document was made up of so-called title summary text, general hope was only retrieved as object with specific parts such as title, summaries.By to the end of title, the end storage separator and the positional information thereof of summary, then can realize such retrieval.The literature search of close association is arranged between a plurality of character strings
Generally all wish from unity and coherence in writing retrieving of close association arranged for recognizing between a plurality of character strings.
For example, can imagine, relation between the character string is only compared in same document, may be just in close relations not as in same paragraph, relation in same sentence then can be more close, add separator and storage and positional information thereof by end and just can retrieve, so just can recognize retrieval in close relations there being the document in the same storage block to paragraph and sentence.C5. the structure of index file
Position Number must be by the form storage of can high-level efficiency when the retrieval taking out in font, separator and identification number thereof and the document.For this reason, in the present embodiment, index file is made up of font file (file of main storing font, separator) and positional information file (mainly storing the file of Position Number in the identification number document).In the font file, storing Position Number residing position in the message file of position in font, separator and the identification number corresponding and the document with it.In the message file of position, storing Position Number in identification number, the document.
This font file and be shown in Fig. 3 with the example of its corresponding position information file.
In Fig. 3, the project of font file 302 is a N continuous character (fonts N=2) here, in database 202 all documents.In order to carry out binary chop, the project of font file 302 is preferably pressed ascending sort with the beginning character code value of standardization font." separator 1 ", " separator 2 ", " な り ", " は is fine " etc. are exactly the project one by one of font file 302.For example, " separator 1 " be ", ", ", ", "." wait and separate the blanket of symbol that article, sentence use and represent, be assigned to the value of 2 bytes.
Positional information file 304 storage of Fig. 3 with each corresponding items of font file 302, at least be an identification number and relevant with this each identification number, be a document Position Number at least.
For the project that makes font file 302 and the project of positional information file 304 correspond to each other, though unlisted among the figure, in each project of font file 302, should have corresponding with it, the project information in position message file 304, from the information of the displacement that begins and corresponding positional information file 304 item sizes of positional information file 304 foremost.Promptly in Fig. 3, for example, font file 302 is searched from the beginning of positional information file 304, and is read the only byte number of appointment the information of item size from the position of finding according to relevant with " separator 2 " and be stored in wherein displacement information.Thereby, can read 8,13,22 in the identification number 1 relevant in the lump with " separator 2 " ... Position Number value in Position Number value and the document relevant in such document with identification number 2 ... (if any), reach Position Number value in the document relevant with identification number n.
In the document relevant with identification number i the Position Number value generally be with, for example, (identification number i:4 byte) (Position Number number k:4 byte in the document) (Position Number in the 1st document: 4 bytes) ... (Position Number in k the document: form storage 4 bytes).In this example,,, adopt 4 bytes, be actually in the previous document displacement of Position Number and start at storage, so can save 1~3 byte though storage is the absolute position of document as the field of Position Number in the storage document.C6. the establishment of index file is handled
The establishment that index file is described below with reference to Fig. 4 is handled.This processing is when setting up database 202 at first, perhaps when database 202 appends or deletes from database 202, utilizes the index of Fig. 2 to generate the processing that update module 206 is carried out.
In Fig. 4,, guarantee the processing of memory block at first in step 402.This processing for example, can obtain the workspace of prescribed level by the function of call operation system on RAM104.
In step 404, a document is read in the suitable memory block that above-mentioned steps 402 obtains from database 202.
In step 406, the document that reads in step 404 is carried out standardization.
In step 408, by the document after the scanning standardization, make the font separator, and the font separator is stored in the memory block that step 402 obtains together with Position Number in the document of the identification number of the document and font separator.
In the processing of step 408, along with Position Number in font, identification number and the document being stored in the memory block that step 402 obtains in advance, the clear area of the memory block of this acquisition may also not be filled with.Therefore, in step 410, whether the memory block that is obtained be filled with check processing, if be filled with, then in step 412, according to, for example, positional information is classified in the document of the identification number of Position Number is stored the memory block in the encoded radio of font and separator, identification number, the document font, separator and document and font, separator, and it is write disk 108 (Fig. 1) as intermediate file, therefore, use also can be opened in the shared memory block of data of being write as intermediate file in following processing.The processing of back then enters step 414.
As judge the memory block in step 410 enough and to spare is arranged still, then handle directly entering step 414.
In step 414, judge and in database 202, whether also leave the document that does not also run through in step 404.If then handle and return step 404.
In step 414, as whole documents of judging database 202 have read in and have disposed, then still will stay in the document of the identification number of the font, separator and the document that do not write out as yet in the memory block that step 402 obtains and font separator positional information according to Position Number in the encoded radio of font, separator, identification number, the document classifies, and it is made intermediate file, write disk 108 (Fig. 1).
Write processing by intermediate file in step 412 and step 416, in disk 108, there are a plurality of intermediate files, because this each intermediate file was done classification in advance, so utilizing well-known merge sort technology in step 418 handles, and be compiled into font file 302 shown in Figure 3 and positional information file 304, and it is stored in the disk 108 from above-mentioned a plurality of intermediate files.In original a plurality of intermediate files, font might repeat several times, therefore, the project of the same in-line that repeats is merged into one, and the processing that Position Number in identification number relevant with it and the document is correlated with.D. use the retrieval process of index file
Following process flow diagram with reference to Fig. 5 illustrates and uses the above-mentioned index file of weaving into to carry out the example of retrieval process.At first,, show for example, have the dialog box of input frame, will import processing, searching character string is input in this input frame user prompt in step 502.
The user imports searching character string to input frame, strikes the OK button, so carry out the standardization of searching character string on demand, the font of counting N character from the beginning of this searching character string uses above-mentioned index file to carry out retrieval process then.The length of the font of said here N character is identical with the character string font length N of above-mentioned index file, therefore, can carry out the high speed binary chop to above-mentioned index file with the font of N character of partial character string of getting searching character string as key word.An example of the suitable N value of Japanese document is to get N=2.
In step 506, as through judging that discovery can not find the font of N character of searching character string beginning, then in step 508, in message box, suitably demonstrate the information that can not find searching character string, processing finishes.
In step 506, as found the font of N character of searching character string beginning through judgement, then owing to return Position Number at least one document of more than one identification number and document numbering from index file, so will be in step 510 this information be deposited in earlier in the regulation buffer zone on primary memory or the disk for the processing of carrying out the back.
In step 512, judge searching character string whether by the partial character string of the font of N character all retrieval finish, if then handle and enter step 520.If not, then in step 514, use above-mentioned index file to carry out retrieval process according to the font of next N character of searching character string.The length of searching character string generally is not limited to the multiple of N, therefore, when the processing of the font of retrieving N character one by one was performed until the partial character string of close searching character string end, the character string of index file key word was shorter than the font of N character sometimes.Run in this situation the partial character string in last N the character of desirable searching character string.This comes, and the result will repeat to some extent with the N that is got a before this character.When searching character string was discontented with N character, binary chop had a plurality of candidate result, and processing is thereafter found out a plurality of candidate result by sequential search exactly.
In step 516, identical with step 506, be the font that judges whether to find in the indexed file with N character of searching character string.But, step 516 and step 506 have difference in essence, in step 516, find with the meaning that does not find to be meant: what looked for is to have Position Number in, certain document in certain identification number relevant with searching character string beginning top n character only to add the font in the Position Number in the document of N.
In step 516, as through judging, find to can not find the font of this N character of searching character string, then in step 508, in message box, show the information that can not find searching character string, so the processing end.
In step 516, as through judging, find to have found the font of N character of searching character string beginning, then if sequentially circulate the identification number that returns from the index file result for retrieval and document numbering the information of the font of N the character of a beginning in the Position Number and the Position Number in the same document at least one document, in step 518, for carrying out later processing, earlier with in the regulation buffer zone of information stores on primary memory or disk.
In step 512, as judge that searching character string is all retrieved and finish, then enter step 520, determine to exist the identification number and the position thereof of searching character string by being stored in Position Number in identification number in the buffer zone and the document, in step 522, with the memory contents of Position Number accessing database 202 in document numbering and the document, and will exist this row in the document of literature search character string in another window, appropriately to show.
In order to check whether searching character string appears at the specific piece (example: the 3rd piece) in the document, the accrued spaced-apart locations of stating in the above-mentioned document that to occur before the position appear in document in searching character string of counting in, so as to checking that above-mentioned searching character string is positioned at which piece (which piece) in above-mentioned document, also can compare with the numbering of physical block.E. fuzzy search is handled
The processing that use index file shown in Figure 5 carries out, we can say that what carry out is rigorous retrieval process, but according to the present invention, be to comprise the use index file, can according to the character string of appointment and with the homotaxial character string of character, each file of database is carried out the processing that is referred to as fuzzy search at a high speed.Particularly, in this mode, can specify character string to be retrieved and retrieval precision (greater than 0 less than 1), and can determine that concrete " similarity " of character string contained to be retrieved surpasses the document and " similar character string " position in document of " the similar character string " of specifying retrieval precision.E1. determine the similar of character string with people's sensation
With the people's who understands Japanese sensation, find that the homotaxy and the close Japanese character string of implication of character has following several situation.(1) different small fonts of the expression way of katakana and big font " ソ Off ト ウ エ Off " " ソ Off ト ウ エ ア " (software) have or not long "-" " コ Application パ イ ラ-" " コ Application パ イ ラ " (programmable device) to have or not mid-round dot " " " ア イ PVC-エ system " " ア イ PVC-エ system " (IBM) to insert auxiliary word etc. in other " PVC Le デ イ Application ゲ " " PVC Le ヂ Application グ " (buildings) (2) between Chinese character phrase and Chinese character phrase
" in the residence prosecution " " in residence ま ま prosecution "
The phrase compound word of " political circles compile again " " political circles compile again " (a 3) Chinese character phrase compound word and a scarce part
" the state-run museum of ethnography " " Rijksmuseum " " museum of ethnography " (4) are because of lacking the part character with ellipsis etc.
" ソ Off ト ウ エ ア exploitation " " ソ Off ト exploitation " (software development) (5) foreign word is wrong to be pieced together
" カ リ Off オ Le ニ ア " " カ リ Off オ リ ニ ア " (California)
The common characteristic of above situation is that character is continuously consistent substantially, lacks or has more character but have.
Go out to send the several speech of research as the viewpoint similar from any aspect, the according to the order of sequence arrangement similar to " ソ Off ト メ-カ-" in respect of " ソ Off ト メ-カ-", " ソ Off ト exploitation メ-カ-", " ソ Off ト exploitation メ-カ-", if and compare with " political fund is advised positive bill ", feel that the according to the order of sequence arrangement similar to it then has " political fund rule execute ", " political fund is just being advised ", " political fund ".
In addition, though unanimity be we can say in character, but say that " ソ Off ト Network リ-system manufacturing machine make The main business と The Ru machinery メ-カ-" (is the enginerring works of main business with the milk products manufacturing machine) is similar character string to " ソ Off ト メ-カ-", that is exactly illogically to have felt.
Whether the people can feel that the similar sensation of character string can reduce, (A) character of unanimity is many more continuously feels similar more, (B) the inconsistent character that is mingled with in the middle of is many more feels dissimilar more, the inconsistent character that is mingled with in the middle of (C) too much with regard to imperceptible be a character string.
At this moment, must consider the concrete condition that repeat input of character string approximated position in the literature.As give an example, be exactly that input of character string is " minister's To of science is taken up the post of ", and be " minister of portion To of science is taken up the post of " in the document." portion " that repeats be though in this character is unnecessary character, with this unallied word of " minister's To of science is taken up the post of " eventually " " compare, just will be understood that the former is that the idea of approaching consistent character is appropriate.E2. the structure of index file and consistent degree
Indexed file structure shown in Figure 3 is the index of Position Number in additional identification number and the document on the font of N character, and retrieval process is to be the retrieval process that least unit is carried out with a font, detects Position Number in document numbering and the document.But when the character string of discontented N the character of retrieval, must start anew from character string to be retrieved, be least unit with the character of whole fonts, carries out retrieval process, and its number is considerable sometimes.With the retrieval number of times of character when N is above in the character string of input is to be that the retrieval of least unit is compared with the character number in the input of character string at most, we can say that the retrieval of the input of character string of discontented N character is loaded bigger.
Therefore, should cast out the unanimity of the part of discontented N character, and determine similar character string, can think that this is is suitable for this saying that keeps high speed really according to the consistent continuously part of N character.E3. similar character string and similarity are established rules then really
From having M the character string consistent continuously more than the character with input of character string, collecting to input of character string has the same sequence relation more approaching similar character string of conduct each other, calculates the summary that similarity is exactly a rule according to the number of characters of unanimity, inconsistent number of characters.
At first, be defined in the term that uses in the explanation.
Consistent character string:
Character string to be retrieved has the above consistent part continuously of M character with the document original text.Begin selected wherein maximum length from identical character.
The character string that (example) is to be retrieved: political fund is advised positive bill
The document original text: ... fund is advised positive め To method カ In
If M=2.Therefore, " fund is just being advised " is consistent character string.At this moment because to select the longest, so " fund " can not be called consistent character string with " fund rule ".And " method " be not so because discontented 2 characters belong to consistent character string.
Effective consistent character string:
Constitute the consistent character string of similar character string.
Maximum inconsistent string length L:
The inconsistent character that contains in the similar character string reaches L character continuously.L is the constant more than 1.
The method for selecting of " similar character string " and the method that quantizes of " similarity " below are described.Determining of (1) the 1st effective consistent character string
Get the 1st consistent character string as the 1st effective consistent character string according to order in the literature.
Wherein,
I effective consistent character string starting position in the literature be labeled as s (D, i)
I effective consistent character string end position in the literature be labeled as e (D, i)
I the starting position of effective consistent character string in character string to be retrieved be labeled as s (C, i)
I the end position of effective consistent character string in character string to be retrieved be labeled as e (C, i).(2) next effective consistent character string determines
When the effective consistent character string of definite i, determine i+1 effective consistent character string as follows.
When initial consistent character string satisfy following a), b) during two conditions, get and make the individual effective consistent character string of i+1.a)e(D,i)+1<=s(D,i+1)<=e(D,i)+L+1
Following formula is meant: when the redundant character permission that sandwiches between i effective consistent character string and i+1 the effective consistent character string at L below the character
(with reference to hereinafter example 3) b) s (C, i+1)>e (C, i)-(M-1)
Before not selected qualified effective consistent character string, carry out repeatedly.(3) " similar character string " and " similarity degree " (similarity) thereof determines
As do not select above-mentioned effective consistent character string, then with the last character of bebinning character effective consistent character string to the end of the 1st effective consistent character string as " similar character string ", and be calculated as follows " similarity ".Similarity=
(the number of characters that belongs to effective consistent character string in the character string to be retrieved
The number of characters of/character string to be retrieved,
The number of characters that belongs to effective consistent character string in " similar character string "
The number of characters of/" similar character string ") belongs to the computing method of effective consistent character string number of characters among the minimum value E4. " similar character string "
When 2 characters were identical with respective symbols in the character string to be retrieved, the 1st character calculated by 1, and the 2nd character calculates by 0.5.Character of other occasions calculates by 1.(with reference to example 4 hereinafter) E5. " similar character string " is definite sequence really
The 1st " similar character string " begins to compare definite from the beginning of document.In the process when determining i " similar character string ", look for backward from the beginning character of i " similar character string ", in the effective consistent character string that constitutes i " similar character string ", find out the character that does not belong to effective consistent character string, begin then to compare, thereby find out i+1 similar character string.
By constant L, M are carried out suitable assignment, whether similar according to the arrangement of character, can calculate and quite consistent " similarity " of people's general judgement.
In addition, when " similarity " was mxm. 1, the expression character string was in full accord, and when character string such as in full accord, then " similarity " must be 1.E6. the process flow diagram of fuzzy search
Above processing is as using flowcharting, then as shown in Figure 6.In Fig. 6, at first in step 602, prompting input searching character string.In step 604, the similarity of prompting input 0~1.The character string in step 602 and step 604 and the input operation of numerical value normally use input frame and scroll bar to carry out on dialog box.
In step 606, the numbering i of effective consistent character string is set to 1, in step 608, carries out the retrieval of effective consistent character string.At this moment, supposing that satisfying effective consistent string length is set in the above condition of M, then in the processing of Fig. 4, is favourable according to the font produce index file of M character.Why like this, be because if prepare such index file in advance, so just can carry out the retrieval of M font arbitrarily at a high speed by binary chop to index file.Then, utilize index file 1 character that staggers from the starting position of the font of M character again, carry out the font retrieval of M character in the indexed file, if the identification number that this result detects is identical with the font retrieval of a preceding M character, and, Position Number is again an order in the document, can obtain the effective consistent character string of M+1 length.Adopt above-mentioned method, if identification number is identical with the font retrieval of a preceding M character, and Position Number is again that then whenever satisfied once above-mentioned condition, the length of effective consistent character string just adds 1 in proper order in the document.But, retrieve as the font that uses M the character that index file carries out that Position Number is not in proper order in the inconsistent or document of the assorted petty identification number that does not find yet or return, the end position of that effective consistent character string of having arrived.
Along with the difference of situation, the marquis can not find effective consistent character string fully sometimes, and in this case, the judgement according to step 610 enters step 626, show not find, and end process.
In step 610, found effective consistent character string as judgement, then enter step 612 and handle; In the literature, from s (D, i) to e (D, i); In searching character string, (C, i) (C i), makes the mark of valid string to e from s.
In step 614, as finding to satisfy following condition: a) e (D, i)+1<=s (D, i+1)<=e (D, i)+L+1
And, b) s (C, i+1)>e (C, i)-(M-1)
Then continue to utilize index file to retrieve i+1 effective consistent character string,, return step 612 as finding then, for this i+1 effective consistent character string, in the literature from s (D, i+1) to e (D, i+1); (C, i+1) (C i+1), makes the mark of valid string to e from s in searching character string.(at the i that adds of step 618, expression is at the effective consistent character string of the next one)
On the other hand,, as do not find the effective consistent character string that finds previously, then carry out calculation of similarity degree in step 620 in step 616.Its method for example, is calculated with following formula as mentioned above,
Similarity=
(the number of characters that belongs to effective consistent character string in the character string to be retrieved
The number of characters of/character string to be retrieved,
The number of characters that belongs to effective consistent character string in " similar character string "
The number of characters of/" similar character string ") minimum value at this moment, " similar character string " is the character string between the rearmost position of effective consistent character string to the end from the starting position of initial effective consistent character string.
In step 622, according to the similarity of calculating in step 620 with in the similarity of step 604 input, carry out the selected of result, only when the result is similarity greater than step 604 input, carry out result's demonstration in step 624 just now.
In processing that step 624 is carried out operation is Position Number in the identification number that returns according to the index file result for retrieval in step 608, step 614 and the document, and visit is stored in the literature content in the database, and shows that this position is expert at.
In addition, " the similar character string " of a searching character string might find in a plurality of documents simultaneously, also can find in many places in a document.Therefore, must be noted that step 606~622 are applicable to such a plurality of " similar character strings ", and be only among a plurality of " similar character strings ", to select to satisfy showing of similarity condition in step 624.E7. determine the example of " similar character string " and similarity
Shown example is established M=2, L=3.(example 1)
123456 character string C to be retrieved: ア イ PVC-エ system # (ア イ PVC-エ system # is IBM Corporation's trade mark)
12345678 ... document D: ア イ PVC-エ system
Starting the longest consistent character string is " ア イ ", therefore
The 1st effective consistent character string is " ア イ " s (C, 1)=1e (C, 1)=2
s(D,1)=1e(D,1)=2
Because of e (C, 1)-(M-1)=1, thus will compare from the 2nd character string that character begins later on of character string to be retrieved and character strings from 3,4,5 or 6 beginnings of document, to retrieve the 2nd effective consistent character string (because of e (D, 1)+1=3, e (D, 1)+L+1=6).
The 2nd effective consistent character string is " PVC-" s (C, 2)=3, e (C, 2)=4 s (D, 2)=4e (D, 2)=5
Because of e (C, 2)-(M-1)=3, thus will compare from the 4th character string that character begins later on of character string to be retrieved and character strings from 5,6,7 or 8 beginnings of document, to retrieve the 3rd effective consistent character string (because of e (D, 2)+1=6, e (D, 2)+L+1=9).
The 3rd effective consistent character string is " エ system " s (C, 3)=5e (C, 3)=6
s(D,3)=7e(D,3)=8
Because arrived the tail end of character string to be retrieved, so the 3rd effective consistent character string is last one.
アイビ-エム
1 2 3
アイ·ビ-·エム…
1 2 3
Be numbered the numbering of effective consistent character string.Therefore, " similar character string " is " the ア イ PVC-エ system " from s (D, 1) to e (D, 3).Minimum value=the 6/8=0.75 of " similarity "=(6/6,6/8) (example 2)
123456789 10 character string C to be retrieved: ソ Off ト ウ エ ア メ-カ-
123456789 ... document D: ソ Off ト exploitation メ-カ-
ソフトウエアメ-カ-
1 2
ソ Off ト exploitation メ-カ-
Minimum value=0.7 (example 3) of 12 " similar character strings "=" ソ Off ト exploitation メ-カ-" similarity=(7/10,7/9)
1234 character string C to be retrieved: prosecute at residence
123456789 ... document D: at residence ま ま In prosecution To ふ body I つ.Starting the longest consistent character string is " at residence ", and therefore the 1st effective consistent character string is " at residence " s (C, 1)=1e (C, 1)=2
s(D,1)=1e(D,1)=2
Will be from the 2nd character string that character begins later on of character string to be retrieved (because of e (C, 1)-(M-1)=1) with from document 3,4,5 or 6 the beginning character strings compare (because of e (D, 1)+and 1=3, e (D, 1)+L+1=6), retrieve the 2nd effective consistent character string.
Owing to can not find the 2nd effective consistent character string, and because arrived the tail end that waits the character string of retrieving, so have only the 1st to be effective consistent character string.
Prosecute at residence
1
At residence ま ま In prosecution To ふ body I つ.
1 therefore, and the 1st " similar character string " is " at the residence " from s (D, 1) to e (D, 3).Minimum value=0.5 of similarity=(2/4,2/2)
" " non-effective consistent character of the beginning of back is " ".From " " back the 2nd of region retrieval " similar character string ", then
Prosecute at residence
1
At residence ま ま In prosecution To ふ body I つ.
1
But in the literature, " at residence " and " prosecution " be 4 characters apart, in this example because of L=3, so above-mentioned " prosecution " can not regard effective consistent character string as.
(example 4)
123 character string C to be retrieved: bank person
1?2?3?4?5?6?7?8?9
The office staff B さ ん of document D:A bank
The longest consistent character string of beginning is " bank ", therefore
The 1st effective consistent character string is " bank " s (C, 1)=1e (C, 1)=2
s(D,1)=2e(D,1)=3
(formula 7)
Will be from the 2nd character string that character begins later on of character string to be retrieved (because of e (C, 1)-(M-1)=1) with from document 4,5,6 or 7 the beginning character strings compare (because of e (D, 1)+and 1=4, e (D, 1)+L+1=7), retrieve the 2nd effective consistent character string.
The 2nd effective consistent character string is " office staff " s (C, 2)=2e (C, 2)=3
s(D,2)=4e(D,2)=5
Because arrived the tail end of character string to be retrieved, so effective consistent character string has two.
Bank person
1
2
The office staff B さ ん of A bank
1 2
1.1. 0.5 1 → 3.5 " similar character strings " are from s (D, 1) to e (D, 2) " office staff of bank ".Minimum value=the 3.5/4=0.875E8. of " similarity "=(3/3,3.5/4) is near the fuzzy example ソ Off ト ウ エ ア メ-カ-ソ Off ト ウ エ ア メ-カ-0.909 of people's sensation
ソ Off ト ウ エ ア develops メ-カ-0.833
ソ Off ト ウ エ ア develops メ-カ-0.769
This example represents that along with the sandwiching of redundant character, " similarity " reduces.
ニツトウエアメ-カ- 0.800
ソフトメ-カ- 0.700
ソフトウエア 0.600
This example is represented the minimizing along with consistent character, and " similarity " reduces.Minister of science elects minister of science to elect 1.000
The minister of Neo-Confucianism portion elects 0.929
Minister of science elects the structure of 0.857E9. index and the relation of " similar character string " retrieval
By the suitable setting of M value, can search for the fuzzy search of " similar character string " with index structure realization of High Speed of the present invention.Definite method N of constant N, M: the number of characters M that is stored in the font in the index: the minimum length L of effective consistent character string of fuzzy search: in the fuzzy search, the maximum length of the non-effective consistent character string in " similar character string ".
Obtain greatly as N, then because of font kind number increases, the font data amount of examining reduces, so retrieval rate is very fast, but the capacity of index file is increased.In general Japanese document, can obtain sufficient retrieval rate with N=2.
If, then in fuzzy search, can obtain sufficient retrieval rate according to the true M of the condition of M 〉=N.Consider as the angle that, fuzzy search more little from M can be careful more, get M=N and be considered to gratifying.E10. determine the 2nd embodiment of similarity
In the fuzzy search of the 2nd embodiment is handled, consider said " the inconsistent character that the centre is mingled with is many more feels dissimilar more ", " the inconsistent character that the centre is mingled with too much with regard to imperceptible be the identical characters string " especially, take into account this two aspects.If run into the character string of importing in the document and be series arrangement by consistent character string, inconsistent character string, consistent character string, and similar character string is won before a consistent character string in the back, thereby similarity degree is reduced, and this is illogical.For example, when input of character string is that " in residence prosecution " and document 1 are for " in the prosecution of residence ま ま In ", document 2 during for " at residence ", " document 1 " residence ま ま In prosecution ", document 2 " at residence " all have similar character string by said; but similarity is to be height with " at residence " person " such rule, result are opposite with people's sensation.As the similarity that is judged as " in residence ま ま In prosecution " is higher or two similar character strings of " at residence " and " prosecution " are arranged in document 1 than " at residence ", and that is just logical.
Below, the processing of the 2nd embodiment is described.As the process flow diagram of reference Fig. 6, in the present embodiment, step 602~612nd, identical, represent that the step 614 of i+1 effective consistent character string search condition then has following change.
S (C, i+1)>e (C, i)-(M-1) ... (formula 1)
S (D, i+1)>e (D, i) ... (formula 2)
And,
s(D,i+1)-e(D,i)-1
+ max (e (C, i)-s (C, i+1)+1.0)≤L ... (formula 3)
S (C, i), e (C, i), s (D, i), (D, i) definition of Denging still as described above for e.
Formula 1 allows that the character that repeats " portion " that resembles aforesaid " minister of portion of science " below M-1, in addition, this means that the character string of the order appearance that character sequence every and in the input of character string is identical all is effective.
Formula 2 means that effective in the literature consistent character string does not repeat.
Formula 3 means the inconsistent character that is clipped in the middle and resembles the character that repeats " portion " of " minister of portion of science " and adds and allow at L below the character together.
In the present embodiment, as last embodiment, calculating in document with each similar character string of searching character string in the shared ratio of effective consistent character string, wherein, every proportion is little is not selected into similarity, and similar character is serially added branch, by divided by full marks, when in full accord (give divide) obtained ratio, calculates.Give each character bonus point by following rule,, calculate the mark of similar character string through adding up.Therefore, in the step 620 of Fig. 6, carry out following processing.The character that belongs to the 1st effective consistent character string ... 1 belongs to the character of i (i>1) effective consistent character string
Position 〉=e in searching character string (C, i-1)+1 (formula 4) ... 1 minute
Position≤e in searching character string (C, i-1)+1 (formula 5) ...-1/ (2*L) divides the character that does not belong to effective consistent character string ...-1/L branch
In the present embodiment, in the process of determining i similar character string, look for backward from i similar character string beginning character, in constituting i similar character string, find out the character that does not belong to effective consistent character string, begin then to compare, thereby find out i+1 similar character string.
The negative branch that does not belong to the character of effective consistent character string is will to take into account " the inconsistent character that the centre is mingled with many more feels more dissimilar ", " the inconsistent character that the centre is mingled with too much with regard to imperceptible be the identical characters string " these two aspects in order considering and to set.The negative branch of a non-consistent character string amounts to and is 1/L*L=1 to the maximum, so get the minimum value of just dividing of next consistent character string, negative score value can not surpass it and just divide when N 〉=1 (to Japanese special recommendation 2).In addition, " portion " such character that repeats of formula 5 expressions aforesaid " minister of portion of science ", formula 4 expressions are not the simple consistent characters that repeats character.For the represented character of formula 5, increase the little negative branch of quite simple non-consistent character, with at the situation that duplicates character.E11. in the 2nd embodiment really phasing like the example of character string and similarity
As example, still establish N=2, L=3
(example 5)
Input of character string C: ア イ PVC-エ system
1?2?3?4?5?6?7?8…
A part of D of document: ... ア イ PVC-エ system
The longest initial consistent character string is " ア イ ", and therefore the 1st effective consistent character string is " ア イ "
s(C,1)=1e(C,1)=2
S (D, 1)=1e (D, 1)=2 by formula 1, formula 2, formula 3 as can be known the 2nd effective consistent character string be " PVC-"
s(C,2)=3e(C,2)=4
S (D, 2)=4e (D, 2)=5 by formula 1, formula 2, formula 3 as can be known the 3rd effective consistent character string be " エ system "
s(C,3)=5e(C,3)=6
S (D, 3)=7e (D, 3)=8 is because arrived the end of preparing the character string of retrieval, so effective consistent character string has 3.
C:アイビ-エム
1 2 3
D:アイ·ビ-·エム
1 2 3
Mark 1.-1.1.1.1.1.
-1/3 -1/3
Similar character string is " the ア イ PVC-エ system " from s (D, 1) to e (D, 3).Similarity=((1*6+ (1/3*2)/6)=0.88 (example 6)
1?2?3?4?5?6?7?8?9?10
Input of character string C: ソ Off ト ウ エ ア メ-カ-
1?2?3?4?5?6?7?8?9…
A part of D of document: ... ソ Off ト exploitation メ-カ-
C: ソフトウエアメ-カ-
1 2
D: ソ Off ト exploitation メ-カ-
12 similar character strings=" ソ Off ト exploitation メ-カ-" similarity=((1*7+ (1/3) * 2)/10)=0.63 (example 7)
1?2?3?4
Input of character string C: prosecute at residence
1?2?3?4?5?6?7?8?91011121314…
A part of D of document: at residence ま ま In prosecution To ふ body I つ.
Initial consistent character string is " at a residence ", and therefore the 1st effective consistent character string is " at residence ", because of the consistent character string of the next one does not satisfy formula 3 for " prosecution ", so have only the 1st to be effective consistent character string.
C: prosecute at residence
1
D: at residence ま ま In prosecution To ふ body I つ.
1 similar character string is " at a residence ".Similarity=2/4=0.5
" " non-effective consistent character of the beginning of back is " ".Should retrieve the 2nd similar character string backward from " ".
C: prosecute at residence
1
D: at residence ま ま In prosecution To ふ body I つ.
1 therefore, and the 2nd similar character string is " prosecution ".(example 8)
1?2?3?4?5?6?7
Input of character string C: minister's To of science is taken up the post of
1?2?3?4?5?6?7?8
A part of D of document: ... the minister of Neo-Confucianism portion To is taken up the post of ... effective consistent character string is two of " portion of science ", " minister's To are taken up the post of ".
C: minister's To of science is taken up the post of
1
>2
D: the minister of portion To of science is taken up the post of
1 2
1.1.1. 1.1.1.1
-1/6 similar character string is " minister of portion To of science is taken up the post of ".The 2nd " portion " satisfies formula 5.Therefore, similarity=((1*7+ (1/6) * 1)/7)=0.97.E12. the result of the 2nd embodiment gathers
Similarity in the input of character string document
ソフトメ-カ- ソフトのメ-カ- 0.95
ソ Off ト exploitation メ-カ-0.85 political fund is advised positive bill political fund rule and is executed 0.87
Political fund 0.50 minister's To of science is taken up the post of the minister of portion To of science and is taken up the post of 0.97
Minister's To of science takes up the post of 0.95
As mentioned above, according to the present invention, can obtain text or database are used the effect of distinctive index structure realization of High Speed with the fuzzy search of people's sensation.

Claims (56)

1. information retrieval method, but be used in document, finding out the similarity of the document character string similar to searching character string according to the retrieval mode storage by Computer Processing, and this method may further comprise the steps:
(a) step of input searching character string;
(b) get length the M (partial character string of (M is the predetermined integers more than 2) more than the character from the beginning of above-mentioned searching character string, in above-mentioned document, detect the with it consistent starting position and the step of end position (below, will by this starting position and end position decision, length is called effective consistent character string M the partial character string more than the character);
(c) if appearance does not detect replying of effective consistent character string in step (b), the partial character string of M the above length of character (M is the predetermined integers more than 2) got again in the character that then staggers from the partial character string starting position of above-mentioned searching character string, and the step of above-mentioned effective consistent character string is looked in retrieval;
(d) if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned document respectively, the distance of retrieving its starting position and this effective consistent character string that has just detected is in the step of L character with effective consistent character string of interior (L is the predetermined integers more than 1);
(e) as long as detect above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, according to the information of the existence in effective consistent character string, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned document to the character string between the end position of last effective consistent character string of above-mentioned document.
2. according to the described information retrieval method of claim 1, it is characterized in that: above-mentioned M is 2, and above-mentioned L is more than 3.
3. according to the described information retrieval method of claim 1, it is characterized in that: shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document in ratio that effective consistent character string is occupied in searching character string and in effective consistent character string, and above-mentioned calculation of similarity degree gets the small value.
4. according to the described information retrieval method of claim 1, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
5. information retrieval method, but be used in document, searching the position that searching character string occurs according to the retrieval mode storage by Computer Processing, and this method may further comprise the steps:
(a) step of input searching character string;
(b) step of input similarity;
(c) from the beginning of above-mentioned searching character string get length for M more than the character one (M be more than 2 predetermined integers-partial character string, in above-mentioned document, detect the with it consistent starting position and the step of end position-below, will by this starting position and end position decision, length is called effective consistent character string one M the partial character string more than the character;
(d) if appearance does not detect replying of effective consistent character string in step (c), the character that then staggers from the partial character string starting position of above-mentioned searching character string get again M the above length-M of character be more than 2 predetermined integers-partial character string, retrieve the step of above-mentioned effective consistent character string;
(e) if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned document respectively, and the distance of the effective consistent character string that has just detected from its starting position and this is the step of the predetermined integers-effective consistent character string of retrieval more than 1 with interior-L at L character;
(f) as long as detect above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (e);
(g) at least from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, according to the information that in effective consistent character string, exists, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned document to the character string between the end position of last effective consistent character string of above-mentioned document.
(h) if draw similarity greater than the replying of the similarity of in above-mentioned steps (b), importing, then show the content of the above-mentioned effective consistent character string that is contained in the above-mentioned document through aforementioned calculation.
6. according to the described information retrieval method of claim 5, it is characterized in that: above-mentioned M is 2, and above-mentioned L is more than 3.
7. according to the described information retrieval method of claim 5, it is characterized in that: effective consistent character string in searching character string shared ratio and in effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
8. according to the described information retrieval method of claim 5, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
9. according to the described information retrieval method of claim 8, it is characterized in that: above-mentioned adding, be divided into a character 1 minute, and above-mentioned subtracting is divided into a character 1/L branch.
10. information retrieval method, but be used in database, detecting the similarity of the document character string similar to searching character string according to a plurality of documents of retrieval mode storage by Computer Processing, and this method may further comprise the steps:
(a) input searching character string;
(b) from the beginning of above-mentioned searching character string get length M more than the character-M be more than 2 predetermined integers-partial character string, in the same document of above-mentioned database, detect the with it consistent starting position and the step of end position-below, will by this starting position and end position decision, length M the above partial character string of character be called effective consistent character string-;
(c) if appearance does not detect effective consistent character string in step (b), the character that then staggers from the partial character string starting position of above-mentioned searching character string get again M the above length-M of character be more than 2 predetermined integers-partial character string, retrieve above-mentioned effective consistent character string;
(d) if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned same document respectively, the distance of retrieving its starting position and this effective consistent character string that has just detected L character with interior-L be more than 1 predetermined whole carrying-the step of effective consistent character string;
(e) as long as find above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned same document, according to the information that exists in effective consistent character string, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned same document to the character string between the end position of last effective consistent character string of above-mentioned document.
11. according to the described information retrieval method of claim 10, it is characterized in that: above-mentioned M is 2, and above-mentioned L is 3.
12. according to the described information retrieval method of claim 10, it is characterized in that: effective consistent character string in searching character string shared ratio and effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
13. according to the described information retrieval method of claim 10, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
14. an information retrieval method, but be used for by the Computer Processing place that searching character string occurs of in database, checking out according to a plurality of documents of retrieval mode storage, and this method may further comprise the steps:
(a) step of input searching character string;
(b) step of input similarity;
(c) from the beginning of above-mentioned searching character string get length M more than the character-M be more than 2 predetermined integers-partial character string, in the same document of above-mentioned database, detect the with it consistent starting position and the step of end position-below, will by this starting position and end position decision, length M the partial character string more than the character be called effective consistent character string-;
(d) if appearance does not detect replying of effective consistent character string in step (c), the character that then staggers from the partial character string starting position of above-mentioned searching character string get again M the above length-M of character be more than 2 predetermined integers-partial character string, retrieve the step of above-mentioned effective consistent character string;
(e) if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned same document respectively, retrieving its starting position is the step of effective consistent character string of the predetermined integers more than 1 with the distance that this has just detected effective consistent character string with interior-L at L character;
(f) as long as find above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (e);
(g) at least from the starting position of initial effective consistent character string of above-mentioned same document to the character string the end position of last effective consistent character string of same above-mentioned document, according to the information that exists in effective consistent character string, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned same document to the character string between the end position of last effective consistent character string of above-mentioned same document.
(h) if draw similarity greater than the replying of the similarity of in above-mentioned steps (b), importing, then be presented at the content of above-mentioned effective consistent character string contained in the above-mentioned document through aforementioned calculation.
15., it is characterized in that: have to above-mentioned a plurality of documents step of the intrinsic numbering of mark or symbol in advance according to the described information retrieval method of claim 14.
16. according to the described information retrieval method of claim 15, it is characterized in that: above-mentioned intrinsic numbering or symbol are sequenced numbering.
17. according to the described information retrieval method of claim 14, it is characterized in that: above-mentioned M is 2, and above-mentioned L is 3.
18. according to the described information retrieval method of claim 14, it is characterized in that: effective consistent character string in searching character string shared ratio and in effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
19. according to the described information retrieval method of claim 14, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
20. according to the described information retrieval method of claim 19, it is characterized in that: above-mentioned adding, be divided into a character 1 minute, and above-mentioned subtracting is divided into a character 1/L branch.
21. an information retrieval system, but be used in document, detecting the position that searching character string occurs according to the retrieval mode storage by Computer Processing, and this system has with lower device:
(a) device of input searching character string;
(b) above-M is more than 2 predetermined integers that the length of getting searching character string is M character-partial character string, the device of consistent with it starting position of retrieval and end position in above-mentioned document-below, will be called effective consistent character string by the partial character string of M the above length of character of this starting position and end position decision;
(c) from the beginning of above-mentioned searching character string, use said apparatus (b), if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned document respectively, the distance of retrieving its starting position and this effective consistent character string that has just detected L character with interior-L be more than 1 predetermined integers-effective consistent character string, if replying of effective consistent character string do not occur detecting, the partial character string of M the above length of character (M is the predetermined integers more than 2) got again in the character that then staggers from the partial character string starting position of above-mentioned searching character string, retrieves the device of above-mentioned effective consistent character string;
(d) as long as detect above-mentioned effective consistent character string, proceed the device of above-mentioned steps (d);
(e) at least from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, according to the information that in effective consistent character string, exists, calculate the device of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned document to the character string between the end position of last effective consistent character string of above-mentioned document.
22. an information retrieval system, but be used in document, detecting the position that searching character string occurs according to the retrieval mode storage by Computer Processing, and this system possesses with lower device:
(a) device of input searching character string;
(b) device of input similarity;
(c) above-M is more than 2 predetermined integers that the length of getting searching character string is M character-partial character string, the device of consistent with it starting position of retrieval and end position in above-mentioned document-below, will be called by the partial character string of M the above length of character of this starting position and end position decision effective consistent character string-;
(d) from the beginning of above-mentioned searching character string, use said apparatus (c), if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned document respectively in view of the above, the distance of retrieving its starting position and this effective consistent character string that has just detected L character with interior-L be more than 1 predetermined integers-effective consistent character string, if replying of effective consistent character string do not occur detecting, the partial character string of M the above length of character (M is the predetermined integers more than 2) got again in the character that then staggers from the partial character string starting position of above-mentioned searching character string, retrieves the device of above-mentioned effective consistent character string;
(e) as long as detect above-mentioned effective consistent character string, proceed the device of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, according to the information that in effective consistent character string, exists, calculate the device of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned document to the character string between the end position of last effective consistent character string of above-mentioned document;
(g), then be presented at the device of the content of above-mentioned effective consistent character string contained in the above-mentioned document as the similarity of the above-mentioned similarity of calculating greater than input in above-mentioned steps (b).
23. according to the described information retrieval system of claim 22, it is characterized in that: above-mentioned M is 2, and above-mentioned L is 3.
24. according to the described information retrieval system of claim 22, it is characterized in that: effective consistent character string in searching character string shared ratio and in effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
25. according to the described information retrieval system of claim 22, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
26. according to the described information retrieval system of claim 25, it is characterized in that: above-mentioned adding, be divided into a character 1 minute, and above-mentioned subtracting is divided into a character 1/L branch.
27. an information retrieval system, but be used in the database of a plurality of documents of storing according to retrieval mode, detecting the position that searching character string occurs by Computer Processing, and this system possesses with lower device:
(a) device of input searching character string;
(b) device of input similarity;
(c) from the beginning of above-mentioned searching character string get length be M more than the character-M be more than 2 predetermined integers-partial character string, the device of consistent with it starting position of retrieval and end position in above-mentioned document-below, will be called by the partial character string of M the above length of character of this starting position and end position decision effective consistent character string-;
(d) from the beginning of above-mentioned searching character string, use said apparatus (c), if detect replying of effective consistent character string, the length of the effective consistent character string that is equivalent to just detect of then only staggering from the partial character string starting position of above-mentioned searching character string and the retrieval starting position the above-mentioned same document respectively, the distance of retrieving its starting position and this effective consistent character string that has just detected L character with interior-L be more than 1 predetermined integers-effective consistent character string, if replying of effective consistent character string do not occur detecting, the character that then staggers from the partial character string starting position of above-mentioned searching character string get again M the above length-M of character be more than 2 predetermined integers-partial character string, retrieve the device of above-mentioned effective consistent character string;
(e) as long as detect above-mentioned effective consistent character string, proceed the device of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned same document to the character string the end position of last effective consistent character string of above-mentioned same document, according to the information that exists in effective consistent character string, calculate the device of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned same document to the character string between the end position of last effective consistent character string of above-mentioned same document;
(g), then be presented at the device of the content of contained above-mentioned effective consistent character string in the above-mentioned document as the similarity of the above-mentioned similarity that calculates greater than input in above-mentioned steps (b).
28. according to the described information retrieval system of claim 27, it is characterized in that: above-mentioned M is 2, and above-mentioned L is 3.
29. according to the described information retrieval system of claim 27, it is characterized in that: effective consistent character string in searching character string shared ratio and in effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
30. according to the described information retrieval system of claim 27, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, and subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
31. according to the described information retrieval system of claim 30, it is characterized in that: above-mentioned adding, be divided into a character 1 minute, and above-mentioned subtracting is divided into a character 1/L branch.
32. the preparation method of an index file is used this method, by Computer Processing, but can be the font of N character in the document high speed retrieval length according to the retrieval mode storage, this method may further comprise the steps:
(a) the above-mentioned document of sequential scanning, and the font of the font of any N the character that will occur continuously in above-mentioned document and this N the character information that occurs the position in above-mentioned document writes the step in the memory block;
(b) as replying of the whole been scanned of above-mentioned document occur in above-mentioned steps (a), the information that then will write in the above-mentioned memory block sorts by above-mentioned font, and adds the step of the above-mentioned positional information corresponding with it on each different mutually above-mentioned font;
(c) in order to retrieve and the corresponding additional above-mentioned positional information of this font, with the step of font as key word establishment and output file.
33., it is characterized in that according to the described index file preparation method of claim 32: above-mentioned steps (c) be retrieved as binary search.
34. according to claim 1 or the described information retrieval method of claim 5, it is characterized in that: in above-mentioned document,, be to use the index file weaved into according to the method for claim 33 and utilize the positional information that detects by this index file to carry out for the retrieval of M the above font of character.
35. the preparation method of an index file is used this method, by Computer Processing, but can be the font of N character in the database high speed retrieval length of a plurality of documents of storing according to retrieval mode, this method may further comprise the steps:
(a) each document in above-mentioned a plurality of documents is added the step that is used for to its symbol of discerning separately or numbering;
(b) the above-mentioned database of sequential scanning, and the font of the font of any N the character that will occur continuously in each document of above-mentioned database and this N the character information that occurs the position in the document writes and adds document distinguished symbol or numbers step in the corresponding memory block;
(c) replying as whole been scanned of document that above-mentioned database in step (b), occurs, the information that then will write in the above-mentioned memory block sorts by above-mentioned font, and adds the above-mentioned document identification symbol corresponding with it or the step of the positional information in the numbering and the document on each different mutually above-mentioned font;
(d) in order to retrieve and the corresponding additional above-mentioned document identification symbol of this font or numbering and above-mentioned positional information, with the step of font as key word establishment and output file.
36., it is characterized in that according to the described index file preparation method of claim 35: above-mentioned steps (d) be retrieved as binary search.
37. according to claim 10 or the described information retrieval method of claim 14, it is characterized in that: in the above-mentioned document for the retrieval of M the above font of character, be to use the index file of weaving into according to the method for claim 36, and utilize the positional information that detects by this index file to carry out.
38. according to the described index file preparation method of claim 37, it is characterized in that: above-mentioned M and N are 2, and above-mentioned L is more than 3.
39. the preparation method of an index file uses this method to go out by Computer Processing, but can be the font of N character in the document high speed retrieval length according to the retrieval mode storage, this method may further comprise the steps:
(a) the above-mentioned document of sequential scanning, and the position appears in the preassigned separator that will occur in above-mentioned document and this separator in above-mentioned document information writes in the memory block, and the font of the font of any N the character that will occur continuously in above-mentioned document and this N the character information that occurs the position in above-mentioned document writes the step in the memory block simultaneously;
(b) as replying of the whole been scanned of above-mentioned document occur in step (a), the information that then will write in the above-mentioned memory block sorts according to above-mentioned font, and adds the step of the above-mentioned positional information corresponding with it on each different mutually above-mentioned font;
(c) in order to retrieve and the corresponding additional above-mentioned positional information of this font, with the step of font as key word establishment and output file.
40. according to claim 1 or the described information retrieval method of claim 5, it is characterized in that: in above-mentioned document to the retrieval of M the above font of character, be to use the index file of weaving into according to the method for claim 39, and utilize the positional information that detects by this index file to carry out.
41. according to the described information retrieval method of claim 40, it is characterized in that: in above-mentioned document, separator is retrieved, and the positional information that detects when will be in above-mentioned document M the above font of character being retrieved is attached to corresponding locational step.
42. the establishment system of an index file uses this system, by Computer Processing, but can be the font of N character in the document high speed retrieval length according to the retrieval mode storage, this system has with lower device:
(a) the above-mentioned document of sequential scanning, and the font of the font of any N character that will occur continuously in above-mentioned document and this N the character information that occurs the position in above-mentioned document writes the device in the memory block;
(b) as replying of the whole been scanned of above-mentioned document occur in said apparatus (a), the information that then will write in the above-mentioned memory block sorts according to above-mentioned font, and adds the device of the above-mentioned positional information corresponding with it on each different mutually above-mentioned font;
(c) in order to retrieve and the corresponding additional above-mentioned positional information of this font, with the device of font as key word establishment and output file.
43. the establishment system of an index file uses this system, by Computer Processing, but can be the font of N character in the database high speed retrieval length of a plurality of documents of storing according to retrieval mode, this system has with lower device:
(a) be used for to its symbol of discerning separately or numbering to each document in above-mentioned a plurality of documents is additional;
(b) the above-mentioned database of sequential scanning, and the font of the font of any N the character that will occur continuously in each document of above-mentioned database and this N the character information that occurs the position in the document writes and adds document distinguished symbol or numbers device in the corresponding memory block;
(c) replying as whole been scanned of document that above-mentioned database in means (b), occurs, the information that then will write in the above-mentioned memory block sorts by above-mentioned font, and adds the above-mentioned document identification symbol corresponding with it or the device of the positional information in the numbering and the document on each different mutually above-mentioned font;
(d) in order to retrieve and the corresponding additional document identification symbol of this font or numbering and above-mentioned positional information, with the device of font as key word establishment and output file.
44. according to the described index file establishment of claim 43 system, it is characterized in that: above-mentioned N is 2.
45. the establishment system of an index file uses this system, by Computer Processing, but can be the font of N character in the document high speed retrieval length according to the retrieval mode storage, this system has with lower device:
(a) the above-mentioned document of sequential scanning, and the position appears in the preassigned separator that will occur in above-mentioned document and this separator in above-mentioned document information writes in the memory block, and the font of the font of any N the character that will occur continuously in above-mentioned document and this N the character information that occurs the position in above-mentioned document writes the device in the memory block simultaneously;
(b) as replying of the whole been scanned of above-mentioned document occur in device (a), the information that then will write in the above-mentioned memory block sorts according to above-mentioned font, and adds the device of corresponding above-mentioned positional information with it on each different mutually above-mentioned font;
(c) in order to retrieve and the corresponding additional above-mentioned positional information of this font, with the device of font as key word establishment and output file.
46. the establishment system of an index file uses this system, by Computer Processing, but can be the font of N character in the database high speed retrieval length of a plurality of documents of storing according to retrieval mode, this system has with lower device:
(a) be used for to its symbol of discerning separately or the device of numbering to each document in above-mentioned a plurality of documents is additional;
(b) the above-mentioned database of sequential scanning, and the position appears in the preassigned separator that will occur in each document of above-mentioned database and this separator in above-mentioned document information writes in the respective storage areas of additional document distinguished symbol or numbering, and the font of the font of any N the character that will occur continuously in each document of above-mentioned database and this N the character information that occurs the position in the document writes the device in the respective storage areas of having added document distinguished symbol or numbering simultaneously;
(c) as the replying of the whole been scanned of document of the above-mentioned database of appearance in device (b), the information that then will write in the above-mentioned memory block sorts according to above-mentioned font, and on each mutually different above-mentioned font additional with it device of the positional information in corresponding above-mentioned document identification symbol or the numbering and the document;
(d) in order to retrieve and the corresponding additional above-mentioned document identification symbol of this font or numbering and above-mentioned positional information, with the device of font as key word establishment and output file.
47. an information retrieval method is used for by Computer Processing, but can detect the position that searching character string occurs in the database of a plurality of documents of storing according to retrieval mode, this method may further comprise the steps:
(a) step of input searching character string;
(b) step of input similarity;
(c) from the beginning of above-mentioned searching character string get length M more than the character-M be more than 2 predetermined integers-partial character string, in the same document of above-mentioned database, retrieve the consistent with it starting position and the step of end position, below, will by the decision of this starting position and end position, length is called effective consistent character string M the partial character string more than the character;
(d) get
I effective consistent character string starting position in the literature be s (D, i)
I effective consistent character string end position in the literature be e (D, i)
I the starting position of effective consistent character string in character string to be retrieved be s (C, i)
I the end position of effective consistent character string in character string to be retrieved be e (C, i)
The step of i+1 effective consistent character string of following two conditions is satisfied in retrieval:
e(D,i)+1≤s(D,i+1)≤e(D,i)+L+1
And
s(C,i+1)>e(C,i)-(M-1)
In following formula, L is the predetermined integers more than 1,
(e) as long as detect above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned same document to the character string the end position of last effective consistent character string of same above-mentioned document, according to the information that exists in effective consistent character string, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned same document to the character string between the end position of last effective consistent character string of above-mentioned same document.
(g) if the similarity that the process aforementioned calculation draws greater than the similarity of importing, then shows the content of contained above-mentioned effective consistent character string in above-mentioned document in above-mentioned steps (b).
48. according to the described information retrieval method of claim 47, it is characterized in that: above-mentioned M is 2, and above-mentioned L is more than 3.
49. according to the described information retrieval method of claim 47, it is characterized in that: effective consistent character string in searching character string shared ratio and in effective consistent character string shared ratio is among the two from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, above-mentioned calculation of similarity degree gets the small value.
50. according to the described information retrieval method of claim 47, it is characterized in that:, in above-mentioned similar degree calculating, get the small value in searching character string shared ratio and to the shared ratio of the effective consistent character string between the end position of last effective consistent character string of above-mentioned document between the two in effective consistent character string in the starting position of initial effective consistent character string of above-mentioned document.
51. an information retrieval method, but be used in the database of a plurality of documents of storing according to retrieval mode, detecting the place that searching character string occurs by Computer Processing, and this method may further comprise the steps:
(a) step of input searching character string;
(b) step of input similarity;
(c) from the beginning of above-mentioned searching character string get length M more than the character-M be more than 2 predetermined integers-partial character string, in the same document of above-mentioned database, detect the with it consistent starting position and the step of end position-below, will by this starting position and end position decision, length is called effective consistent character string M the partial character string more than the character
(d) get
I effective consistent character string starting position in the literature be s (D, i)
I effective consistent character string end position in the literature be e (D, i)
The starting position of i effective consistent character string in the character string of hope retrieval be s (C, i)
The end position of i effective consistent character string in the character string of hope retrieval be e (C, i)
The step of the i+1 that retrieval meets the following conditions an effective consistent character string:
s(C,i+1)>e(C,i)-(M-1)
s(D,i+1)>e(D,i)
And
s(D,i+1)-e(D,i)-1+max(e(C,i)-s(C,i+1)
+1.0)≤L
(in following formula, L is the predetermined integers more than 1)
(e) as long as detect above-mentioned effective consistent character string, just proceed the step of above-mentioned steps (d);
(f) at least from the starting position of initial effective consistent character string of above-mentioned same document to the character string the end position of last effective consistent character string of same above-mentioned document, according to the information that exists in effective consistent character string, calculate the step of the similarity itself and the above-mentioned searching character string from the starting position of initial effective consistent character string of above-mentioned same document to the character string between the end position of last effective consistent character string of above-mentioned same document.
(g) if the similarity that the process aforementioned calculation draws greater than the similarity of importing, then is presented at the content of above-mentioned effective consistent character string contained in the above-mentioned document in above-mentioned steps (b).
52. according to the described information retrieval method of claim 51, it is characterized in that: above-mentioned M is 2, and above-mentioned L is more than 3.
53. according to the described information retrieval method of claim 51, it is characterized in that: in effective consistent character string shared ratio and in ratio that effective consistent character string is occupied from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document among the two in searching character string, above-mentioned calculation of similarity degree gets the small value.
54. according to the described information retrieval method of claim 51, it is characterized in that: for each character from the starting position of initial effective consistent character string of above-mentioned document to the character string the end position of last effective consistent character string of above-mentioned document, bonus point when it belongs to effective consistent character string, and subtract branch when not belonging to, the value of gained after above-mentioned calculation of similarity degree utilizes score as a result divided by score value in full accord.
55. according to the described information retrieval method of claim 51, it is characterized in that: above-mentioned adding, be divided into a character 1 minute, and above-mentioned subtracting is divided into a character 1/L branch.
56. belong to i-1 the such character of effective consistent character string again for belonging to i effective consistent character string and its corresponding searching character string character, subtract 1/ (2L) and divide.
CN 95118142 1994-11-22 1995-11-01 Information searching method and system Pending CN1151558A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP6287642A JP2669601B2 (en) 1994-11-22 1994-11-22 Information retrieval method and system
JP287642/94 1994-11-22

Publications (1)

Publication Number Publication Date
CN1151558A true CN1151558A (en) 1997-06-11

Family

ID=17719873

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 95118142 Pending CN1151558A (en) 1994-11-22 1995-11-01 Information searching method and system

Country Status (3)

Country Link
JP (1) JP2669601B2 (en)
KR (1) KR960018993A (en)
CN (1) CN1151558A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for fast comparision searching word string
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3275816B2 (en) 1998-01-14 2002-04-22 日本電気株式会社 Symbol string search method, symbol string search device, and recording medium recording symbol string search program
JP4042580B2 (en) * 2003-01-28 2008-02-06 ヤマハ株式会社 Terminal device for speech synthesis using pronunciation description language
CN1645374A (en) * 2005-01-17 2005-07-27 徐文新 Digit marking character string searching technology
US7827179B2 (en) 2005-09-02 2010-11-02 Nec Corporation Data clustering system, data clustering method, and data clustering program
JP5900367B2 (en) * 2013-01-30 2016-04-06 カシオ計算機株式会社 SEARCH DEVICE, SEARCH METHOD, AND PROGRAM
CN105550175B (en) * 2014-10-28 2019-03-01 阿里巴巴集团控股有限公司 The recognition methods of malice account and device
CN108133016A (en) * 2017-12-22 2018-06-08 大连景竣科技有限公司 One kind does public document alignment system and method
CN112733524A (en) * 2020-12-31 2021-04-30 浙江省方大标准信息有限公司 Method, system and device for automatically correcting standard serial numbers and batch checking standard states

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP3099298B2 (en) * 1991-03-20 2000-10-16 株式会社日立製作所 Document search method and apparatus
JPH07109603B2 (en) * 1990-12-12 1995-11-22 株式会社テレマティーク国際研究所 Information retrieval processing method and retrieval file creation device

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100424676C (en) * 1999-07-01 2008-10-08 株式会社日立制作所 Geographical name presentation method, method and apparatus for geographical name string identification
CN100357946C (en) * 2004-06-09 2007-12-26 金宝电子(上海)有限公司 Electronic device and method for fast comparision searching word string
US7831241B2 (en) 2004-12-22 2010-11-09 Research In Motion Limited Entering contacts in a communication message on a mobile device
CN101116357B (en) * 2004-12-22 2012-12-12 捷讯研究有限公司 Entering contacts in a communication message on a mobile device
US8675845B2 (en) 2004-12-22 2014-03-18 Blackberry Limited Entering contacts in a communication message on a mobile device
CN101517363B (en) * 2006-08-18 2012-09-26 谷歌公司 Providing routing information based on ambiguous locations
WO2008098495A1 (en) * 2007-02-14 2008-08-21 Jie Bai Method and device for determing object file
CN101939743B (en) * 2007-12-24 2013-10-16 高通股份有限公司 Apparatus and methods for retrieving/downloading content on a communication device
CN103425629A (en) * 2012-05-24 2013-12-04 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN103425629B (en) * 2012-05-24 2017-05-03 富士通株式会社 Generation apparatus, generation method, searching apparatus, and searching method
CN104090875A (en) * 2013-04-01 2014-10-08 鸿富锦精密工业(深圳)有限公司 Information retrieval system and information retrieval method

Also Published As

Publication number Publication date
KR960018993A (en) 1996-06-17
JP2669601B2 (en) 1997-10-29
JPH08147320A (en) 1996-06-07

Similar Documents

Publication Publication Date Title
CN1109994C (en) Document processor and recording medium
CN1194319C (en) Method for retrieving, listing and sorting table-formatted data, and recording medium recorded retrieving, listing or sorting program
CN1151558A (en) Information searching method and system
CN1101032C (en) Related term extraction apparatus, related term extraction method, and computer-readable recording medium having related term extration program recorded thereon
CN1171162C (en) Apparatus and method for retrieving charater string based on classification of character
CN1552032A (en) Database
CN1158627C (en) Method and apparatus for character recognition
CN1310422A (en) Data processing method, system, processing program and recording medium
CN1014845B (en) Technique for creating and expanding element marks in a structured document
CN1331449A (en) Method and relative system for dividing or separating text or decument into sectional word by process of adherence
CN1281191A (en) Information retrieval method and information retrieval device
CN1855103A (en) System and methods for dedicated element and character string vector generation
CN1053852A (en) Name resolution in the catalog data base
CN1132564A (en) Method and appts. for data storage and retrieval
CN1656455A (en) Method for managing file using network structure, operation object display limiting program, and recording medium
CN1558348A (en) Method and system for converting a schema-based hierarchical data structure into a flat data structure
CN1578954A (en) Machine translation
CN1310173C (en) Table format data presenting method, inserting method, deleting method, and updating method
CN1869989A (en) System and method for generating structured representation from structured description
CN1203430C (en) Data management system for using multiple data operation modules
CN101031892A (en) Arrangement generation method and arrangement generation program
CN1991819A (en) Language morphological analyzer
CN1647069A (en) Conversation control system and conversation control method
CN1879104A (en) Data structure and management system for a superset of relational databases
CN1261862C (en) Input prediction processing method, device and program and the program recording medium

Legal Events

Date Code Title Description
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C06 Publication
PB01 Publication
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication