Detailed description of the invention
It is the same that any those of ordinary skill in like this area is enough recognized, the present invention can comprise a kind of method, data handling system or program product.Can be existed in some computer-readable carrier according to the software that the present invention write, as storer, or CD ROM, or transmit on the net, and carried out by processor.Yet cardinal principle of the present invention can be described in the network intelligence information processing method or network intelligence information handling system of the following stated.
Fig. 1 represents a system of the present invention.Subscriber computer/computing machine 101 connects 108,109 by the Internet, is connected to the webserver 102 and internet resource positioning mark server, as the server 103 and 104 of http://www.3721.com.Subscriber computer 101 can be the computing machine of any kind of of operation Microsoft Windows (Microsoft's Window) operating system, comprise PC, macintosh computer, and internet equipment are as WebTV (Web TV) and wireless Internet browsing apparatus.Subscriber computer 101 can be by pulling out into modulator-demodular unit, the DSL line, and wire line MODEM, industrial siding, as T1 or T3, or optical fiber connects, and is connected to the Internet.Obviously, those of ordinary skills know, the present invention does not limit the concrete form that is connected between the particular type of subscriber computer or subscriber computer and the Internet.Internet resource locator server 103 and 104 comprises browser model database 105, URL pattern 106 and other pattern 107.
Fig. 2 represents subscriber computer 203, connects 202 by the Internet, is connected to internet resource positioning mark server 201, as 3721 servers or contain other server of server software of the present invention.The browser screen image is just carried out in subscriber computer 203.A little user end computer software is also just being carried out (seeing the little figure of bottom of screen) in subscriber computer 203.Little user end computer software is from address box intercepting text message (msg) input of browser.This information or be sent to internet resource positioning mark server 201 for processing is perhaps carried out this locality by little user side software and is handled
Fig. 3 illustrates the processing procedure of user side running software of the present invention.User side software uses win32 Hook Technique (Win32 hook technology) to inject all operation processes.Hook is a point in the Microsoft windows messaging treatment mechanism, and at this point, application program can be installed a subroutine or independent module, with the message in surveillance contact with handle the message of some type.The hook program can be overall, the message in all threads of surveillance, and perhaps it also can monitor the message of single thread specific to thread.Some hook can only be set at system scope (as, WH_SYSMSGFILTER), but the action scope of most of hooks can have system or particular thread scope.Can find technical information at microsoft rs web_site (http://www.microsoft.com) about the Win32 hook.
Whether check the process of all operations, be the target that needs intercepting and capturing to determine it.If it is a target, the information of relevant process just is used to search the edit control that the user imports the browser of URL.This information can be used for retrieving the browser model storehouse, with the version of the browser determining to move in the subscriber computer.This database can upgrade automatically.
In case find edit control, just generate a subclass.The message of this editor can be the selection or the keyboard input of combo box and drop-down list.If it is the keyboard input, just checks and determine whether it is the URL address.Still in the mode of rule storehouse of a URL, retrieve to determine whether it is a URL.If it is the selection of combo box or drop-down list, just by processing shown in Figure 3.
Fig. 4 illustrates the browser of Chinese edition and the image of user side software interactive of the present invention.The user with input in Chinese word " computing machine ", just produces the Chinese address table relevant with this word in the address box of browser.
Yet today, the retrieval of website not only can be undertaken by the URL or the keyword of English, and carried out with other kind natural language, as Chinese.This just needs some can use the sort of natural language, carries out the disposal route or the system of this network information retrieval effectively and accurately.
Be appreciated that retrieval undertaken by database usually, this database comprises specially designed key, thus convenient various retrieval tasks.For the internet retrieval of Chinese information, no exception.As the purpose of retrieval of the present invention, internet resource positioning mark server should comprise the search index table of Chinese character at least, the key of phonetic spelling (phonetic) search index table and phonetic transcriptions of Chinese characters letter abbreviations (phonetic prefix).
Usually, when the input keyword query, the keyword phrase of input just is broken down into several significant words, and it is mated with the key of establishing is in advance arranged.Then, the result for retrieval of each word combines consideration, to determine net result or Query Result.Yet for some natural language, as Chinese, the inquiry of being imported may be a Chinese character.Each character has or may not have definite connotation, and character and the combination of other character can produce the Chinese word of different connotations.Therefore, the simple decomposition of Chinese character string can not guarantee the accuracy of Query Result.Therefore, the present invention can be with phrase or the query word that the user imported, resolve into might be combined into the significant Chinese word that comes.
For example, first word just simply with second word and/or the 3rd word combination of back, obtain a significant speech, in addition, also can form other any significant speech with each word of back.In the present invention, first word can make up with any word of input, forms all possible significant speech and is used for inquiry.Therefore, when whole results all come from might be combined into significant speech the time, it is correct that the Query Result of acquisition can guarantee to inquire about.
Inquiry input to Chinese website might be Chinese character input, URL input and phonetic input, comprises the input of phonetic spelling, the abbreviation of phonetic prefix, the phonetic input of input of phonetically similar word phonetic and southern sound.Before the details of the method and system that enters relevant above-mentioned each input of the present invention, existing once input in Chinese technology is discussed is helped to understand better the present invention.
The main coded system of Chinese is: Big5 and GB (that is national standard).Big5 generally is used to handle the complex form of Chinese characters, and GB generally is used for simplified Chinese character.In the current Big coded system in Hong Kong and Taiwan, " my god " binary coding be 1101000110100100." my god " GB be 1110110011001100.Please note above-mentioned " my god " the Big5 sign indicating number or GB all with 1 the beginning, and the letter " A " ASCII character be with 0 the beginning.This example speaks, that is, all Chinese sign indicating numbers are all with 1 beginning, and all ASCII character are all with 0 beginning.In this sense, whether be English still Chinese in system if can detect given byte in the file that comprises the Chinese and English text.
Computing machine input and processing Chinese text are very problems of difficulty.The quantity of Chinese character has illustrated this point.In Chinese characters (Chinese character) writing system of Chinese, normally used Chinese character has 3000 to 6000.If comprise less relatively use, more than 10,000 Chinese character just arranged.Except that this difficulty, also have the standardization of Chinese version, a plurality of phonetically similar words, the separatrix problems of rarely used word etc. all hinder computing machine effectively to handle Chinese text.Although carried out a large amount of research decades, exist hundreds of diverse ways, computing machine input in Chinese and processing are still and hinder computing machine to use in China, particularly text-processing major obstacle.
At present, the computer system that can be used for importing and handle Chinese language text can be divided into three kinds.First kind is based on Chinese character is resolved into primary graphic element.It is not unique that the Chinese character of every kind of method decomposes.Therefore, learn quite difficulty of these methods.
Second kind is based on pronunciation with the third, as phonetic spelling method.These methods can run into " phonetically similar word problem " in the Chinese processing.Second kind be phonetic entry (as, be used for " phonetic " of China's Mainland and be used for " phonetic notation " in Taiwan or BPMF), it is method in common to except that professional typist everyone.The Chinese character writing system is that this method is at conceptive and actual obstacle.
Although, for thousands of word, 1300 the different speech syllables of only having an appointment, however a speech syllable can be equivalent to many different Chinese characters.For example, the pronunciation of " yi " can be equivalent to more than 100 Chinese character in the mandarin.When this is translated into corresponding Chinese character at the speech syllable with input, produce uncertain.
Relate to this " phonetically similar word problem ", most of voice entry systems use the multiselect method.For example, No. the 3rd, 142,138, the Deutsche Bundespatent in 5 days Mays in 1938 of J.Heinzl etc., No. the 1064957th, the Chinese patent application in the 8 days March in 1991 of No. the 5th, 047,932, United States Patent (USP) in 10 days September in 1991 of K.C.Hsieh and TanShanguang.After keying in speech syllable, computing machine demonstrates all possible word of same pronunciation.In some cases, there are not enough spaces to remove to show all possible word of same pronunciation on the screen.This can need scroll-up/down.Therefore, these speech method based on single syllable are very slow.
The improvement to this multiselect method based on the probability (possibility) that obtains adjacent Chinese characters is disclosed in, in No. the 2nd, 248,328, the UK Patent Application in the 1 day April in 1992 of R.W.Sproat.Probability (possibility) method can further combine with syntax rule.For example, the Chinese in 1992 of K.T.Lua etc. and the Computer Processing of oriental language, Vol.6, Num.1,85 pages.Yet the accuracy (voice are to word) of these method conversions generally can only reach about 80%.
The third method is combined with voice one characters input method and other non-voice letter.The non-voice letter is added on the phonetic letter word of artificially difference same pronunciation.Example comprises the phonetic (No. the 2nd, 158,776, the BrP in the 20 days November in 1985 of C.C.Chen) of band radicals by which characters are arranged in traditional Chinese dictionaries mark and the phonetic (No. the 1066518th, the Chinese patent application in the 25 days November in 1992 of G.Xie) of band stroke number.These methods need be remembered the rule of formulating or calculate stroke number that reality has reduced input speed.
Also have other Chinese character input method, for example, United States Patent (USP) the 6th, 073, No. 146 are disclosed.' 146 patent disclosures a kind of system, use the keyboard of the other symbolic key of zone (with corresponding ASCII character), make the user can be with the syllable of the speech text of each input of distinctive signs note of representing syllable tone.In this method of carrying out in the system is the syllable that has been transfused to when determining at distinctive signs (or defining symbol) keystroke.Subsequently, the syllable of all inputs and one can received speech syllable and the abbreviation epiphase relatively.If the syllable of input is on this table, the syllable of then correct spelling and accent just is stored in the storer, and is displayed on the phonological component that image shows.Follow-up syllable is continued to handle, define symbol up to input.Define symbol in case run into, just use morphology and comprehensive processing and/or the statistical language pattern character string (being defined as two word strings that define between the symbol) of coming analysing word clearly to determine the suitable Chinese character in the character string of representing this speech.This unique Chinese translation just is stored in the storer, and is displayed on the Chinese character part of graphic interface.
Among the present invention, be used for search index data structure such as Fig. 5 A of the Internet keyword query, shown in Fig. 5 B and Fig. 5 C.The present invention has the search index table of three kinds of structure proximates.For realizing the high-speed intelligent retrieval of the Internet key word, it is very important setting up the efficient data structure that is fit to the retrieval large-scale data.Three kinds of data structures of the present invention are concordance lists that (1) is used to discern the intelligent retrieval of the speech of common Chinese character and English word or phrase; (2) Chinese phonetic alphabet spelling intelligent retrieval concordance list; (3) Chinese phonetic alphabet abbreviation intelligent retrieval concordance list.
Referring to Fig. 5 A, concordance list is Chinese and English vocabulary, comprises all Sino-British clictions, for example " China ", " software ", " computer ", " ibm " etc.In Chinese or English table, each speech all is connected to the Internet key word node tabulation.Each node in this table is represented certain pointer, points to the physical memory space of the Internet keyword that comprises this word.Therefore, it can retrieve all the Internet keywords that comprise this Chinese or English word from being linked to key word entrance, the Internet tabulation of each speech.
Referring to Fig. 5 B, data structure is similar to Fig. 5 A's.Just the left side Chinese word is phonetic form, i.e. Chinese phonetic spelling.For example, the Chinese of last predicate be now " zhongguo ", " ruanijan ", " diannao ", etc.Key word entrance, the Internet tabulation of link is the tabulation that comprises the Internet key word of this speech Chinese phonetic alphabet form.
Fig. 5 C has the data structure similar to Fig. 5 A.Difference is that in the vocabulary of left side, each speech all is forms of Chinese Pin Yin initial abbreviation, as " zg ", " rj ", " dn " etc.Like this, relevant key word entrance, the Internet tabulation comprises that this speech is corresponding with the phonetic alphabet abbreviation of these inquiries.By this three figure as can be known, three kinds of basic intelligent search methods have similar data structure, and still, speech is with China and British cliction, phonetic spelling (phonetic), or the multi-form storage of phonetic alphabet abbreviations (Chinese phonetic alphabet prefix).Therefore, the internal algorithm that is appreciated that these three kinds of retrievals is identical.Key is that these speech are how to divide into groups or selection, have the term of connotation with composition in inquiry.As mentioned above, query string be broken down into the significant speech that might be combined out, guaranteeing that each possible term points to the Internet key word in the tabulation, and guarantee that how inquiry is judged as is Chinese character input or english input, input of phonetic spelling or phonetic prefix abbreviation input.Correlation technique of the present invention below is discussed.
Although developed simpler method, the Chinese character input remains the very work of difficulty.Particularly when internet apparatus is hand-held device, as personal digital assistant, the perhaps mobile phone that is connected with internet wireless.One aspect of the present invention provides a kind of method of simplifying Chinese characters and importing.The present invention is specially adapted to import network address, perhaps natural language keyword or website (webpage) name.Fig. 6 expresses a specific embodiments of the present invention.In the method, the user keys in the prefix of Chinese word phonetic spelling, shown in 501.The phonetic prefix is used to Query Database, and a possible URL table as a result of is listed, shown in 502.This table can be based on statistical information, as according to the frequency of inquiry the most frequently used URL at first being listed, shown in 503.
Fig. 7 expresses another specific embodiments of the present invention, 601, and the phonetic spelling of input Chinese word.602, check this spelling, to determine whether it is common misspellings.What common mistake was pieced together is because the reason of accent.At southern china, many southerners are because southern accent causes Chinese phonetic alphabet mistake.If because wrong the assembly appears in southern accent, 605, system of the present invention can be automatically with its correction.If query string does not have mistake to piece together, or be repaired wrong the assembly, then 603, the url database that retrieval is relevant.604, show its output.
A little user side software draws the support of pulling with database by the intelligent retrieval of rear end, can be used as the example of specific embodiments of the present invention.This software can be downloaded from http://www.3721.com.The user needn't know or key in long and complicated url string, the substitute is simply and keys in the brand of being familiar with, the Chinese character of name of product at the network address frame, just it can be taken to its desirable targeted sites or related web page.For example, " Legend computer " that the user can key in Chinese simply will find the website that will visit, and need not key in http://www.legend.com.cn.
Now, following principal feature of the present invention, Fig. 8 expresses the basic flow sheet of Chinese of the present invention and/or english retrieval.801, behind the inquiry string A of input Chinese and/or english form, 802, system just contrasts Chinese and English vocabulary (CEWL) analysis and consult character string A, and, inquiry string A is resolved into one or more Chinese words: W=(W
1, W
2, W
3..., W
n).803, to each the speech W among the W
x, system is term W in the CEWL table
x, to find its attached key word entrance, the Internet table (IKEPL
x), IKEPL
xEach node in the table can point to one and comprise speech W
xThe Internet key word (IK).
804, system is with all IKEPL
1, IKEPL
2..., IKEPL
nCombine, obtain R as a result, that is, and R=IKEPL
1, U IKEPL
2, U..., IKEPL
nBecause IKEPL
xIn each node all point to and comprise speech W
xIK, then each IK of R comprises a speech among the W at least.805, in the time of merging, system calculates its weight by ad hoc rules to each IK among the R, and the example of rule is as follows:
(1) weight counted in speech: the number of the speech in W that IK is contained
(2) length overall of the contained speech in W of length weight: IK
At last, on the basis of above-mentioned rule, the comprehensive weight of each IK of system-computed.After the calculating, 806, the weight of IK is pressed by system, with the classification of R as a result, so, recently like the result appear at gauge outfit, and system can limit the quantity of result among the R.Then, 807, final IK table R appears.
Similarly, referring to Fig. 9,901, the inquiry string A of input is the form of phonetic spelling.902, after character string A input, system's contrast Chinese phonetic alphabet spelling vocabulary (FCPWL) is analyzed character string A, and is broken down into one or more Chinese phonetic alphabet speech: W={W
1, W
2, W
3..., W
n.903, for each the speech W among the W
x, system is retrieved in FCPWL, to find its attached keyword entrance, the Internet Table I KEPL
x, IKEPL
xEach node in the table points to its phonetic and comprises W
xThe Internet keyword (IK).Subsequently, 904, system merges IKEPL
1, IKEPL
2..., IKEPL
n, to obtain R=IKEPL as a result
1, U IKEPL
2, U..., IKEPL
nLike this, the phonetic of each IK among the R all comprises a speech among the W at least.Following steps 906-907 is very identical with the step of 805-807,, presses the weight that ad hoc rules calculates each IK among the R that is; The weight of pressing IK will be shown result's classification of R, so that result like recently is placed on gauge outfit, and, result's quantity among the restriction R, thus the table R of IK as a result finally obtained.
Similarly, referring to Figure 10,11, the user will import Chinese phonetic alphabet abbreviated character string A.12, system's contrast Chinese phonetic alphabet abbreviation vocabulary (ACPWL) is analyzed character string A, and, character string A is resolved into one or more Chinese phonetic alphabet abb.s: W={W
1, W
2, W
3..., W
n.Then, 13, to each the speech W among the W
x, this speech is retrieved by system in ACPWL, to find its attached keyword entrance, the Internet Table I KEPL
x, IKEPL
xEach node in the table points to its Pinyin abbreviation and comprises speech W
xThe Internet keyword (IK).Subsequently, 14, system merges IKEPL
1, IKEPL
2..., IKEPL
n, to obtain R=IKEPL as a result
1, U IKEPL
2, U..., IKEPL
n, then the Pinyin abbreviation of each IK all comprises a speech in the W at least among the R.Those steps among following steps 15-17 and Fig. 8 and Fig. 9 are basic identical,, press the weight that ad hoc rules calculates each IK among the R that is; The weight of pressing IK will be shown result's classification of R, so that result like recently is placed on the gauge outfit place, and, result's quantity among the restriction R, thus the table R of IK as a result finally obtained.
China and British cliction, Chinese phonetic alphabet spelling speech, with Chinese phonetic alphabet abb., on the basis of these three kinds of intelligent retrieval patterns, the present invention will judge about the method and system of Intelligent Information Processing in the wide area network whether the input inquiry character string is China and British cliction, Chinese phonetic alphabet spelling speech, still is Chinese phonetic alphabet abb., as shown in figure 11.Behind 110 input of character string A, 111, system judges whether the inquiry string A of input is the form of Chinese phonetic alphabet spelling speech.If system is just calculated by the intelligent search method of phonetic spelling, as shown in Figure 9.
If character string A is not a Chinese phonetic alphabet spelling speech, 112, system judges whether the inquiry string A of input is the form of Chinese phonetic alphabet abb..If system is just calculated by the intelligent search method of Chinese phonetic alphabet abb., as shown in figure 10.If character string A is not, the inquiry string A that therefore system just judges input is the form of China and British cliction, and, carry out the calculating identical with calculating shown in Figure 8.Yet, a kind of situation is arranged, system judges 113 whether the result of calculation of Chinese phonetic alphabet spelling word and search or the retrieval of Chinese phonetic alphabet abb. is blank.If the result is blank, system will carry out the calculating of Chinese and English word and search once more, as shown in Figure 8.If the calculating of the search modes of Fig. 9 or Figure 10 is not blank, then its result of calculation just is judged as net result.
Figure 12 A has represented the phonetic spelling search modes of homonym of the present invention.121, behind the input inquiry character string A, 122, systematic analysis obtains all possible homonym combination, as searchable spelling speech.123, for each spelling homonym, system carries out Chinese phonetic alphabet spelling word and search to be calculated, as shown in Figure 9.Obtaining all result for retrieval R
NAfter, 124, system is with analysis result R
N, and obtain the most probable result of final sum, or restriction result's quantity.
Figure 12 B illustrates and has the wrong phonetic spelling search modes of correcting function of piecing together of dialect among the present invention.For further expanding the method and system of Fig. 7,125, behind the input spelling speech character string A, 126, system of the present invention will contrast listed consonant or the vowel that may misspell because of southern accent in the table, analyze the speech of input, as " huang " and " wang ", " shi " and " si " " lu " and " l ", etc.In a word, this tabular lifted the speech that might misspell.Therefore, the inquiry string of input is split up into several pinyin word, comprises all possible pinyin word, then, 127, calculates by the method for phonetic spelling retrieval, to obtain all possible IK as a result.Subsequently,, analyze result for retrieval, to obtain the most probable result of final sum 128.
Be appreciated that above narration only is explanation rather than restriction.For the those of ordinary skills that read above-mentioned explanation, many variations of the present invention are conspicuous.Therefore, scope of the present invention not only should be determined in conjunction with above explanation, but also should be determined in conjunction with variation and equivalent.Although the present invention narrates with specific embodiments; But be appreciated that this does not have plan and limits the present invention to these specific embodiments.On the contrary, this invention is intended to cover may be at the variation in connotation of the present invention and the scope, modification and equivalent.