WO2007101391A1 - A discrete substring matching method for information searching and information inputting - Google Patents

A discrete substring matching method for information searching and information inputting Download PDF

Info

Publication number
WO2007101391A1
WO2007101391A1 PCT/CN2007/000392 CN2007000392W WO2007101391A1 WO 2007101391 A1 WO2007101391 A1 WO 2007101391A1 CN 2007000392 W CN2007000392 W CN 2007000392W WO 2007101391 A1 WO2007101391 A1 WO 2007101391A1
Authority
WO
WIPO (PCT)
Prior art keywords
discrete
text
substring
character
pattern
Prior art date
Application number
PCT/CN2007/000392
Other languages
French (fr)
Chinese (zh)
Inventor
Guangyao Ding
Original Assignee
Guangyao Ding
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN 200610020427 external-priority patent/CN1811776A/en
Priority claimed from CN 200610021280 external-priority patent/CN1869983A/en
Application filed by Guangyao Ding filed Critical Guangyao Ding
Publication of WO2007101391A1 publication Critical patent/WO2007101391A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Definitions

  • the invention relates to a discrete substring pattern matching method for information retrieval and information input.
  • the input search term needs to be used as a substring to perform matching search on a stored text such as a database or a webpage. If the search term is a substring of the stored text, the stored text is output as the retrieved text, otherwise the text is discarded; If all the text stored in the file does not match the search term, no text can be retrieved.
  • a character string input on an input device such as a keyboard is used as a pattern substring to match text in a text library stored by an information processing device such as a computer; if the pattern substring matches the text, the text is selected and performed.
  • the speed, recall rate, and precision of the substring pattern matching method are crucial for information retrieval and information input.
  • the position in S is i. That is, the existing substring must be composed of consecutive characters in the text string S, and the character string composed of the discontinuous characters in the text string S is not a substring of the text S.
  • the substring pattern matching is Refers to: Whether there is a substring equal to the pattern P in the text S. In some application fields, the judgment is also required, and the matching degree and the position of occurrence are output.
  • the above various pattern matching algorithms are based on the pattern P, searching for a continuous substring matching the P in the text string S, and the algorithm is continuously improved around improving the matching speed.
  • P matches S, which is a substring of S.
  • the sub-string correlation concept reflects that the discrete feature is not a perfect ensemble correlation concept.
  • the sub-string conceptually misses the relevant text with discrete characteristics, which brings many troubles to the application and increases the problem-solving. Complexity.
  • the following example further reflects the inherent discrete related text missing problem in the entire information retrieval system based on the substring concept.
  • Example 4 In the spelling of the English word “procedure”, it is easy to remember the initials of each syllable and the last string of the string “prcde”, which does not satisfy the substring definition of “procedure”, however "prcde” Discretely appear in “procedure”. Enter “prcde” and the existing substring matches, and the English word cannot be retrieved.
  • Another type of pattern matching is a non-exact matching, which is used to determine whether the pattern P is similar to the text S, allowing a limited error to occur, and the similarity constraint, the return judgment result, and the positioning position are applied to information retrieval, information processing, and biology.
  • the main error factors that affect the exact match include: insert error, exchange error, delete error, replacement error, reverse error, and so on. Due to the variety of error factors, the inexact pattern matching method considers some error factors comprehensively, and forms various solutions from linear time complexity to non-deterministic polynomial time complexity (NPC problem) from different application angles and various techniques. Solution, trying to solve the matching problem that allows limited errors, the effect is limited by the comprehensiveness of the error and the number of errors.
  • the non-exact matching method deals with the problem in Example 4 above
  • the ED (Edi t Distance) method that does not match the exact match considers that there are four deletion errors. Since the BD method also considers insert errors and replacement errors, if four errors are allowed
  • the matching the comprehensive processing of the three types of errors, will result in matching a large number of words satisfying 4 error constraints from the English lexicon, making the matching result meaningless. Wildcard matching is an option to solve this type of problem, such as entering "pr*c*d*e", but you must consider where and how many wildcards are added. For the public, this solution still has operational advantages. difficult.
  • the maximum matching can also solve the problem, but the maximum matching is equivalent to comprehensive consideration of the insertion and deletion error factors, so the complexity of the method itself, the time complexity is improved, and the number of candidate words is increased for the thesaurus retrieval.
  • Hamming Distance only considers replacement errors.
  • the similarity matching still considers three kinds of error factors comprehensively, and seeks the similarity between the pattern and the text.
  • the retrieval effect of the above inexact method is limited by the number of allowed errors and the comprehensiveness of errors. Therefore, existing inexact matching does not solve these typical discrete correlation problems well.
  • the object of the present invention is to solve the above problems, and to provide a discrete substring pattern matching method for information retrieval and information input, which has high recall rate, high accuracy, and easy positioning; information retrieval and information input tube Single, flexible and fast.
  • Step a Take the first character of the text S as the compared character, and take the first character of the pattern P as the comparison character; bstep If the compared character or the comparison character is the end flag, go to step d;
  • the next character of the text S is taken as the compared character, and the next character of the pattern P is taken as the comparison character, and the step b is performed; otherwise, the next character of the text S is taken as the Compare characters, compare characters, and turn b steps;
  • the determination mode P is a discrete substring of the text S, and the data representing the determination result "present” is output, and the matching is ended; otherwise, the discrete substring of the pattern P does not exist in the text S, and the output is The data representing the "non-existence" of the judgment result ends the match.
  • the discrete substring of the present invention is a character string consisting of any one or more characters in the text S, which extends the concept of the substring, that is, does not require the characters in the substring to be consecutive characters in the text S.
  • the existing substring is only a special case of the discrete substring in the present invention. Since the present invention is a discrete substring of the text S, the method gives the result of the P matching S. Therefore, the recall rate of the present invention is significantly improved, and the problem of discrete related text omission existing in the existing substring pattern matching is solved. . And it is also convenient to implement the positioning of the discrete substrings by further pattern matching methods.
  • the method of the present invention is only when the characters in the pattern P must be completely, ordered (in bit order) and can be discretely in the text S When appears, the text S is judged to be the text associated with the pattern P, and therefore, the matching accuracy is guaranteed.
  • This decision method skips extraneous text in a faster manner than the complexity of the existing substring pattern matching method (m+n) (SA Cook theory). Therefore, the method of the present invention is quick and effective.
  • the search term may be ordered by the text, It can be composed of discrete characters.
  • the choice of search terms is very simple and flexible. It can reduce the input code length and effectively avoid spelling errors or dialect errors.
  • This matching method is a basic discrete substring pattern matching method. When it is determined that there is a discrete substring, the output data representing the judgment result "present” is output, otherwise the data representing the judgment result "nonexistence” is output. Suitable for inputting judgments of short text and large-capacity character string sets.
  • the above-mentioned discrete substring pattern matching method for information retrieval and information input can be slightly modified to form a pattern matching method for outputting a simple matching degree, which is a step, b step, and c step in the above basic matching method.
  • Change, and step d is modified to:
  • This output matching method of simple matching degree determines the simple matching degree (100*m/n) of the discrete substring of the output text S and the mode P when there is a discrete substring. It is suitable for information input judgment of short text and large capacity string sets.
  • the search result can be arranged in descending order of discrete substring simple matching degree, and the retrieved text is output, so that the user can first select the text with high matching degree.
  • step c and step d are modified to:
  • the position value of the character to be compared in the text S is stored in the position array pos [ ], and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken.
  • the next character as the compared character take the next character of the pattern P as the comparison character, and turn to step b; otherwise, take the next character of the text S as the compared character, compare the characters unchanged, and turn to step b;
  • the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is outputted as "- ⁇ , ending the matching; otherwise, the determination mode P is the discrete substring of the text S, and the text S is obtained.
  • the exact match of the substring dispersion degree Round (100 X (ra - (g - g - m + 1) ⁇ n) ⁇ n) , the end of the match.
  • This method of matching the exact matching degree of the output determines the exact matching degree of the output text S and the mode (when there are discrete substrings (100*(m-(t perennial-t -m+l) ⁇ n) ⁇ n) ,
  • the exact matching degree considers not only the length of the text S and the mode ⁇ , but also the influence of the discrete number of the retrieved discrete substrings on the matching degree. It is also suitable for the information input judgment of short text and large capacity string sets.
  • the precise matching degree is arranged in descending order, and the retrieved text is output, which is more convenient for the user to first select the text with high matching degree.
  • the discrete substring pattern matching method for information retrieval and information input with the precise matching degree of output can be modified to form a pattern matching method for outputting discrete number and position of discrete substrings, and the method is the above-mentioned output accurate matching degree.
  • steps a, b, and c are unchanged, and step d is modified to:
  • the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is output, and the matching is ended; otherwise, the pattern P is the discrete substring of the text S "S gl S g2 ... S
  • the discrete number of the substring D g m -g -m+l , and output the position array pos [ ] to end the match.
  • the discrete number of the discrete substring and the corresponding position of each character of the discrete substring in the text S are output, and the discrete number reflects the degree of dispersion, and the discrete number of the detected text can be ascending Sorting, combined with positioning information, makes the subsequent processing of information retrieval more accurate and effective.
  • This method is suitable for location retrieval of short texts such as network information search and database information retrieval.
  • the time complexity of the decision method is independent of the mode P. It is only necessary to compare the n times to skip the irrelevant text S, and the worst case of finding the first discrete substring is to compare n. Secondary character.
  • the above-mentioned discrete substring pattern matching method for outputting discrete substring discrete numbers and positions for information retrieval and information input can be modified to form a pattern matching method for outputting discrete substring discrete numbers and positions based on a given discrete number.
  • the method is the above method for outputting the discrete substring discrete number and the position matching method.
  • the a step, the b step, and the c step are unchanged, and the d step is modified as: d step, if the comparison character is not the end flag, it is determined that the pattern does not exist in the text S.
  • the discrete substring of P outputs the judgment result "-1" to end the match; otherwise, the pattern P is the discrete substring "S el S s2 ...
  • the decision mode P is the discrete number D of the text S.
  • the first discrete substring required, the discrete number D is output, and the position array pos [ ] is output, ending Match.
  • the discrete substring pattern matching method can be applied to the location search of long and short texts such as network information search and database information retrieval.
  • This method can adjust the discrete number D. , changing the function of discrete substring pattern matching, searching for discrete substrings and positions satisfying the requirements of a given discrete number; the smaller the discrete number, the more accurate the search positioning, but the worse the search function is, and may skip related Some texts satisfying discrete substrings; the larger the discrete number, the less precise the search position, but the more powerful the search finds, the more text that satisfies the discrete substring matches.
  • the given discrete number can be determined by the user, and by changing the discrete number, a flat street is sought in the recalling rate, the precision rate, and the positioning accuracy, thereby satisfying different conditions, the user has different recall rates, precision ratios, Information retrieval requirements for positioning accuracy.
  • a discrete number D. 0
  • the retrieval function is equivalent to the substring, and the compatibility with the substring pattern is achieved.
  • Visible, discrete number D. Play an important role in the discrete substring pattern matching method.
  • the discrete substring pattern matching method for information retrieval and information input with the above output precise matching degree can be slightly modified and expanded to form a pattern matching method for outputting the discrete substring matching degree, and the method is the above output accurate matching degree.
  • the a step, the b step, and the c step are unchanged, and the discrete substring is first found, and then the discrete substrings are found by the following d steps, e steps, and f steps, that is, discrete numbers in the relevant range of the discrete substrings found.
  • Step d If the comparison character is not the end mark, go to step h; otherwise, move the position of the compared character in the text S forward by 2 character positions, and take the character of the position as the compared character, and compare the position of the character in the pattern P. Move forward 2 characters and take the character at that position as the comparison character.
  • Step e If the first character of mode P has been compared, go to g step;
  • the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken.
  • the previous character is used as the compared character, and the previous character of the pattern P is taken as the comparison character, and the e step is changed; otherwise, the previous character of the text S is taken as the compared character, the comparison character is unchanged, and the e step is performed;
  • the step determination mode P is a discrete element substring of the text S, and the length n of the text S is obtained, and the length ra of the pattern P is obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: g is produced in pos [ ]
  • the first value, g m the last value in pos [ ]
  • the output discrete prime substring match Round (100 ⁇ (m- (g m -g -m+l) ⁇ n) ⁇ n) , end Matching; the discrete element substring of the pattern P does not exist in the step determination text S, and the determination result "-1" is output, and the matching is ended.
  • the pattern matching method for outputting the discrete sub-string matching degree can determine whether a discrete element substring exists and output text
  • the degree of matching of S with the discrete prime substring of pattern P (100 x (m - (g n - g - m + l) ⁇ n) ⁇ n).
  • the discrete element substring reflects a better matching position than the discrete substring, and therefore, the matching degree of the discrete substring can better reflect the degree of matching.
  • This method is suitable for information input judgment of short text and large-capacity string sets.
  • all the retrieved texts can be output in descending order, so that the user first processes the text with high matching degree, and the retrieval efficiency is improved.
  • the method finds the first discrete element substring with a time complexity of 0 (n) and the character comparison number f (n) ⁇ n+ (m+D r ) ⁇ 2n-l , and D f is the first discrete element found.
  • the discrete substring pattern matching method for outputting discrete element sub-string matching degree for information retrieval and information input is slightly modified to form a pattern matching method for outputting discrete numbers and positions of discrete sub-substrings, and the method is the above-mentioned output discrete sub-string
  • a-f step and h step are unchanged
  • g step is modified to:
  • the method outputs the discrete number of the discrete element substring and the position of each character of the discrete element substring in the text S after determining that the discrete element substring exists.
  • Discrete element substring positioning is better than discrete substring pattern matching positioning. This is because there are discrete substrings, and there must be discrete substrings, and the discrete number of discrete substrings must be less than or equal to the discrete number of discrete substrings; With the degree of dispersion, the discrete numbers of the detected text can be sorted in ascending order, and then combined with the positioning information, so that the subsequent processing of information retrieval is more accurate and effective.
  • This method is suitable for location retrieval of short texts such as network information search and database information retrieval.
  • the time complexity of the discrete method substring cannot be found in the decision method. It is only necessary to compare the n times, and then skip the irrelevant text S and find the first discrete substring. The worst case is to compare 2n-l characters.
  • the discrete substring pattern matching method for outputting information and information input by discrete numbers and positions of the discrete elements of the discrete elements can be modified to form a pattern matching method for outputting discrete numbers and positions of discrete prime substrings based on a given discrete number.
  • the method is the af step and the h step in the pattern matching method for outputting the discrete number and position of the discrete prime substring, and the g step is modified to -.
  • the discrete number I is given a discrete number D. Then, it is determined that the mode P is the discrete number D of the text S. The first discrete element substring is required, the discrete number D is output, and the position array pos [ ] is output to end the match.
  • This method can adjust the discrete number D of discrete element substrings. , changing the function of discrete element sub-pattern matching, searching for discrete sub-strings and positions that satisfy the discrete number requirement.
  • the smaller the discrete number the more accurate the search positioning, but the worse the search function is. It may skip some related texts that satisfy the discrete substring.
  • the larger the discrete number the less accurate the search position, but the search function is searched.
  • the method can be applied to the location retrieval of long and short texts such as network information search and database information retrieval.
  • the time complexity of the discrete prime substring is 0 (n+k (m+Da) ) , and the number of character comparisons is f (n) n+2 (k-1) (m+Da-1); Number D.
  • the number of times, Da is the average discrete number of discrete substrings found.
  • the discrete substring pattern matching method for information retrieval and information input which outputs the discrete substring matching degree, can be modified to form a pattern matching method for outputting the minimum discrete prime substring matching degree, and the method is the output discrete substring described above.
  • the af step is unchanged, the g step, the h step are modified, and the i step is added:
  • step (y strig- yr"ffl+l) 0, go to h step; otherwise, restart the matching of the next discrete substring, and change the position of the compared character in text S to the second of pos [ ] The value of the position, and take the character of the position as the compared character; modify the position of the comparison character in the mode P to the first character position of the mode P, and take the character of the position as the comparison character, and turn to step b.
  • the pattern matching method for outputting the minimum discrete element substring matching degree, the minimum discrete element substring matching degree of the output text S and the pattern P when determining the smallest discrete element substring in the text (l OO x (m- (y ra -y -m+l) ⁇ n) ⁇ n)
  • the smallest discrete element substring reflects the discrete substring with the smallest scatter in the text. Therefore, the matching degree of the smallest discrete substring can most accurately reflect the degree of matching.
  • This method is more suitable for information input judgment of short text and large-capacity string sets.
  • all the retrieved texts can be output in descending order, and the user first processes the text with high matching degree, which further improves the processing efficiency of information retrieval and input.
  • This method finds the smallest discrete prime substring with a time complexity of 0 (n+k (m+Da);) and the number of character comparisons is f (n) ⁇ n+2k(ra+Da-l), where k is the number of occurrences of the found discrete element substring, and Da is the average discrete number of the found discrete element substring.
  • the discrete substring pattern matching method for information retrieval and information input for outputting the minimum discrete prime substring matching degree is slightly modified to form a pattern matching method for outputting the discrete number and position of the minimum discrete prime substring, which is the minimum output described above.
  • the a-f step is unchanged, and the g step, the h step, and the i step are modified as:
  • Step h If the position array min[] of the current smallest discrete substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is outputted "- ⁇ , the end is matched; otherwise, the length of the pattern P is obtained.
  • the character is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
  • This method further improves the positioning accuracy of discrete element sub-pattern matching. This is because if there is a discrete element substring in the text S, there must be a discrete element substring with the smallest discrete number. Finding the discrete discrete substring with the smallest discrete number in the whole range of text is an optimal positioning scheme, which can effectively improve the efficiency and accuracy of information retrieval.
  • the discrete substring pattern matching method for information retrieval and information input outputting the discrete number and position of the minimum discrete prime substring is slightly modified to form a pattern matching of the discrete discrete substring discrete number and position based on a given discrete number
  • the method is the same as the ag step and the i step in the pattern matching method for outputting the discrete number and position of the smallest discrete element substring, and the h step is modified as:
  • the determination result is outputted "- ⁇ , end matching; otherwise, the length of the mode P is obtained.
  • the mode P is determined to be the smallest discrete element substring required by the given discrete number D of the text S, the discrete number D is output, and the position array min [ ] is output. , end the match.
  • This method finds in the text S that a given discrete number D is satisfied.
  • the minimum required discrete substrings improve the function of discrete prime sub-pattern matching, filtering out the smallest discrete sub-strings with too large discrete texts, and improving the accuracy of information retrieval.
  • the method can be applied to the positioning and retrieval of long and short texts such as network information search and database information retrieval.
  • the time complexity of this method is the same as the minimum discrete element substring pattern matching method.
  • the above-mentioned discrete substring pattern matching method for information retrieval and information input can modify the expansion to form a two-dimensional discrete substring pattern matching method, which firstly expands the concept of discrete substring and text into a two-dimensional discrete substring. And the concept of two-dimensional text, and then corresponding to the four steps of a, b, c, d in the discrete substring pattern matching method, respectively, B, C, D four steps, and in the C step reference a, b , c, d four steps, namely:
  • the two-dimensional discrete substring of the text Ds, the specific steps of this two-dimensional discrete substring pattern matching method are as follows:
  • Step A takes the first text S 1 of the two-dimensional text Ds as the compared text, and takes the first pattern P 1 of the two-dimensional pattern Dp as the comparison text;
  • Step B If the text being compared or the comparison text is the end mark, go to step D;
  • Step C is to compare the text and the comparison text to the steps of step a, step b, step c, and step d in the method for matching the discrete substring basic pattern. If the result of the step d is present, the two-dimensional text D s is taken. The next text is the compared text, take the next mode of the two-dimensional mode D p as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, turn B Step
  • step D if the comparison text is the end mark, the two-dimensional mode D p is a two-dimensional discrete substring of the two-dimensional text D s , and the number n of texts of the two-dimensional text D s and the number of modes of the two-dimensional mode D p are obtained.
  • a two-dimensional discrete substring pattern matching method is used for two-dimensional space.
  • the Chinese single-word pinyin in the keyboard input the English word as a one-dimensional string, the Chinese phrase Pinyin, and the English phrase can be considered as a two-dimensional string.
  • the two-dimensional discrete substring has all the characteristics of the discrete substring, and also contains the discrete substring.
  • the present invention can perform arbitrary character omitting input and retrieval on the level of one-dimensional text space, and can perform omitting input and retrieval of any one-dimensional text on the two-dimensional space level.
  • the relevant text can be found, making information retrieval and information input simpler and more flexible.
  • This two-dimensional discrete substring pattern matching method determines the simple matching degree (100 x m ⁇ n) of the two-dimensional discrete substring after the existence of the two-dimensional discrete substring.
  • the matching degree By using the matching degree, all the retrieved two-dimensional texts can be output in descending order, and the user first processes the two-dimensional text with high matching degree, which improves the efficiency of the retrieval processing. It is suitable for the retrieval judgment of dictionary short text and large capacity two-dimensional string set.
  • the above-mentioned two-dimensional discrete substring pattern matching method for information retrieval and information input can be slightly modified to form a pattern matching method for output accurate matching degree, and the method is the above two-dimensional discrete substring pattern matching method. Steps and steps B are unchanged, and steps C and D are modified to:
  • Step C is to compare the text with the comparison text to perform the steps of step a, step b, step c, and step d in the method of discrete substring base matching. If the result of step d is present, the text to be compared is The position value in the two-dimensional text D s is stored in the position array pos [ ], and its storage position is the same as the position of the comparison text in the two-dimensional mode D p , and the next text of the two-dimensional text D s is taken as the comparison Text, take the next mode of the two-dimensional mode D p as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, and turn to step B;
  • the two-dimensional discrete substring of the two-dimensional pattern D p does not exist in the two-dimensional text D s , and the determination result "- , the end matching is output; otherwise, the two-dimensional mode D p is determined to be two-dimensional The two-dimensional discrete substring of the text D s, the number n of texts of the two-dimensional text D s , the number m of modes of the two-dimensional pattern D p , the first text string of the two-dimensional discrete substring and the last one
  • the position of the text string in the two-dimensional text D s: the first value in G ⁇ pos [ ], the last value in G m pos [ ], the exact match of the output two-dimensional discrete substring - Round (100 X (m - (G m - G - m + l) ⁇ n) ⁇ n) , end the match.
  • This method determines the exact matching degree of the two-dimensional discrete substring when there is a two-dimensional discrete substring (100 X (m - (g m - g - m + l) ⁇ n) ⁇ n).
  • the exact matching degree of the two-dimensional discrete substring not only considers the number of one-dimensional texts of the two-dimensional text S and the two-dimensional pattern P, but also considers the influence of the discrete numbers of the retrieved two-dimensional discrete substrings on the matching degree.
  • all the retrieved two-dimensional texts can be output in descending order, and the user first processes the two-dimensional text with high matching degree, which further improves the retrieval processing efficiency of the two-dimensional space. It is also applicable to the retrieval judgment of dictionary short text and large-capacity two-dimensional string set.
  • a step takes the first character of the text S as the compared character, and takes the first character of the pattern P as the comparison character; b step if the compared character or the comparison character is the end flag, the d step;
  • the next character of the text S is taken as the compared character, and the next character of the pattern P is taken as the comparison character, and the step b is performed; otherwise, the next character of the text S is taken as the Compare characters, compare characters, and turn b steps;
  • the determination mode P is a discrete substring of the text S, and the data representing the determination result "present” is output, and the matching is ended; otherwise, the discrete substring of the pattern P does not exist in the text S, and the output is The data representing the "non-existence" of the judgment result ends the match.
  • Embodiment 2 is a diagrammatic representation of Embodiment 1:
  • the method of this example is a pattern matching method of output simple matching degree formed by slightly modifying the basic matching method, and the method is the steps a, b, and c in the method of the first embodiment.
  • Step d is modified to:
  • the Round in the present invention is a rounding function, that is, a rounding rounding operation.
  • the method of this example is also a pattern matching method for output accurate matching degree formed by slightly modifying a basic matching method, which is implemented in the method of implementing one, step a and step b are unchanged, and step c, Step d is modified to:
  • the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken.
  • the next character as the compared character take the next character of the pattern P as the comparison character, and turn to step b; otherwise, take the next character of the text S as the compared character, compare the characters unchanged, and turn to step b;
  • the output determination result "ends the matching; otherwise, the determination mode P is the discrete substring of the text S, and the length n of the text S is obtained.
  • the method of this example is a mode matching method for output discrete substring discrete numbers and positions formed by slightly modifying the mode of output precision matching of three, which is a method of implementing three steps a, b, The c step is unchanged, and the d step is modified as: d step If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-, the end matching is output; otherwise, the mode P is the text S.
  • the discrete substring "S sl S g 2 >
  • the method of this example is a method for matching the discrete number and position of discrete substrings of a given discrete number on the method of implementing the output discrete substring discrete number and position of the fourth, and the method is In the method of the fourth embodiment, steps a, b, and c are unchanged, and step d is modified to:
  • the mode P is determined to be the discrete number D of the text S.
  • the first discrete substring required, the discrete number D is output, and the position array pos [ ] is output. End the match.
  • the position of the compared character in the text S is modified to: the currently compared character position - the length of the pattern P m - the predetermined discrete number D consume, and The character at the position is taken as the compared character; the position of the comparison character in the mode P is modified to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
  • the method of this example is a pattern matching method for the output discrete element substring matching degree formed by slightly modifying and expanding on the pattern matching method of the output precise matching degree of the third embodiment, and the method is the step a of the method of the third embodiment.
  • step b, step c does not change, first find the discrete substring, then find the discrete substring through the following d step, e step, f step, and then determine the matching degree of the discrete prime substring by g step, h step:
  • step h If the comparison character is not the end mark, go to step h; otherwise, move the position of the compared character in the text S by 1 character position, and take the character at the position as the compared character, and compare the position of the character in the pattern P. Move forward 2 characters and take the character at that position as the comparison character.
  • Step e If the first character of mode P has been compared, go to g step;
  • the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken.
  • the previous character is used as the compared character, and the previous character of the pattern P is taken as the comparison character, and the e step is changed; otherwise, the previous character of the text S is taken as the compared character, the comparison character is unchanged, and the e step is performed;
  • the method of this example is formed by slightly modifying the pattern matching method of the output discrete sub-string matching degree of the sixth embodiment.
  • a pattern matching method for outputting discrete number and position of discrete sub-substrings is the same as the method of the sixth embodiment, wherein the a-f step and the h-step are unchanged, and the g-step is modified to:
  • This example is a pattern matching method in which the discrete matching sub-distribution number and position of a discrete number of discrete elements are obtained by slightly modifying the pattern matching method of the discrete-sub-substring discrete number and position of the output discrete-sub-substring.
  • the af step and the h step are unchanged, and the g step is modified to:
  • the discrete number D is given a discrete number D. Then, it is determined that the mode P is the discrete number D of the text S. The first discrete element substring is required, the discrete number D is output, and the position array pos [ ] is output to end the match.
  • This example is a pattern matching method for outputting a minimum discrete element substring matching degree which is slightly modified on the pattern matching method of outputting the discrete element substring matching degree of six.
  • the method is the a-f in the method of the sixth embodiment. Steps are unchanged, modify g step, h step, and increase i step:
  • i step (y yr m+l) 0, go to h step; otherwise, restart the matching of the next discrete substring, and change the position of the compared character in text S to the value of the second position of pos [ ] And take the character of the position as the compared character; modify the position of the comparison character in the mode P to the first character position of the mode P, and take the character of the position as the comparison character, and turn.
  • Example ten This example is a pattern matching method for outputting a minimum discrete element substring discrete number and a position formed by slightly modifying the pattern matching method for outputting the minimum discrete element substring matching degree in the ninth embodiment, and the method is the method of the ninth embodiment.
  • the af step is unchanged, and the g step, h step, and i step are modified to:
  • the position in the text S: ⁇ 3 [ ]
  • Step h If the position array min [ ] of the current smallest discrete prime substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the mode P is obtained.
  • the character is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
  • This example is a pattern matching method for the discrete-distribution sub-distribution number and position of a given discrete number based on a pattern matching method for outputting the minimum discrete element sub-string discrete number and position of the tenth embodiment.
  • the method is the a-g step, the i step is unchanged in the method of the tenth embodiment, and the h step is modified to:
  • Step h If the position min gate of the current minimum discrete element substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "- 1" is output, and the matching is ended; otherwise, the length of the pattern P is obtained.
  • the discrete number D> is a predetermined discrete number D. Then, it is determined that there is no discrete number D that satisfies a predetermined value in the text S. The smallest discrete prime substring, the output judgment result ends the match.
  • the mode P is determined to be the smallest discrete element substring of the text S that meets the predetermined discrete number D bani, the discrete number D is output, and the position array min [ ] is output. End the match.
  • the expansion is performed to form a two-dimensional discrete substring pattern matching method, which firstly expands the concept of discrete substring and text.
  • the concept of two-dimensional discrete substring and two-dimensional text, and then corresponding four steps A, B, C, and D similar to the four steps a, b, c, and d in the discrete substring pattern matching method, and in step C The four steps a, b, c, and d are quoted, namely:
  • the specific steps of the substring pattern matching method are as follows:
  • Step A takes the first text S 1 of the two-dimensional text Ds as the compared text, and takes the first pattern P 1 of the two-dimensional pattern Dp as the comparison text;
  • Step B If the text is compared or the comparison text is the end mark, go to step D;
  • Step C is to compare the text and the comparison text to the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of step d is present, take the next step of the two-dimensional text Ds.
  • the text as the compared text take the next mode of the two-dimensional mode Dp as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, and turn to step B;
  • This example is a two-dimensional discrete substring pattern matching method for the information retrieval and information input in the two-dimensional discrete substring pattern matching method of the twelfth embodiment, and the output precise matching degree is formed by a slight modification, and the method is implemented.
  • step A and step B are unchanged, and step C and step D are changed to:
  • Step C is to compare the text with the comparison text to perform the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of the step d is present, the text to be compared is in the two-dimensional text.
  • the position value in Ds is stored in the position array pos [ ], and its storage position is the same as the position of the comparison text in the two-dimensional mode D p , and the next text of the two-dimensional text D s is taken as the compared text, taking two dimensions.
  • the next mode of the mode Dp is used as the comparison text, and the process proceeds to step B; otherwise, the next text of the two-dimensional text Ds is taken as the compared text, the comparison text is unchanged, and the step B is performed;
  • the two-dimensional discrete substring of the two-dimensional mode Dp does not exist in the two-dimensional text Ds, and the determination result "-1" is output, and the matching is ended; otherwise, the two-dimensional mode Dp is determined to be two-dimensional text.
  • the two-dimensional discrete substring of Ds find the number n of texts of the two-dimensional text Ds, the number m of modes of the two-dimensional pattern Dp, and find the first text string of the two-dimensional discrete substring and the last text string in two
  • the pattern matching method of the first, second, third, sixth, and ninth methods determines whether there are discrete substrings in the text and performs the calculation of the matching degree, but does not perform positioning, and is mainly used in the field of information input technology.
  • Table 1 shows the output of a specific pattern match for these pattern matching methods.
  • Table 1 Comparison of output results of the pattern matching methods of the first, second, third, sixth and ninth embodiments
  • the pattern matching method of the first embodiment only determines whether the mode P exists in the text S, and cannot sort and output the retrieved text.
  • the pattern matching method of the single cylinder matching degree of the second embodiment can be sorted according to the simple matching degree of the discrete substrings of the pattern P and the text S, but cannot reflect the influence of the discrete number of discrete substrings on the matching degree, and the smaller the discrete number, The match should be larger.
  • the exact matching degree pattern matching method of the third embodiment can reflect the influence of the discrete number on the matching degree, and the different discrete numbers obtain different matching degrees, but the judgment result of the third embodiment pattern matching method is not necessarily a better matching position.
  • the discrete-sub-string matching degree pattern matching method of the sixth embodiment can find the position of the discrete-sub-string, since the discrete-sub-string is a discrete sub-string with a discrete number within the corresponding discrete sub-string, showing a more accurate matching position. , so the output is more accurate.
  • the pattern matching method of the minimum discrete element substring matching degree of the ninth embodiment can find the position of the smallest discrete element substring in the text, so the matching degree of the output is the most accurate, and the sorting result is optimal.
  • Table 2 lists the time complexity of the discrete substring pattern matching method of the first, second, third, sixth, and ninth embodiments described above.
  • Table 2 Time complexity analysis of the first, second, third, sixth and ninth methods
  • Table 1 and Table 2 reflect the search for the least discrete prime substring pattern matching method of Example 9.
  • Matching degree Sorting the output in descending order can accurately reflect the degree of matching of the text, but the time complexity of the method is the highest, which also increases the complexity of the method itself. Therefore, according to the requirements of practical problems, comprehensive consideration of various factors, select the above appropriate method for search and determination.
  • the pattern matching method of the fourth, fifth, seventh, eighth, tenth, and eleventh embodiments performs the calculation of the discrete number on the basis of determining whether there is a discrete substring in the text to reflect the degree of correlation between the pattern and the text, and gives The corresponding position of each character of the pattern in the text is also positioned. They are mainly used in the field of information retrieval technology to more easily and efficiently retrieve relevant texts and indicate the specific location of the characters in the pattern P in the text ⁇ . The output positioning results for a specific pattern match for these pattern matching methods are listed below.
  • the output results (examples) of several discrete substring pattern matching methods that can be located are as follows, and the search term is the pattern.
  • the fourth embodiment is a pattern matching method for outputting discrete substring discrete numbers and positions, which is located in the first discrete substring appearing in the text, and has no discrete number limitation, and is suitable for information retrieval of short text.
  • the fifth embodiment is to output a pattern matching method based on discrete number and position of discrete substrings of a given discrete number, and improve the pattern matching method of the fourth embodiment, and the first one that appears in the text satisfies a predetermined discrete number.
  • D Discrete substrings, suitable for long and short text retrieval.
  • the seventh embodiment is a pattern matching method for outputting the discrete number and position of the discrete element substring, which is located in the first discrete element substring appearing in the text, the discrete element substring is within the range of the corresponding discrete substring, and the discrete substring of the discrete number is smaller. Therefore, the positioning is more precise, there is no discrete number limit, and it is suitable for information retrieval of short text.
  • the eighth embodiment is a pattern matching method for outputting discrete numbers and positions of discrete prime sub-strings based on a given discrete number, which improves the method of the seventh embodiment, and the first one that appears in the text satisfies a predetermined discrete number Do The discrete prime substring, suitable for long and short text information retrieval.
  • Embodiment 10 is a pattern matching method for outputting the discrete number and position of the smallest discrete element substring, which is located in the smallest discrete element substring appearing in the text, the smallest discrete element is in the text range, and the discrete number is the smallest discrete substring, so the positioning The most accurate.
  • This method has no discrete number limitation and is suitable for information retrieval of short text.
  • Embodiment 11 is a pattern matching method for outputting a discrete number and position of a minimum discrete element substring based on a given discrete number, which improves the method of Embodiment 10 and is located to satisfy a predetermined discrete number D.
  • the smallest discrete prime substring suitable for long and short text information retrieval.
  • the pattern matching positioning based on the existing substring may result in the retrieval omission of the discrete related text.
  • the discrete substring pattern matching method of the present invention only when the given discrete number of the fifth embodiment, the eighth embodiment, and the eleventh method is too small, an undesired discrete correlation text missing occurs. Other discrete substring pattern matching methods do not occur for discrete related text retrieval omissions.
  • the fifth embodiment, the method of the eighth embodiment and the eleventh embodiment are the most flexible methods, and can balance the recall rate, the precision rate, and the positioning precision by adjusting a predetermined discrete number; and Under the condition of discrete number, the recall and precision of related texts are the same, but the latter is the best.
  • Table 3 lists the time complexity of the above discrete substring pattern matching methods that can be located, and the adaptation of each method. The scope.
  • the above embodiments one, two, three, six, and nine are one-dimensional discrete substring pattern matching methods.
  • the pattern matching method of the twelfth and thirteenth embodiments is a two-dimensional discrete substring pattern matching method formed by expanding and modifying the above one-dimensional discrete substring pattern matching method, which is suitable for short text in two-dimensional space. , large-capacity lexicon pattern matching.
  • the Chinese single-word pinyin in the keyboard input, the one-dimensional string in the English word, the pinyin of the Chinese phrase, and the English phrase can be regarded as a two-dimensional string.
  • Table 4 below shows the results of the method of Embodiments 12 and 13 specifically for pattern matching in two-dimensional text.
  • the underlined Chinese characters indicate the Chinese characters that the two-dimensional pattern Dp matches in the two-dimensional text.
  • the pattern and text content in Table 4 are actually the pinyin of Chinese characters.
  • Chinese characters are replaced by Chinese characters.
  • the pinyin of each Chinese character can also be subjected to a one-dimensional random default search.
  • the two-dimensional discrete substring pattern matching method of the twelfth embodiment can sort according to the simple matching degree of the two-dimensional discrete substring of the two-dimensional pattern Dp and the two-dimensional text Ds, but cannot reflect the two-dimensional discrete substring discrete The effect of the number of pairs, the smaller the number of discretes, the greater the degree of matching.
  • the two-dimensional discrete substring pattern matching method of the output precise matching degree of the thirteenth embodiment the exact matching degree of the output can reflect the influence of the discrete number on the matching degree, and the different discrete numbers obtain different matching degrees, so the sorting result is more reasonable. .
  • the discrete substring proposed by the present invention conceptually greatly increases the range of related texts; the present invention proposes a string based on discrete characteristics with respect to existing inexact matching based on error factor distance calculations. Match research ideas.
  • the pattern matching method based on discrete substrings requires that the characters in the pattern must be completely, orderly, and discrete (three characteristics) appear in the text. When the discrete number is zero, it will evolve into an exact substring pattern matching.
  • Discrete substring contains substrings, substrings It is a special case of discrete substrings.
  • the discrete characteristics of discrete substrings are in line with the public's choice of search terms. Users can flexibly and simply choose to satisfy ordered and discrete search terms.
  • the discrete substring pattern matching method because it satisfies completeness and order, and can discriminate the degree of correlation of the detected text through discrete numbers, the recall rate is high, the accuracy is guaranteed, and the positioning can be reasonably located.
  • the discrete substring pattern matching method solves the problem of the inherent discrete correlation retrieval omission in information retrieval in the past 40 years, and has important application value. Applicable to the following areas of information retrieval and information input: database retrieval of various texts, network information search, intra-site retrieval, information inquiry, keyboard input, electronic dictionary, operating system file retrieval, etc.
  • the output result in each pattern matching method of the present invention represents "non-existence", and any other specified data may be selected as a non-existent flag output.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A discrete substring matching method for information searching and information inputting is disclosed. The discrete substring is a character string (“Sg1Sg2&mldr;&mldr;Sgm”(1≤g1<g2&mldr;&mldr;<)gm) formed by one or multiple character in the text S=“S1S2&mldr;&mldr;Sn”. The discrete substring pattern matching is whether the judgement pattern P=“P1P2P3&mldr;&mldr;Pm”(1≤m≤n) is a discrete substring “Sg1Sg2&mldr;&mldr;Sgm” of text S or not. On the other hand, the method provides detail steps of the discrete substring pattern matching. The discrete substring expands the concept scope of the substring. The pattern matching method solves the problem of the pretermission in text searching with discrete feature. It has some advantages in function. For example, it improves the integrality and the veracity of searching. It makes position become easy. It has some advantages in application. For example, it makes information searching and information inputting become simple, flexible and quick.

Description

用于信息检索与信息输入的离散子串模式匹配方法  Discrete substring pattern matching method for information retrieval and information input
技术领域 Technical field
本发明涉及一种用于信息检索与信息输入的离散子串模式匹配方法。  The invention relates to a discrete substring pattern matching method for information retrieval and information input.
背景技术 Background technique
现有信息检索与信息输入领域需要用子串对文本进行模式匹配的处理。如在信息检索领 域需要将输入的检索词作为子串对数据库、 网页等存储文本进行匹配检索, 如果检索词为存 储文本的子串, 则存储文本作为检索到的文本输出, 否则放弃该文本; 如杲存储的所有文本 均与检索词不匹配, 则检索不到任何文本。 在信息愉入领域则将键盘等输入设备上输入的字 符串作为模式子串对计算机等信息处理设备存储的文本库中的文本进行匹配;若模式子串与 文本匹配, 则选中该文本, 进行后续处理。 显然, 子串模式匹配方法的速度、 查全率、 查准 率等功能, 对于信息检索与信息输入至关重要。  Existing information retrieval and information input fields require substrings to perform pattern matching on text. For example, in the field of information retrieval, the input search term needs to be used as a substring to perform matching search on a stored text such as a database or a webpage. If the search term is a substring of the stored text, the stored text is output as the retrieved text, otherwise the text is discarded; If all the text stored in the file does not match the search term, no text can be retrieved. In the field of information acquisition, a character string input on an input device such as a keyboard is used as a pattern substring to match text in a text library stored by an information processing device such as a computer; if the pattern substring matches the text, the text is selected and performed. Follow-up processing. Obviously, the speed, recall rate, and precision of the substring pattern matching method are crucial for information retrieval and information input.
现有信息检索与信息输入领域的子串定义为:在有限字符集∑上,给定一个长度为 η 的 文本字符串 S= %S2…… S„" , 以及一个长度为 m 的模式字符串 Ρ= …… P„ "; 如果存在 "SiSi+1…… S i+B -:" = "P:P2…… Pm,,, 则称 P 为 S的子串, 且 P出现在 S中位置为 i。 也即 现有的子串必须是文本字符串 S中的连续字符构成,文本字符串 S中不连续字符构成的字符 串不是文本 S的子串。 子串模式匹配是指: 文本 S中是否存在与模式 P相等的子串。 在一些 应用领域, 还要求判定的同时, 输出匹配度以及出现的位置。 The substring of the existing information retrieval and information input fields is defined as: on a finite character set, given a text string of length η S= %S 2 ... S„" , and a pattern character of length m Ρ Ρ = ...... P „ "; If there is "SiS i+1 ...... S i+B -:" = "P:P 2 ...... P m ,,, then P is a substring of S, and P appears The position in S is i. That is, the existing substring must be composed of consecutive characters in the text string S, and the character string composed of the discontinuous characters in the text string S is not a substring of the text S. The substring pattern matching is Refers to: Whether there is a substring equal to the pattern P in the text S. In some application fields, the judgment is also required, and the matching degree and the position of occurrence are output.
这是最简单也是最经典的子串模式匹配问题。对于这个问题,最早的方法是 Brute- Force 方法(朴素子串模式匹配方法),该方法的最坏时间复杂度为 0 (m*n)。 1970 年, S. A. Cook 从 理论上证明了子串模式匹配问题可以在 0 (m+n)时间内解决, 同年, Morr i s 和 Pra t t 仿照 Cook的证明构造了一个算法, 但时间复杂度并没有达到 0 (ηι+η)。 之后, Kmith对这个算法 进行了改进, 最终在 1976 年, 诞生了历史上第一个在 0 (ra+n)时间复杂度内解决子串模式匹 配的算法,该算法简称为 KMP (Knuth、 Morr i s , Prat t) , 时间复杂度得到明显降低。 1977 年, Boyer 和 Moore提出了另一个拥有线性时间复杂度 0 (ra+n)的算法(BM算法)。 BM算法采用从 右向左的匹配思路, 在实际的模式匹配中, 跳过了很多无用字符, 使服算法获得了很高的 效率, 特别是在大字符集上进行子串的模式匹配时更为显著, 得到广泛应用。 此后, 又有一 些更有效率的算法被提出, 大多都是在 KMP 算法或 BM算法的基础上做了一些改进。  This is the simplest and most classic substring pattern matching problem. The earliest method for this problem is the Brute-Force method (the simple substring pattern matching method), which has a worst-case time complexity of 0 (m*n). In 1970, SA Cook theoretically proved that the substring pattern matching problem can be solved in 0 (m+n) time. In the same year, Morr is and Pra tt constructed an algorithm following the proof of Cook, but the time complexity did not reach 0 (ηι+η). Later, Kmith improved the algorithm. Finally, in 1976, the first algorithm in the history of solving substring pattern matching in 0 (ra+n) time complexity was born. The algorithm is abbreviated as KMP (Knuth, Morr). Is , Prat t) , time complexity is significantly reduced. In 1977, Boyer and Moore proposed another algorithm with linear time complexity 0 (ra+n) (BM algorithm). The BM algorithm adopts a right-to-left matching idea. In the actual pattern matching, a lot of useless characters are skipped, so that the service algorithm achieves high efficiency, especially when performing pattern matching of substrings on a large character set. For significant, it is widely used. Since then, some more efficient algorithms have been proposed, mostly based on KMP or BM algorithms.
以上各种模式匹配算法, 都是基于模式 P, 在文本串 S中寻找一个与 P匹配的连续子串, 算法围绕着提高匹配速度而不断改进。  The above various pattern matching algorithms are based on the pattern P, searching for a continuous substring matching the P in the text string S, and the algorithm is continuously improved around improving the matching speed.
基于这样的子串匹配方法, 在信息检索、 信息输入等领域, 长期以来存在离散相关文本 遗漏问题。  Based on such a substring matching method, in the fields of information retrieval, information input, and the like, there has been a problem of missing related texts for a long time.
设文本 S中存在与模式 P相等(匹配)的子串 "SiSi+1 ·.· .,. 则子串在文本 S中的离散 特性主要体现在三个方面: Let the text string S have the substring "SiS i+1 ·.· . . . , which is equal (matched) to the pattern P. Then the discrete characteristics of the substring in the text S are mainly reflected in three aspects:
a)中间连续, 由 S的第 i个字符开始的连续 la个字符构成; b)后向缺省, S的后面连续缺省 n- m- i+1个字符; a) intermediate continuous, consisting of consecutive la characters starting from the ith character of S; b) backward default, followed by S consecutive default n- m-i+1 characters;
c)前向缺省, S的前面连续缺省 i- 1个字符。  c) Forward default, the front of S is continuously defaulted by i-1 characters.
下面的例子可以清楚的反映出, 采用以上子串进行匹配存在的明显离散相关文本遗漏问 题:  The following example clearly reflects the obvious discrete-related text omissions that exist using the above substrings for matching:
例 1: S= "中文拼音、 笔划、 声调组合输入法"; P= "拼音、 笔划、 声调"。  Example 1: S= "Chinese Pinyin, Stroke, Tone Combination Input Method"; P = "Pinyin, Stroke, Tone".
P匹配 S, 为 S的子串。  P matches S, which is a substring of S.
例 2: S= "中文拼音、 笔划、 声调組合输入法", P= "拼音笔划声调", 其典型特征为: P 离散地出现在 S中。  Example 2: S= "Chinese Pinyin, Stroke, Tone Combination Input Method", P = "Pinyin Stroke Tone", its typical features are: P discretely appears in S.
P不匹配 S, 不是 S的子串。  P does not match S, not a substring of S.
显然, 在信息检索及信息输入中, 人们希望第二种情况下, P也是 S的子串, 这明显与 子串定义矛盾。 现有的子串匹配方法无法实现第二种情况下 P对 S的匹配。  Obviously, in information retrieval and information input, it is hoped that in the second case, P is also a substring of S, which obviously contradicts the substring definition. The existing substring matching method cannot achieve the matching of P to S in the second case.
这是现有子串概念表示相关性的缺陷即要求子串字符的连续性带来的问题。 子串的相关 性理念, 在离散特性上反映出不是一个完美的全集相关性理念, 子串从概念上就遗漏了具有 离散特性的相关文本, 给应用程序带来许多困扰, 增加了解决问题的复杂性。  This is a problem that the existing substring concept indicates the dependency of the substring character. The sub-string correlation concept reflects that the discrete feature is not a perfect ensemble correlation concept. The sub-string conceptually misses the relevant text with discrete characteristics, which brings many troubles to the application and increases the problem-solving. Complexity.
以下示例, 进一步反映出基于子串概念的整个信息检索系统存在的固有离散相关文本遗 漏问题。  The following example further reflects the inherent discrete related text missing problem in the entire information retrieval system based on the substring concept.
例 3 : 文件查找  Example 3: File Search
假设硬盘中存在 "my— working-dai ly— plan, doc"。 现有的子串模式匹配, 不能通过检索 字符串 "mwdp" 检索出该文件, "mwdp" 离散地出现在 "my— working— dai ly一 plan, doc" 中。  Suppose there is "my-working-dai ly- plan, doc" in the hard disk. The existing substring pattern matches, the file cannot be retrieved by retrieving the string "mwdp", and "mwdp" appears discretely in "my-working-daily-a plan, doc".
例 4: 英语单词 " procedure " 的拼写中, 人们往往容易记住各音节的首字母以及最后一 个字母组成的字符串 "prcde" , 该串不满足 "procedure" 的子串定义, 然而 "prcde" 离 散地出现在 "procedure" 中。 输入 "prcde" ,现有的子串匹配, 检索不出该英语单词。  Example 4: In the spelling of the English word "procedure", it is easy to remember the initials of each syllable and the last string of the string "prcde", which does not satisfy the substring definition of "procedure", however "prcde" Discretely appear in "procedure". Enter "prcde" and the existing substring matches, and the English word cannot be retrieved.
例 5: 中文汉字的输入  Example 5: Input of Chinese characters
假设 "床" 的拼音笔划编码为 "chuangdhp" ,并存储在汉字编码库中。 由于编码太长, 是否可以采用 "cugdh" 、 "cdhl" 、 "cugd" 等随意缺省输入方式输入 "床" , 从而减少 输入码长? 注意 "cugdh" 、 "cdhl" 、 "cugd" 均离散地出现在 "chuangdhp" 中。 基于现 有的子串模式匹配, 不能实现这种功能。  Assume that the "bed" pinyin stroke is encoded as "chuangdhp" and stored in the Chinese character encoding library. Since the encoding is too long, can you enter "bed" with random default input methods such as "cugdh", "cdhl", "cugd", etc., thus reducing the input code length? Note that "cugdh", "cdhl", and "cugd" appear discretely in "chuangdhp". This feature is not possible based on existing substring pattern matching.
另一类模式匹配为非精确匹配, 用于判定模式 P是否与文本 S相似, 允许出现有限的错 误, 通过相似度约束, 返回判定结果以及定位位置, 被应用在信息检索、 信息处理、 以及生 物技术的 DNA匹配等众多领域。 影响精确匹配的错误因素主要包括: 插入错误、 交换错误、 删除错误、 替换错误、 反向错误等。 由于错误因素种类较多, 非精确模式匹配方法综合考虑 部分错误因素, 从不同应用角度、 采用各种技术, 形成了从线性时间复杂性到非确定多项式 时间复杂性(NPC问题) 的各种解决方案, 试图解决允许有限错误的匹配问题, 效果受限于 错误的综合性与错误个数。  Another type of pattern matching is a non-exact matching, which is used to determine whether the pattern P is similar to the text S, allowing a limited error to occur, and the similarity constraint, the return judgment result, and the positioning position are applied to information retrieval, information processing, and biology. Technical DNA matching and many other fields. The main error factors that affect the exact match include: insert error, exchange error, delete error, replacement error, reverse error, and so on. Due to the variety of error factors, the inexact pattern matching method considers some error factors comprehensively, and forms various solutions from linear time complexity to non-deterministic polynomial time complexity (NPC problem) from different application angles and various techniques. Solution, trying to solve the matching problem that allows limited errors, the effect is limited by the comprehensiveness of the error and the number of errors.
例如, 非精确匹配方法处理以上例 4中的问题时, 非賴 -确匹配的 ED ( Edi t Dis tance ) 方法, 认为有四个删除错误。 由于 BD方法还综合考虑插入错误和替换错误, 若允许四个错误 的匹配, 三类错误的综合处理, 将造成从英语词库中匹配出众多满足 4个错误约束条件的单 词,使得匹配结果毫无意义。通配符匹配是解决该类问题的一种选择,例如输入 "pr*c*d*e" , 但必须考虑在哪些位置, 加多少个通配符, 对大众而言, 这种解决方案仍然存在操作上的困 难。 最大匹配也可解决该问题, 但最大匹配等同于综合考虑插入与删除错误因素, 因此方法 本身的复杂程度、 时间复杂性提高, 用于词库检索, 候选词数目增加。 Hamming Dis tance只 考虑替换错误。 相似性匹配仍然综合考虑三种错误因素, 寻求模式与文本的相似程度。 以上 非精确方法的检索效果均受限于允许错误的数目和错误的综合性。 因此, 现有的非精确匹配 并不能很好地解决这些典型离散相关问题。 For example, when the non-exact matching method deals with the problem in Example 4 above, the ED (Edi t Distance) method that does not match the exact match considers that there are four deletion errors. Since the BD method also considers insert errors and replacement errors, if four errors are allowed The matching, the comprehensive processing of the three types of errors, will result in matching a large number of words satisfying 4 error constraints from the English lexicon, making the matching result meaningless. Wildcard matching is an option to solve this type of problem, such as entering "pr*c*d*e", but you must consider where and how many wildcards are added. For the public, this solution still has operational advantages. difficult. The maximum matching can also solve the problem, but the maximum matching is equivalent to comprehensive consideration of the insertion and deletion error factors, so the complexity of the method itself, the time complexity is improved, and the number of candidate words is increased for the thesaurus retrieval. Hamming Distance only considers replacement errors. The similarity matching still considers three kinds of error factors comprehensively, and seeks the similarity between the pattern and the text. The retrieval effect of the above inexact method is limited by the number of allowed errors and the comprehensiveness of errors. Therefore, existing inexact matching does not solve these typical discrete correlation problems well.
随着网络信息的不断普及和深入,大众化的信息获取以及信息输入成为信息的瓶颈问题, 字符串模式匹配在信息获取以及信息输入中成为最耀眼的明星, 基于现有字符串模式匹配方 法进行信息获取与信息输入, 存在的以上离散相关文本遗漏问题, 给普通大众造成了许多不 便, 亟待解决。  With the continuous popularization and deepening of network information, popular information acquisition and information input become the bottleneck of information. String pattern matching has become the most dazzling star in information acquisition and information input. Information is based on existing string pattern matching methods. Obtaining the above-mentioned discrete related text omission problems with information input, has caused a lot of inconvenience to the general public and needs to be solved urgently.
发明内容 Summary of the invention
本发明的目的在于解决上述问题, 提出一种用于信息检索与信息输入的离散子串模式匹 配方法, 该种方法查全率高, 准确率有保障, 且易于定位; 信息检索与信息输入筒单、 灵活、 快捷。  The object of the present invention is to solve the above problems, and to provide a discrete substring pattern matching method for information retrieval and information input, which has high recall rate, high accuracy, and easy positioning; information retrieval and information input tube Single, flexible and fast.
本发明解决其技术问题所采用的技术方案为: 一种用于信息检索与信息输入的离散子串 模式匹配方法, 其特点是所述的离散子串为文本 s= '%s2...... s„" 中的任意一个或一个以上的 字符组成的字符串 "SgJSg2…… Sgm" (1 < gl<g2…… <g„< n) ; 离散子串模式匹配即判定模式 P= "Ρ Ρ3…… Pm" (l m < n )是否为文本 S的离散子串 "SglSg2…… Sgm" ,并输出判定结果的具 体步骤如下: The technical solution adopted by the present invention to solve the technical problem thereof is as follows: A discrete substring pattern matching method for information retrieval and information input, characterized in that the discrete substring is text s= '%s 2 ... ... s„" any one or more characters consisting of the string "S gJ S g2 ...... S gm " (1 < gl <g 2 ...... <g„<n); Discrete substring pattern matching That is, the determination mode P = "Ρ Ρ 3 ...... P m " (lm < n ) is the discrete substring "S gl S g2 ... S gm " of the text S, and the specific steps of outputting the determination result are as follows:
a步 取文本 S的第一个字符作为被比较字符, 取模式 P的第一个字符作为比较字符; b步 如果被比较字符或比较字符为结束标志, 转 d步;  Step a: Take the first character of the text S as the compared character, and take the first character of the pattern P as the comparison character; bstep If the compared character or the comparison character is the end flag, go to step d;
c步 若被比较字符与比较字符相等, 则取文本 S的下一个字符作为被比较字符, 取模 式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字符作为被比较字符, 比 较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the next character of the text S is taken as the compared character, and the next character of the pattern P is taken as the comparison character, and the step b is performed; otherwise, the next character of the text S is taken as the Compare characters, compare characters, and turn b steps;
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 输出代表判定结果 "存在" 的数据,结束匹配; 否则,判定文本 S中不存在模式 P的离散子串, 输出代表判定结 果 "不存在" 的数据, 结束匹配。  If the comparison character is the end flag, the determination mode P is a discrete substring of the text S, and the data representing the determination result "present" is output, and the matching is ended; otherwise, the discrete substring of the pattern P does not exist in the text S, and the output is The data representing the "non-existence" of the judgment result ends the match.
与现有技术相比, 本发明的有益效果是:  Compared with the prior art, the beneficial effects of the present invention are:
一、 本发明的离散子串为文本 S中任意一个或一个以上的字符组成的字符串, 它对子串 进行了概念的拓展, 也即不要求子串中的字符为文本 S中的连续字符, 现有子串仅仅是本发 明中的离散子串的一个特例。 由于本发明只要模式 P是文本 S的离散子串, 方法就给出 P匹 配 S的结果, 因此, 本发明查全率明显提高, 解决了现有子串模式匹配中存在的离散相关文 本遗漏问题。 并且也能够方便地通过进一步的模式匹配方法, 实现离散子串的定位。  1. The discrete substring of the present invention is a character string consisting of any one or more characters in the text S, which extends the concept of the substring, that is, does not require the characters in the substring to be consecutive characters in the text S. The existing substring is only a special case of the discrete substring in the present invention. Since the present invention is a discrete substring of the text S, the method gives the result of the P matching S. Therefore, the recall rate of the present invention is significantly improved, and the problem of discrete related text omission existing in the existing substring pattern matching is solved. . And it is also convenient to implement the positioning of the discrete substrings by further pattern matching methods.
二、 本发明方法只有当模式 P 中的字符必须完全、 有序 (按位序)且可离散地在文本 S 中出现, 文本 S才被判定是与模式 P相关的文本, 因此, 其匹配准确率有保障。 三、理论分析表明:本发明判定存在离散子串的时间复杂度为 0 (n) , 字符比较次数 f (n) < n; 判定不存在离散子串的时间复杂度为 0 (n) , 字符比较次数 f (n) =n。 该判定方法, 较之 现有的子串模式匹配方法的复杂度 0 (m+n) ( S. A. Cook理论), 以更快的方式, 跳过无关的文 本。 因此, 本发明方法处理快捷、 有效。 Second, the method of the present invention is only when the characters in the pattern P must be completely, ordered (in bit order) and can be discretely in the text S When appears, the text S is judged to be the text associated with the pattern P, and therefore, the matching accuracy is guaranteed. Third, the theoretical analysis shows that the time complexity of the existence of discrete substrings is 0 (n), the number of character comparisons f (n) <n; the time complexity of determining that there are no discrete substrings is 0 (n), characters The number of comparisons f (n) = n. This decision method skips extraneous text in a faster manner than the complexity of the existing substring pattern matching method (m+n) (SA Cook theory). Therefore, the method of the present invention is quick and effective.
四、由于离散子串为文本 S 'SA ... ... Sn"中的任意一个或一个以上的字符组成的字符串, 信息检索与信息输入时, 检索词可以由文本的有序、 可离散的字符组成, 其检索词的选择非 常简单、 灵活, 既可减少输入码长, 也能有效避免拼读错误或方言错误。 4. Since the discrete substring is a string consisting of any one or more characters in the text S 'SA ... S n ", when the information is retrieved and the information is input, the search term may be ordered by the text, It can be composed of discrete characters. The choice of search terms is very simple and flexible. It can reduce the input code length and effectively avoid spelling errors or dialect errors.
五、 与现有的基于错误因素距离计算的非精确字符串匹配研究思路进行比较, 离散子串 模式匹配采用了完全不同的基于离散特性的字符串模式匹配研究思路。 错误因素是字符串匹 配问题所表现出来的现象, 而离散特性是字符串匹配问题的内在规律, 离散特性不等同于任 何错误因素。 例如: 删除错误与离散特性存在概念差异, 在文本与模式的匹配中, 删除错误 可以存在于文本的任何位置, 而离散特性只讨论 "SBlSg2…… 中离散的字符数。 V. Compared with the existing research ideas of inexact string matching based on error factor distance calculation, discrete substring pattern matching adopts completely different discrete pattern based string pattern matching research ideas. The error factor is the phenomenon exhibited by the string matching problem, and the discrete property is the inherent law of the string matching problem. The discrete property is not equivalent to any error factor. For example: There is a conceptual difference between a delete error and a discrete feature. In text-to-pattern matching, the delete error can exist anywhere in the text, while the discrete feature only discusses the discrete number of characters in "S Bl S g2 ....
这种匹配方法是一种基本的离散子串模式匹配方法, 判定存在离散子串时, 输出代表判 定结果 "存在" 的数据, 否则输出代表判定结果 "不存在" 的数据。 适合短文本、 大容量字 符串集的信息输入判定。  This matching method is a basic discrete substring pattern matching method. When it is determined that there is a discrete substring, the output data representing the judgment result "present" is output, otherwise the data representing the judgment result "nonexistence" is output. Suitable for inputting judgments of short text and large-capacity character string sets.
上述的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改即可形成输出简单 匹配度的模式匹配方法, 其作法是上述的基本匹配方法中 a步、 b步、 c步不变, 而 d步修改 为:  The above-mentioned discrete substring pattern matching method for information retrieval and information input can be slightly modified to form a pattern matching method for outputting a simple matching degree, which is a step, b step, and c step in the above basic matching method. Change, and step d is modified to:
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 求出文本 S的长度 n, 模式 P的长度 m, 输出离散子串简单匹配度 =Round ( 100 x m ÷ n ), 结束匹配; 否则,判定 文本 S中不存在模式 P的离散子串, 输出判定结果 "-1" , 结束匹配。  If the comparison character is the end flag in d step, it is determined that the mode P is a discrete substring of the text S, and the length n of the text S, the length m of the pattern P, and the simple matching degree of the output discrete substring = Round (100 xm ÷ n ) are obtained. End matching; otherwise, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended.
这种输出简单匹配度的匹配方法, 判定存在离散子串时, 输出文本 S与模式 P的离散子 串简单匹配度(100*m/n)。 它适合短文本、 大容量字符串集的信息输入判定。 其检索结果可按 离散子串简单匹配度降序排列方式, 输出检索到的文本, 便于用户首先选择匹配度高的文本。  This output matching method of simple matching degree determines the simple matching degree (100*m/n) of the discrete substring of the output text S and the mode P when there is a discrete substring. It is suitable for information input judgment of short text and large capacity string sets. The search result can be arranged in descending order of discrete substring simple matching degree, and the retrieved text is output, so that the user can first select the text with high matching degree.
这种方法判定存在离散子串的时间复杂度为 0 (η) , 字符比较次数 f (n) < n; 判定不存在 离散子串的时间复杂度为 0 (η) , 字符比较次数 f (n) =n , 与以上基本的模式匹配方法相同。  This method determines that the time complexity of the discrete substring is 0 (η), the number of character comparisons f (n) < n; the time complexity of determining that there is no discrete substring is 0 (η), the number of character comparisons f (n) ) =n , the same as the basic pattern matching method above.
上述的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改即可形成输出精确 匹配度的模式匹配方法, 其作法是上述的基本匹配方法中 a步、 b步不变, 而 c步、 d步修改 为:  The above-mentioned discrete substring pattern matching method for information retrieval and information input can be slightly modified to form a pattern matching method for output accurate matching degree, which is the same as the a step and the b step in the above basic matching method. Step c and step d are modified to:
c步 若被比较字符与比较字符相等, 则将被比较字符在文本 S中的位置值, 存储于位 置数组 pos [ ]中, 其存储位置与比较字符在模式 P中的位置相同, 取文本 S的下一个字符作 为被比较字符, 取模式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字符 作为被比较字符, 比较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos [ ], and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The next character as the compared character, take the next character of the pattern P as the comparison character, and turn to step b; otherwise, take the next character of the text S as the compared character, compare the characters unchanged, and turn to step b;
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 " - Γ , 结束匹配; 否则判定模式 P为文本 S的离散子串, 求出文本 S的长度 n, 模式 P 的长度 m, 求出离散子串的首字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值, gm=pos [ ]中的最后一个数值,输出能反映离散子串离散程度的精确匹配度 =Round (100 X (ra- (g -g -m+1) ÷ n) ÷ n) , 结束匹配。 If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is outputted as "- Γ , ending the matching; otherwise, the determination mode P is the discrete substring of the text S, and the text S is obtained. Length n, mode P The length m, find the position of the first character of the discrete substring and the position of the last character in the text S: gl = pos [ the first value in [ ], the last value in g m = pos [ ], the output can reflect the discrete The exact match of the substring dispersion degree = Round (100 X (ra - (g - g - m + 1) ÷ n) ÷ n) , the end of the match.
这种输出精确匹配度的匹配方法, 判定存在离散子串时, 输出文本 S与模式 Ρ的精确匹 配度(100* (m- (t„-t -m+l) ÷ n) ÷ n) ,精确匹配度不仅考虑文本 S与模式 Ρ的长度,还考虑检 索出的离散子串的离散数对匹配度的影响。 它也适合短文本、 大容量字符串集的信息输入判 定。 检索结果可按精确匹配度降序排列方式, 输出检索到的文本, 更方便用户首先选择匹配 度高的文本。 '  This method of matching the exact matching degree of the output determines the exact matching degree of the output text S and the mode ( when there are discrete substrings (100*(m-(t„-t -m+l) ÷ n) ÷ n) , The exact matching degree considers not only the length of the text S and the mode Ρ, but also the influence of the discrete number of the retrieved discrete substrings on the matching degree. It is also suitable for the information input judgment of short text and large capacity string sets. The precise matching degree is arranged in descending order, and the retrieved text is output, which is more convenient for the user to first select the text with high matching degree.
本发明判定存在离散子串的时间复杂度为 0 (η) , 字符比较次数 f (n) < n; 判定不存在离 散子串的时间复杂度为 0 (n) , 字符比较次数 f (n) =n, 与以上基本的模式匹配方法相同。 The present invention determines that the time complexity of the existence of the discrete substring is 0 ( η ) , the number of character comparisons f (n) <n; the time complexity of determining that there is no discrete substring is 0 (n), the number of character comparisons f (n) =n, the same as the basic pattern matching method above.
上迷输出精确匹配度的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改即 可形成输出离散子串离散数及位置的模式匹配方法, 其作法是上述的输出精确匹配度的模式 匹配方法中 a步、 b步、 c步不变, d步修改为:  The discrete substring pattern matching method for information retrieval and information input with the precise matching degree of output can be modified to form a pattern matching method for outputting discrete number and position of discrete substrings, and the method is the above-mentioned output accurate matching degree. In the pattern matching method, steps a, b, and c are unchanged, and step d is modified to:
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 ,结束匹配; 否则, 模式 P为文本 S的离散子串 "SglS g2…… S 求出模式 P的长 度 m, 求出离散子串的首字符以及末字符在文本 S 中的位置: gl=pos [ ]中的笫一个数值, g»=pos [ ]中的最后一个数值,输出离散子串的离散数 D=gm-g -m+l , 并输出位置数組 pos [ ] , 结束匹配。 If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is output, and the matching is ended; otherwise, the pattern P is the discrete substring of the text S "S gl S g2 ... S The length m of the pattern P is obtained, and the position of the first character of the discrete substring and the position of the last character in the text S is obtained: gl = a value in pos [ ], the last value in g»=pos [ ], the output is discrete The discrete number of the substring D=g m -g -m+l , and output the position array pos [ ] to end the match.
该种方法判定存在离散子串后, 输出离散子串的离散数和离散子串的每个字符在文本 S 中的对应位置, 离散数反映了离散程度, 可对检出文本的离散数进行升序排序, 再结合定位 信息, 使得信息检索的后续处理更加准确、 有效。 该种方法适用于网络信息搜索、 数据库信 息检索等短文本的定位检索。  After the method determines that there are discrete substrings, the discrete number of the discrete substring and the corresponding position of each character of the discrete substring in the text S are output, and the discrete number reflects the degree of dispersion, and the discrete number of the detected text can be ascending Sorting, combined with positioning information, makes the subsequent processing of information retrieval more accurate and effective. This method is suitable for location retrieval of short texts such as network information search and database information retrieval.
该方法的时间复杂度分析: 找到第一个离散子串的时间复杂度为 0 (η) , 字符比较次数 f (n) < n; 找不到离散子串的时间复杂度为 O (n) , 字符比较次数 f (n) =ri。 从时间复杂度上可 以看出, 该判定方法的时间复杂度与模式 P无关, 只需比较 n次, 即可跳过无关文本 S , 而 找到第一个离散子串的最坏情形为比较 n次字符。  Time complexity analysis of the method: The time complexity of finding the first discrete substring is 0 (η), the number of character comparisons f (n) < n; The time complexity of finding the discrete substring is O (n) , the number of character comparisons f (n) = ri. As can be seen from the time complexity, the time complexity of the decision method is independent of the mode P. It is only necessary to compare the n times to skip the irrelevant text S, and the worst case of finding the first discrete substring is to compare n. Secondary character.
上述输出离散子串离散数及位置的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改即可形成输出基于给定离散数的离散子串离散数及位置的模式匹配方法, 其作法是 上述输出离散子串离散数及位置的模式匹配方法中 a步、 b步、 c步不变, d步修改为: d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 "-1" , 结束匹配; 否则, 模式 P为文本 S的离散子串 "SelS s2…… Ss„" , 求出: 模式 P的 长度 m、离散子串的首字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值、 g„=pos [ ] 中的最后一个数值、 离散子串的离散数 D=g0-g -ra+L The above-mentioned discrete substring pattern matching method for outputting discrete substring discrete numbers and positions for information retrieval and information input can be modified to form a pattern matching method for outputting discrete substring discrete numbers and positions based on a given discrete number. The method is the above method for outputting the discrete substring discrete number and the position matching method. The a step, the b step, and the c step are unchanged, and the d step is modified as: d step, if the comparison character is not the end flag, it is determined that the pattern does not exist in the text S. The discrete substring of P outputs the judgment result "-1" to end the match; otherwise, the pattern P is the discrete substring "S el S s2 ... S s „" of the text S, and finds: the length m of the pattern P, discrete The first character of the substring and the position of the last character in the text S: gl =pos [the first value in [ ], the last value in g„=pos [ ], the discrete number of the discrete substring D=g 0 - g -ra+L
如果离散数 D <预先给定的离散数 D„, 则判定模式 P为文本 S的符合离散数 D。要求的第 一个离散子串, 输出离散数 D, 并输出位置数组 pos [ ],结束匹配。  If the discrete number D < a predetermined discrete number D „, then the decision mode P is the discrete number D of the text S. The first discrete substring required, the discrete number D is output, and the position array pos [ ] is output, ending Match.
如果 D> D0 ; 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: 当前被比较字符位置 -模式 P的长度 m-预先给定的离散数 D„, 并取该位置的字符作为被比较 字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较 字符, 转 b步。 If D> D 0 ; restart the matching of the next discrete substring, modify the position of the compared character in the text S to: The currently compared character position - the length m of the mode P - a predetermined discrete number D „, and takes the character of the position as the compared character; the position of the comparison character in the mode P is modified to the first character position of the mode P, and Take the character at that position as the comparison character and turn to step b.
这种基于给定的离散数 D。的离散子串模式匹配方法, 可应用于网络信息搜索、 数据库信 息检索等长、 短文本的定位检索。  This is based on a given discrete number D. The discrete substring pattern matching method can be applied to the location search of long and short texts such as network information search and database information retrieval.
该方法可以调整离散数 D。, 改变离散子串模式匹配的功能, 搜索到满足给定离散数要求 的离散子串及位置; 离散数越小, 搜索定位越精确, 但搜索的查全功能就越差, 可能跳过有 关的一些满足离散子串的文本; 离散数越大, 搜索定位越不精确, 但搜索查出的功能就越强, 找出更多的满足离散子串匹配的文本。  This method can adjust the discrete number D. , changing the function of discrete substring pattern matching, searching for discrete substrings and positions satisfying the requirements of a given discrete number; the smaller the discrete number, the more accurate the search positioning, but the worse the search function is, and may skip related Some texts satisfying discrete substrings; the larger the discrete number, the less precise the search position, but the more powerful the search finds, the more text that satisfies the discrete substring matches.
因此, 可以由用户确定该给定离散数, 通过改变离散数, 在查全率、 查准率、 定位精度 上寻求平街, 从而满足不同情况下, 用户对不同查全率、 查准率、 定位精度的信息检索要求。 当给定离散数 D。=0, 演变为现有的子串模式匹配, 检索功能等同于子串, 实现了与子串模式 匹配的兼容。 可见, 离散数 D。在离散子串模式匹配方法中扮演着重要角色。  Therefore, the given discrete number can be determined by the user, and by changing the discrete number, a flat street is sought in the recalling rate, the precision rate, and the positioning accuracy, thereby satisfying different conditions, the user has different recall rates, precision ratios, Information retrieval requirements for positioning accuracy. When given a discrete number D. =0, evolved into the existing substring pattern matching, the retrieval function is equivalent to the substring, and the compatibility with the substring pattern is achieved. Visible, discrete number D. Play an important role in the discrete substring pattern matching method.
其时间复杂度分析: 找到笫一个满足离散数 D。的离散子串的时间复杂度为 0 (n+k (m+D„) ) , 字符比较次数为 f (n) < n+ (k-1) (ra+D„); 找不到满足离散数 D„的离散子串的时 间复杂度为◦ (n+k (m+D„) ), 字符比较次数为 f (n) =n+k (m+ D„) ; k为找到的离散子串出现的次 数, D。为给定的离散数。  Its time complexity analysis: Find one that meets the discrete number D. The time complexity of the discrete substring is 0 (n+k (m+D„)), and the number of character comparisons is f (n) < n+ (k-1) (ra+D„); The time complexity of the discrete substring of D„ is ◦(n+k (m+D„) ), the number of character comparisons is f (n) =n+k (m+ D„); k is the found discrete substring The number of times, D. is the given discrete number.
上述输出精确匹配度的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改与 扩充即可形成输出离散素子串匹配度的模式匹配方法, 其作法是上述的输出精确匹配度的模 式匹配方法中 a步、 b步、 c步不变, 先找到离散子串, 然后通过以下的 d步、 e步、 f 步, 找到离散素子串也即找到的离散子串相关范围内离散数最小的离散子串, 再由 g步、 h步判 定并输出离散素子串匹配度:  The discrete substring pattern matching method for information retrieval and information input with the above output precise matching degree can be slightly modified and expanded to form a pattern matching method for outputting the discrete substring matching degree, and the method is the above output accurate matching degree. In the pattern matching method, the a step, the b step, and the c step are unchanged, and the discrete substring is first found, and then the discrete substrings are found by the following d steps, e steps, and f steps, that is, discrete numbers in the relevant range of the discrete substrings found. The smallest discrete substring, then determined by g step, h step and output the discrete prime substring matching degree:
d步 若比较字符不是结束标志,转 h步; 否则, 将文本 S中被比较字符的位置前移 2个 字符位置, 并取该位置的字符作为被比较字符, 将模式 P中比较字符的位置前移 2个字符位 置,并取该位置的字符作为比较字符。  Step d If the comparison character is not the end mark, go to step h; otherwise, move the position of the compared character in the text S forward by 2 character positions, and take the character of the position as the compared character, and compare the position of the character in the pattern P. Move forward 2 characters and take the character at that position as the comparison character.
然后, 再依次进行以下的 e、 f、 g、 h步:  Then, proceed to the following steps e, f, g, and h:
e步 如果模式 P的首字符已比较完毕, 转 g步;  Step e If the first character of mode P has been compared, go to g step;
f 步 若被比较字符与比较字符相等, 则将被比较字符在文本 S中的位置值, 存储于位 置数组 pos门中, 其存储位置与比较字符在模式 P中的位置相同, 取文本 S的前一个字符作 为被比较字符, 取模式 P的前一个字符作为比较字符, 转 e步; 否则, 取文本 S的前一个字 符作为被比较字符, 比较字符不变, 转 e步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The previous character is used as the compared character, and the previous character of the pattern P is taken as the comparison character, and the e step is changed; otherwise, the previous character of the text S is taken as the compared character, the comparison character is unchanged, and the e step is performed;
步 判定模式 P为文本 S的离散素子串, 求出文本 S的长度 n, 模式 P的长度 ra, 求出 离散素子串的首字符以及末字符在文本 S 中的位置: g产 pos [ ]中的第一个数值, gm=pos [ ] 中的最后一个数值,输出离散素子串匹配度 =Round (100 χ (m- (gm-g -m+l) ÷ n) ÷ n) ,结束匹配; 步 判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "-1" , 结束匹配。 The step determination mode P is a discrete element substring of the text S, and the length n of the text S is obtained, and the length ra of the pattern P is obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: g is produced in pos [ ] The first value, g m = the last value in pos [ ], the output discrete prime substring match = Round (100 χ (m- (g m -g -m+l) ÷ n) ÷ n) , end Matching; the discrete element substring of the pattern P does not exist in the step determination text S, and the determination result "-1" is output, and the matching is ended.
这种输出离散素子串匹配度的模式匹配方法, 能判定是否存在离散素子串, 并输出文本 S与模式 P的离散素子串的匹配度(100 x (m- (gn-g -m+l) ÷ n) ÷ n)。 离散素子串反映了比离散 子串更好、 离散数更小的匹配位置, 因此, 离散素子串的匹配度, 能更好地反映匹配的程度。 The pattern matching method for outputting the discrete sub-string matching degree can determine whether a discrete element substring exists and output text The degree of matching of S with the discrete prime substring of pattern P (100 x (m - (g n - g - m + l) ÷ n) ÷ n). The discrete element substring reflects a better matching position than the discrete substring, and therefore, the matching degree of the discrete substring can better reflect the degree of matching.
这种方法适合短文本、 大容量字符串集的信息输入判定。 利用该离散素子串匹配度, 可 以对检索出的所有文本进行降序排列输出, 使用户首先处理匹配度高的文本, 提高了检索的 效率。  This method is suitable for information input judgment of short text and large-capacity string sets. By using the discrete element substring matching degree, all the retrieved texts can be output in descending order, so that the user first processes the text with high matching degree, and the retrieval efficiency is improved.
该种方法找到第一个离散素子串的时间复杂度为 0 (n) , 字符比较次数 f (n) < n+ (m+Dr) < 2n-l , Df为找到的第一个离散素子串的离散数; 找不到离散素子串的时间复杂度为 0 (n) , 字符比较次数 f (n) =n。 从时间复杂度分析可知, 该判定方法找不到离散素子串的时间复杂度 与模式 P无关, 只需比较 n次, 即可跳过无关文本, 找到第一个离散素子串的最坏情形为比 较 2n- 1次字符。 The method finds the first discrete element substring with a time complexity of 0 (n) and the character comparison number f (n) < n+ (m+D r ) < 2n-l , and D f is the first discrete element found. The discrete number of the string; the time complexity of the discrete substring not found is 0 (n), and the number of character comparisons f (n) = n. From the time complexity analysis, the time complexity of the discrete method substring cannot be found in this decision method. It is only necessary to compare n times, and the unrelated text can be skipped. The worst case of finding the first discrete prime substring is Compare 2n-1 characters.
上述输出离散素子串匹配度的用亍信息检索与信息输入的离散子串模式匹配方法, 略加 修改即形成输出离散素子串离散数及位置的模式匹配方法, 其作法是上述的输出离散素子串 匹配度的模式匹配方法中 a- f 步以及 h步不变, g步修改为:  The discrete substring pattern matching method for outputting discrete element sub-string matching degree for information retrieval and information input is slightly modified to form a pattern matching method for outputting discrete numbers and positions of discrete sub-substrings, and the method is the above-mentioned output discrete sub-string In the pattern matching method of matching degree, a-f step and h step are unchanged, and g step is modified to:
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 m, 求出离散素子串的首字 符以及末字符在文本 S中的位置: g^pos t ]中的第一个数值, g„=pos [ ]中的最后一个数值, 输出离散数 D-g^-gr m+l, 并输出位置数組 pos [ ] , 结束匹配。  The g step determination mode P is a discrete element substring of the text S, and the length m of the pattern P is obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: the first value in g^pos t ] , g„=pos [ ] the last value, output the discrete number Dg^-gr m+l, and output the position array pos [ ] to end the match.
该种方法在判定存在离散素子串后, 输出离散素子串的离散数及离散素子串的每个字符 在文本 S中的位置。 离散素子串定位优于离散子串模式匹配定位, 这是因为存在离散子串, 就一定存在离散素子串, 而离散素子串的离散数一定小于等于离散子串的离散数; 并且离散 数反映了离散程度, 可对检出文本的离散数进行升序排序, 再结合定位信息, 使得信息检索 的后续处理更为准确、 有效。 该种方法适用于网络信息搜索、 数据库信息检索等短文本的定 位检索。  The method outputs the discrete number of the discrete element substring and the position of each character of the discrete element substring in the text S after determining that the discrete element substring exists. Discrete element substring positioning is better than discrete substring pattern matching positioning. This is because there are discrete substrings, and there must be discrete substrings, and the discrete number of discrete substrings must be less than or equal to the discrete number of discrete substrings; With the degree of dispersion, the discrete numbers of the detected text can be sorted in ascending order, and then combined with the positioning information, so that the subsequent processing of information retrieval is more accurate and effective. This method is suitable for location retrieval of short texts such as network information search and database information retrieval.
该方法的时间复杂度分析: 找到第一个离散素子串的时间复杂度为 0 (n) , 字符比较次数 f (n) < 2n-l; 找不到离散素子串的时间复杂度为 O (n) , 字符比较次数 f (n) =n。 从时间复杂 度上可以看出, 该判定方法找不到离散素子串的时间复杂度与模式 P无关, 只需比较 n次, 即可跳过无关文本 S, 而找到第一个离散素子串的最坏情形为比较 2n-l次字符。  Time complexity analysis of the method: The time complexity of finding the first discrete prime substring is 0 (n), the number of character comparisons f (n) < 2n-l; The time complexity of finding the discrete substring is O ( n) , the number of character comparisons f (n) = n. As can be seen from the time complexity, the time complexity of the discrete method substring cannot be found in the decision method. It is only necessary to compare the n times, and then skip the irrelevant text S and find the first discrete substring. The worst case is to compare 2n-l characters.
上述输出离散素子串离散数及位置的用于信息检索与信息输入的离散子串模式匹配方 法, 略加修改即可形成输出基于给定离散数的离散素子串离散数及位置的模式匹配方法, 其 作法是上述输出离散素子串离散数及位置的模式匹配方法中的 a-f 步以及 h步不变, g步修 改为-.  The discrete substring pattern matching method for outputting information and information input by discrete numbers and positions of the discrete elements of the discrete elements can be modified to form a pattern matching method for outputting discrete numbers and positions of discrete prime substrings based on a given discrete number. The method is the af step and the h step in the pattern matching method for outputting the discrete number and position of the discrete prime substring, and the g step is modified to -.
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 m, 求出: 离散素子串的首 字符以及末字符在文本 S中的位置: g产 pos [ ]中的第一个数值、 gm=pos [ ]中的最后一个数值、 离散数 D-g g m+L The g step determination mode P is a discrete element substring of the text S, and the length m of the mode P is obtained, and the first character of the discrete substring and the position of the last character in the text S are obtained: g is the first of the pos [ ] The last value in the value, g m =pos [ ], the discrete number Dg g m+L
如果离散数 I 预先给定的离散数 D。,则判定模式 P为文本 S的符合离散数 D。要求的第一 个离散素子串, 输出离散数 D, 并输出位置数组 pos [ ] ,结束匹配。  If the discrete number I is given a discrete number D. Then, it is determined that the mode P is the discrete number D of the text S. The first discrete element substring is required, the discrete number D is output, and the position array pos [ ] is output to end the match.
如果 D> Do , 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: Max (pos [ ]的第二个位置的值, g„+l- m_ D。), 并取该位置的字符作为被比较字符; 将模式 P 中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较字符, 转 b步。 If D> Do , restart the matching of the next discrete substring, modify the position of the compared character in the text S to: Max (pos) the value of the second position of the pos [ ], g „+l- m_ D.), and take the character at the position as the compared character; modify the position of the comparison character in the pattern P to the first character position of the pattern P , and take the character at the position as the comparison character, and turn to step b.
这种方法可以调整离散素子串离散数 D。, 改变离散素子串模式匹配的功能, 搜索到满足 离散数要求的离散素子串及位置。 离散数越小, 搜索定位越精确, 但搜索的查全功能就越差, 可能跳过有关的一些满足离散素子串的文本; 离散数越大, 搜索定位越不精确, 但搜索的查 出功能就越强, 找出更多满足离散素子串匹配的文本。 这使得本发明可以通过调节预先给定 的离散数, 满足信息检索中不同场合对不同定位精度、 不同查全功能的要求。  This method can adjust the discrete number D of discrete element substrings. , changing the function of discrete element sub-pattern matching, searching for discrete sub-strings and positions that satisfy the discrete number requirement. The smaller the discrete number, the more accurate the search positioning, but the worse the search function is. It may skip some related texts that satisfy the discrete substring. The larger the discrete number, the less accurate the search position, but the search function is searched. The stronger, the more text that satisfies the matching of discrete substrings. This allows the present invention to meet the requirements of different positioning accuracy and different checking functions in different situations in information retrieval by adjusting a predetermined discrete number.
该方法可应用于网络信息搜索、 数据库信息检索等长、 短文本的定位检索。  The method can be applied to the location retrieval of long and short texts such as network information search and database information retrieval.
其时间复杂度分析: 找到第一个满足离散数 D。的离散素子串的时间复杂度为 0 (n+k (m+Da) ) , 字符比较次数为 f (n) n+2 (k- 1) (m+Da- 1); 找不到满足离散数 D。的离散素子 串的时间复杂度为 0 (n+k (m+Da) ), 字符比较次数为 f (n) =n+2k (m+Da-1); k为找到的离散素 子串出现的次数, Da为找到的离散素子串的平均离散数。  Its time complexity analysis: Find the first one to satisfy the discrete number D. The time complexity of the discrete prime substring is 0 (n+k (m+Da) ) , and the number of character comparisons is f (n) n+2 (k-1) (m+Da-1); Number D. The time complexity of the discrete prime substring is 0 (n+k (m+Da) ), the number of character comparisons is f (n) =n+2k (m+Da-1); k is the found of the found discrete substring The number of times, Da is the average discrete number of discrete substrings found.
上述输出离散素子串匹配度的用于信息检索与信息输入的离散子串模式匹配方法, 略加 修改即可形成输出最小离散素子串匹配度的模式匹配方法, 其作法是上述的输出离散素子串 匹配度的模式匹配方法中 a-f 步不变, 修改 g步、 h步,并增加 i步:  The discrete substring pattern matching method for information retrieval and information input, which outputs the discrete substring matching degree, can be modified to form a pattern matching method for outputting the minimum discrete prime substring matching degree, and the method is the output discrete substring described above. In the pattern matching method of matching degree, the af step is unchanged, the g step, the h step are modified, and the i step is added:
g步 判定模式 P为文本 S的离散素子串, 如果当前最小离散素子串的首字符以及末字符 在文本 S中的位置 y,、 yra 未被赋值, 则令 y pos f ]中的第一个数值, ym=pos [ ]中的最后一 个数值, 转 i步; 否则, 求出离散素子串的首字符以及末字符在文本 S中的位置: g,=pos [ ] 中的第一个数值, gm=pos [ ]中的最后一个数值,若(g„— gl) < (ym-yi) , 则 y^ g yffl= gB , 转 i 步; 若(gn- gi) > (ym-yi) , 则直接转 i步; Step g P mode is determined as a discrete element text substring of S, if the current smallest discrete element of the first character of the substring and the last character in the text position y ,, y ra S is not assigned, then let y pos f] first Value, y m = pos [ the last value in [ ], turn i step; otherwise, find the first character of the discrete substring and the position of the last character in the text S: g, =pos [ ] The value, g m = pos [ the last value in [ ], if (g„- gl ) < (y m - yi ), then y^ gy ffl = g B , turn i step; if (g n - gi) > (y m -yi) , then turn directly to i step;
h步 如杲当前最小离散素子串的首字符与末字符在文本 S中的位置 yi、 y2 未被赋值, 则判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "- 1" ,结束匹配;否则, 求出文 本 S的长度 n, 模式 P的长度 m, 输出最小离散素子串匹配度 =Round (100 x (m- ^-y,- m+1) ÷ n) ÷ n),结束匹配; h Step As Gao current minimum discrete element substring the first character and the last character position yi text in S, y 2 is not assigned, it is determined that there is the pattern P text S discrete element substring, outputs a determination result "- 1" , end the match; otherwise, find the length n of the text S, the length m of the pattern P, and output the minimum discrete element substring matching degree = Round (100 x (m- ^-y, - m+1) ÷ n) ÷ n) End matching;
i步 若(y„- yr"ffl+l) =0, 转 h步; 否则, 重新开始下一个离散子串的匹配, 将文本 S中 被比较字符的位置修改为 pos [ ]的第二个位置的值, 并取该位置的字符作为被比较字符; 将 模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较字符, 转 b步。  If i step (y„- yr"ffl+l) =0, go to h step; otherwise, restart the matching of the next discrete substring, and change the position of the compared character in text S to the second of pos [ ] The value of the position, and take the character of the position as the compared character; modify the position of the comparison character in the mode P to the first character position of the mode P, and take the character of the position as the comparison character, and turn to step b.
这种输出最小离散素子串匹配度的模式匹配方法,判定存在文本中最小的离散素子串时, 输出文本 S与模式 P的最小离散素子串匹配度(l OO x (m- (yra-y -m+l) ÷ n) ÷ n) ,最小离散素子 串反映了文本中离.散数最小的离散子串, 因此, 最小离散素子串的匹配度, 能最精确地反映 匹配的程度。 The pattern matching method for outputting the minimum discrete element substring matching degree, the minimum discrete element substring matching degree of the output text S and the pattern P when determining the smallest discrete element substring in the text (l OO x (m- (y ra -y -m+l) ÷ n) ÷ n) The smallest discrete element substring reflects the discrete substring with the smallest scatter in the text. Therefore, the matching degree of the smallest discrete substring can most accurately reflect the degree of matching.
这种方法更适合短文本、 大容量字符串集的信息输入判定。 利用该匹配度, 可以对检索 出的所有文本进行降序排列输出, 用户首先处理的是匹配度高的文本, 更进一步提高了信息 检索与输入的处理效率。  This method is more suitable for information input judgment of short text and large-capacity string sets. By using the matching degree, all the retrieved texts can be output in descending order, and the user first processes the text with high matching degree, which further improves the processing efficiency of information retrieval and input.
该种方法找到最小的离散素子串的时间复杂度为 0 (n+k (m+Da);), 字符比较次数为 f (n) <n+2k(ra+Da-l), 其中, k为找到的离散素子串出现的次数, Da为找到的离散素子串的平均 离散数。 找不到最小离散子串的时间复杂度为 0(n), 字符比较次数为 f (n)=n。 从时间复杂 度分析中可知, 当文本串 T与模式 P无关联时, 也即 T中不可能出现 P的离散子串, 本方法 只需进行 n次字符比较即可跳过无关文本。 This method finds the smallest discrete prime substring with a time complexity of 0 (n+k (m+Da);) and the number of character comparisons is f (n) <n+2k(ra+Da-l), where k is the number of occurrences of the found discrete element substring, and Da is the average discrete number of the found discrete element substring. The time complexity of finding the smallest discrete substring is 0(n), and the number of character comparisons is f(n)=n. It can be seen from the time complexity analysis that when the text string T is not associated with the pattern P, that is, the discrete substring of P is unlikely to occur in the T, the method can skip the irrelevant text by only performing n character comparisons.
上述输出最小离散素子串匹配度的用于信息检索与信息输入的离散子串模式匹配方法, 略加修改即形成输出最小离散素子串离散数及位置的模式匹配方法, 其作法是上述的输出最 小离散素子串匹配度的模式匹配方法中 a- f 步不变, g步、 h步、 i步修改为:  The discrete substring pattern matching method for information retrieval and information input for outputting the minimum discrete prime substring matching degree is slightly modified to form a pattern matching method for outputting the discrete number and position of the minimum discrete prime substring, which is the minimum output described above. In the pattern matching method of the discrete element sub-matching degree, the a-f step is unchanged, and the g step, the h step, and the i step are modified as:
g 步 找到离散素子串, 如果当前最小离散素子串位置数組 min[ ]未被赋值, 则令 min[]=Pos[],转 i步;否则,求出离散素子串的首字符及末字符在文本 S中的位置: g产 pos[] 中的第一个数值, g„=pos [ ]中的最后一个数值,求出当前最小离散素子串首字符及末字符在 文本 S中的位置: yi=min [ ]中的第一个数值, ym=min [ ]中的最后一个数值,若(g„,- gl) < (ym-yi) , 则令 min[ ]=pos门, 转 i步; 若(gm- gl) > (yffl-yi) , 则直接转 i步; Step g finds the discrete element substring. If the current minimum discrete element substring position array min[ ] is not assigned, let min[]= P os[], turn i step; otherwise, find the first character and end of the discrete element substring The position of the character in the text S: g produces the first value in pos[], the last value in g„=pos [ ], finds the position of the first and last characters of the current smallest discrete prime substring in the text S : yi =min [ ] The first value in the y m =min [ ], if (g„,- gl ) < (y m - yi ), then let min[ ]=pos gate, Turn i step; if (g m - gl ) > (y ffl - yi ), then turn directly to i step;
h步 如果当前最小离散素子串的位置数组 min[] 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "-Γ, 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S 中的位置: yi-min[ ]中的第一个数值, ym=min [ ] 中的最后一个数值, 输出离散数 D=y„-y广 m+1, 并输出位置数組 min[ ], 结束匹配; Step h If the position array min[] of the current smallest discrete substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is outputted "-Γ, the end is matched; otherwise, the length of the pattern P is obtained. m, find the position of the first and last characters of the current smallest discrete prime substring in the text S: the first value in yi-min[ ], the last value in y m =min [ ], the output discrete number D= y„-y广m+1, and output the position array min[ ], ending the match;
i步 求出当前最小离散素子串首字符及末字符在文本 S中的位置: =πΰη[ ]中的第一 个数值, ym=min[ ]中的最后一个数值, 若(y„- yi- m+l)=0, 转 h步; 否则, 重新开始下一个离 散子串的匹配, 将文本 S中被比较字符的位置修改为 pos[ ]的第二个位置的值, 并取该位置 的字符作为被比较字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位 置的字符作为比较字符, 转 b步。 Step i find the position of the first and last characters of the current smallest discrete prime substring in the text S: the first value in =πΰη[ ], the last value in y m =min[ ], if (y„- yi - m+l)=0, go to h step; otherwise, restart the matching of the next discrete substring, modify the position of the compared character in text S to the value of the second position of pos[ ], and take the position The character is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
该种方法进一步改善了离散素子串模式匹配的定位精度。 这是因为, 若文本 S中, 存在 离散素子串, 则一定存在离散数最小的离散素子串。 在文本的全部范围内, 找出离散数最小 的离散素子串, 是一种最佳的定位方案, 能最有效地提高信息检索的效率、 准确率。  This method further improves the positioning accuracy of discrete element sub-pattern matching. This is because if there is a discrete element substring in the text S, there must be a discrete element substring with the smallest discrete number. Finding the discrete discrete substring with the smallest discrete number in the whole range of text is an optimal positioning scheme, which can effectively improve the efficiency and accuracy of information retrieval.
该方法的时间复杂度分析: 找到最小的离散素子串的时间复杂度为 0 (n+k(in+Da)), 字符 比较次数为 f (n) n+2 (k-1) (m+Da-1); 找不到最小离散素子串的时间复杂度为 0 (n+k (m+Da) ) , 字符比较次数为 f (n) =n+2k (m+ Da-1); k为找到的离散素子串出现的次数, Da 为找到的离散素子串的平均离散数。  Time complexity analysis of the method: The time complexity of finding the smallest discrete substring is 0 (n+k(in+Da)), and the number of character comparisons is f (n) n+2 (k-1) (m+ Da-1); The time complexity of finding the smallest discrete prime substring is 0 (n+k (m+Da) ) , and the number of character comparisons is f (n) =n+2k (m+ Da-1); k is The number of occurrences of the found discrete substring, Da is the average discrete number of the found discrete substring.
上述输出最小离散素子串离散数及位置的用于信息检索与信息输入的离散子串模式匹配 方法, 略加修改即可形成输出基于给定离散数的最小离散素子串离散数及位置的模式匹配方 法, 其作法是上述输出最小离散素子串离散数及位置的模式匹配方法中的 a-g步、 i步不变, h步修改为:  The discrete substring pattern matching method for information retrieval and information input outputting the discrete number and position of the minimum discrete prime substring is slightly modified to form a pattern matching of the discrete discrete substring discrete number and position based on a given discrete number The method is the same as the ag step and the i step in the pattern matching method for outputting the discrete number and position of the smallest discrete element substring, and the h step is modified as:
步 如果当前最小离散素子串的位置数組 rain [ ] 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "-Γ, 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S 中的位置: yi=min[ ]中的第一个数值, ym=min[ ] 中的最后一个数值, 求出离散数 D=ym-y-m+l。 如果离散数 D>预先给定的离散数 Do, 则判定文本 S 中不存在满足预先给定的离散数 Do 的最小离散素子串, 输出判定结果 "- , 结束匹配。 If the position array rain [ ] of the current minimum discrete substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result is outputted "-Γ, end matching; otherwise, the length of the mode P is obtained. m, find the position of the first and last characters of the current smallest discrete prime substring in the text S: the first value in yi =min[ ], the last value in y m =min[ ], find the discrete number D =y m -y-m+l. If the discrete number D> is a predetermined discrete number Do, it is determined that there is no minimum discrete element substring satisfying the predetermined discrete number Do in the text S, and the determination result "-" is ended.
如果离散数 D <预先给定的离散数 Do,则判定模式 P为文本 S的符合顿先给定的离散数 D„ 要求的最小离散素子串, 输出离散数 D, 并输出位置数组 min [ ],结束匹配。  If the discrete number D is a predetermined discrete number Do, the mode P is determined to be the smallest discrete element substring required by the given discrete number D of the text S, the discrete number D is output, and the position array min [ ] is output. , end the match.
该种方法在文本 S中找出满足给定离散数 D。要求的最小的离散素子串, 改善了离散素子 串模式匹配的功能, 过滤掉最小的离散素子串离散数过大的相关文本, 提高了信息检索准确 率。  This method finds in the text S that a given discrete number D is satisfied. The minimum required discrete substrings improve the function of discrete prime sub-pattern matching, filtering out the smallest discrete sub-strings with too large discrete texts, and improving the accuracy of information retrieval.
该方法, 可应用于网络信息搜索、 数据库信息检索等长、 短文本的定位检索。  The method can be applied to the positioning and retrieval of long and short texts such as network information search and database information retrieval.
该方法的时间复杂度与最小离散素子串模式匹配方法相同。  The time complexity of this method is the same as the minimum discrete element substring pattern matching method.
上述的用于信息检索与信息输入的离散子串模式匹配方法, 修改扩充即可形成二维离散 子串模式匹配方法, 其作法是首先将离散子串及文本的概念扩充形成二维离散子串及二维文 本的概念, 然后相应进行与离散子串模式匹配方法中 a、 b、 c、 d四步分别类似的人、 B、 C、 D四步, 并在 C步中引用了 a、 b、 c、 d四步, 即:  The above-mentioned discrete substring pattern matching method for information retrieval and information input can modify the expansion to form a two-dimensional discrete substring pattern matching method, which firstly expands the concept of discrete substring and text into a two-dimensional discrete substring. And the concept of two-dimensional text, and then corresponding to the four steps of a, b, c, d in the discrete substring pattern matching method, respectively, B, C, D four steps, and in the C step reference a, b , c, d four steps, namely:
用于信息检索与信息输入的离散子串模式匹配方法, 所迷的文本 S 有多个,多个文本 S!S2…… Sn构成二维文本 Ds= "S'S2…… S" " , 二维文本 Ds= "S'S2…… S" " 中的任意一个或一 个以上的文本 的离散子串 SCi'组成的文本串 "S ' S02' …… SGffl' " (其中 K G^G^…… <Gm n)为二维离散子串,二维的离散子串模式匹配即判定二维模式 Dp = "P'P2…… Pm " (K m < n)是否为二维文本 Ds的二维离散子串, 这种二维的离散子串模式匹配方法的具体步骤如下:The discrete substring pattern matching method for information retrieval and information input has a plurality of texts S, and a plurality of texts S ! S 2 ... S n constitute a two-dimensional text Ds = "S'S 2 ... S"" , Ds = arbitrary two-dimensional text "in the"S'S 2 ...... S "of one or more discrete text substring S Ci 'text string consisting of"S' S 02 '...... S Gffl'"( wherein KG ^ G ^... <G m n ) is a two-dimensional discrete substring, and the two-dimensional discrete substring pattern matching determines whether the two-dimensional mode Dp = "P'P 2 ...... P m " (K m < n) is two-dimensional The two-dimensional discrete substring of the text Ds, the specific steps of this two-dimensional discrete substring pattern matching method are as follows:
A步 取二维文本 Ds的第一个文本 S1作为被比较文本, 取二维模式 Dp的第一个模式 P1 作为比较文本; Step A takes the first text S 1 of the two-dimensional text Ds as the compared text, and takes the first pattern P 1 of the two-dimensional pattern Dp as the comparison text;
B步 如果被比较文本或比较文本为结束标志, 转 D步;  Step B If the text being compared or the comparison text is the end mark, go to step D;
C步被比较文本与比较文本进行所述离散子串基本模式匹配方法中的 a步、 b步、 c步、 d步的步骤,若 d步的结果为存在, 则取二维文本 D s的下一个文本作为被比较文本, 取二维 模式 D p的下一个模式作为比较文本,转 B步;否则,取二维文本 D s的下一个文本作为被比较 文本, 比较文本不变, 转 B步;  Step C is to compare the text and the comparison text to the steps of step a, step b, step c, and step d in the method for matching the discrete substring basic pattern. If the result of the step d is present, the two-dimensional text D s is taken. The next text is the compared text, take the next mode of the two-dimensional mode D p as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, turn B Step
D步 若比较文本为结束标志, 则二维模式 D p为二维文本 D s的二维离散子串, 求出二 维文本 D s 的文本个数 n , 二维模式 D p 的模式个数 m;输出二维离散子串简单匹配度 =Round (100 x ra ÷ n) ,结束匹配; 否则二维文本 D s中不存在二维模式 D p的二维离散子串, 输 出判定结果 "-1 " ,结束匹配。  In step D, if the comparison text is the end mark, the two-dimensional mode D p is a two-dimensional discrete substring of the two-dimensional text D s , and the number n of texts of the two-dimensional text D s and the number of modes of the two-dimensional mode D p are obtained. m; output two-dimensional discrete substring simple matching degree = Round (100 x ra ÷ n), end matching; otherwise there is no two-dimensional discrete substring of the two-dimensional mode D p in the two-dimensional text D s, outputting the judgment result "- 1 " , end the match.
这种二维离散子串模式匹配方法, 对离散子串进行了拓展, 实现了二维离散子串模式匹配, 离散子串仅仅是二维离散子串的一个特例, 也即当上述的二维离散子串 "Sgl' S62' …… Ssm' " 中的 m=l时, 二维离散子串就演变为离散子串。 The two-dimensional discrete substring pattern matching method extends the discrete substring to realize two-dimensional discrete substring pattern matching, and the discrete substring is only a special case of the two-dimensional discrete substring, that is, when the above two-dimensional When m=l in the discrete substring "S gl ' S 62 ' ... S sm '", the two-dimensional discrete substring evolves into a discrete substring.
二维离散子串模式匹配方法用于二维空间。 例如, 在键盘输入中的中文单字拼音、 英文单 词为一维的字符串, 中文词组的拼音、 英文词组可认为是二维的字符串。 二维离散子串具有 离散子串的所有特点, 又包容了离散子串。 如此, 本发明既可在一维文本空间的层面上进行 任意字符省略输入与检索, 又可在二维空间层面上进行任意一维文本的省略输入与检索, 均 可查出相关文本, 使得信息检索与信息输入更加简单、 灵活。 A two-dimensional discrete substring pattern matching method is used for two-dimensional space. For example, the Chinese single-word pinyin in the keyboard input, the English word as a one-dimensional string, the Chinese phrase Pinyin, and the English phrase can be considered as a two-dimensional string. The two-dimensional discrete substring has all the characteristics of the discrete substring, and also contains the discrete substring. In this way, the present invention can perform arbitrary character omitting input and retrieval on the level of one-dimensional text space, and can perform omitting input and retrieval of any one-dimensional text on the two-dimensional space level. The relevant text can be found, making information retrieval and information input simpler and more flexible.
这种二维离散子串模式匹配方法, 判定存在二维离散子串后, 输出二维离散子串的简单 匹配度(100 x m ÷ n)。 利用该匹配度, 可以对检索出的所有二维文本进行降序排列输出, 用户 首先处理的是匹配度高的二维文本, 提高了检索处理的效率。 它适合于词典类短文本、 大容 量二维字符串集的检索判定。  This two-dimensional discrete substring pattern matching method determines the simple matching degree (100 x m ÷ n) of the two-dimensional discrete substring after the existence of the two-dimensional discrete substring. By using the matching degree, all the retrieved two-dimensional texts can be output in descending order, and the user first processes the two-dimensional text with high matching degree, which improves the efficiency of the retrieval processing. It is suitable for the retrieval judgment of dictionary short text and large capacity two-dimensional string set.
其时间复杂度分析: 假设二维文本 S中各文本字符串的长度分别为: Li、 L2 Ln, 令 L-I +I^…… +Ln,则判定存在二维离散子串的时间复杂度为 0 (L) ,字符比较次数 f (L) < L, 判定不存在二维离散子串的时间复杂度为 O (L) , 字符比较次数 f (L) =L, 与二维模式 P的长 度无关, 该判定方法, 以最快的方式, 跳过无关的文本。 The time complexity analysis: Assume that the lengths of the text strings in the two-dimensional text S are: Li, L 2 L n , let LI +I^... + L n , then determine the time complexity of the existence of the two-dimensional discrete substring Degree is 0 (L), the number of character comparisons f (L) < L, the time complexity of determining that there is no two-dimensional discrete substring is O (L), the number of character comparisons f (L) = L, and the two-dimensional mode P Regardless of the length, the decision method, in the quickest way, skips irrelevant text.
上述的用于信息检索与信息输入的二维离散子串模式匹配方法, 略加修改即可形成输出 精确匹配度的模式匹配方法, 其作法是上迷的二维离散子串模式匹配方法中 A步、 B步不变, 而 C步、 D步修改为:  The above-mentioned two-dimensional discrete substring pattern matching method for information retrieval and information input can be slightly modified to form a pattern matching method for output accurate matching degree, and the method is the above two-dimensional discrete substring pattern matching method. Steps and steps B are unchanged, and steps C and D are modified to:
C步被比较文本与比较文本进行所述离散子串基 ^莫式匹配方法中的 a步、 b步、 c步、 d步的步骤,若 d步的结果为存在, 则将被比较文本在二维文本 D s中的位置值, 存储于位置 数組 pos [ ]中, 其存储位置与比较文本在二维模式 D p中的位置相同, 取二维文本 D s 的下 一个文本作为被比较文本, 取二维模式 D p的下一个模式作为比较文本,转 B步; 否则,取二 维文本 D s的下一个文本作为被比较文本, 比较文本不变, 转 B步;  Step C is to compare the text with the comparison text to perform the steps of step a, step b, step c, and step d in the method of discrete substring base matching. If the result of step d is present, the text to be compared is The position value in the two-dimensional text D s is stored in the position array pos [ ], and its storage position is the same as the position of the comparison text in the two-dimensional mode D p , and the next text of the two-dimensional text D s is taken as the comparison Text, take the next mode of the two-dimensional mode D p as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, and turn to step B;
D 步 若比较文本不是结束标志, 则二维文本 D s 中不存在二维模式 D p 的二维离散子 串, 输出判定结果 "- ,结束匹配; 否则, 判定二维模式 D p为二维文本 D s的二维离散子 串, 求出二维文本 D s的文本个数 n, 二维模式 D p的模式个数 m, 求出二维离散子串的第一 个文本串以及最后一个文本串在二维文本 D s中的位置: G^pos [ ]中的第一个数值, Gm=pos [ ] 中的最后一个数值, 输出二维离散子串的精确匹配度 -Round (100 X (m- (Gm-G -m+l) ÷ n) ÷ n) , 结束匹配。 If the comparison text is not the end mark, the two-dimensional discrete substring of the two-dimensional pattern D p does not exist in the two-dimensional text D s , and the determination result "- , the end matching is output; otherwise, the two-dimensional mode D p is determined to be two-dimensional The two-dimensional discrete substring of the text D s, the number n of texts of the two-dimensional text D s , the number m of modes of the two-dimensional pattern D p , the first text string of the two-dimensional discrete substring and the last one The position of the text string in the two-dimensional text D s: the first value in G^pos [ ], the last value in G m =pos [ ], the exact match of the output two-dimensional discrete substring - Round (100 X (m - (G m - G - m + l) ÷ n) ÷ n) , end the match.
该种方法判定存在二维离散子串时, 输出二维离散子串的精确匹配度(100 X (m- (gm-g -m+l) ÷ n) ÷ n)。二维离散子串的精确匹配度不仅考虑二维文本 S与二维模式 P的一 维文本个数, 还考虑检索出的二维离散子串的离散数对匹配度的影响。 利用该匹配度, 可以 对检索出的所有二维文本进行降序排列输出, 用户首先处理的是匹配度高的二维文本, 进一 步提高了二维空间的检索处理效率。 它也适用于词典类短文本、 大容量二维字符串集的检索 判定。 This method determines the exact matching degree of the two-dimensional discrete substring when there is a two-dimensional discrete substring (100 X (m - (g m - g - m + l) ÷ n) ÷ n). The exact matching degree of the two-dimensional discrete substring not only considers the number of one-dimensional texts of the two-dimensional text S and the two-dimensional pattern P, but also considers the influence of the discrete numbers of the retrieved two-dimensional discrete substrings on the matching degree. By using the matching degree, all the retrieved two-dimensional texts can be output in descending order, and the user first processes the two-dimensional text with high matching degree, which further improves the retrieval processing efficiency of the two-dimensional space. It is also applicable to the retrieval judgment of dictionary short text and large-capacity two-dimensional string set.
其时间复杂度分析: 判定存在二维离散子串的时间复杂度为 0 (L), 字符比较次数 f (L) < L, 判定不存在二维离散子串的时间复杂度为 0 (L) , 字符比较次数 f (L) =L, 与二维模式 P 的长度无关, 该判定方法, 以最快的方式, 跳过无关的文本。  The time complexity analysis: The time complexity of determining the existence of two-dimensional discrete substring is 0 (L), the number of character comparison times f (L) < L, and the time complexity of determining that there is no two-dimensional discrete substring is 0 (L) , The number of character comparisons f (L) = L, regardless of the length of the two-dimensional mode P, the decision method, in the fastest way, skips irrelevant text.
下面结合具体实施方式对本发明作进一步的详细说明。  The present invention will be further described in detail below in conjunction with specific embodiments.
具体实施方式 detailed description
实施例一 Embodiment 1
本发明的第一种实施方式为: 一种用于信息检索与信息输入的离散子串模式匹配方法, 其特点是所述的离散子串为文本 S= …… s„"中的任意一个或一个以上的字符组成的字符 串 "SElSg2…一 Sgm" (K gl<g2…… <gm < n) ; 离散子串模式匹配即判定模式 Ρ= "Ρ P3…… Ρ, (K m n )是否为文本 S的离散子串 "SglSe2…… S6 ' ,并输出判定结果的具体步骤如下: a步 取文本 S的第一个字符作为被比较字符, 取模式 P的第一个字符作为比较字符; b步 如果被比较字符或比较字符为结束标志, 转 d步; A first embodiment of the present invention is: a discrete substring pattern matching method for information retrieval and information input, It is characterized in that the discrete substring is a string consisting of any one or more characters of the text S=...s„""S El S g2 ... a S gm " (K gl <g 2 ... < g m <n); Discrete substring pattern matching, ie, decision mode Ρ = "Ρ P 3 ...... Ρ, (K mn ) is the discrete substring of the text S "S gl S e2 ... S 6 ' and outputs a decision The specific steps of the result are as follows: a step takes the first character of the text S as the compared character, and takes the first character of the pattern P as the comparison character; b step if the compared character or the comparison character is the end flag, the d step;
c步 若被比较字符与比较字符相等, 则取文本 S的下一个字符作为被比较字符, 取模 式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字符作为被比较字符, 比 较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the next character of the text S is taken as the compared character, and the next character of the pattern P is taken as the comparison character, and the step b is performed; otherwise, the next character of the text S is taken as the Compare characters, compare characters, and turn b steps;
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 输出代表判定结果 "存在" 的数据,结束匹配; 否则,判定文本 S中不存在模式 P的离散子串, 输出代表判定结 果 "不存在" 的数据, 结束匹配。  If the comparison character is the end flag, the determination mode P is a discrete substring of the text S, and the data representing the determination result "present" is output, and the matching is ended; otherwise, the discrete substring of the pattern P does not exist in the text S, and the output is The data representing the "non-existence" of the judgment result ends the match.
实施例二: Embodiment 2:
本例的方法是在实施一的基本匹配方法上, 略加修改而形成的输出简单匹配度的模式匹 配方法, 其作法是实施例一的方法中 a步、 b步、 c步不变, 而 d步修改为:  The method of this example is a pattern matching method of output simple matching degree formed by slightly modifying the basic matching method, and the method is the steps a, b, and c in the method of the first embodiment. Step d is modified to:
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 求出文本 S的长度 n, 模式 P的长度 m, 输出离散子串简单匹配度 =Round ( 100 x m ÷ n ), 结束匹配; 否则,判定 文本 S中不存在模式 P的离散子串, 输出判定结果 "- , 结束匹配。  If the comparison character is the end flag in d step, it is determined that the mode P is a discrete substring of the text S, and the length n of the text S, the length m of the pattern P, and the simple matching degree of the output discrete substring = Round (100 xm ÷ n ) are obtained. End the match; otherwise, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-" is ended.
本发明中的 Round为取整函数即四舍五入取整运算。  The Round in the present invention is a rounding function, that is, a rounding rounding operation.
实施例三 Embodiment 3
本例的方法也是在实施一的基本匹配方法上, 略加修改而形成的输出精确匹配度的模式 匹配方法, 其作法是在实施一的方法中 a步、 b步不变, 而 c步、 d步修改为:  The method of this example is also a pattern matching method for output accurate matching degree formed by slightly modifying a basic matching method, which is implemented in the method of implementing one, step a and step b are unchanged, and step c, Step d is modified to:
c步 若被比较字符与比较字符相等, 则将被比较字符在文本 S 中的位置值, 存储于位 置数組 pos门中, 其存储位置与比较字符在模式 P中的位置相同, 取文本 S的下一个字符作 为被比较字符, 取模式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字符 作为被比较字符, 比较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The next character as the compared character, take the next character of the pattern P as the comparison character, and turn to step b; otherwise, take the next character of the text S as the compared character, compare the characters unchanged, and turn to step b;
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 " 结束匹配; 否则判定模式 P为文本 S的离散子串, 求出文本 S的长度 n, 模式 P 的长度 m, 求出离散子串的首字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值, g«=pos [ ]中的最后一个数值,输出能反映离散子串离散程度的精确匹配度-Round (100 X (ra- (gm-g -m+1) ÷ n) ÷ n) , 结束匹配。 If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the output determination result "ends the matching; otherwise, the determination mode P is the discrete substring of the text S, and the length n of the text S is obtained. The length m of the pattern P, find the position of the first character and the last character of the discrete substring in the text S: gl = pos [ the first value in [ ], the last value in g«=pos [ ], the output can The exact match that reflects the degree of dispersion of the discrete substrings - Round (100 X (ra - (g m - g - m + 1) ÷ n) ÷ n) , ends the match.
实施例四 Embodiment 4
本例方法是在实施三的输出精确匹配度的模式方法上, 略加修改而形成的输出离散子串 离散数及位置的模式匹配方法, 其作法是实施三的方法中 a步、 b步、 c步不变, d步修改为: d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 "- ,结束匹配; 否则, 模式 P为文本 S的离散子串 "SslS g2…… Sem" , 求出模式 P的长 度 m, 求出离散子串的首字符以及末字符在文本 S 中的位置: g,=pos [ ]中的第一个数值, gm=pos [ ]中的最后一个数值,输出离散子串的离散数 D=gm-g -m+l , 并输出位置数组 pos门, 结束匹配。 The method of this example is a mode matching method for output discrete substring discrete numbers and positions formed by slightly modifying the mode of output precision matching of three, which is a method of implementing three steps a, b, The c step is unchanged, and the d step is modified as: d step If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-, the end matching is output; otherwise, the mode P is the text S. The discrete substring "S sl S g 2 ...... S em ", find the length m of the pattern P, and find the position of the first character and the last character of the discrete substring in the text S: g, = pos [ ] a value, The last value in g m =pos [ ], output the discrete number of the discrete substring D=g m -g -m+l , and output the position array pos gate to end the match.
实施例五 Embodiment 5
本例的方法是在实施四的输出离散子串离散数及位置的方法上, 略加修改而形成的输出 基于给定离散数的离散子串离散数及位置的模式匹配方法, 其作法是在实施例四的方法中 a 步、 b步、 c步不变, d步修改为:  The method of this example is a method for matching the discrete number and position of discrete substrings of a given discrete number on the method of implementing the output discrete substring discrete number and position of the fourth, and the method is In the method of the fourth embodiment, steps a, b, and c are unchanged, and step d is modified to:
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 结束匹配; 否则, 模式 P为文本 S的离散子串 "SglS g2…… Sg;' , 求出: 模式 P的 长度 m、离散子串的首字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值、 g^pos [ ] 中的最后一个数值、 离散子串的离散数 D=gm-g -m+l ; If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the output judgment result ends the matching; otherwise, the pattern P is the discrete substring of the text S "S gl S g2 ... S g ; ' , find: the length m of the pattern P, the first character of the discrete substring and the position of the last character in the text S: gl = pos [ the first value in [ ], the last value in g^pos [ ], The discrete number of discrete substrings D = g m - g - m + l ;
如果离散数 D <预先给定的离散数 D„, 则判定模式 P为文本 S的符合离散数 D。要求的第 一个离散子串, 输出离散数 D, 并输出位置数組 pos [ ],结束匹配。  If the discrete number D < a predetermined discrete number D „, then the mode P is determined to be the discrete number D of the text S. The first discrete substring required, the discrete number D is output, and the position array pos [ ] is output. End the match.
如果 D> D„; 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: 当前被比较字符位置 -模式 P的长度 m-预先给定的离散数 D„, 并取该位置的字符作为被比较 字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较 字符, 转 b步。  If D>D„; restarts the matching of the next discrete substring, the position of the compared character in the text S is modified to: the currently compared character position - the length of the pattern P m - the predetermined discrete number D „, and The character at the position is taken as the compared character; the position of the comparison character in the mode P is modified to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
实施例六 Embodiment 6
本例的方法是在实施例三的输出精确匹配度的模式匹配方法上, 略加修改与扩充而形成 的输出离散素子串匹配度的模式匹配方法,其作法是实施例三的方法中 a步、 b步、 c步不变, 先找到离散子串, 然后通过以下的 d步、 e步、 f 步找到离散素子串, 再由 g步、 h步判定并 输出离散素子串匹配度:  The method of this example is a pattern matching method for the output discrete element substring matching degree formed by slightly modifying and expanding on the pattern matching method of the output precise matching degree of the third embodiment, and the method is the step a of the method of the third embodiment. , step b, step c does not change, first find the discrete substring, then find the discrete substring through the following d step, e step, f step, and then determine the matching degree of the discrete prime substring by g step, h step:
d步 若比较字符不是结束标志,转 h步; 否则, 将文本 S中被比较字符的位置前移 1个 字符位置, 并取该位置的字符作为被比较字符, 将模式 P中比较字符的位置前移 2个字符位 置,并取该位置的字符作为比较字符。  If the comparison character is not the end mark, go to step h; otherwise, move the position of the compared character in the text S by 1 character position, and take the character at the position as the compared character, and compare the position of the character in the pattern P. Move forward 2 characters and take the character at that position as the comparison character.
然后, 再依次进 4亍以下的 e、 f、 g、 h步:  Then, in turn, enter e, f, g, h steps below 4亍:
e步 如果模式 P的首字符已比较完毕, 转 g步;  Step e If the first character of mode P has been compared, go to g step;
f 步 若被比较字符与比较字符相等, 则将被比较字符在文本 S 中的位置值, 存储于位 置数组 pos门中, 其存储位置与比较字符在模式 P中的位置相同, 取文本 S的前一个字符作 为被比较字符, 取模式 P的前一个字符作为比较字符, 转 e步; 否则, 取文本 S的前一个字 符作为被比较字符, 比较字符不变, 转 e步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The previous character is used as the compared character, and the previous character of the pattern P is taken as the comparison character, and the e step is changed; otherwise, the previous character of the text S is taken as the compared character, the comparison character is unchanged, and the e step is performed;
g步 判定模式 P为文本 S的离散素子串, 求出文本 S的长度 n, 模式 P的长度 m, 求出 离散素子串的首字符以及末字符在文本 S 中的位置: gl=pos [ ]中的第一个数值, gm=po s [ ] 中的最后一个数值,输出离散素子串匹配度-Round d OO x (m- (g„-g -m+1) ÷ n) ÷ n),结束匹配; h步 判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "- 1 " , 结束匹配。 The g step determination mode P is a discrete element substring of the text S, and the length n of the text S and the length m of the pattern P are obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: gl =pos [ ] The first value in , g m =po s [ last value in [ ], output discrete substring matching - Round d OO x (m- (g„-g -m+1) ÷ n) ÷ n) End matching; h step determines that there is no discrete substring of the pattern P in the text S, and outputs a determination result "- 1 " to end the matching.
实施七 Implementation seven
本例的方法是在实施例六的输出离散素子串匹配度的模式匹配方法上, 略加修改而形成 的输出离散素子串离散数及位置的模式匹配方法, 其作法是实施例六的方法中 a- f 步以及 h 步不变, g步修改为: The method of this example is formed by slightly modifying the pattern matching method of the output discrete sub-string matching degree of the sixth embodiment. A pattern matching method for outputting discrete number and position of discrete sub-substrings is the same as the method of the sixth embodiment, wherein the a-f step and the h-step are unchanged, and the g-step is modified to:
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 m, 求出离散素子串的首字 符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值, gm=Pos [ ]中的最后一个数值, 输出离散数 D=gm-g -m+l, 并输出位置数组 po s [ ] , 结束匹配。 The g step determination mode P is a discrete element substring of the text S, and the length m of the pattern P is obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: gl = pos [ ] the first value , g m =Pos [ last value in [ ], output the discrete number D=g m -g -m+l, and output the position array po s [ ] to end the match.
实施例八 Example eight
本例是在实施七的输出离散素子串离散数及位置的模式匹配方法上,略加修改而形成的输 出基于给定离散数的离散素子串离散数及位置的模式匹配方法, 其作法是在实施例七的方法 中的 a-f 步以及 h步不变, g步修改为:  This example is a pattern matching method in which the discrete matching sub-distribution number and position of a discrete number of discrete elements are obtained by slightly modifying the pattern matching method of the discrete-sub-substring discrete number and position of the output discrete-sub-substring. In the method of the seventh embodiment, the af step and the h step are unchanged, and the g step is modified to:
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 ra, 求出: 离散素子串的首 字符以及末字符在文本 S中的位置: gl=Pos [ ]中的第一个数值、 gm=pos [ ]中的最后一个数值、 离散数 D=gfgl-m+l。 The g step determination mode P is a discrete prime substring of the text S, and the length ra of the mode P is obtained, and the position of the first character of the discrete substring and the position of the last character in the text S is obtained: gl = P os [ ] The value, g m = the last value in pos [ ], the discrete number D = gf gl -m + l.
如果离散数 D 预先给定的离散数 D。,则判定模式 P为文本 S的符合离散数 D。要求的第一 个离散素子串, 输出离散数 D, 并输出位置数組 pos [ ],结束匹配。  If the discrete number D is given a discrete number D. Then, it is determined that the mode P is the discrete number D of the text S. The first discrete element substring is required, the discrete number D is output, and the position array pos [ ] is output to end the match.
如果 D> Do , 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: Max (pos [ ]的第二个位置的值, +1- m- D。), Max 为求最大值运算, 该式的意义是取出两 个数中最大的数的值。 并取该位置的字符作为被比较字符; 将模式 P中比较字符的位置修改 为模式 P的首字符位置, 并取该位置的字符作为比较字符, 转 b步。  If D> Do, restart the matching of the next discrete substring, modify the position of the compared character in the text S to: Max (pos [ ], the value of the second position, +1- m- D.), Max For the maximum operation, the meaning of this equation is to take the value of the largest of the two numbers. And take the character of the position as the compared character; change the position of the comparison character in the mode P to the first character position of the mode P, and take the character of the position as the comparison character, and turn to step b.
实施例九 Example nine
本例是在实施六的输出离散素子串匹配度的模式匹配方法上, 略加修改而形成的输出最 小离散素子串匹配度的模式匹配方法, 其作法是在实施例六的方法中 a- f 步不变, 修改 g步、 h步,并增加 i步:  This example is a pattern matching method for outputting a minimum discrete element substring matching degree which is slightly modified on the pattern matching method of outputting the discrete element substring matching degree of six. The method is the a-f in the method of the sixth embodiment. Steps are unchanged, modify g step, h step, and increase i step:
g步 判定模式 P为文本 S的离散素子串, 如果当前最小离散素子串的首字符以及末字符 在文本 S中的位置 y,、 yB 未被赋值, 则令 y^pos f ]中的第一个数值, ym=pos [ ]中的最后一 个数值, 转 i步; 否则, 求出离散素子串的首字符以及末字符在文本 S中的位置: gl=pos门 中的第一个数值, g„=pos [ ]中的最后一个数值,若(g„- gl) < (ym-yi) , 则 y^ gi , y„,= gB , 转 i 步; 若 ( - gi) > (ym-yi) , 则直接转 i步; The first step g determination mode P is a discrete element substring text S, and if the current smallest discrete element of the substring first character and the last character position y in the text in S ,, y B not assigned, then let y ^ pos f] in A value, y m = pos [ ] the last value, turn i step; otherwise, find the first character of the discrete substring and the position of the last character in the text S: gl = the first value in the pos gate , g„=pos [ ] the last value, if (g„- gl ) < (y m - yi ), then y^ gi , y„,= g B , turn i step; if ( - gi) > (y m -yi) , then turn directly to i step;
h步 如果当前最小离散素子串的首字符与末字符在文本 S中的位置 yi、 y2 未被赋值, 则判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "-1" ,结束匹配;否则, 求出文 本 S的长度 n, 模式 P的长度 m, 输出最小离散素子串匹配度 =Round (100 (m- (ym-y,-m+l) ÷ n) ÷ n) ,结束匹配; Step h If the positions yi and y 2 of the first character and the last character of the current minimum discrete element substring are not assigned in the text S, it is determined that there is no discrete element substring of the pattern P in the text S, and the determination result "-1" is output. End the match; otherwise, find the length n of the text S, the length m of the pattern P, and output the minimum discrete element substring match = Round (100 (m- (y m -y, -m+l) ÷ n) ÷ n) End matching;
i步 若(y yr m+l) =0, 转 h步; 否则, 重新开始下一个离散子串的匹配, 将文本 S中 被比较字符的位置修改为 pos [ ]的第二个位置的值, 并取该位置的字符作为被比较字符; 将 模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较字符, 转 步。  If i step (y yr m+l) =0, go to h step; otherwise, restart the matching of the next discrete substring, and change the position of the compared character in text S to the value of the second position of pos [ ] And take the character of the position as the compared character; modify the position of the comparison character in the mode P to the first character position of the mode P, and take the character of the position as the comparison character, and turn.
实施例十 本例是在实施例九的输出最小离散素子串匹配度的模式匹配方法上, 略加修改而形成的 输出最小离散素子串离散数及位置的模式匹配方法,其作法是实施例九的方法中 a-f 步不变, g步、 h步、 i步修改为: Example ten This example is a pattern matching method for outputting a minimum discrete element substring discrete number and a position formed by slightly modifying the pattern matching method for outputting the minimum discrete element substring matching degree in the ninth embodiment, and the method is the method of the ninth embodiment. The af step is unchanged, and the g step, h step, and i step are modified to:
g 步 找到离散素子串, 如果当前最小离散素子串位置数組 min [ ]未被赋值, 则令 rain门 =pos [ ],转 i步;否则,求出离散素子串的首字符及末字符在文本 S中的位置: =ρο3 [ ] 中的第一个数值, gm=pos [ ]中的最后一个数值,求出当前最小离散素子串首字符及末字符在 文本 S中的位置: yi=min [ ]中的第一个数值, y„-min [ ]中的最后一个数值,若(g„- g,) < (y^y , 则令 min [ ] =pos [ ] , 转 i步; 若(g„- gl) > (ym-yi), 则直接转 i步; Step g finds the discrete prime substring. If the current smallest discrete prime substring position array min [ ] is not assigned, let the rain gate = pos [ ], turn i step; otherwise, find the first and last characters of the discrete prime substring. The position in the text S: = ρο3 [ ] The first value in the [ g m = pos [ ], the last value of the current smallest discrete prime substring and the position of the last character in the text S: yi = The first value in min [ ], the last value in y„-min [ ], if (g„- g,) < (y^y , let min [ ] =pos [ ], turn i step; If (g„- gl ) > (y m - yi ), then turn directly to i step;
h步 如果当前最小离散素子串的位置数組 min [ ] 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "-1 " , 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S 中的位置: yi=inin [ ]中的第一个数值, yn=min [ ] 中的最后一个数值, 输出离散数 D=y„- -m+l , 并输出位置数組 min门, 结束匹配; Step h If the position array min [ ] of the current smallest discrete prime substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the mode P is obtained. The length m, find the position of the first and last characters of the current smallest discrete prime substring in the text S: yi =inin [ the first value in [ ], the last value in y n =min [ ], the output discrete number D=y„- -m+l , and output the position array min gate to end the match;
i步 求出当前最小离散素子串首字符及末字符在文本 S中的位置: Υι=πιίη [ ]中的第一 个数值, y„=fflin门中的最后一个数值, 若 (ym- yi- m+l) =0, 转 h步; 否则, 重新开始下一个离 散子串的匹配, 将文本 S中被比较字符的位置修改为 pos [ ]的笫二个位置的值, 并取该位置 的字符作为被比较字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位 置的字符作为比较字符, 转 b步。 Step i find the position of the first and last characters of the current smallest discrete prime substring in the text S: Υι =πιίη [ the first value in the y„= fflin the last value in the gate, if (y m - yi - m+l) =0, go h step; otherwise, restart the matching of the next discrete substring, modify the position of the compared character in the text S to the value of the two positions of pos [ ], and take the position The character is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
实施例十一 Embodiment 11
本例是在实施例十的输出最小离散素子串离散数及位置的模式匹配方法上, 略加修改而 形成的输出基于给定离散数的最小离散素子串离散数及位置的模式匹配方法, 其作法是在实 施例十的方法中的 a- g步、 i步不变, h步修改为:  This example is a pattern matching method for the discrete-distribution sub-distribution number and position of a given discrete number based on a pattern matching method for outputting the minimum discrete element sub-string discrete number and position of the tenth embodiment. The method is the a-g step, the i step is unchanged in the method of the tenth embodiment, and the h step is modified to:
h步 如果当前最小离散素子串的位置数组 min门 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "- 1" , 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S 中的位置: yi=min [ ]中的第一个数值, ym=min [ ] 中的最后一个数值, 求出离散数 D=ym-y -ni+l。 Step h If the position min gate of the current minimum discrete element substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "- 1" is output, and the matching is ended; otherwise, the length of the pattern P is obtained. m, find the position of the first and last characters of the current smallest discrete prime substring in the text S: yi =min [ the first value in [ ], the last value in y m =min [ ], find the discrete number D =y m -y -ni+l.
如果离散数 D>预先给定的离散数 D。, 则判定文本 S 中不存在满足预先给定的离散数 D。 的最小离散素子串, 输出判定结果 结束匹配。  If the discrete number D> is a predetermined discrete number D. Then, it is determined that there is no discrete number D that satisfies a predetermined value in the text S. The smallest discrete prime substring, the output judgment result ends the match.
如杲离散数 D 预先给定的离散数 Do,则判定模式 P为文本 S的符合预先给定的离散数 D„ 要求的最小离散素子串, 输出离散数 D, 并输出位置数组 min [ ],结束匹配。  If the discrete number D is a predetermined discrete number Do, the mode P is determined to be the smallest discrete element substring of the text S that meets the predetermined discrete number D „, the discrete number D is output, and the position array min [ ] is output. End the match.
实施例十二 Example twelve
本例是在实施例一的用于信息检索与信息输入的离散子串模式匹配方法基 上, 修改扩 充而形成二维离散子串模式匹配方法, 它首先将离散子串及文本的概念扩充形成二维离散子 串及二维文本的概念, 然后相应进行与离散子串模式匹配方法中 a、 b、 c、 d四步分别类似的 A、 B、 C、 D四步, 并在 C步中引用了 a、 b、 c、 d四步, 即:  In this example, based on the discrete substring pattern matching method for information retrieval and information input in the first embodiment, the expansion is performed to form a two-dimensional discrete substring pattern matching method, which firstly expands the concept of discrete substring and text. The concept of two-dimensional discrete substring and two-dimensional text, and then corresponding four steps A, B, C, and D similar to the four steps a, b, c, and d in the discrete substring pattern matching method, and in step C The four steps a, b, c, and d are quoted, namely:
用于信息检索与信息输入的离散子串模式匹配方法, 所述的文本 S 有多个,多个文本 S'S2…… Sn构成二维文本 Ds= "S'S2... ... S" " , 二维文本 Ds= "S'S2…… Sn " 中的任意一个或一 个以上的文本 S1 "的离散子串 Sei'组成的文本串 "S"' S02'…… S1501' " (其中 KG^G^■····· <G„ <n)为二维离散子串,二维的离散子串模式匹配即判定二维模式 Dp = "P …… " (Km< n)是否为二维文本 Ds的二维离散子串, 这种二维的离散子串模式匹配方法的具体步膝如下:A discrete substring pattern matching method for information retrieval and information input, wherein the text S has a plurality of texts S'S 2 ... S n constitute a two-dimensional text Ds = "S'S 2 ... S " , any one or one of the two-dimensional text Ds = "S'S 2 ... S n " Text string "S"'S 02 '... S 1501 '" (where KG^G^■····· <G„ <n) consists of more than one text S 1 "discrete substring S ei ' Two-dimensional discrete substring, two-dimensional discrete substring pattern matching, that is, whether the two-dimensional mode Dp = "P ...... " (Km < n) is a two-dimensional discrete substring of the two-dimensional text Ds, this two-dimensional discrete The specific steps of the substring pattern matching method are as follows:
A步 取二维文本 Ds的第一个文本 S1作为被比较文本, 取二维模式 Dp的第一个模式 P1 作为比较文本; Step A takes the first text S 1 of the two-dimensional text Ds as the compared text, and takes the first pattern P 1 of the two-dimensional pattern Dp as the comparison text;
B步如果被比较文本或比较文本为结束标志, 转 D步;  Step B If the text is compared or the comparison text is the end mark, go to step D;
C步被比较文本与比较文本进行所述离散子串模式匹配方法中的 a步、 b步、 c步、 d步 的步骤,若 d步的结果为存在, 则取二维文本 Ds的下一个文本作为被比较文本, 取二维模式 Dp的下一个模式作为比较文本,转 B步;否则,取二维文本 D s的下一个文本作为被比较文本, 比较文本不变, 转 B步;  Step C is to compare the text and the comparison text to the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of step d is present, take the next step of the two-dimensional text Ds. The text as the compared text, take the next mode of the two-dimensional mode Dp as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, and turn to step B;
D步 若比较文本为结束标志, 则二维模式 Dp为二维文本 Ds的二维离散子串, 求出二 维文本 Ds 的文本个数 n, 二维模式 Dp 的模式个数 m;输出二维离散子串简单匹配度 =Roimd(100xm÷n),结束匹配; 否则二维文本 Ds中不存在二维模式 Dp的二维离散子串,输 出判定结果 "- 1" ,结束匹配。  In step D, if the comparison text is the end mark, the two-dimensional mode Dp is a two-dimensional discrete substring of the two-dimensional text Ds, and the number n of texts of the two-dimensional text Ds is obtained, and the number of modes of the two-dimensional mode Dp is m; Dimensional discrete substring simple matching degree = Roimd (100xm÷n), end matching; otherwise, the two-dimensional discrete substring of the two-dimensional mode Dp does not exist in the two-dimensional text Ds, and the determination result "- 1" is output, and the matching is ended.
实施例十三 Example thirteen
本例是在实施例十二的用于信息检索与信息输入的二维离散子串模式匹配方法, 略加修 改而形成的输出精确匹配度的二维离散子串模式匹配方法, 其作法是实施例十二的方法中 A 步、 B步不变, 而 C步、 D步^^改为:  This example is a two-dimensional discrete substring pattern matching method for the information retrieval and information input in the two-dimensional discrete substring pattern matching method of the twelfth embodiment, and the output precise matching degree is formed by a slight modification, and the method is implemented. In the method of Example 12, step A and step B are unchanged, and step C and step D are changed to:
C步被比较文本与比较文本进行所述离散子串模式匹配方法中的 a步、 b步、 c步、 d步 的步骤,若 d步的结果为存在, 则将被比较文本在二维文本 Ds中的位置值, 存储于位置数组 pos [ ]中, 其存储位置与比较文本在二维模式 D p中的位置相同, 取二维文本 D s的下一个文 本作为被比较文本, 取二維模式 Dp的下一个模式作为比较文本,转 B步; 否则,取二维文本 Ds的下一个文本作为被比较文本, 比较文本不变, 转 B步;  Step C is to compare the text with the comparison text to perform the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of the step d is present, the text to be compared is in the two-dimensional text. The position value in Ds is stored in the position array pos [ ], and its storage position is the same as the position of the comparison text in the two-dimensional mode D p , and the next text of the two-dimensional text D s is taken as the compared text, taking two dimensions. The next mode of the mode Dp is used as the comparison text, and the process proceeds to step B; otherwise, the next text of the two-dimensional text Ds is taken as the compared text, the comparison text is unchanged, and the step B is performed;
D 步 若比较文本不是结束标志, 则二维文本 Ds 中不存在二维模式 Dp 的二维离散子 串, 输出判定结果 "-1" ,结束匹配; 否则, 判定二维模式 Dp为二维文本 Ds的二维离散子 串, 求出二维文本 Ds的文本个数 n, 二维模式 Dp的模式个数 m, 求出二维离散子串的第一 个文本串以及最后一个文本串在二维文本 Ds中的位置: d-pos [ ]中的第一个数值, Gm=pos [ ] 中的最后一个数值, 输出二维离散子串的精确匹配度 -Round (100 X (m- (G„-G-m+l) ÷ n) ÷ n), 结束匹配。 If the comparison text is not the end mark, the two-dimensional discrete substring of the two-dimensional mode Dp does not exist in the two-dimensional text Ds, and the determination result "-1" is output, and the matching is ended; otherwise, the two-dimensional mode Dp is determined to be two-dimensional text. The two-dimensional discrete substring of Ds, find the number n of texts of the two-dimensional text Ds, the number m of modes of the two-dimensional pattern Dp, and find the first text string of the two-dimensional discrete substring and the last text string in two The position in the dimension text Ds: the first value in d-pos [ ], the last value in G m =pos [ ], the exact match of the output two-dimensional discrete substring - Round (100 X (m- ( G„-G-m+l) ÷ n) ÷ n), end the match.
下面,是将以上的各实施例应用在列举的模式和文本中, 进行模式匹配后输出的结果及 其综合分析。  In the following, the above embodiments are applied to the enumerated modes and texts, and the results of the pattern matching and the comprehensive analysis are performed.
实施例一、 二、 三、 六、 九的模式匹配方法判定文本中是否存在离散子串并进行匹配度 的计算, 但不进行定位, 主要在信息输入技术领域使用。 表 1给出了这几种模式匹配方法的 一个具体模式匹配的输出结果。 表 1 : 实施例一、 二、 三、 六、 九的模式匹配方法的输出结果比较 The pattern matching method of the first, second, third, sixth, and ninth methods determines whether there are discrete substrings in the text and performs the calculation of the matching degree, but does not perform positioning, and is mainly used in the field of information input technology. Table 1 shows the output of a specific pattern match for these pattern matching methods. Table 1: Comparison of output results of the pattern matching methods of the first, second, third, sixth and ninth embodiments
Figure imgf000019_0001
Figure imgf000019_0001
注:带下划线的字符表示模式 P在文本中所匹配的字符。  Note: The underlined characters indicate the characters that pattern P matches in the text.
从表 1可知: 实施例一的模式匹配方法只判定模式 P是否存在于文本 S中, 不能对检索 到的文本进行排序输出。 实施例二的输出筒单匹配度的模式匹配方法能根据模式 P与文本 S 的离散子串简单匹配度进行排序, 但不能反映出离散子串离散数对匹配度的影响, 离散数越 小, 匹配度应该越大。 实施例三的精确匹配度模式匹配方法能反映离散数对匹配度的影响, 不同离散数得到不同的匹配度, 但实施例三模式匹配方法的判定结果不一定是更好的匹配位 置。 实施例六的离散素子串匹配度模式匹配方法可以求出离散素子串的位置, 由于离散素子 串是相应离散子串范围内, 离散数更小的离散子串, 表现出更为精确的匹配位置, 因此输出 的匹配度更为精确。 实施例九的最小离散素子串匹配度的模式匹配方法, 可以求出文本中的 最小离散素子串的位置, 因此输出的匹配度最为精确, 排序结果最理想。  It can be seen from Table 1 that the pattern matching method of the first embodiment only determines whether the mode P exists in the text S, and cannot sort and output the retrieved text. The pattern matching method of the single cylinder matching degree of the second embodiment can be sorted according to the simple matching degree of the discrete substrings of the pattern P and the text S, but cannot reflect the influence of the discrete number of discrete substrings on the matching degree, and the smaller the discrete number, The match should be larger. The exact matching degree pattern matching method of the third embodiment can reflect the influence of the discrete number on the matching degree, and the different discrete numbers obtain different matching degrees, but the judgment result of the third embodiment pattern matching method is not necessarily a better matching position. The discrete-sub-string matching degree pattern matching method of the sixth embodiment can find the position of the discrete-sub-string, since the discrete-sub-string is a discrete sub-string with a discrete number within the corresponding discrete sub-string, showing a more accurate matching position. , so the output is more accurate. The pattern matching method of the minimum discrete element substring matching degree of the ninth embodiment can find the position of the smallest discrete element substring in the text, so the matching degree of the output is the most accurate, and the sorting result is optimal.
2, 则列出了上述的实施例一、 二、 三、 六、 九的离散子串模式匹配方法的时间复杂 性。 Table 2 lists the time complexity of the discrete substring pattern matching method of the first, second, third, sixth, and ninth embodiments described above.
表 2 : 实施例一、 二、 三、 六、 九方法的时间复杂性分析  Table 2: Time complexity analysis of the first, second, third, sixth and ninth methods
Figure imgf000019_0002
Figure imgf000019_0002
(其中, k为检索中找到的离散素子串个数, Da为找到的离散素子串的平均离散数) 表 1、 表 2反映出, 对实施例九的最小离散素子串模式匹配方法检索得到的匹配度进行 降序排序输出 , 能精确地体现文本的匹配程度, 但该方法时间复杂度最高, 也相应增加了该 方法本身的复杂性。 因此, 可根据实际问题的要求, 综合考虑各种因素, 选择以上合适的方 法进行检索判定。 (where k is the number of discrete prime substrings found in the search, and Da is the average discrete number of the found discrete substrings) Table 1 and Table 2 reflect the search for the least discrete prime substring pattern matching method of Example 9. Matching degree Sorting the output in descending order can accurately reflect the degree of matching of the text, but the time complexity of the method is the highest, which also increases the complexity of the method itself. Therefore, according to the requirements of practical problems, comprehensive consideration of various factors, select the above appropriate method for search and determination.
实施例四、 五、 七、 八、 十、 十一的模式匹配方法, 在判定文本中是否存在离散子串基 础上, 进行离散数的计算, 以反映模式与文本匹配的相关程度, 并且给出模式每个字符在文 本中的对应位置也即对模式进行定位。 它们主要在信息检索技术领域使用, 更方便、 有效地 检索出相关文本并指出模式 P中的字符在文本 τ中的具体定位位置。 下面列出了这几种模式 匹配方法的一个具体模式匹配的输出定位结果。  The pattern matching method of the fourth, fifth, seventh, eighth, tenth, and eleventh embodiments performs the calculation of the discrete number on the basis of determining whether there is a discrete substring in the text to reflect the degree of correlation between the pattern and the text, and gives The corresponding position of each character of the pattern in the text is also positioned. They are mainly used in the field of information retrieval technology to more easily and efficiently retrieve relevant texts and indicate the specific location of the characters in the pattern P in the text τ. The output positioning results for a specific pattern match for these pattern matching methods are listed below.
能定位的几种离散子串模式匹配方法的输出结果(示例)如下, 检索词即为模式。  The output results (examples) of several discrete substring pattern matching methods that can be located are as follows, and the search term is the pattern.
文本 "拼音为主的中文拼省 笔划、 声调组合输入法, 简称拼、 笔、 声组合输入法" 现有的子串定位 检索遗漏 Text "Pinyin-based Chinese spelling strokes, tone combination input method, abbreviation spell, pen, sound combination input method" Existing substring positioning Search missing
检索词 "拼笔声" Search term "spell"
文本 "赴音为主的中文拼音、 «划、 调组合输入法, 简称拼、 笔、 声组合输入法" 实施例四 † † † Text "Chinese Pinyin, "Structure, Combination, Input, Abbreviation, Pen, and Sound Combination Input Method". Example 4 † † †
检索词 "拼笔声" Search term "spell"
文本 "进音为主的中文拼音、 划、 A调组合输入法, 简称拼、 笔、 声组合输入法" 实施例五 † † Text "Chinese Pinyin, stroke, and A combination input method, referred to as spell, pen, and sound combination input method". Example 5 † †
检索词 = "拼笔声" 给定离散数 =25 Search term = "spelling" given discrete number = 25
文本 = "拼音为主的中文 ¾音、 «划、 A调组合输入法, 筒称拼、 笔、 声组合输入法" 实施例五 † † † Text = "Pinyin-based Chinese 3⁄4 tone, «Line, A-key combination input method, barrel-like spell, pen, sound combination input method" Example 5 † † †
检索词 = "拼笔声" , 给定离散数 =10 Search term = "scrape", given discrete number = 10
文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 合输入法" 实施例五
Figure imgf000020_0001
Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, Input Method" Example 5
Figure imgf000020_0001
检索词 = "拼笔声" , 给定离散数 =5 Search term = "spelling", given discrete number = 5
文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 简称拼、 笔、 声组合输入法" 实施例五 检索遗漏 (不符合给定离散数 1的要求) Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, abbreviated spell, pen, and sound combination input method" Example 5 Search omission (not meeting the requirement of a given discrete number 1)
检索词 = "拼笔声" , 给定离散数 =1 Search term = "spell", given discrete number =1
文本 = "拼音为主的中文迸音、 划、 调组合输入法, 筒称拼、 笔、 声組合输入法" 实施例七 † † † Text = "Pinyin-based Chinese arpeggio, stroke, and combination input method, cartridge-like spell, pen, and sound combination input method" Example 7 † † †
检索词 = "拼笔声" Search term = "spell"
文本 = "拼音为主的中文赴音、 ¾划、 调组合输入法, 筒称拼、 笔、 声组合输入法" 实施例八 † † † Text = "Pinyin-based Chinese tones, 3⁄4 strokes, combination of input methods, cartridges, pens, and sound combination input methods" Example 8 † † †
检索词 = "拼笔声" , 给定离散数 =25 Search term = "scrape", given discrete number = 25
文本 组合输入法" 实施例八
Figure imgf000020_0002
Text combination input method"
Figure imgf000020_0002
检索词 = "拼笔声" , 给定离散数 5 文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 简称拼、 笔、 声组合输入法" 实施例八 检索遗漏 (不符合给定离散数 1的要求) Search term = "spell", given discrete number 5 Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, abbreviated as spell, pen, and sound combination input method" Example 8 Search Missing (not meeting the requirement of a given discrete number 1)
检索词 = "拼笔声" , 给定离散数 =1 Search term = "spell", given discrete number =1
文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 简称 、 A组合输入法 实施例十 † † Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, Abbreviation, A Combination Input Method Example 10 † †
检索词 = "拼笔声" Search term = "spell"
文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 简称 、 A组合输入法" 实施例十一 † † Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, Abbreviation, A Combination Input Method" Example 11 † †
检索词 = "拼笔声" , 给定离散数 =25 Search term = "scrape", given discrete number = 25
文本 = "拼音为主的中文拼音、 笔划、 声调组合输入法, 简称进、 A组合输入法" 实施例十一 † Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, Referred to as A, A Combination Input Method" Example 11
检索词 = "拼笔声" , 给定离散数 =5 Search term = "spelling", given discrete number = 5
文本 = "拼音为主的中文拼音、 笔划、 声调組合输入法, 简称拼、 笔、 声组合输入法" 实施例十一 检索遗漏 (不符合给定离散数 1的要求) Text = "Pinyin-based Chinese Pinyin, Stroke, Tone Combination Input Method, abbreviated spell, pen, and sound combination input method" Example 11 Search missing (not meeting the requirement of a given discrete number 1)
检索词 = "拼笔声" , 给定离散数 =1 Search term = "spell", given discrete number =1
实施例四是输出离散子串离散数及位置的模式匹配方法, 它定位于文本中出现的第一个 离散子串, 没有离散数限制, 适合短文本的信息检索。 实施例五是输出基于给定离散数的离 散子串离散数及位置的模式匹配方法, 对实施例四的模式匹配方法加以改进, 定位于文本中 出现的第一个满足预先给定的离散数 D。的离散子串, 适合长、 短文本的信息检索。  The fourth embodiment is a pattern matching method for outputting discrete substring discrete numbers and positions, which is located in the first discrete substring appearing in the text, and has no discrete number limitation, and is suitable for information retrieval of short text. The fifth embodiment is to output a pattern matching method based on discrete number and position of discrete substrings of a given discrete number, and improve the pattern matching method of the fourth embodiment, and the first one that appears in the text satisfies a predetermined discrete number. D. Discrete substrings, suitable for long and short text retrieval.
实施例七是输出离散素子串离散数及位置的模式匹配方法, 它定位于文本中出现的第一 个离散素子串, 离散素子串是相应离散子串范围内, 离散数更小的离散子串, 因此定位更精 确, 没有离散数限制, 适合短文本的信息检索。 实施例八是输出基于给定离散数的离散素子 串离散数及位置的模式匹配方法, 它对实施例七的方法加以改进, 定位于文本中出现的第一 个满足预先给定的离散数 Do的离散素子串, 适合长、 短文本的信息检索。  The seventh embodiment is a pattern matching method for outputting the discrete number and position of the discrete element substring, which is located in the first discrete element substring appearing in the text, the discrete element substring is within the range of the corresponding discrete substring, and the discrete substring of the discrete number is smaller. Therefore, the positioning is more precise, there is no discrete number limit, and it is suitable for information retrieval of short text. The eighth embodiment is a pattern matching method for outputting discrete numbers and positions of discrete prime sub-strings based on a given discrete number, which improves the method of the seventh embodiment, and the first one that appears in the text satisfies a predetermined discrete number Do The discrete prime substring, suitable for long and short text information retrieval.
实施例十是输出最小离散素子串离散数及位置的模式匹配方法, 它定位于文本中出现的 最小离散素子串, 最小离散素于串是文本范围内, 离散数最小的离散子串, 因此定位最精确。 该方法没有离散数限制, 适合短文本的信息检索。 实施例十一是输出基于给定离散数的最小 离散素子串离散数及位置的模式匹配方法, 它对实施例十的方法加以改进, 定位于满足预先 给定的离散数 D。的最小离散素子串, 适合长、 短文本的信息检索。  Embodiment 10 is a pattern matching method for outputting the discrete number and position of the smallest discrete element substring, which is located in the smallest discrete element substring appearing in the text, the smallest discrete element is in the text range, and the discrete number is the smallest discrete substring, so the positioning The most accurate. This method has no discrete number limitation and is suitable for information retrieval of short text. Embodiment 11 is a pattern matching method for outputting a discrete number and position of a minimum discrete element substring based on a given discrete number, which improves the method of Embodiment 10 and is located to satisfy a predetermined discrete number D. The smallest discrete prime substring, suitable for long and short text information retrieval.
从上面示例可知, 基于现有的子串的模式匹配定位会出现离散相关文本的检索遗漏。 本 发明的离散子串模式匹配方法中, 只有当实施例五、 实施例八以及实施例十一方法的给定离 散数过小时, 才会出现不希望发生的离散相关文 ^佥索遗漏, 而其它离散子串模式匹配方法 不会发生离散相关文本检索遗漏。 实施例五、 实施例八以及实施例十一的方法是最灵活的方 法, 可以通过调节预先给定的离散数大小, 在查全率、 查准率、 定位精度上进行平衡; 且在 同样给定的离散数条件下, 相关文本的查全率、 查准率相同, 但后者定位最佳。  As can be seen from the above example, the pattern matching positioning based on the existing substring may result in the retrieval omission of the discrete related text. In the discrete substring pattern matching method of the present invention, only when the given discrete number of the fifth embodiment, the eighth embodiment, and the eleventh method is too small, an undesired discrete correlation text missing occurs. Other discrete substring pattern matching methods do not occur for discrete related text retrieval omissions. The fifth embodiment, the method of the eighth embodiment and the eleventh embodiment are the most flexible methods, and can balance the recall rate, the precision rate, and the positioning precision by adjusting a predetermined discrete number; and Under the condition of discrete number, the recall and precision of related texts are the same, but the latter is the best.
表 3 , 列出了以上几种能定位的离散子串模式匹配方法的时间复杂性, 以及各方法适应 的范围。 Table 3 lists the time complexity of the above discrete substring pattern matching methods that can be located, and the adaptation of each method. The scope.
表 3 : 能定位的几种离散子串模式匹配方法综合分析  Table 3: Comprehensive analysis of several discrete substring pattern matching methods that can be located
Figure imgf000022_0001
Figure imgf000022_0001
(其中, D。为预先给定的离散度, k为检索中找到的离散素子串个数, Da为找到的离散 素子串的平均离散素)  (where D is the predetermined dispersion, k is the number of discrete prime substrings found in the search, and Da is the average discrete element of the found discrete substring)
以上的实施例一、 二、 三、 六、 九是一维的离散子串模式匹配方法。 而实施例十二、 十 三的模式匹配方法则是对以上的一维离散子串模式匹配方法进行扩充、 修改而形成的二维离 散子串模式匹配方法, 它适用于二维空间的短文本、 大容量词库模式匹配。 例如, 在键盘输 入中的中文单字拼音、 英文单词为一维的字符串, 中文词组的拼音、 英文词组可认为是二维 的字符串。  The above embodiments one, two, three, six, and nine are one-dimensional discrete substring pattern matching methods. The pattern matching method of the twelfth and thirteenth embodiments is a two-dimensional discrete substring pattern matching method formed by expanding and modifying the above one-dimensional discrete substring pattern matching method, which is suitable for short text in two-dimensional space. , large-capacity lexicon pattern matching. For example, the Chinese single-word pinyin in the keyboard input, the one-dimensional string in the English word, the pinyin of the Chinese phrase, and the English phrase can be regarded as a two-dimensional string.
以下表 4给出了实施例十二、 十三的方法具体用于二维文本中的模式匹配的结果。  Table 4 below shows the results of the method of Embodiments 12 and 13 specifically for pattern matching in two-dimensional text.
表 4: 二维离散子串模式匹配方法的结果(示例)  Table 4: Results of the 2D Discrete Substring Pattern Matching Method (Example)
Figure imgf000022_0002
Figure imgf000022_0002
注:带下划线的汉字表示二维模式 Dp在二维文本中所匹配的汉字。  Note: The underlined Chinese characters indicate the Chinese characters that the two-dimensional pattern Dp matches in the two-dimensional text.
表 4中的模式与文本内容, 实际上为汉字的拼音, 为了更清晰地反映二维离散特性, 用 汉字替代了汉字拼音。 每个汉字的拼音还可以进行一维的随意缺省检索。  The pattern and text content in Table 4 are actually the pinyin of Chinese characters. In order to more clearly reflect the two-dimensional discrete characteristics, Chinese characters are replaced by Chinese characters. The pinyin of each Chinese character can also be subjected to a one-dimensional random default search.
实施例十二的二维的离散子串模式匹配方法,能;^据二维模式 Dp与二维文本 Ds的二维离 散子串简单匹配度进行排序, 但不能反映出二维离散子串离散数对匹配度的影响, 离散数越 小, 匹配度应该越大。 实施例十三的输出精确匹配度的二维离散子串模式匹配方法, 输出的 精确匹配度能反映离散数对匹配度的影响, 不同离散数得到不同的匹配度, 因此, 排序结果 更为合理。  The two-dimensional discrete substring pattern matching method of the twelfth embodiment can sort according to the simple matching degree of the two-dimensional discrete substring of the two-dimensional pattern Dp and the two-dimensional text Ds, but cannot reflect the two-dimensional discrete substring discrete The effect of the number of pairs, the smaller the number of discretes, the greater the degree of matching. The two-dimensional discrete substring pattern matching method of the output precise matching degree of the thirteenth embodiment, the exact matching degree of the output can reflect the influence of the discrete number on the matching degree, and the different discrete numbers obtain different matching degrees, so the sorting result is more reasonable. .
相对于子串, 本发明提出的离散子串从概念上极大地提高了相关文本的值域; 相对于现 有的基于错误因素距离计算的非精确匹配, 本发明提出了基于离散特性的字符串匹配研究思 路。 基于离散子串的模式匹配方法, 要求模式中的字符必须完全、 有序、 可离散(三种特性) 地出现在文本中, 当离散数为零时即演变为精确子串模式匹配。 离散子串包含了子串, 子串 是离散子串的特例。 With respect to substrings, the discrete substring proposed by the present invention conceptually greatly increases the range of related texts; the present invention proposes a string based on discrete characteristics with respect to existing inexact matching based on error factor distance calculations. Match research ideas. The pattern matching method based on discrete substrings requires that the characters in the pattern must be completely, orderly, and discrete (three characteristics) appear in the text. When the discrete number is zero, it will evolve into an exact substring pattern matching. Discrete substring contains substrings, substrings It is a special case of discrete substrings.
在应用领域, 离散子串的离散特性符合大众对检索词的逸择思路, 用户可以灵活、 简单 地选择满足有序、 可离散的检索词。 在检索功能上, 离散子串模式匹配方法, 由于满足完全 性、 有序性, 并且可以通过离散数, 约束检出文本的相关程度, 查全率高, 准确性有保障, 能合理地定位。  In the field of application, the discrete characteristics of discrete substrings are in line with the public's choice of search terms. Users can flexibly and simply choose to satisfy ordered and discrete search terms. In the retrieval function, the discrete substring pattern matching method, because it satisfies completeness and order, and can discriminate the degree of correlation of the detected text through discrete numbers, the recall rate is high, the accuracy is guaranteed, and the positioning can be reasonably located.
离散子串模式匹配方法,解决了四十年来信息检索中存在的固有离散相关检索遗漏问题, 有重要的应用价值。 适用于信息检索与信息输入的如下领域: 各种文字的数据库检索、 网络 信息搜索、 网站内检索、 信息查询、 键盘输入、 电子词典、 操作系统文件检索等。  The discrete substring pattern matching method solves the problem of the inherent discrete correlation retrieval omission in information retrieval in the past 40 years, and has important application value. Applicable to the following areas of information retrieval and information input: database retrieval of various texts, network information search, intra-site retrieval, information inquiry, keyboard input, electronic dictionary, operating system file retrieval, etc.
本发明的各模式匹配方法中的输出结果以 代表 "不存在", 也可以选用其它任何一 个规定的数据, 作为不存在的标志输出。  The output result in each pattern matching method of the present invention represents "non-existence", and any other specified data may be selected as a non-existent flag output.

Claims

WO 2007/101391 权 利 要 求 书 PCT/CN2007/000392 WO 2007/101391 Claim PCT/CN2007/000392
1、一种用于信息检索与信息输入的离散子串模式匹配方法, 其特征在于: 所述的离散子 串为文本 3= '%S2…… S。" 中的任意一个或一个以上的字符组成的字符串 "S8lSg2…… Ssm" (1 g!<g2―… <gra < n) ; 离散子串模式匹配即判定模式 Ρ= "Ρ Ρ;…… P." (l < m < n )是否为文 本 S的离散子串 "SslSs2…… Sg„" ,并输出判定结果的具体步骤如下: A discrete substring pattern matching method for information retrieval and information input, characterized in that: the discrete substring is text 3 = '%S 2 ... S. "S 8l S g2 ...... S sm " (1 g!<g 2 ―... <g ra <n);"Discrete substring pattern matching""ΡΡ;......P." (l < m < n ) is the discrete substring "S sl S s2 ...... S g „" of the text S, and the specific steps of outputting the determination result are as follows:
a步 取文本 S的第一个字符作为被比较字符, 取模式 P的第一个字符作为比较字符; b步 如果被比较字符或比较字符为结束标志, 转 d步;  Step a: Take the first character of the text S as the compared character, and take the first character of the pattern P as the comparison character; bstep If the compared character or the comparison character is the end flag, go to step d;
c步 若被比较字符与比较字符相等, 则取文本 S的下一个字符作为被比较字符, 取模 式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字符作为被比较字符, 比 较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the next character of the text S is taken as the compared character, and the next character of the pattern P is taken as the comparison character, and the step b is performed; otherwise, the next character of the text S is taken as the Compare characters, compare characters, and turn b steps;
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 输出代表判定结果 "存在" 的数据,结束匹配; 否则,判定文本 S中不存在模式 P的离散子串, 输出代表判定结 果 "不存在" 的数据, 结束匹配。  If the comparison character is the end flag, the determination mode P is a discrete substring of the text S, and the data representing the determination result "present" is output, and the matching is ended; otherwise, the discrete substring of the pattern P does not exist in the text S, and the output is The data representing the "non-existence" of the judgment result ends the match.
2、根据权利要求 1所述的用于信息检索与信息输入的离散子串模式匹配方法, 其特征在 于, 所述的:  2. The discrete substring pattern matching method for information retrieval and information input according to claim 1, wherein:
d步 若比较字符为结束标志, 则判定模式 P为文本 S的离散子串, 求出文本 S的长度 n, 模式 P的长度 fli, 输出离散子串简单匹配度 =Round ( 100 x m ÷ n ), 结束匹配; 否则,判定 文本 S中不存在模式 P的离散子串, 输出判定结果 , 结束匹配。  If the comparison character is the end flag, the decision mode P is a discrete substring of the text S, and the length n of the text S, the length fli of the pattern P, and the simple matching degree of the output discrete substring = Round (100 xm ÷ n ) are obtained. End matching; otherwise, it is determined that there is no discrete substring of the pattern P in the text S, the determination result is output, and the matching is ended.
3、根据权利要求 1所迷的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于, 所述的:  3. A discrete substring pattern matching method for information retrieval and information input according to claim 1, wherein:
c步 若被比较字符与比较字符相等, 则将被比较字符在文本 S 中的位置值, 存储于位 置数组 pos [ ]中, 其存储位置与比较字符在模式 P中的位置相同 , 取文本 S的下一个字符作 为被比较字符, 取模式 P的下一个字符作为比较字符, 转 b步; 否则,取文本 S的下一个字^ 作为被比较字符, 比较字符不变, 转 b步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos [ ], and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The next character as the compared character, take the next character of pattern P as the comparison character, and turn b step; otherwise, take the next word of text S as the compared character, compare the characters unchanged, and turn to step b;
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判 结果 " - Γ , 结束匹配; 否则判定模式 P为文本 S的离散子串, 求出文本 S的长度 Ω, 模式 的长度 ffl, 求出离散子串的首字符以及末字符在文本 S中的位置: gi=pos [ ]中的第一个数值 g„=pos [ 〗中的最后一个数值,输出能反映离散子串离散程度的精确匹配度 -Round (100: (ra- (gm-g -m+1) ÷ n) ÷ n) , 结束匹配。 If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the result of the judgment is outputted as " - Γ , ending the match; otherwise the decision mode P is the discrete substring of the text S, and the text S is obtained. Length Ω, the length of the pattern ffl, find the position of the first character of the discrete substring and the position of the last character in the text S: gi =pos [ ] The last value in the g „=pos [ 〗, the output An exact match that reflects the degree of dispersion of discrete substrings - Round (100: (ra - (g m - g - m + 1) ÷ n) ÷ n) , ends the match.
4、根据权利要求 3所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其 ^ 征在于, 所述的: d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 " - 1" ,结束匹配; 否则, 模式 P为文本 S的离散子串 "SslS s2... ... Sg ' , 求出模式 P的长 度 m, 求出离散子串的首字符以及末字符在文本 S 中的位置: gl=pos [ ]中的第一个数值, gm=pos [ ]中的最后一个数值,输出离散子串的离散数 D=gm-g -m+l , 并输出位置数组 pos [ ] , 结束匹配。 4. A discrete substring pattern matching method for information retrieval and information input according to claim 3, wherein: If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the pattern P is the discrete substring of the text S "S sl S s2 ... S g ' , find the length m of the pattern P, find the position of the first character of the discrete substring and the last character in the text S: gl = pos [ the first value in [ ], g m The last value in =pos [ ], output the discrete number D = g m -g -m + l of the discrete substring, and output the position array pos [ ] to end the match.
5、才艮据权利要求 4所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于, 所述的:  5. A discrete substring pattern matching method for information retrieval and information input according to claim 4, wherein:
d步 若比较字符不是结束标志, 则判定文本 S中不存在模式 P的离散子串, 输出判定 结果 " - 1" , 结束匹配; 否则, 模式 P为文本 S的离散子串 "SglS E2…… Ss„,", 求出: 模式 P的 长度 ra、离散子串的首字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值、 gra=pos [ ] 中的最后一个数值、 离散子串的离散数 D=gm-g -ffl+l ; If the comparison character is not the end flag, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the pattern P is the discrete substring of the text S "S gl S E2 ... S s „,", find: the length of the pattern P, the first character of the discrete substring, and the position of the last character in the text S: gl = pos [ the first value in [ ], g ra = pos [ The last value in , the discrete number of the discrete substring D = g m -g -ffl + l ;
如果离散数 D <预先给定的离散数 D9, 则判定模式 P为文本 S的符合离散数 D»要求的第 一个离散子串, 输出离散数 D, 并输出位置数組 pos门,结束匹配; If the discrete number D is a predetermined discrete number D 9 , then the decision mode P is the first discrete substring of the text S that meets the requirement of the discrete number D», outputs the discrete number D, and outputs the position array pos gate, ending match;
如果 D> D0 ; 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: 当前被比较字符位置 -模式 P的长度 m-预先给定的离散数 Do, 并取该位置的字符作为被比较 字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较 字符, 转 b步。 If D> D 0 ; restart the matching of the next discrete substring, modify the position of the compared character in the text S to: the currently compared character position - the length of the pattern P m - the predetermined discrete number Do, and take The character at the position is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
6、根据权利要求 3所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于: 所述的,  6. A discrete substring pattern matching method for information retrieval and information input according to claim 3, wherein:
d步 若比较字符不是结束标志,转 h步; 否则, 将文本 S中被比较字符的位置前移 2个 字符位置, 并取该位置的字符作为被比较字符, 将模式 P中比较字符的位置前移 2个字符位 置,并取该位置的字符作为比较字符;  Step d If the comparison character is not the end mark, go to step h; otherwise, move the position of the compared character in the text S forward by 2 character positions, and take the character of the position as the compared character, and compare the position of the character in the pattern P. Move forward 2 characters and take the character at that position as the comparison character;
然后, 再依次进行以下的 e、 f、 g、 h步:  Then, proceed to the following steps e, f, g, and h:
e步 如果模式 P的首字符已比较完毕, 转 g步;  Step e If the first character of mode P has been compared, go to g step;
f 步 若被比较字符与比较字符相等, 则将被比较字符在文本 S 中的位置值, 存储于位 置数组 pos门中, 其存储位置与比较字符在模式 P中的位置相同, 取文本 S的前一个字符作 为被比较字符, 取模式 P的前一个字符作为比较字符, 转 e步; 否则, 取文本 S的前一个字 符作为被比较字符, 比较字符不变, 转 e步;  If the comparison character is equal to the comparison character, the position value of the character to be compared in the text S is stored in the position array pos gate, and the storage position is the same as the position of the comparison character in the pattern P, and the text S is taken. The previous character is used as the compared character, and the previous character of the pattern P is taken as the comparison character, and the e step is changed; otherwise, the previous character of the text S is taken as the compared character, the comparison character is unchanged, and the e step is performed;
g步 判定模式 P为文本 S的离散素子串, 求出文本 S的长度 n, 模式 P的长度 m, 求出 离散素子串的首字符以及末字符在文本 S中的位置: gl=pos [ 〗中的第一个数值, g =pos [ ]中 的最后一个数值, 输出离散素子串匹配度 =Round (100 (m- (gffl- g「m+l) ÷ n) ÷ n) ,结束匹配; h步 判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "-1", 结束匹配。 The g step determination mode P is a discrete element substring of the text S, and the length n of the text S and the length m of the pattern P are obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: gl = pos [ 〗 The first value in , g = the last value in pos [ ], the output discrete substring match = Round (100 (m- (g ffl - g"m+l) ÷ n) ÷ n) , end matching ; The h-step determination text S does not have the discrete element sub-string of the pattern P, and the determination result "-1" is output, and the matching is ended.
7、根据权利要求 6所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于, 所述的:  7. A discrete substring pattern matching method for information retrieval and information input according to claim 6, wherein:
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 m, 求出离散素子串的首字 符以及末字符在文本 S中的位置: g^pos i: ]中的第一个数值, gm=pos [ ]中的最后一个数值, 输出离散数 D=gm-gl-m+l, 并输出位置数组 po s [ ] , 结束匹配; The g step determination mode P is a discrete element substring of the text S, and the length m of the pattern P is obtained, and the position of the first character and the last character of the discrete element substring in the text S is obtained: the first of g^pos i: The value, g m = the last value in pos [ ], the output discrete number D = g m - gl -m + l, and output the position array po s [ ], end the match;
8、根据权利要求 7所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于, 所述的:  8. A discrete substring pattern matching method for information retrieval and information input according to claim 7, wherein:
g步 判定模式 P为文本 S的离散素子串, 求出模式 P的长度 m, 求出: 离散素子串的首 字符以及末字符在文本 S中的位置: gl=pos [ ]中的第一个数值、 gm=pos [ ]中的最后一个数值、 离散数 D-gfgr m+1 ; The g step determination mode P is a discrete prime substring of the text S, and the length m of the mode P is obtained, and the first character of the discrete substring and the position of the last character in the text S are obtained: gl = pos [ ] The value, the last value in g m =pos [ ], the discrete number D-gfgr m+1 ;
如果离散数 D <预先给定的离散氣则判定模式 P为文本 S的符合离散数 D。要求的第一个 离散素子串, 输出离散数 D, 并输出位置数组 pos [ ],结束匹配;  If the discrete number D < a predetermined discrete gas, then the mode P is determined to be the discrete number D of the text S. The first discrete element substring is required, the discrete number D is output, and the position array pos [ ] is output, and the matching is ended;
如果 D> D。, 重新开始下一个离散子串的匹配, 将文本 S 中被比较字符的位置修改为: Max (pos t ]的第二个位置的值, g„+l- m- D。), 并取该位置的字符作为被比较字符; 将模式 Ρ 中比较字符的位置修改为模式 Ρ的首字符位置, 并取该位置的字符作为比较字符, 转 b步; If D> D. , restart the matching of the next discrete substring, modify the position of the compared character in the text S to: the value of the second position of Max (pos t ], g„+l- m- D.), and take the The character of the position is used as the compared character; the position of the comparison character in the mode 修改 is changed to the first character position of the mode ,, and the character of the position is taken as the comparison character, and the step b is performed;
9、根据权利要求 6所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其特 征在于, 所述的: 9. A discrete substring pattern matching method for information retrieval and information input according to claim 6, wherein:
g步判定模式 P为文本 S的离散素子串, 如果当前最小离散素子串的首字符以及末字符 在文本 S中的位置 y,、 ym未被赋值, 则令 yi=pos [ ]中的第一个数值, yffl=pos [ ]中的最后一 个数值, 转 i步; 否则, 求出离散素子串的首字符以及末字符在文本 S中的位置: gl=pos门 中的第一个数值, gm=pos [ ]中的最后一个数值,若(gm- g < (ym-yi) , 则 = ^ ym= gra , 转 i 步; 若(g«- g!) > (ym-Yi) , 则直接转 i步; The first step g determination mode P is a discrete element substring text S, and if the current smallest discrete element of the substring first character and the last character position y in the text in S ,, y m is not assigned, then let yi = pos [] in A value, y ffl = pos [ ] the last value, turn i step; otherwise, find the first character of the discrete substring and the position of the last character in the text S: gl = the first value in the pos gate , g m = pos [ the last value in [ ], if (g m - g < (y m - yi ), then = ^ y m = g ra , turn i step; if (g«- g!) > ( y m -Yi) , then turn directly to i step;
h步 如果当前最小离散素子串的首字符与末字符在文本 S中的位置 y,、 y2 未被赋值, 则判定文本 S中不存在模式 P的离散素子串, 输出判定结果 "- ,结束匹配;否则, 求出文 本 S的长度 n, 模式 P的长度 m, 输出最小离散素子串匹配度 =Round (100 χ (m- (ym-y -m+l) ÷ n) ÷ n) ,结束匹配; Step h If the position y, y 2 of the first character and the last character of the current minimum discrete element substring are not assigned in the text S, it is determined that there is no discrete element substring of the pattern P in the text S, and the determination result "-, the end is output. Match; otherwise, find the length n of the text S, the length m of the pattern P, and output the minimum discrete element substring matching degree = Round (100 χ (m- (y m -y -m+l) ÷ n) ÷ n) , End the match;
i步 若(yn- y,- m+l) =0, 转 h步; 否则, 重新开始下一个离散子串的匹配, 将文本 S中 被比较字符的位置修改为 pos [ ]的第二个位置的值, 并取该位置的字符作为被比较字符; 将 模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位置的字符作为比较字符, 转 b步。 If i step (y n - y, - m+l) =0, go to h step; otherwise, restart the matching of the next discrete substring, and change the position of the compared character in text S to the second of pos [ ] The value of the position, and takes the character of the position as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
10、 根据权利要求 9所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其 特征在于, 所述的: 10. A discrete substring pattern matching method for information retrieval and information input according to claim 9, wherein:
g 步 找到离散素子串, 如果当前最小离散素子串位置数組 min [ ]未被赋值, 则令 min [ ] =pos [ ],转 i步;否则,求出离散素子串的首字符及末字符在文本 S中的位置: ,=pos [ ] 中的第一个数值, gffl=pos [ ]中的最后一个数值,求出当前最小离散素子串首字符及末字符在 文本 S中的位置: yi=min [ ]中的第一个史值, ym=min [ ]中的最后一个 :值,若(gn- gi) < (ym-y , 则令 min [ ] =pos [ ], 转 i步; 若 (g„- gd > (y.-y , 则直接转 i步; Step g finds the discrete element substring. If the current minimum discrete element substring position array min [ ] is not assigned, let min [ ] = pos [ ], turn i step; otherwise, find the first and last characters of the discrete element substring The position in the text S: , the first value in =pos [ ], the last value in g ffl =pos [ ], find the position of the first and last characters of the current smallest discrete prime substring in the text S: The first historical value in yi =min [ ], y m =min [ ] The last one: value, if (g n - gi) < (y m -y , then let min [ ] =pos [ ], Turn i step; if (g„- gd > (y.-y , then turn directly to i step;
h步 如果当前最小离散素子串的位置数组 min [ ] 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "-1", 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S中的位置: yi=min [ ]中的第一个数值, yra=min [ ]中 的最后一个数值, 输出离散数 D=y„-y ,-ι +Ι , 并输出位置数组 min [ ] , 结束匹配; Step h If the position array min [ ] of the current minimum discrete element substring is not assigned, it is determined that there is no discrete element substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the mode P is obtained. Length m, find the position of the first and last characters of the current smallest discrete substring in the text S: yi =min [ the first value in [ ], the last value in y ra =min [ ], the output discrete number D =y„-y , -ι +Ι , and output the position array min [ ] to end the match;
i步 求出当前最小离散素子串首字符及末字符在文本 S中的位置: Υι=ηΰη [ :!中的第一 个数值, ym=min [ ]中的最后一个数值, 若(y„- m+1) =0, 转 h步; 否则, 重新开始下一个离 散子串的匹配, 将文本 S中被比较字符的位置修改为 pos [〗的第二个位置的值, 并取该位置 的字符作为被比较字符; 将模式 P中比较字符的位置修改为模式 P的首字符位置, 并取该位 置的字符作为比较字符, 转 b步。 Step i find the position of the first and last characters of the current smallest discrete prime substring in the text S: Υι =ηΰη [ The first value in : !, the last value in y m =min [ ], if (y„ - m+1) =0, go h step; otherwise, restart the matching of the next discrete substring, modify the position of the compared character in the text S to the value of the second position of pos [〗, and take the position The character is used as the compared character; the position of the comparison character in the mode P is changed to the first character position of the mode P, and the character of the position is taken as the comparison character, and the step b is performed.
11、根据权利要求 10所述的一种用于信息检索与信息输入的离散子串模式匹配方法, 其 特征在于, 所述的:  11. A discrete substring pattern matching method for information retrieval and information input according to claim 10, wherein:
h步 如果当前最小离散素子串的位置数組 min [】 未被赋值, 则判定文本 S中不存在模 式 P的离散素子串, 输出判定结果 "-1", 结束匹配;否则, 求出模式 P的长度 m, 求出当前 最小离散素子串首字符及末字符在文本 S 中的位置: =ηιίη [ ]中的第一个数值, ym=min [ ] 中的最后一个数值, 求出离散数 D=yn-y -m+l; Step h If the position array min [] of the current smallest discrete prime substring is not assigned, it is determined that there is no discrete substring of the pattern P in the text S, and the determination result "-1" is output, and the matching is ended; otherwise, the mode P is obtained. The length m, find the position of the first and last characters of the current smallest discrete substring in the text S: =ηιίη [the first value in [ ], the last value in y m =min [ ], find the discrete Number D=y n -y -m+l;
如果离散数 D>预先给定的离散数 Do, 则判定文本 S 中不存在满足预先给定的离散数 Do 的模式 P的最小离散素子串, 输出判定结果 "-1" , 结束匹配;  If the discrete number D> is a predetermined discrete number Do, it is determined that there is no minimum discrete element substring of the pattern P satisfying the predetermined discrete number Do in the text S, and the determination result "-1" is output, and the matching is ended;
如果离散数 D <预先给定的离散数 Do,则判定模式 P为文本 S的符合预先给定的离散数 Do 要求的最小离散素子串, 输出离散数 D, 并输出位置数组 min [ ] ,结束匹配。  If the discrete number D is a predetermined discrete number Do, the decision mode P is the smallest discrete element substring of the text S that meets the predetermined discrete number Do, outputs the discrete number D, and outputs the position array min [ ], ending match.
12、 根据权利要求 1所述的用于信息检索与信息输入的离散子串模式匹配方法, 其特征 在于: 所述的文本 S有多个,多个文本 S^2…… Sn构成二维文本 Ds= "S'SJ…… S" " , 二维文本 Ds= "S^2…… Sn " 中的任意一个或一个以上的文本 S "的离散子串 Sei'组成的文本串 "SG1. S02' …… SCm' " (其中 K G^…… <G n)为二维离散子串,二维的离散子串模式匹配即判定 二維模式 Dp = "Ρ'Ρ2…… Pm " (l ra n)是否为二维文本 Ds 的二维离散子串, 其具体步驟如 下: 12. The discrete substring pattern matching method for information retrieval and information input according to claim 1, wherein: the plurality of texts S, the plurality of texts S^ 2 ... S n form a two-dimensional Text Ds= "S'S J ...... S"" , a text string consisting of any one or more of the two-dimensional text Ds = "S^ 2 ...... S n " or a discrete substring S ei ' of the text S " G1 . S 02 ' ...... S Cm '" (where KG^... <G n) is a two-dimensional discrete substring, and the two-dimensional discrete substring pattern matching determines the two-dimensional mode Dp = "Ρ'Ρ 2 ...... Whether P m " (l ra n) is a two-dimensional discrete substring of the two-dimensional text Ds, the specific steps are as follows Next:
A步 取二维文本 Ds的第一个文本 S1作为被比较文本, 取二维模式 Dp的第一个模式 P1 作为比较文本; Step A takes the first text S 1 of the two-dimensional text Ds as the compared text, and takes the first pattern P 1 of the two-dimensional pattern Dp as the comparison text;
B步 如果被比较文本或比较文本为结束标志, 转 D步;  Step B If the text being compared or the comparison text is the end mark, go to step D;
C步被比较文本与比较文本进行所述离散子串模式匹配方法中的 a步、 b步、 c步、 d步 的步骤,若 d步的结果为存在, 则取二维文本 Ds的下一个文本作为被比较文本, 取二维模式 Dp的下一个模式作为比较文本,转 B步;否则,取二维文本 D s的下一个文本作为被比较文本, 比较文本不变, 转 B步;  Step C is to compare the text and the comparison text to the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of step d is present, take the next step of the two-dimensional text Ds. The text as the compared text, take the next mode of the two-dimensional mode Dp as the comparison text, and turn to step B; otherwise, take the next text of the two-dimensional text D s as the compared text, compare the text unchanged, and turn to step B;
D步 若比较文本为结束标志, 则二维模式 Dp为二维文本 Ds的二维离散子串, 求出二 维文本 Ds 的文本个数 n, 二维模式 Dp 的模式个数 ra;输出二维离散子串筒单匹配度 =Round(100xm÷n),结束匹配; 否则二维文本 D s中不存在二维模式 D p的二维离散子串,输 出判定结果 "- ,结束匹配。  In step D, if the comparison text is the end mark, the two-dimensional mode Dp is a two-dimensional discrete substring of the two-dimensional text Ds, and the number n of texts of the two-dimensional text Ds is obtained, and the number of modes of the two-dimensional mode Dp is ra; Dimensional discrete substring single matching degree = Round (100xm ÷ n), end matching; otherwise there is no two-dimensional discrete substring of the two-dimensional mode D p in the two-dimensional text D s, and the judgment result "- , the end matching is output.
13、根据权利要求 12所述的用于信息检索与信息输入的离散子串模式匹配方法, 其特征 在于, 所述的:  13. The discrete substring pattern matching method for information retrieval and information input according to claim 12, wherein:
C步被比较文本与比较文本进行所述离散子串模式匹配方法中的 a步、 b步、 c步、 d步 的步骤,若 d步的结果为存在, 则将被比较文本在二维文本 Ds中的位置值, 存储于位置数组 pos [ ]中, 其存储位置与比较文本在二维模式 D p中的位置相同, 取二维文本 D s的下一个文 本作为被比较文本, 取二维模式 Dp的下一个模式作为比较文本,转 B步; 否则,取二维文本 Ds的下一个文本作为被比较文本, 比较文本不变, 转 B步;  Step C is to compare the text with the comparison text to perform the steps of step a, step b, step c, and step d in the discrete substring pattern matching method. If the result of the step d is present, the text to be compared is in the two-dimensional text. The position value in Ds is stored in the position array pos [ ], and its storage position is the same as the position of the comparison text in the two-dimensional mode D p , and the next text of the two-dimensional text D s is taken as the compared text, taking two dimensions. The next mode of the mode Dp is used as the comparison text, and the process proceeds to step B; otherwise, the next text of the two-dimensional text Ds is taken as the compared text, the comparison text is unchanged, and the step B is performed;
D 步 若比较文本不是结束标志, 则二维文本 Ds 中不存在二维模式 Dp 的二维离散子 串, 输出判定结果 "-1" ,结束匹配; 否则, 判定二维模式 Dp为二维文本 Ds的二维离散子 串, 求出二维文本 Ds的文本个数 n, 二维模式 Dp的模式个数 m, 求出二维离散子串的第一 个文本串以及最后一个文本串在二维文本 Ds中的位置: G^pos [ ]中的第一个数值, Gn=pos [ ] 中的最后一个数值, 输出精确的二维离散子串匹配度 =Round (100 X (m- (Gm-G-m+l) ÷ n) ÷ n) , 结束匹配。 If the comparison text is not the end mark, the two-dimensional discrete substring of the two-dimensional mode Dp does not exist in the two-dimensional text Ds, and the determination result "-1" is output, and the matching is ended; otherwise, the two-dimensional mode Dp is determined to be two-dimensional text. The two-dimensional discrete substring of Ds, find the number n of texts of the two-dimensional text Ds, the number m of modes of the two-dimensional pattern Dp, and find the first text string of the two-dimensional discrete substring and the last text string in two Position in the dimension text Ds: the first value in G^pos [ ], the last value in G n =pos [ ], the output of the exact 2D discrete substring matching = Round (100 X (m- ( G m -G-m+l) ÷ n) ÷ n ) , end the match.
PCT/CN2007/000392 2006-03-07 2007-02-05 A discrete substring matching method for information searching and information inputting WO2007101391A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN 200610020427 CN1811776A (en) 2006-03-07 2006-03-07 Random default substring mode matching judging and positioning method used for information inputting and retrieving
CN200610020427.3 2006-03-07
CN200610021280.X 2006-06-27
CN 200610021280 CN1869983A (en) 2006-06-27 2006-06-27 Generalized substring pattern matching method for information retrieval and information input

Publications (1)

Publication Number Publication Date
WO2007101391A1 true WO2007101391A1 (en) 2007-09-13

Family

ID=38474598

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2007/000392 WO2007101391A1 (en) 2006-03-07 2007-02-05 A discrete substring matching method for information searching and information inputting

Country Status (1)

Country Link
WO (1) WO2007101391A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5983223A (en) * 1997-05-06 1999-11-09 Novell, Inc. Method and apparatus for determining a longest matching prefix from a dictionary of prefixes
CN1559072A (en) * 2001-09-30 2004-12-29 ÷ķ�����������ι�˾ Reverse searching system and method
CN1641631A (en) * 2004-01-13 2005-07-20 中国科学院计算技术研究所 Machine translation automatic evaluating method and system thereof

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5983223A (en) * 1997-05-06 1999-11-09 Novell, Inc. Method and apparatus for determining a longest matching prefix from a dictionary of prefixes
CN1559072A (en) * 2001-09-30 2004-12-29 ÷ķ�����������ι�˾ Reverse searching system and method
CN1641631A (en) * 2004-01-13 2005-07-20 中国科学院计算技术研究所 Machine translation automatic evaluating method and system thereof

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
YAN W. ET AL.: "Pattern matching algorithm for string", DATA STRUCTURE (C LANGUAGE), TSINGHUA UNIVERSITY PUBLISHING HOUSE, April 1997 (1997-04-01), pages 79 - 80 *

Similar Documents

Publication Publication Date Title
US7756847B2 (en) Method and arrangement for searching for strings
US8055498B2 (en) Systems and methods for building an electronic dictionary of multi-word names and for performing fuzzy searches in the dictionary
US8725509B1 (en) Back-off language model compression
US20100235780A1 (en) System and Method for Identifying Words Based on a Sequence of Keyboard Events
US20060036627A1 (en) Method and apparatus for a restartable hash in a trie
WO2014201047A1 (en) Fast, scalable dictionary construction and maintenance
CN105630765A (en) Place name address identifying method
JP2009015530A (en) Bit string retrieval method and program
Park et al. Profiles of tries
CN113901825B (en) Entity relationship joint extraction method and system based on active deep learning
WO2008031306A1 (en) A method of characteristic character string matching based on discreteness, cross and non-identical
CN112100361B (en) Character string multimode fuzzy matching method based on AC automaton
Bachteler et al. Similarity filtering with multibit trees for record linkage
CN116562297B (en) Chinese sensitive word deformation identification method and system based on HTRIE tree
US10474958B2 (en) Apparatus, system and method for an adaptive or static machine-learning classifier using prediction by partial matching (PPM) language modeling
WO2020037794A1 (en) Index building method for english geographical name, and query method and apparatus therefor
Inoue et al. Computing longest common square subsequences
WO2007101391A1 (en) A discrete substring matching method for information searching and information inputting
JP4108337B2 (en) Electronic filing system and search index creation method thereof
CN108170682B (en) Chinese word segmentation method based on professional vocabulary and computing equipment
Gollapudi et al. A dictionary for approximate string search and longest prefix search
CN109800412A (en) A kind of Chinese word segmentation and big data information retrieval method and device
CN113407693B (en) Text similarity comparison method and device for full-media reading
Liang Spell checkers and correctors: A unified treatment
Lewenstein et al. Document retrieval with one wildcard

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application
NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07710866

Country of ref document: EP

Kind code of ref document: A1