CN116450896A - Text fuzzy matching method, device, electronic equipment and readable storage medium - Google Patents

Text fuzzy matching method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN116450896A
CN116450896A CN202310233802.6A CN202310233802A CN116450896A CN 116450896 A CN116450896 A CN 116450896A CN 202310233802 A CN202310233802 A CN 202310233802A CN 116450896 A CN116450896 A CN 116450896A
Authority
CN
China
Prior art keywords
matching
score
text
cost
initial
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310233802.6A
Other languages
Chinese (zh)
Inventor
李滨君
庞建新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ubtech Robotics Corp
Original Assignee
Ubtech Robotics Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ubtech Robotics Corp filed Critical Ubtech Robotics Corp
Priority to CN202310233802.6A priority Critical patent/CN116450896A/en
Publication of CN116450896A publication Critical patent/CN116450896A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The application provides a text fuzzy matching method, a text fuzzy matching device, electronic equipment and a readable storage medium, and belongs to the technical field of computers. The method comprises the following steps: acquiring corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched; obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost; determining the highest matching score from the final matching scores, and determining the single word average score according to the number of characters of the matching template and the highest matching score; if the single word average score is larger than a preset matching threshold value, determining a matching path according to the matching template; editing the text to be matched according to the matching path to obtain a modified text; and matching the modified text with the matching template. Thus, the system has higher tolerance of the front-end system error and reduces the front-end system error.

Description

Text fuzzy matching method, device, electronic equipment and readable storage medium
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a text fuzzy matching method, a device, an electronic apparatus, and a readable storage medium.
Background
In the existing text processing system, matching is still a stable and effective recognition method, the most strict matching method is accurate matching, the matching method requires that the word sequence and the word pattern of the two matched words are completely consistent, and other interference characters cannot be doped in the sequence process. Fuzzy matching is based on some assumption or precondition, allowing some error in the matching process, not necessarily requiring perfect agreement. The matching operation can be roughly classified into two kinds according to the relationship to be matched: one is keyword matching, characterized in that one of the matching objects is contained in the other object. The other is full sentence matching, and the similarity is often used as the judging basis of the matching degree. When the similarity is 1, namely, the two texts are exactly the same, the method is accurate matching, and is characterized in that the matching relationship has anti-body property, and the source of the text used as comparison in full sentence matching is credible and controllable, unlike similarity discrimination.
In the existing keyword matching method, most of the keyword matching methods are based on the assumption of character string similarity. For example, the character string is "c% a% b% cda", the content to be matched is "abc", and the corresponding content cannot be matched under the condition of exact matching. But when "%" is regarded as an interference character and a certain interference is allowed to occur between contents to be matched, the matching to the corresponding result may be blurred. In addition, there is also a fuzzy matching method for deleting the content to be matched, namely when the character string is 'Mik has go', the content to be matched is 'Mike', and it is easy to know that only one difference between 'Mik' and 'Mike' is e, and fuzzy matching can be performed according to rules. The existing rule can better cope with multi-word and few-word errors caused by human input errors. But the error handling caused by the existing voice recognition and optical character recognition techniques is poor.
For speech recognition, there may be misrecognitions due to pronunciation inaccuracy, for example, "that's" and "thas" for continuous reading misrecognition in english, and "tiger" and "old" for similar pronunciation in chinese, or, due to untimely word stock update for speech recognition, such as "new crown" being recognized as "new officer", etc. In the optical character recognition technology, due to the "rabbit" and "exempt" of the approximate character misrecognition, when the words are used as the contents to be matched, the misrecognition result is difficult to distinguish only by the similarity degree of the character strings.
The existing full sentence matching technology often depends on a character intersection set and a deep learning model, and the effect is poor when the character is wrongly recognized, because the approximate characters in the character intersection set can be calculated as completely different individuals, the word meaning of the approximate characters in the deep learning model can be completely different, and thus, the information extraction is incorrect, and the final effect is affected. In summary, the pre-systematic error in the existing text matching technology is relatively large.
Disclosure of Invention
In order to solve the technical problems, the application provides a text fuzzy matching method, a text fuzzy matching device, electronic equipment and a readable storage medium.
In a first aspect, the present application provides a text fuzzy matching method, where the method includes:
acquiring corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched;
obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost;
determining the highest matching score from a plurality of final matching scores, and determining a single word average score according to the number of characters of the matching template and the highest matching score;
if the average score of the single words is larger than a preset matching threshold value, determining a matching path according to the matching template;
editing the text to be matched according to the matching path to obtain a modified text;
and carrying out matching processing on the modified text and the matching template.
In an embodiment, the obtaining the corresponding initial matching score according to the modification cost of each character of the matching template and each character of the text to be matched includes:
if the modification cost is greater than a preset cost threshold, determining the initial matching score as 0;
and if the modification cost is smaller than or equal to a preset cost threshold value, taking the modification cost as the initial matching score.
In an embodiment, the obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost, and the deletion cost includes:
traversing each initial matching score, and determining a first adjacent initial matching score, a second adjacent initial matching score and a third adjacent initial matching score of the current initial matching score;
calculating a first difference value between the second adjacent initial matching score and the insertion cost and a second difference value between the third adjacent initial matching score and the deletion cost according to the calculated sum value of the first adjacent initial matching score and the current initial matching score;
and determining the maximum value of the sum value, the first difference value and the second difference value as the final matching score.
In an embodiment, the calculating the sum value according to the first adjacent initial matching score and the current initial matching score includes:
and when the first adjacent initial matching score is not 0, calculating the sum according to the first adjacent initial matching score, the current initial matching score and a preset compensation score.
In one embodiment, the method further comprises:
if the current initial matching score does not have the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score, setting the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score of the current initial matching score to be 0.
In an embodiment, obtaining the modification cost of each character of the matching template and each character of the text to be matched includes:
if the error source is an image recognition error, determining single word similarity according to the character pattern related characteristics, and taking the single word similarity as the modification cost;
and if the error source is a voice recognition error, adopting the phoneme similarity as the modification cost.
In an embodiment, the editing the text to be matched according to the matching path to obtain a modified text includes:
determining modification operation and matching character string substrings according to the matching paths;
and editing the text to be matched according to the modification operation and the matching character string substring to obtain the modification text.
In a second aspect, the present application provides a text fuzzy matching device, the device including:
the first acquisition module is used for acquiring corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched;
the second acquisition module is used for acquiring a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost;
the first determining module is used for determining the highest matching score from a plurality of final matching scores, and determining a word average score according to the number of characters of the matching template and the highest matching score;
the second determining module is used for determining a matching path according to the matching template if the single word average score is larger than a preset matching threshold;
the editing module is used for editing the text to be matched according to the matching path to obtain a modified text;
and the matching module is used for matching the modified text with the matching template.
In a third aspect, the present application provides an electronic device comprising a memory and a processor, the memory being configured to store a computer program which, when executed by the processor, performs the text fuzzy matching method provided in the first aspect.
In a fourth aspect, the present application provides a computer readable storage medium storing a computer program which, when run on a processor, performs the text fuzzy matching method provided in the first aspect.
According to the text fuzzy matching method device, the electronic equipment and the readable storage medium, the corresponding initial matching score is obtained according to the modification cost of each character of the matching template and each character of the text to be matched; obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost; determining the highest matching score from a plurality of final matching scores, and determining a single word average score according to the number of characters of the matching template and the highest matching score; if the average score of the single words is larger than a preset matching threshold value, determining a matching path according to the matching template; editing the text to be matched according to the matching path to obtain a modified text; and carrying out matching processing on the modified text and the matching template. Therefore, the method has higher error tolerance of the front-end system, is less prone to occurrence of recognition omission, has lower possibility of being influenced by the front-end system error in text matching, and can provide higher robustness for local and whole systems.
Drawings
In order to more clearly illustrate the technical solutions of the present application, the drawings that are required for the embodiments will be briefly described, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope of protection of the present application. Like elements are numbered alike in the various figures.
Fig. 1 shows a flow diagram of a text fuzzy matching method provided in the present application;
FIG. 2 shows a flow diagram of the text fuzzy matching method provided by the present application;
FIG. 3 is a schematic flow chart of the text fuzzy matching method provided by the present application;
fig. 4 shows a schematic structural diagram of the text fuzzy matching device provided by the application.
Major icons: 400-text fuzzy matching device, 401-first acquisition module, 402-second acquisition module, 403-first determination module, 404-second determination module, 405-editing module, 406-matching module.
Detailed Description
The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application.
The components of the present application, as generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
In the following, the terms "comprises", "comprising", "having" and their cognate terms may be used in various embodiments of the present application are intended only to refer to a particular feature, number, step, operation, element, component, or combination of the foregoing, and should not be interpreted as first excluding the existence of or increasing the likelihood of one or more other features, numbers, steps, operations, elements, components, or combinations of the foregoing.
Furthermore, the terms "first," "second," "third," and the like are used merely to distinguish between descriptions and should not be construed as indicating or implying relative importance.
Unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which various embodiments of this application belong. The terms (such as those defined in commonly used dictionaries) will be interpreted as having a meaning that is identical to the meaning of the context in the relevant art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein in connection with the various embodiments.
Example 1
The application provides a text fuzzy matching method.
Referring to fig. 1, the text blur matching method includes steps S101 to S106, and each step is described below.
Step S101, obtaining corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched.
In this embodiment, the text to be matched is an input text, and may be a segment of a string input by the user, for example, the user input string is "how is an individual protected against Jin Guanzhuang virus? The text to be matched may be other character strings, which is not limited herein. The plurality of matching templates can be stored in a matching library, and each matching template can be a key character string marked manually or a key character string identified through artificial intelligence. For example, the matching template may be "new coronavirus".
It should be noted that, the modification cost of each character of the matching template and each character of the text to be matched is calculated in different ways due to different error sources. For example, in image recognition, the probability of misrecognition between two words "rabbit" and "exempt" should be much higher than the probability of misrecognition between two words "rabbit" and "tiger". The errors of the golden crown and the new crown in the voice recognition should be smaller, and the two modes of image recognition and language recognition should be modified differently. For example, the modification cost may be determined by an error source, and for text from an image recognition source, the glyph-related features may be extracted and the word similarity calculated as the modification cost. For text from speech recognition sources, phoneme similarity is used as a modification cost. It should be noted that, the phonemes are the smallest phonetic units that are divided according to the natural attribute of the speech, and for example, chinese pinyin and english phonetic symbols can be used as the phonemes.
In an embodiment, obtaining the modification cost of each character of the matching template and each character of the text to be matched includes:
if the error source is an image recognition error, determining single word similarity according to the character pattern related characteristics, and taking the single word similarity as the modification cost;
and if the error source is a voice recognition error, adopting the phoneme similarity as the modification cost.
For the speech recognition error, similarity discrimination can be performed on Chinese phonetic alphabets and English phonetic alphabets, and the modification cost can be calculated according to the phoneme confusion relation cost determined by manual experience. For image recognition errors, the word similarity is determined based on the glyph related features, and the modification cost is determined by the word similarity.
In this way, the appropriate modification cost can be determined according to different sources, and deviation information brought by an error source is introduced so as to be compatible with possible result errors in the pre-recognition technology. The phonemes and the fonts are used as the cost basis of the editing distance in the front, so that matching errors caused by wrongly written characters, errors in recognition and untimely updating of a hot word system are reduced.
Referring to fig. 2, step S101 includes:
step S1011, if the modification cost is greater than a preset cost threshold, determining the initial matching score as 0;
step S1012, if the modification cost is less than or equal to a preset cost threshold, taking the modification cost as the initial matching score.
In this embodiment, the preset cost threshold may be a threshold set empirically, for example, the preset cost threshold may be set to 0.5, or other values, which is not limited herein. By judging whether the modification cost is larger than the factor and cost threshold, the modification difficulty of each character of the matching template and each character of the text to be matched can be distinguished. If the modification cost is greater than the preset cost threshold, the modification difficulty of the characters of the matching template and the characters of the text to be matched is relatively high, that is, the matching degree of the characters of the matching template and the characters of the text to be matched is relatively low, and the initial matching score of the characters of the matching template and the characters of the text to be matched can be determined to be 0. If the modification cost is smaller than or equal to the preset cost threshold, the modification difficulty of the characters of the matching template and the characters of the text to be matched is smaller, namely the matching degree of the characters of the matching template and the characters of the text to be matched is higher, the modification cost of the characters of the matching template and the characters of the text to be matched can be used as an initial matching score, and the matching degree of the characters of the matching template and the characters of the text to be matched is represented.
Step S102, obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost.
It should be noted that performing the insert operation and the delete operation is generally regarded as being equivalent to two character strings. For example, if there are two pieces of text AB, the effect of inserting a character in a and deleting a character in B is the same, so the deletion cost and the insertion cost can be set to the same cost value. For example, the deletion cost and the insertion cost may be set to 0.4.
Referring to fig. 3, step S102 includes:
step S1021, traversing each initial matching score, and determining a first adjacent initial matching score, a second adjacent initial matching score and a third adjacent initial matching score of the current initial matching score;
step S1022, calculating a first difference value between the second adjacent initial matching score and the insertion cost and a second difference value between the third adjacent initial matching score and the deletion cost according to the calculated sum value of the first adjacent initial matching score and the current initial matching score;
step S1022, determining the maximum value among the sum value, the first difference value, and the second difference value as the final matching score.
In this embodiment, a corresponding initial matching score table may be obtained according to initial matching scores of each character of the matching template and each character of the text to be matched, where a first column of the table sequentially includes each character of the matching template, a first row of the table sequentially includes each character of the text to be matched, and initial matching scores corresponding to an ith character of the matching template and a jth character of the text to be matched are stored in a table of an ith row and a jth column of the table. If the current initial match score is the initial match score in the table of the ith row and the jth column, the first adjacent initial match score is the initial match score in the table of the ith-1 row and the jth-1 column, the second adjacent initial match score is the initial match score in the table of the ith-1 row and the jth column, and the third adjacent initial match score is the initial match score in the table of the ith row and the jth-1 column.
In this embodiment, the method further includes:
and if the current initial matching score does not have the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score, setting the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score of the current initial matching score to be 0.
It should be noted that, for the foregoing table of initial matching scores, when the upper direction and the left direction of the current initial matching score exceed the table, the values of the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score are regarded as 0.
For example, for the initial match score in the 2 nd row and 2 nd column table, since the 1 st row and 1 st column, the 1 st row and 2 nd column, and the 2 nd row and 1 st column table have no corresponding initial match score, the initial match score in the 2 nd row and 2 nd column table has no first adjacent initial match score, second adjacent initial match score, and third adjacent initial match score, and according to the rule, the first adjacent initial match score, the second adjacent initial match score, and the third adjacent initial match score of the initial match score in the 2 nd row and 2 nd column table may be all set to 0. In this way, since there is no initial matching score in the table above the 2 nd row and the 2 nd column table, the initial matching score of the 1 st row and the 1 st column table, the 1 st row and the 2 nd column table, and the 2 nd row and the 1 st column table is regarded as 0, and data is supplemented. The initial matching score of the other 2 nd row and the j-th column tables and the initial matching score of the i-th row and the j-th column tables may be set to 0 according to the foregoing rule, which is not described herein.
It should be noted that, the final matching score in this embodiment may be understood as a degree of similarity between a segment of the subsequence in the matching template and a portion of the subsequence in the text to be matched, where the degree of similarity may also be referred to as an edit distance variant, and a portion of the subsequence in the text to be matched may also be referred to as a key sequence. In most matching scenarios, the content of the key sequence is any subsequence of text to be matched.
In the prior art, the full-text matching technology is affected by various recognition errors, and the full-sequence-based deep learning fuzzy matching method has poor effect on key sequences with low occupation and is easily affected by context-independent information. The editing distance variant in the invention can realize fuzzy matching of the key sequence with irrelevant duty ratio in the whole sequence, and eliminate the influence of context irrelevant information. The process of calculating the final matching score is the process of calculating the edit distance variation.
It should be noted that, the edit distance in the prior art is a very common sequence similarity feature, the text is a sequence, and the meaning of the text and the sequence may be approximately equivalent when describing the content related to the edit distance. The definition of the calculated edit distance can be expressed as follows: assuming that there are three operations for calculating similarity characteristics between the sequence a and the sequence B, and an element can be inserted, deleted and modified at any position in the calculation process, each operation needs a respective cost score, and the editing distance is the minimum value of the sum of the cost scores obtained by converting the two sequences into the same cost score by using the three transformations. A common edit distance calculation is the distance between two sequences, and in this embodiment, the process of calculating the edit distance variation is the process of calculating the final match score as described above. And calculating the correlation between the matching template and the text to be matched by adopting an editing distance variant, wherein the editing distance variant can ignore the context-free information and only search the matching degree of the keywords.
In an embodiment, the calculating the sum value according to the first adjacent initial matching score and the current initial matching score includes:
and when the first adjacent initial matching score is not 0, calculating the sum according to the first adjacent initial matching score, the current initial matching score and a preset compensation score.
Illustratively, when the first adjacent initial match score is not 0, a sum of the first adjacent initial match score, the current initial match score, and a preset compensation score is calculated, and when the first adjacent initial match score is 0, a sum of the first adjacent initial match score and the current initial match score is calculated, so that a longer matching sequence is more prone to be selected by adding a preset compensation score. The predetermined compensation fraction may be a smaller fraction less than 1, for example, the predetermined compensation fraction may be 0.1, which is not limited herein.
Referring to table 1 below, table 1 shows a schematic representation of the final match scores.
TABLE 1 final match score schematic form
It is added that in the initial matching score table described above, the initial matching score between "gold" and "new" is 0.8, the initial matching score between "poison" and "guard" is 0.3, and the initial matching score of the remaining complete matches is 1; the insertion cost and the deletion cost are both 0.4, the preset supplement score is 0.01, the preset matching threshold is 0.8, the final matching score schematic table shown in table 1 is obtained according to the calculation of the foregoing steps, as shown in table 1, the 1 st column is each character of the new coronavirus of the matching template, the 1 st line is each character of the "gold-protecting coronavirus" of the text to be matched, and the numerical value recorded in the j-th column table of the i-th line is the final matching score of the 1 st character to the i-th character of the matching template and the 1 st character to the j-th character of the text to be matched. For example, in Table 1, the value in the table of the intersection of the row of the "poison" word in column 1 and the column of the "poison" word in row 1 is 3.43, representing the final match score of the "new coronavirus" up to the "poison" word.
And step S103, determining the highest matching score from a plurality of final matching scores, and determining the single word average score according to the number of characters of the matching template and the highest matching score.
Exemplarily, in table 1, the highest matching score is determined to be 3.43 from the plurality of final matching scores in table 1, and since the matching template is "new coronavirus", there are 4 characters, and the single word average score is calculated by: 3.43/4= 0.8575, the average score 0.8575 of the single word is greater than the preset matching threshold value 0.8, so that a matching path can be obtained, when the average score of the single word is smaller than or equal to the preset matching threshold value, the current matching template can be omitted due to insufficient matching degree, other matching templates are searched from a matching library to perform the processing procedure, and the matching template with the average score of the single word greater than the preset matching threshold value is known.
Step S104, if the single word average score is larger than a preset matching threshold, a matching path is determined according to the matching template.
For example, in table 1, if the word average score 0.8575 is greater than the preset matching threshold value 0.8, the matching path of the text to be matched "how personal protected Jin Guanzhuang virus" can be determined according to "new coronavirus".
Step S105, editing the text to be matched according to the matching path to obtain a modified text.
It should be noted that, the modified text obtained based on the foregoing steps is text obtained by using fewer modification operations, and is a low-cost modified text.
For example, the corresponding characters in the matching template are replaced by the characters in the text to be matched, for example, the "Jin Guanzhuang virus" in table 1 is replaced by the "new coronavirus", the "gold" is replaced by the "new", "shape" extra characters are reserved, and then the traditional fuzzy matching such as similarity is performed.
In this way, the text to be matched is modified and corrected before the whole sentence matching operation, so that the semantic extraction error caused by the recognition error is prevented, and the matching result is further prevented from being wrong.
And step S106, carrying out matching processing on the modified text and the matching template.
In this embodiment, the modified text and the matching template are matched, so as to prevent the semantic extraction difference in the long text caused by the recognition error, and further avoid misjudgment of a text processing system.
In one embodiment, step S106 includes:
determining modification operation and matching character string substrings according to the matching paths;
and editing the text to be matched according to the modification operation and the matching character string substring to obtain the modification text.
Exemplary, for example, in table 1, the matching path is to match "Jin Guanzhuang virus" to "new coronavirus", the modification operation is to modify "gold" of "Jin Guanzhuang virus" to "new", the matching string substring is "Jin Guanzhuang virus", the "gold" of "Jin Guanzhuang virus" is modified to "new", and the "shape" word is reserved, so that the modified text is "new coronavirus".
Aiming at the situation that the matching result is affected by the fact that the existing keyword matching is easily affected by factors such as recognition errors, untimely updating of a system word stock and the like, the text to be matched changes, and the matching result is affected, the embodiment provides a text fuzzy matching method, and the error of a front system is taken as editing operation cost, so that the error brought by the front system is contained, and the robustness of the whole system is improved. Aiming at the situation that error of text matching results is caused by misidentification in full sentence matching, an error correction mechanism based on editing cost is added, and different error correction is carried out on each matching template, so that errors caused by pre-system identification are contained, and a more robust result is obtained.
Compared with the existing error correction technology, the existing error correction technology adopts the same language model assumption for all texts, and meanwhile a hotword database needs to be updated in time, and the text fuzzy matching method of the embodiment considers errors existing in text sources and tries to reduce the errors, so that the errors are used as the matching basis. The text fuzzy matching method of the embodiment is different from the existing error correction technology in that: the error correction technology adopts the same language model to correct errors for all results, and the text fuzzy matching method of the embodiment is based on the error of the template and the prepositive system, and the error correction results of the same text in different templates are different. For example, the text to be matched is "Hu Jinguan virus expert", the result of the existing error correction may be one of "Hu Jinguan |virus expert" or "Hu|New coronavirus|expert", and both text fuzzy matching methods of the present embodiment may be successfully matched.
According to the text fuzzy matching method provided by the embodiment, corresponding initial matching scores are obtained according to the modification cost of each character of the matching template and each character of the text to be matched; obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost; determining the highest matching score from a plurality of final matching scores, and determining a single word average score according to the number of characters of the matching template and the highest matching score; if the average score of the single words is larger than a preset matching threshold value, determining a matching path according to the matching template; editing the text to be matched according to the matching path to obtain a modified text; and carrying out matching processing on the modified text and the matching template. Therefore, the method has higher error tolerance of the front-end system, is less prone to occurrence of recognition omission, has lower possibility of being influenced by the front-end system error in text matching, and can provide higher robustness for local and whole systems.
Example 2
In addition, the application provides a text fuzzy matching device.
As shown in fig. 4, the text blur matching device 400 includes:
the first obtaining module 401 is configured to obtain a corresponding initial matching score according to modification costs of each character of the matching template and each character of the text to be matched;
a second obtaining module 402, configured to obtain a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each of the initial matching score, the insertion cost, and the deletion cost;
a first determining module 403, configured to determine a highest matching score from a plurality of the final matching scores, and determine a word average score according to the number of characters of the matching template and the highest matching score;
a second determining module 404, configured to determine a matching path according to the matching template if the average score of the single word is greater than a preset matching threshold;
an editing module 405, configured to edit the text to be matched according to the matching path, so as to obtain a modified text;
and the matching module 406 is used for matching the modified text with the matching template.
In an embodiment, the first obtaining module 401 is further configured to determine the initial matching score as 0 if the modification cost is greater than a preset cost threshold;
and if the modification cost is smaller than or equal to a preset cost threshold value, taking the modification cost as the initial matching score.
In an embodiment, the second obtaining module 402 is further configured to traverse each of the initial matching scores to determine a first adjacent initial matching score, a second adjacent initial matching score, and a third adjacent initial matching score of the current initial matching score;
calculating a first difference value between the second adjacent initial matching score and the insertion cost and a second difference value between the third adjacent initial matching score and the deletion cost according to the calculated sum value of the first adjacent initial matching score and the current initial matching score;
and determining the maximum value of the sum value, the first difference value and the second difference value as the final matching score.
In an embodiment, the second obtaining module 402 is further configured to calculate the sum according to the first adjacent initial matching score, the current initial matching score, and a preset compensation score when the first adjacent initial matching score is not 0.
In one embodiment, the text blur matching device 400 further includes:
the first processing module is configured to set the first adjacent initial match score, the second adjacent initial match score, and the third adjacent initial match score of the current initial match score to be 0 if the current initial match score does not have the first adjacent initial match score, the second adjacent initial match score, and the third adjacent initial match score.
In one embodiment, the text blur matching device 400 further includes:
the second processing module is used for determining single word similarity according to the character pattern related characteristics if the error source is an image recognition error, and taking the single word similarity as the modification cost;
and if the error source is a voice recognition error, adopting the phoneme similarity as the modification cost.
In an embodiment, the editing module 405 is further configured to determine a modification operation and a matching string sub-string according to the matching path;
and editing the text to be matched according to the modification operation and the matching character string substring to obtain the modification text.
The text fuzzy matching device 400 provided in this embodiment can implement the text fuzzy matching method provided in embodiment 1, and in order to avoid repetition, a description thereof will be omitted.
According to the text fuzzy matching device provided by the embodiment, corresponding initial matching scores are obtained according to the modification cost of each character of the matching template and each character of the text to be matched; obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost; determining the highest matching score from a plurality of final matching scores, and determining a single word average score according to the number of characters of the matching template and the highest matching score; if the average score of the single words is larger than a preset matching threshold value, determining a matching path according to the matching template; editing the text to be matched according to the matching path to obtain a modified text; and carrying out matching processing on the modified text and the matching template. Therefore, the method has higher error tolerance of the front-end system, is less prone to occurrence of recognition omission, has lower possibility of being influenced by the front-end system error in text matching, and can provide higher robustness for local and whole systems.
Example 3
Furthermore, the present application provides an electronic device comprising a memory and a processor, the memory storing a computer program which, when run on the processor, performs the text blur matching method provided by embodiment 1.
The electronic device provided in this embodiment may implement the text fuzzy matching method provided in embodiment 1, and in order to avoid repetition, details are not repeated here.
Example 4
The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the text blur matching method provided by embodiment 1.
In the present embodiment, the computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.
The computer readable storage medium provided in this embodiment may implement the text fuzzy matching method provided in embodiment 1, and in order to avoid repetition, a description thereof will be omitted.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or terminal comprising the element.
From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product stored in a storage medium (such as ROM/RAM, magnetic disk, optical disk), including several instructions for causing a terminal (which may be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to perform the method described in the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those of ordinary skill in the art without departing from the spirit of the present application and the scope of the claims, which are also within the protection of the present application.

Claims (10)

1. A method for fuzzy matching of text, the method comprising:
acquiring corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched;
obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost;
determining the highest matching score from a plurality of final matching scores, and determining a single word average score according to the number of characters of the matching template and the highest matching score;
if the average score of the single words is larger than a preset matching threshold value, determining a matching path according to the matching template;
editing the text to be matched according to the matching path to obtain a modified text;
and carrying out matching processing on the modified text and the matching template.
2. The method according to claim 1, wherein the obtaining the corresponding initial matching score according to the modification cost of each character of the matching template and each character of the text to be matched comprises:
if the modification cost is greater than a preset cost threshold, determining the initial matching score as 0;
and if the modification cost is smaller than or equal to a preset cost threshold value, taking the modification cost as the initial matching score.
3. The method according to claim 1, wherein the obtaining a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each of the initial matching score, the insertion cost, and the deletion cost comprises:
traversing each initial matching score, and determining a first adjacent initial matching score, a second adjacent initial matching score and a third adjacent initial matching score of the current initial matching score;
calculating a first difference value between the second adjacent initial matching score and the insertion cost and a second difference value between the third adjacent initial matching score and the deletion cost according to the calculated sum value of the first adjacent initial matching score and the current initial matching score;
and determining the maximum value of the sum value, the first difference value and the second difference value as the final matching score.
4. A method according to claim 3, wherein said calculating a sum value from said first adjacent initial match score and a current initial match score comprises:
and when the first adjacent initial matching score is not 0, calculating the sum according to the first adjacent initial matching score, the current initial matching score and a preset compensation score.
5. A method according to claim 3, characterized in that the method further comprises:
if the current initial matching score does not have the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score, setting the first adjacent initial matching score, the second adjacent initial matching score and the third adjacent initial matching score of the current initial matching score to be 0.
6. The method of claim 1, wherein obtaining the modification cost of each character of the matching template and each character of the text to be matched comprises:
if the error source is an image recognition error, determining single word similarity according to the character pattern related characteristics, and taking the single word similarity as the modification cost;
and if the error source is a voice recognition error, adopting the phoneme similarity as the modification cost.
7. The method according to claim 1, wherein editing the text to be matched according to the matching path to obtain a modified text comprises:
determining modification operation and matching character string substrings according to the matching paths;
and editing the text to be matched according to the modification operation and the matching character string substring to obtain the modification text.
8. A text fuzzy matching device, the device comprising:
the first acquisition module is used for acquiring corresponding initial matching scores according to the modification cost of each character of the matching template and each character of the text to be matched;
the second acquisition module is used for acquiring a final matching score between the subsequence of the matching template and the subsequence of the text to be matched according to each initial matching score, the insertion cost and the deletion cost;
the first determining module is used for determining the highest matching score from a plurality of final matching scores, and determining a word average score according to the number of characters of the matching template and the highest matching score;
the second determining module is used for determining a matching path according to the matching template if the single word average score is larger than a preset matching threshold;
the editing module is used for editing the text to be matched according to the matching path to obtain a modified text;
and the matching module is used for matching the modified text with the matching template.
9. An electronic device comprising a memory and a processor, the memory storing a computer program that, when executed by the processor, performs the text fuzzy matching method of any of claims 1 to 7.
10. A computer-readable storage medium, characterized in that it stores a computer program which, when run on a processor, performs the text blur matching method of any one of claims 1 to 7.
CN202310233802.6A 2023-02-28 2023-02-28 Text fuzzy matching method, device, electronic equipment and readable storage medium Pending CN116450896A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310233802.6A CN116450896A (en) 2023-02-28 2023-02-28 Text fuzzy matching method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310233802.6A CN116450896A (en) 2023-02-28 2023-02-28 Text fuzzy matching method, device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN116450896A true CN116450896A (en) 2023-07-18

Family

ID=87129255

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310233802.6A Pending CN116450896A (en) 2023-02-28 2023-02-28 Text fuzzy matching method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN116450896A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881405A (en) * 2023-09-07 2023-10-13 深圳市金政软件技术有限公司 Chinese character fuzzy matching method, device, equipment and medium

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116881405A (en) * 2023-09-07 2023-10-13 深圳市金政软件技术有限公司 Chinese character fuzzy matching method, device, equipment and medium

Similar Documents

Publication Publication Date Title
CN109800414B (en) Method and system for recommending language correction
US9875254B2 (en) Method for searching for, recognizing and locating a term in ink, and a corresponding device, program and language
CN113435186B (en) Chinese text error correction system, method, device and computer readable storage medium
CN112287684A (en) Short text auditing method and device integrating variant word recognition
CN111651978A (en) Entity-based lexical examination method and device, computer equipment and storage medium
CN113128203A (en) Attention mechanism-based relationship extraction method, system, equipment and storage medium
CN114282527A (en) Multi-language text detection and correction method, system, electronic device and storage medium
CN112381038B (en) Text recognition method, system and medium based on image
CN111401012B (en) Text error correction method, electronic device and computer readable storage medium
CN113449514A (en) Text error correction method and device suitable for specific vertical field
CN116450896A (en) Text fuzzy matching method, device, electronic equipment and readable storage medium
CN113673228A (en) Text error correction method, text error correction device, computer storage medium and computer program product
CN111046660B (en) Method and device for identifying text professional terms
KR20150092879A (en) Language Correction Apparatus and Method based on n-gram data and linguistic analysis
CN108564086B (en) Character string identification and verification method and device
JP5203324B2 (en) Text analysis apparatus, method and program for typographical error
CN109002454B (en) Method and electronic equipment for determining spelling partition of target word
WO2000036530A1 (en) Searching method, searching device, and recorded medium
Byambakhishig et al. Error correction of automatic speech recognition based on normalized web distance
US8977538B2 (en) Constructing and analyzing a word graph
CN111310457B (en) Word mismatching recognition method and device, electronic equipment and storage medium
JPS6239793B2 (en)
Hasan et al. SweetCoat-2D: Two-Dimensional Bangla Spelling Correction and Suggestion Using Levenshtein Edit Distance and String Matching Algorithm
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
KR101663521B1 (en) Method and program for proofreading word spacing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination