CN116881405A

CN116881405A - Chinese character fuzzy matching method, device, equipment and medium

Info

Publication number: CN116881405A
Application number: CN202311147233.XA
Authority: CN
Inventors: 李铀; 石自军; 于鹏; 余方波
Original assignee: Shenzhen Jinzheng Software Technology Co ltd
Current assignee: Shenzhen Jinzheng Software Technology Co ltd
Priority date: 2023-09-07
Filing date: 2023-09-07
Publication date: 2023-10-13

Abstract

The invention relates to the field of text matching and discloses a Chinese character fuzzy matching method, a device, equipment and a medium. According to the method, a matched text is split to obtain a matched word set and a matched single word set, the matched word set and the matched single word set are assigned, a target phrase is split to obtain the target word set and the target single word set, the target phrase is subjected to traversal matching with the matched word set and the matched single word set respectively, a first score obtained after successful matching is calculated based on a score correspondence table, the matched text and the target phrase are calculated through a similarity algorithm to obtain a second score, the matching degree is obtained through the product of the first score and the second score, the similarity degree of the target phrase and the matched text is determined according to the matching degree, the text is deeply deconstructed, the split words and single word meanings are matched, and the character similarity is combined, so that the word semantic relation is more closed during fuzzy matching, and the matching accuracy is improved.

Description

Chinese character fuzzy matching method, device, equipment and medium

Technical Field

The invention relates to the field of text matching, in particular to a Chinese character fuzzy matching method, a device, equipment and a medium.

Background

With the rapid development of big data, more enterprise groups begin to pay attention to the problem of data application, and the creation of a data warehouse meeting the self requirements of enterprises becomes the most priority requirement. Most of the systems used by companies are subjected to version changing for many times, users input information in a random state, so that original data redundancy is huge and complicated, a large amount of manpower is needed to be input in the data treatment process for matching and finishing the original data, and enthusiasm and patience of the enterprises on data warehouse establishment are eliminated.

The current fuzzy matching calculation method is mainly an edit distance algorithm, which means the minimum number of editing operations needed for converting one character string into the other character string. The allowed editing operations include replacing one character with another, inserting one character, deleting one character. In general, the smaller the edit distance, the smaller the number of operations, and the greater the similarity of the two character strings.

However, the matching method only considers the similarity of characters, but does not consider the similarity in terms of semantics, and cannot capture deep semantic relations.

Disclosure of Invention

The invention mainly aims to provide a Chinese character fuzzy matching method, a device, equipment and a medium for solving the technical problems.

The first aspect of the present invention provides a fuzzy matching method for Chinese characters, comprising:

obtaining a matching text and splitting the matching text to obtain a matching word set and a matching single word set, wherein the matching word set comprises at least one matching word, and the matching single word set comprises at least one matching single word;

assigning a value to each matching word and each matching word to obtain a score corresponding table;

according to the score corresponding table, matching the target phrase with each matching word and each matching word respectively to obtain a first score;

based on a preset character similarity algorithm, performing similarity calculation on the matching text and the target phrase to obtain a second score;

and calculating the product of the first score and the second score to obtain the matching degree of the target phrase relative to the matching text, wherein the larger the matching degree is, the more similar the target phrase is to the matching text.

Optionally, in a second implementation manner of the first aspect of the present invention, when a word obtained by freely combining any N matching words is the same as any matching word in the matching word set, a sum of assignment scores of the N matching words is equal to an assignment score corresponding to the matching word, where N >1, N is a positive integer.

Optionally, in a third implementation manner of the first aspect of the present invention, according to the score correspondence table, the matching the target phrase with each matching word and each matching word respectively, to obtain a first score includes:

splitting a target phrase to obtain a target word set, wherein the target word set comprises at least one target word;

splitting the target word to obtain a target word set, wherein the target word set comprises at least two target words.

Optionally, in a fourth implementation manner of the first aspect of the present invention, according to the score correspondence table, the matching the target phrase with each matching word and each matching word respectively, to obtain a first score includes:

performing traversal matching on the target words in the target word set and each matching word respectively;

and summarizing the assigned scores of the matched words successfully matched based on the score corresponding table to obtain an A score.

Optionally, in a fifth implementation manner of the first aspect of the present invention, according to the score correspondence table, the matching the target phrase with each matching word and each matching word, to obtain a first score includes:

judging whether the target word which fails to match with each matching word exists or not;

if an implementation target word with matching failure exists, matching the target word in the target word set with each matching word respectively, summarizing assigned scores of a plurality of successfully matched words based on the score corresponding table to obtain a B score, and calculating the sum of the A score and the B score to obtain a first score;

and if no matching failed implementation target word exists, determining the A score as the first score.

Optionally, in a sixth implementation manner of the first aspect of the present invention, the performing, based on a preset character similarity algorithm, a similarity calculation on the matching text and the target phrase to obtain a second score includes:

recognizing the character length of the matched text to obtain a matched length;

recognizing the character length of the target phrase to obtain a target length;

and calculating the matching length and the target length based on a preset character similarity algorithm to obtain a second score, wherein the character similarity algorithm is a cosine similarity algorithm.

Optionally, in a seventh implementation manner of the first aspect of the present invention, a formula of the cosine similarity algorithm is:

wherein X is the matching length, Y is the target length, ++>Is a second score.

The second aspect of the present invention provides a fuzzy matching device for Chinese characters, comprising:

the phrase disassembling module is used for obtaining a matching text and splitting the matching text to obtain a matching word set and a matching single word set, wherein the matching word set comprises at least one matching word, and the matching single word set comprises at least one matching single word;

the assignment module is used for carrying out score assignment on each matching word and each matching word respectively to obtain a score corresponding table;

the first score calculating module is used for respectively matching the target phrase with each matching word and each matching word according to the score corresponding table to obtain a first score;

the second score calculating module is used for carrying out similarity calculation on the matching text and the target phrase based on a preset character similarity algorithm to obtain a second score;

and the similarity judging module is used for calculating the product of the first score and the second score to obtain the matching degree of the target phrase relative to the matching text, wherein the larger the matching degree is, the more similar the target phrase is to the matching text.

The third aspect of the present invention provides a fuzzy matching device for Chinese characters, comprising: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line; the at least one processor invokes the instructions in the memory to cause the chinese character fuzzy matching device to perform the chinese character fuzzy matching method described above.

A fourth aspect of the present invention provides a computer-readable storage medium having instructions stored therein that, when executed on a computer, cause the computer to perform the above-described method of fuzzy matching chinese characters.

In the embodiment of the invention, the matched text is split to obtain the matched word set and the matched single word set, the matched word set and the matched single word set are assigned, the target phrase is split to obtain the target word set and the target single word set, the target phrase is respectively traversed and matched with the matched word set and the matched single word set, the first score obtained after successful matching is calculated based on the score correspondence table, the matched text and the target phrase are calculated through a similarity algorithm to obtain the second score, the matching degree is obtained through the product of the first score and the second score, the similarity degree of the target phrase and the matched text is determined according to the matching degree, the text is deeply deconstructed, the split words and single word senses are matched, and the character similarity is combined, so that the word is closer to the text semantic relationship during fuzzy matching, and the matching accuracy is improved.

Drawings

FIG. 1 is a diagram showing a first embodiment of a fuzzy matching method for Chinese characters according to an embodiment of the present invention;

FIG. 2 is a diagram showing a step 103 of the fuzzy matching method for Chinese characters according to an embodiment of the present invention;

FIG. 3 is a diagram showing a second embodiment of step 103 in the fuzzy matching method of Chinese characters according to the embodiment of the present invention;

FIG. 4 is a diagram showing a third embodiment of step 103 in the fuzzy matching method of Chinese characters according to the embodiment of the present invention;

FIG. 5 is a schematic diagram showing a specific embodiment of step 105 in the fuzzy matching method of Chinese characters according to the embodiment of the present invention;

FIG. 6 is a schematic diagram of an embodiment of a Chinese character fuzzy matching device according to an embodiment of the present invention;

fig. 7 is a schematic diagram of an embodiment of a fuzzy matching device for chinese characters according to an embodiment of the present invention.

Detailed Description

The embodiment of the invention provides a Chinese character fuzzy matching method, a device, equipment and a medium.

The terms "first," "second," "third," "fourth" and the like in the description and in the claims and in the above drawings, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus.

For easy understanding, the following describes a specific flow of an embodiment of the present invention, referring to fig. 1 to fig. 5, and an embodiment of a method for fuzzy matching of chinese characters in the embodiment of the present invention includes:

101. obtaining a matching text and splitting the matching text to obtain a matching word set and a matching single word set, wherein the matching word set comprises at least one matching word, and the matching single word set comprises at least one matching single word;

in this embodiment, the matching text may be or is specified to obtain any article, paragraph, sentence or even phrase, and then the matching text is split, where in this implementation scenario, the matching text is split into word splitting and single word splitting, where the word splitting is multiple word splitting, for example, two-word splitting, three-word splitting, four-word splitting, five-word splitting, etc., it should be noted that the multiple word splitting may obtain specific semantics from channels such as dictionary, network library, etc., and the single word splitting is to split the matching text into one word, and the combination between each word can find the corresponding multiple word.

102. Assigning a value to each matching word and each matching word to obtain a score corresponding table;

in this embodiment, the assignment of the matching word may be 1 or 2, and the assignment of the matching word may be performed by using other values, where the assignment of the matching word needs to be performed according to a multi-word structure thereof, for example, the assignment of the two words is 3, the assignment of the three words is 6, the assignment of the four words is 10, and so on, and the assignment manner may be an incremental sequence stacking manner, or may also use a replacement assignment score, and so on, and it should be noted that, when a word obtained by freely combining any N matching words is identical to any matching word in the matching word set, the sum of the assignment scores of the N matching words is equal to the assignment score of the matching word corresponding to the matching word, where N >1, N is a positive integer, that is, when the matching word formed by combining any matching word is the sum of the assignment scores is identical to the assignment score of the matching word, and a score correspondence table is required to be separately established, where the score corresponding to each word is recorded, so that when the score corresponding to the score table can be found quickly according to the score of the assignment.

103. According to the score corresponding table, matching the target phrase with each matching word and each matching word respectively to obtain a first score;

in this embodiment, the matching mode is traversal matching, which is to match the target phrase with each matching word one by one, and match each matching word if there is no match, until each word in the target phrase is completely matched.

1031. Splitting a target phrase to obtain a target word set, wherein the target word set comprises at least one target word;

1032. splitting the target word to obtain a target word set, wherein the target word set comprises at least two target words.

In step 1031-1032, the target phrase may be extracted from any journal, paper, sentence, etc., and the terminal may receive the text selection operation input by the user in the terminal, and determine the text selected by the user as the target phrase, or the terminal may also receive the text sent by other clients as the target phrase. In one implementation scenario, the terminal may provide a phrase matching page, and the user may input a target phrase to be matched in the phrase matching page, and the terminal obtains the target word phrase. When matching is carried out, the target word group needs to be split in advance, and the target word group is the same as the matching text and needs to be split into a target word and a target single word, wherein the target word group is actually combined by a plurality of words, and when the target word group is split into the target single word, only the target word group needs to be split.

1033. Performing traversal matching on the target words in the target word set and each matching word respectively;

1034. and summarizing the assigned scores of the matched words successfully matched based on the score corresponding table to obtain an A score.

1035. Judging whether the target word which fails to match with each matching word exists or not;

1036. if an implementation target word with matching failure exists, matching the target word in the target word set with each matching word respectively, summarizing assigned scores of a plurality of successfully matched words based on the score corresponding table to obtain a B score, and calculating the sum of the A score and the B score to obtain a first score;

1037. and if no matching failed implementation target word exists, determining the A score as the first score.

In steps 1033-1037, each target word is first traversed and matched with each matching word, and the assigned scores of the successfully matched words are summarized according to the score correspondence table to obtain an a score, which is exemplified herein as: an "information communication institute" as a matching text, wherein "information", "communication" and "institute" are matching words and combined into a set of matching words, and wherein "information", "communication", "research", "institute" is a matching word and combined into a set of matching words, and "standard technical institute" is selected as a target phrase, wherein "standard", "technology", "institute" is a target word and combined into a set of target words, "standard", "quasi", "skill", "research", "institute" is a target word and combined into a set of target words, and each target word "standard", "skill", "institute" is matched with each matching word "information", "communication" and "institute" respectively, a score of 6 can be obtained, the score of 6 is obtained by matching to the "institute", and the other words are not matched without score or assignment, if all other target words are matched, then the a score is the first score, but this example can see that the two words of "standard" and "technology" are not matched, so the operations of steps 1035-1037 need to be performed to determine that the target word that is not matched is again matched with each matching word, in this example, each target word is not matched with the corresponding matching word, so the B score is 0 or not, so the sum of the a score and the B score obtains the first score, namely the a score, note that the assignment of a word is generally 1 or 2, and of course, the assignment of a word also has a compensation value, takes the word of "institute" and the assignment score of the word is 6, and at this time, each word can be 2, however, two single words can be rotated to be 1, the superposition of the two single words is only 2, the difference between the two single words and the score of 6 is 4, at the moment, the last single word can be used as a compensation value to compensate the assigned score of the other two single words, and the reason for selecting smaller assigned values is to improve the difference line, so that the subsequent matching degree calculation is convenient.

104. Based on a preset character similarity algorithm, performing similarity calculation on the matching text and the target phrase to obtain a second score;

105. and calculating the product of the first score and the second score to obtain the matching degree of the target phrase relative to the matching text, wherein the larger the matching degree is, the more similar the target phrase is to the matching text.

1051. Recognizing the character length of the matched text to obtain a matched length;

1052. recognizing the character length of the target phrase to obtain a target length;

1053. calculating the matching length and the target length based on a preset character similarity algorithm to obtain a second score, wherein the character similarity algorithm is a cosine similarity algorithm, and the cosine similarity algorithm has the formula:

wherein X is the matchLength, Y is the target length, +.>Is a second score.

In steps 104-1053, after the first score is determined, similarity is calculated for the character length of the target phrase, but as for the "standard computing institute" illustrated above, the second score is 1, and all the calculated second scores are the same seven-word phrase, so it can be seen that the larger the final calculation result is, the higher the similarity is.

In this embodiment, a matching text is split to obtain a matching word set and a matching word set, the matching word set and the matching word set are assigned, a target phrase is split to obtain a target word set and a target word set, the target phrase is respectively traversed and matched with the matching word set and the matching word set, a first score obtained after successful matching is calculated based on a score correspondence table, the matching text and the target phrase are calculated through a similarity algorithm to obtain a second score, the matching degree is obtained through the product of the first score and the second score, the similarity degree of the target phrase and the matching text is determined according to the matching degree, the text is deeply deconstructed, the split words and word meanings are matched, and the character similarity is combined, so that the matching text semantic relation is more closely approached during fuzzy matching, and the matching accuracy is improved.

The method for fuzzy matching Chinese characters in the embodiment of the present invention is described above, and the fuzzy matching device for Chinese characters in the embodiment of the present invention is described below, referring to fig. 6, one embodiment of the fuzzy matching device for Chinese characters in the embodiment of the present invention includes:

the phrase disassembling module 201 is configured to obtain a matching text and split the matching text to obtain a matching word set and a matching word set, where the matching word set includes at least one matching word, and the matching word set includes at least one matching word;

the assignment module 202 is configured to perform score assignment on each matching word and each matching word, so as to obtain a score correspondence table;

the first score calculating module 203 is configured to match, according to the score correspondence table, the target phrase with each matching word and each matching word, so as to obtain a first score;

a second score calculating module 204, configured to perform a similarity calculation on the matching text and the target phrase based on a preset character similarity algorithm, to obtain a second score;

and the similarity judging module 205 is configured to calculate a product of the first score and the second score to obtain a matching degree of the target phrase with respect to the matching text, where the greater the matching degree is, the more similar the target phrase is to the matching text.

Another embodiment of the Chinese character fuzzy matching device in the embodiment of the invention comprises:

Further, the assignment module 202 specifically further includes:

when the word obtained by freely combining any N matched single words is the same as any matched word in the matched word set, the sum of the assigned scores of the N matched single words is equal to the assigned score of the corresponding matched word, wherein N is greater than 1, and N is a positive integer.

Further, the first score calculating module 203 may specifically perform:

splitting the target word to obtain a target word set, wherein the target word set comprises at least two target words;

summarizing the assigned scores of the matched words successfully matched based on the score corresponding table to obtain a score A;

Further, the preprocessing module 202 may specifically perform:

respectively carrying out normalization processing on the N cutting pictures to obtain N normalization data;

and calculating the N normalization data and preset RGB data to obtain N prediction data, wherein the data type of the prediction data is floating point type.

Further, the similarity determination module 205 may further specifically perform:

calculating the matching length and the target length based on a preset character similarity algorithm to obtain a second score, wherein the character similarity algorithm is a cosine similarity algorithm, and the cosine similarity algorithm has the formula:

wherein X is the matching length, Y is the target length, ++>Is a second score.

Splitting a matched text to obtain a matched word set and a matched word set, assigning values to the matched word set and the matched word set, splitting a target phrase to obtain the target word set and the target word set, performing traversal matching on the target phrase and the matched word set respectively, calculating a first score obtained after successful matching based on a score correspondence table, calculating the matched text and the target phrase through a similarity algorithm to obtain a second score, obtaining a matching degree through the product of the first score and the second score, determining the similarity degree of the target phrase and the matched text according to the matching degree, deeply deconstructing the text, matching the split word and word meaning from the split word and word meaning, combining the character similarity, enabling the word to be closer to a text semantic relationship during fuzzy matching, and further improving the matching accuracy.

The above figure 6 describes the Chinese character fuzzy matching device in the embodiment of the present invention in detail from the point of view of modularized functional entities, and the following describes the Chinese character fuzzy matching device in the embodiment of the present invention in detail from the point of view of hardware processing.

Fig. 7 is a schematic structural diagram of a chinese character fuzzy matching device according to an embodiment of the present invention, where the chinese character fuzzy matching device 300 may have a relatively large difference due to different configurations or performances, and may include one or more processors central processing units, a CPU310, for example, one or more processors and a memory 320, and one or more storage media 330, for example, one or more mass storage devices, for storing application programs 333 or data 332. Wherein memory 320 and storage medium 330 may be transitory or persistent storage. The program stored in the storage medium 330 may include one or more block diagrams not shown, and each block may include a series of instruction operations on the kanji fuzzy matching device 300. Still further, the processor 310 may be configured to communicate with the storage medium 330 and execute a series of instruction operations in the storage medium 330 on the chinese character fuzzy matching device 300.

The chinese character based fuzzy matching device 300 may also include one or more power supplies 340, one or more wired or wireless network interfaces 350, one or more input/output interfaces 330, and/or one or more operating systems 331, such as Windows service, mac OS X, unix, linux, freeBSD, etc. It will be appreciated by those skilled in the art that the chinese character fuzzy matching device structure illustrated in fig. 7 is not limiting and may include more or fewer components than illustrated, or may combine certain components, or may be a different arrangement of components.

The invention also provides a computer readable storage medium, which can be a nonvolatile computer readable storage medium, and can also be a volatile computer readable storage medium, wherein the computer readable storage medium stores instructions which when run on a computer cause the computer to execute the steps of the Chinese character fuzzy matching method.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the system or apparatus and unit described above may refer to the corresponding process in the foregoing method embodiment, which is not repeated herein.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for enabling a computer device to be a personal computer, a server, or a network device, etc. to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory, a ROM, a random access memory random access memory, a RAM, a magnetic disk, or an optical disk, etc., which can store program codes.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A Chinese character fuzzy matching method is characterized by comprising the following steps:

2. The fuzzy matching method of claim 1, wherein when the word obtained by free combination of any N of the matching words is identical to any one of the matching words in the set of matching words, the sum of the assigned scores of the N matching words is equal to the assigned score of the corresponding matching word, where N >1, N is a positive integer.

3. The fuzzy matching method of claim 2, wherein the matching the target phrase with each of the matching words and each of the matching words according to the score correspondence table to obtain a first score includes:

4. The fuzzy matching method of Chinese characters according to claim 3, wherein said matching the target phrase with each of said matching words and each of said matching words according to said score correspondence table, respectively, to obtain a first score comprises:

5. The fuzzy matching method of Chinese characters of claim 4, wherein said matching the target phrase with each of said matching words and each of said matching words according to said score correspondence table, respectively, to obtain a first score comprises:

6. The fuzzy matching method of Chinese characters of claim 1, wherein the performing a similarity calculation on the matching text and the target phrase based on a preset character similarity algorithm to obtain a second score comprises:

7. The fuzzy matching method of Chinese characters of claim 6, wherein the cosine similarity algorithm has the formula:

wherein X is the matching length, Y is the target length, ++>Is a second score.

8. The Chinese character fuzzy matching device is characterized by comprising:

and the similarity judging module is used for calculating the product of the first score and the second score to obtain the matching degree of the target phrase relative to the matching text.

9. A chinese character fuzzy matching apparatus, characterized in that the chinese character fuzzy matching apparatus comprises: a memory and at least one processor, the memory having instructions stored therein, the memory and the at least one processor being interconnected by a line;

the at least one processor invoking the instructions in the memory to cause the device to perform the kanji fuzzy matching method of any of claims 1-7.

10. A computer readable storage medium having a computer program stored thereon, wherein the computer program when executed by a processor implements the chinese character fuzzy matching method of any one of claims 1-7.