CN115905473B - Full noun fuzzy matching method, device and storage medium - Google Patents

Full noun fuzzy matching method, device and storage medium Download PDF

Info

Publication number
CN115905473B
CN115905473B CN202211638615.8A CN202211638615A CN115905473B CN 115905473 B CN115905473 B CN 115905473B CN 202211638615 A CN202211638615 A CN 202211638615A CN 115905473 B CN115905473 B CN 115905473B
Authority
CN
China
Prior art keywords
matching
text
clause
clauses
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211638615.8A
Other languages
Chinese (zh)
Other versions
CN115905473A (en
Inventor
裴俊枫
秦周
郁彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Wuxi Xiyin Jinke Information Technology Co ltd
Original Assignee
Wuxi Xiyin Jinke Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Wuxi Xiyin Jinke Information Technology Co ltd filed Critical Wuxi Xiyin Jinke Information Technology Co ltd
Priority to CN202211638615.8A priority Critical patent/CN115905473B/en
Publication of CN115905473A publication Critical patent/CN115905473A/en
Application granted granted Critical
Publication of CN115905473B publication Critical patent/CN115905473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The application discloses a full noun fuzzy matching method, a device and a storage medium, which relate to the field of text matching, and are used for carrying out word segmentation processing on text clauses in an original text and matching clauses in a comparison text through a text word stock and calculating weight scores of the matching clauses; polling to calculate the matching degree scores between the text clauses and all the matching clauses based on the weight values, and determining candidate matching clauses; when a plurality of candidate matching clauses are matched, determining a unique target matching clause from the candidate matching clauses based on the target matching condition, and updating a text word stock and word segmentation weight according to the candidate matching clauses. According to the scheme, the keyword lexicon and the word segmentation weight table are customized, the target matching clause is determined according to the matching degree score between the clauses, the content of the text lexicon and the word segmentation weight are updated in real time according to the matching result, the precision and the accuracy of subsequent matching are improved, and the text matching and auditing efficiency under the condition of full nouns is also improved.

Description

Full noun fuzzy matching method, device and storage medium
Technical Field
The embodiment of the application relates to the field of text matching, in particular to a full-name word fuzzy matching method, a full-name word fuzzy matching device and a storage medium.
Background
The text matching is a directional searching and matching method taking text content as a searching target, such as full word matching, namely, taking the whole text as a searching word to perform directional searching, and non-full word matching, namely, taking split word or word as a target searching word to perform directional searching, wherein only one element can return a result. Of course, some optimization of the results will generally be performed, with the results that are more tightly bound to the elements being presented preferentially. In practical applications, text comparison and matching degree calculation are often required according to practical application scenes and contexts. Especially in a large-batch text matching scene, when one-to-one and one-to-n matching scenes are needed, text matching is often needed by combining contexts. For example, "Jiangsu province xx traffic management department" and "Jiangsu xx traffic management department", although the characters are different, express a meaning, in doing data management and approval work business, it should match unanimously according to the manual matching result, if use the whole noun, will cause the matching failure, if use the fuzzy matching, may cause the result to mismatch, match mistake, even result is not unique situation to take place, influence the auditing efficiency.
In the related art, text batch fuzzy matching generally adopts an edit distance fuzzy matching algorithm and uses a word bag model to directly perform batch similarity matching. The "edit distance fuzzy matching algorithm" is also called Levenshtein distance (Levenshtein distance). Different from the hamming distance (the number of different characters at the corresponding positions of the character strings with equal length), the character can be replaced, and characters can be added and deleted. And (3) the algorithm time complexity is O (m x n), if the number of texts (t) is large, traversing the text set, calculating the editing distance of the keywords and the text Pair, and then performing TOP_K selection. The time complexity is O (m×n×t+logt), and there is a performance problem when t is large. And since the algorithm simply matches text content and does not incorporate an actual context, it is vastly different in an actual context.
The Bag of words model (Bag-of-words model) is a simplified expression model under natural language processing and Information Retrieval (IR). Under this model, a piece of text (such as a sentence or a document) can be represented by a bag holding the words, regardless of grammar and word order. The greatest problem of the method is that samples in an actual use scene are less, so that the precision is insufficient, and certain deviation occurs under the full noun scene, which is different from the natural context.
Disclosure of Invention
The application provides a full-name fuzzy matching method, a full-name fuzzy matching device and a storage medium, which solve the problem that text comparison calculation matching degree is not high in different application scenes and contexts under the condition of full-name.
In one aspect, the present application provides a full-name word fuzzy matching method, where the method includes:
performing word segmentation on text clauses in an original text and matching clauses in a comparison text through a text word stock, and calculating weight scores of the matching clauses; the original text and the comparison text comprise a plurality of text clauses to be matched and matching clauses, and the weight score is obtained based on word segmentation weight calculation after word segmentation processing;
calculating matching degree scores between the text clause and all the matching clauses based on the weight polling, and determining candidate matching clauses;
when a plurality of candidate matching clauses are matched, determining a unique target matching clause from the candidate matching clauses based on a target matching condition, and updating the text word stock and the word segmentation weight according to the candidate matching clauses.
Specifically, the word segmentation processing is performed on the text clause in the original text and the matching clause in the comparison text through the text word stock, and the weight score of the matching clause is calculated, including:
reading the text word stock, and splitting the text clause and the matching clause according to the word stock content;
calculating word segmentation weight sum of each word segmentation in the matched clause according to a word segmentation weight table corresponding to the text word stock to obtain the weight score S; the word segmentation weight table comprises word segmentation weights of all the words in the text word stock.
Specifically, after determining the weight score of the matching clause, the method further includes:
and comparing texts sequentially based on the text clauses in a polling way, determining the matched clauses as the target matched clauses when the matched clauses with the text contents and the word order completely consistent exist, and otherwise, calculating the matching degree scores of the matched clauses in a polling way.
Specifically, the calculating the matching degree scores between the text clause and all the matching clauses based on the weight polling, and determining candidate matching clauses includes:
determining the same word and the corresponding matching weight score S1 in the text clause and the matching clause, and determining the matching degree score P of the text clause and the matching clause according to a matching degree calculation formula; the matching degree calculation formula is as follows:
wherein S1 represents the matched phase of the two clausesWord segmentation weight sum, P of same word segmentation part i A matching score representing the ith matching clause of the poll;
when the candidate matching clause uniquely exceeding the matching degree score threshold exists, determining the candidate matching clause as the target matching clause;
when the candidate matching clause exceeding the matching degree score threshold value does not exist, outputting a null;
when there are a plurality of the candidate matching clauses exceeding a matching degree score threshold, they are determined as the candidate matching clauses.
Specifically, the determining, based on the target matching condition, a unique target matching clause from the candidate matching clauses includes:
sorting the candidate matching clauses according to the magnitude of the matching degree score;
determining the candidate matching clause with the highest matching degree score as the target matching clause, and determining the rest candidate matching clauses as candidate matching clauses; or (b)
And selecting the target matching clause from the candidate matching clauses except for the highest matching degree score based on a confirmation instruction of manual auditing, and determining the rest as the candidate matching clause.
Specifically, when the target matching clause is selected from the candidate matching clauses; the updating the text word stock and the word segmentation weight according to the candidate matching clause comprises the following steps:
determining the matching degree score of the target matching clause as a target matching score, and determining the alternative matching clause higher than the target matching score as a refusing clause;
performing difference set operation on the word segmentation split by the rejecting clause and the target matching clause, and determining invalid words in the rejecting clause;
and updating the word segmentation weight table based on the invalid words, and updating the text word stock according to the keywords determined by word stock updating conditions.
Specifically, the updating the word segmentation weight table based on the candidate keywords includes:
matching the current word segmentation weight of the candidate keyword from the word segmentation weight table, and comparing the current word segmentation weight with the lowest weight value;
if the current word segmentation weight value is larger than the lowest weight value, reducing the weight value according to a preset gradient progressive method;
if the current word segmentation weight value is not greater than the lowest weight value, the updating is not carried out.
Specifically, the updating the text word stock according to the keyword determined by the word stock updating condition includes:
performing difference set operation on the target matching clause and the word split by the rejecting clause, and determining invalid words in the target matching clause;
the candidate keywords are differentiated from the sum value of the word segmentation weight values after the invalid words are updated according to the sum value of the initialized word segmentation weight values, so that keyword context scores are obtained;
when the keyword context score is larger than an admission threshold value, determining the keyword context score as a keyword, and adding the keyword context score into the text word stock;
and when the keyword context score is smaller than an admission threshold value, not updating the text word stock.
Specifically, the text word stock comprises a standard word stock and a custom word stock; the standard vocabulary entries of each field are stored in the standard vocabulary library, and keywords added when the vocabulary library is updated are stored in the custom vocabulary library.
On the other hand, the application provides a full-name word fuzzy matching device, which comprises:
the weight score calculating module is used for carrying out word segmentation processing on the text clauses in the original text and the matching clauses in the comparison text through the text word stock and calculating the weight score of the matching clauses; the original text and the comparison text comprise a plurality of text clauses to be matched and matching clauses, and the weight score is obtained based on word segmentation weight calculation after word segmentation processing;
the matching degree calculation module is used for calculating the matching degree scores between the text clauses and all the matching clauses based on the weight polling and determining candidate matching clauses;
and the updating module is used for determining a unique target matching clause from the candidate matching clauses based on target matching conditions when a plurality of candidate matching clauses are matched, and updating the text word stock and the word segmentation weight according to the candidate matching clauses.
The beneficial effects that this application provided technical scheme brought include at least: the method comprises the steps of carrying out word segmentation and splitting processing on input text clauses and matching clauses through a constructed text word library, calculating the weight score of each matching clause according to word segmentation weight, carrying out polling matching on the selected text clauses, calculating the matching degree score for the text clauses according to the weight scores of the matching clauses, and preliminarily finding out a certain number of similar candidate matching clauses from massive matching texts. Meanwhile, the text word stock and the word segmentation weight are updated aiming at the target matching clause and the rest candidate matching clauses, so that the content of the text word stock and the word segmentation weight can be updated in real time according to the matching result, and the precision and the accuracy of subsequent matching are improved.
Drawings
FIG. 1 is a flowchart of a full-name fuzzy matching method provided by an embodiment of the present application;
FIG. 2 is a flow chart of a full-name fuzzy matching method provided in another embodiment of the present application;
FIG. 3 is an algorithm flow chart of a full-name word fuzzy matching method provided by an embodiment of the present application;
fig. 4 is a block diagram of a full-name word fuzzy matching device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
References herein to "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The technical problems of the bag-of-words model in the related art are described by the following two text examples.
(1)John likes to watch movies.Mary likes movies too.
(2)John also likes to watch football games.
Based on the above two texts, the following list can be constructed:
["John","likes","to","watch","movies","also","football","games","Mary","too"]
here there are 10 different words, and the index of the manifest is used to represent a vector of length 10:
(1)[1,2,1,1,2,0,0,0,1,1]
(2)[1,1,1,1,0,1,1,1,0,0]
the index content of each vector corresponds to the number of occurrences of the word in the list.
For example, the first two content indexes of the first vector (text one) are 1 and 2, the first index content is "John" corresponding to the first word of the list and this value is set to 1, because "John" occurs once. This vector representation does not preserve the order of words in the original sentence. There are many successful applications of this representation, such as mail filtering.
In the above example, the document vector contains term frequencies, and different approaches are commonly used for term weights in IR and text classification. A common method is tf-idf. The greatest problem of the method is that samples in an actual use scene are less, so that the precision is insufficient, and certain deviation occurs under the full noun scene, which is different from the natural context.
Aiming at the technical problems, the scheme adopts a compatible full-noun fuzzy matching scheme to identify and match similar text contents, and fig. 1 is a flow chart of a full-noun fuzzy matching method provided by the embodiment of the application, and comprises the following steps:
and step 101, word segmentation processing is carried out on the text clauses in the original text and the matching clauses in the comparison text through the text word stock, and the weight scores of the matching clauses are calculated.
The original text and the comparison text contain a plurality of text clauses, and the text clauses can be files in specific fields and scenes. The word segmentation process is to separate text parts in a text into words one by one, and the separation is based on a text word stock, wherein the text word stock can be a word stock set according to a use scene, for example, in a rural agricultural bureau clear production and fund scene. And importing report data of a national rural collective asset inventory nuclear resource management system as an original sample, and importing market-level three-resource operation platform data as a comparison sample. The method aims to match a specific text in an original sample with a specific text of a comparison file, establish a one-to-one correspondence, particularly find out the correspondence from mass data under the scenes of abbreviations, errors and the like, facilitate asset verification, such as matching of 'Jiangsu province tin-free ABM finite liability company' found out from the original file with 'tin-free ABM company' in the comparison file, and optionally display the two data as comparison information without manual searching and matching, facilitate manual comparison and verification and accelerate work verification efficiency. The text library herein is text words in the direction of the inventory management. The weight score is calculated on the clause of the text, for example, a plurality of clauses of a clause are determined, and the weight is set, so that the sum of the clauses is the weight score of the matched clause.
And 102, calculating the matching degree scores between the text clauses and all the matching clauses based on the weight polling, and determining candidate matching clauses.
Extracting text clauses in the original text one by one, matching with all matching clauses in the comparison file according to a polling mode of 1:n, generating a matching degree score of each matching clause, and then selecting candidate matching clauses according to the matching degree score.
And 103, when a plurality of candidate matching clauses are matched, determining a unique target matching clause from the candidate matching clauses based on the target matching condition, and updating a text word stock and word segmentation weight according to the candidate matching clauses.
For example, the text a in the original text, and after polling, it is determined that the text B and the text C matched with the text a are candidate matching clauses, and then it is required to determine a unique target matching clause from the candidate matching clauses according to the target matching condition. After the target matching clause is determined, a text word stock and a word segmentation weight are required to be updated, for example, new keywords are added for matching recognition, the word segmentation weight is lowered or raised, and the matching accuracy is improved when the target matching clause is used for matching a subsequent clause.
It should be noted that, when the matching degree score of the matching calculation cannot be determined or there is no candidate matching clause, the matching degree score is directly displayed as empty, and when only one candidate matching clause exists in the result, the matching degree score is displayed as a unique target matching clause.
In summary, the word splitting process is performed on the input text clauses and the matching clauses through the constructed text word library, and the weight score of each matching clause is calculated according to the word splitting weight, so that polling matching can be performed on the selected text clauses, the matching degree score of the text clause is calculated according to the weight scores of the matching clauses, a certain number of similar candidate matching clauses can be initially found out from massive matching texts, and the matching clauses are required to be further analyzed under the condition of more than one candidate matching clauses because of the full-term fuzzy matching method, and unique target matching clauses are determined from the candidate matching clauses according to target matching conditions. Meanwhile, the text word stock and the word segmentation weight are updated aiming at the target matching clause and the rest candidate matching clauses, so that the content of the text word stock and the word segmentation weight can be updated in real time according to the matching result, and the precision and the accuracy of subsequent matching are improved.
Fig. 2 is a flowchart of a full-name fuzzy matching method according to another embodiment of the present application, including the following steps:
step 201, reading a text word stock, and splitting text clauses and matching clauses according to word stock contents.
The splitting step mainly depends on a text word stock, and the text word stock in the scheme is divided into a standard word stock and a keyword word stock, and is specifically shown in fig. 3. The standard lexicon stores standard lexicons of each field, and is established specifically according to the use scene or the field during construction, for example, the clear and pay-off scene of the rural agricultural bureau is constructed only according to the data content of each rural agricultural bureau. The keyword word library is set up to improve the matching precision and accuracy, and according to the item correlation of each matching, when the matching frequency is higher, the keyword word library may be recorded. Both the standard lexicon and the keyword lexicon are equipped with system maintenance personnel, who can manually insert proper names.
Step 202, calculating word segmentation weight sum of each word segmentation in the matched clause according to the word segmentation weight table corresponding to the text word stock to obtain a weight score S.
As shown in fig. 3, a word segmentation weight table is provided for the established keyword and standard word library, and the weight value of all the words recorded therein, namely the word segmentation weight. For example, "John likes to watch move, mark features move to" split according to specific word segmentation (scoring or not scoring, chinese-like, is selected according to actual circumstances for the case where a specific ligature or verb is present). Each noun matches a corresponding word segmentation weight. The weight score of the matching clause is thus the sum of the weights of the individual tokens.
In a possible implementation manner, a configuration item is set for the matching scheme, the configuration item stores each parameter information, the default initialization weight of the word segmentation weight is 80, the default word segmentation weight of all keywords in the keyword lexicon is 200 after the subsequent matching process is updated, and thus the importance of the keywords in the clauses can be increased.
And 203, polling the original text sequentially based on the text clauses, determining the text clauses as target matching clauses when matching clauses with completely consistent text content and word order exist, and otherwise, polling to calculate the matching degree scores of the matching clauses.
The text clauses in the original text are sequentially selected, all the matched texts in the rest original texts are matched, and the step is to use the whole text clause as a keyword for matching, so as to find out whether the matched clause with the text content completely consistent with the word order exists, namely, full word matching search. When the result of the complete consistency is matched, the result is used as a target matching clause, the clause is ended, and the next text clause is continuously polled. Of course, this is not the solution focused on discussion, but text matching in the case where two texts are not identical, which requires calculation of a matching score with each matching clause.
Step 204, determining the same word and the corresponding matching weight score in the text clause and the matching clause, and determining the matching degree score of the text clause and the matching clause according to the matching degree calculation formula.
The matching degree calculation formula is as follows:
wherein S1 represents the sum of word segmentation weights of the same word segmentation part matched in two clauses, P i Representing the matching score of the ith matching clause of the poll.
Simply understood that, in two clauses, the sum of the weight values of the same clauses is S1, and the sum is used as a molecule; the weight score of the whole matching clause is S, and the sum of the weight scores is 1 and is taken as a denominator. The purpose of adding 1 is to prevent the segmentation content of two clauses from being identical, but the segmentation order is different. The case where the match score is a 100% result is the case of full word matching in step 203.
Step 205, when there is a candidate matching clause that is unique beyond the matching score threshold, it is determined to be the target matching clause.
All the polled matching clauses have matching degree scores, and the scheme is provided with a matching degree score threshold value, and only if the matching degree score threshold value (for example, 80 percent) is exceeded, the matching clauses can be used as candidate matching clauses.
When the calculation result has only one candidate matching clause, the target matching clause can be directly determined.
Step 206, outputting null when there is no candidate matching clause exceeding the matching degree score threshold.
Step 207, when there are a plurality of candidate matching clauses exceeding the matching degree score threshold, it is determined as a candidate matching clause.
The scheme focuses on the situation that a plurality of candidate matching clauses exist, and in this case, the candidate matching clauses need to be determined first and then screened for a second time.
Step 208, determining a unique target matching clause from the candidate matching clauses based on the target matching condition.
The target matching condition is determined according to the actual situation, for example, in one possible implementation, the candidate matching clauses are ranked according to the magnitude of the matching degree score, then the candidate matching clause with the highest matching degree score is determined as the target matching clause, and the rest candidate matching clauses are determined as candidate matching clauses. The method is suitable for automatically identifying scenes in the later period of big data training stabilization, automatically determining target matching clauses, and displaying an interface to an operator.
In other embodiments, all possible candidate matching clauses may be displayed on the interface first, and based on the confirmation instruction of the manual audit, the target matching clause is selected from the candidate matching clauses except for the candidate matching clause with the highest matching degree score, and the rest is determined as the candidate matching clause. Because of the differences in grammar and context, this approach can have accuracy problems in the early stages of use, so manual validation audits are required. When the candidate matching clause with the highest matching degree score is not selected during manual auditing (the condition that the score is not matched with the actual selection, such as selecting the matching clause with the second or third highest score as a target), the word stock and the word segmentation weight table are described as being distorted, or the evaluation is not yet completed. At this time, the background is required to update the text lexicon and the word segmentation weight according to the operation.
Step 209, determining the matching degree score of the target matching clause as the target matching score, and determining the candidate matching clause higher than the target matching score as the rejecting clause.
For selecting the clause with the score rank being not the first from the candidate list as the target matching clause, the rest clauses are confirmed to be alternative clauses, and then for the alternative clauses with the score being higher than the target matching clause, the alternative clauses are refused to not meet the target matching condition, so that the word stock and the weight table need to be updated according to the refused clauses.
Step 210, performing difference set operation on the word split between the rejecting clause and the target matching clause, and determining invalid words in the rejecting clause.
The difference set operation splits respective word sets of the respective split clauses, namely a reject item and a target matching clause (matching item), and is defined as a word set A (reject item) and a word set B (matching item). So the difference set between set A and set B is all the segmentations that belong to A and not to B, called the nulls. The word words are special word-segmentation or problematic word groups in the text language, the word-segmentation weights of the word groups are overlarge, and the weight scores of the word words are increased when the word words are summed, so that the result is distorted.
Step 211, updating the word segmentation weight table based on the invalid words.
The "distortion" processing is to reduce the word segmentation weight of the invalid word. In one possible implementation, the word segmentation weight value in the standard word library is 100 by default, but the reduction cannot be performed without limitation, so that the minimum weight value (for example, default 20) needs to be set, and before the word segmentation weight is reduced, the current word segmentation weight of the invalid word is matched from the word segmentation weight table and compared with the minimum weight value.
If the current word segmentation weight value is larger than the lowest weight value, the weight value is reduced according to a preset gradient progressive method.
If the current word segmentation weight value is not greater than the lowest weight value, the updating is not carried out.
The gradient decreasing method may be set according to the actual situation, for example, setting the decreasing gradient to 1.
It should be noted here that for invalid words that may be matched out of the word stock, the default lowest weight value is set.
Step 212, performing difference set operation on the target matching clause and the word separated from the rejecting clause, and determining candidate keywords in the target matching clause.
As can be seen from step 210, the difference between the set B and the set A is the word segment belonging to B and not belonging to A, i.e. the candidate keyword. The candidate keywords that appear within the literal language may be keyword groups that are characterized as target matching clauses, e.g., the text of set B is "Jiangsu province ABC Limited liability company" and the text of set a is "Jiangsu province DEF shares Limited company". Then it may be determined that the invalid words are "DEF" and "share limited" and the candidate keywords are "ABC" and "limited liability".
And 213, the candidate keywords are differentiated from the sum of the updated word segmentation weight values of the invalid words according to the sum of the initialized word segmentation weight values, so that the keyword context score is obtained.
Here, the candidate keywords are subjected to secondary screening to determine the keywords which truly determine the target matching item. And summing all candidate keywords according to the initialized word segmentation weight value, namely summing all the candidate keywords according to 100 scores, and making a difference between the sum of the word segmentation weights of the invalid words after the gradient update is performed previously, wherein the difference is the keyword context score of all the candidate keywords.
Step 214, updating the text thesaurus based on the keyword context score.
When the keyword context score is greater than the admission threshold, it is determined to be a keyword and added to the text lexicon. When the keyword context score is less than the admission threshold, the keyword lexicon in the text lexicon is not updated. The admission threshold is set according to the actual situation, for example, 200, and only candidate keywords with keyword context scores greater than 200 can be determined as keywords.
In the rural agricultural bureau clear production and joint fund scene. The report data of the national rural collective asset inventory nuclear resource management system is imported as an original sample, and the market-level three-resource operation platform data is imported as a comparison sample. And comparing and training the data through the flow. The accuracy of one-time matching after training through 800000 samples reaches more than 95%. And after the accuracy reaches 95%, all manual auditing parts can be canceled, and the full-automatic operation is realized.
In summary, in the embodiment of the present application, the text word stock is formed by the standard word stock and the keyword word stock, so that the input text clause is split and broken, and the weight score is calculated for each matching clause based on the word weight table corresponding to the word stock. In this way, polling matching can be carried out on each text clause, the matching degree scores of all matching clauses are calculated, and then unique or multiple candidate matching clauses are screened out. In the initial stage of training, selecting a target matching clause in a manual auditing mode, and triggering a background word stock and a word segmentation weight updating flow.
The word segmentation weight updating is performed based on a difference set operation between a reject clause, which is higher than the target matching clause, and the target matching clause, wherein the result of the difference set operation is an invalid word, and the weight value is reduced according to a preset gradient progressive method. And the word stock updating needs to calculate the difference between the word of the target matching clause and the word of the reject clause to determine candidate keywords, and based on the sum value of the initialized word segmentation weight values of the candidate keywords, the difference is made between the sum value of the updated word segmentation weight values of the invalid words, the keyword context score is obtained, and then the keywords are determined according to the admission threshold relation and updated. The background updating process is triggered based on manual auditing, high-precision matching can be realized in a big data iteration scene, text matching under massive data is greatly improved, and searching and matching precision is improved.
Fig. 3 is an algorithm flow chart of a full-name word fuzzy matching method provided in an embodiment of the present application. The method comprises the following steps:
1. the original text and the comparison text for comparison are imported.
2. And performing word segmentation processing on the original text and the comparison text.
3. And reading the word segmentation weights corresponding to all the word segmentation.
4. Comparing whether the two clauses match perfectly, if so, step 7, step 5 is not a perfect match jump (perfect match means that the two texts contain each word and its order exactly the same).
5. The 1-to-n polling calculates the matching degree score (matching part word weight/matching text weight score +1) of the matching clause. The matching portion refers to the intersection of two sets of the original and matching tokens. The purpose of +1 is to distinguish a perfect match because the set intersection is exactly the same as in time without distinguishing the order in which the words appear, nor is 100%.
6. And judging whether candidate matching clauses with the values higher than a matching degree score threshold value exist. Step 7 is skipped, and no loop is skipped.
7. And (3) manually intervening to confirm that one target matching item or all target matching items are not matched, and ending the circulation.
Begin the invalid word recognition flow
8. And judging that the candidate matching clause is larger than 1, the matching degree score is smaller than other items and the score is not 100% (the calculated score is not matched with the actual score when the multi-word matching is recognized). Otherwise, ending the current flow, and if yes, jumping to the step 9.
9. And taking difference sets of the reject term and the match term with scores greater than the match term, and updating a term weight table. Ending the current flow.
Begin keyword recognition process
10. And obtaining candidate keyword tables from the difference sets of the match term segmentation and the reject term segmentation.
11. And performing secondary calculation on the word segmentation weight in the candidate key word (the total score minus the invalid word score is the word segmentation weight in the key word).
12. And judging whether the numerical value is higher than the keyword entry threshold, if not, ending the current flow, and recording a keyword lexicon. (finally, the keyword word frequency library can be automatically added according to word frequency or can be manually checked and added from the keyword word frequency library to the custom word library)
Fig. 4 is a block diagram of a full-name word fuzzy matching device according to an embodiment of the present application, including the following structures:
the weight score calculating module 401 is configured to perform word segmentation processing on a text clause in an original text and a matching clause in a comparison text through a text word stock, and calculate a weight score of the matching clause; the original text and the comparison text comprise a plurality of text clauses to be matched and matching clauses, and the weight score is obtained based on word segmentation weight calculation after word segmentation processing;
a matching degree calculation module 402, configured to calculate a matching degree score between the text clause and all the matching clauses based on the weight polling, and determine candidate matching clauses;
and the updating module 403 is configured to determine a unique target matching clause from the candidate matching clauses based on a target matching condition when a plurality of candidate matching clauses are matched, and update the text word stock and the word segmentation weight according to the candidate matching clauses.
In addition, the application further provides a computer readable storage medium, on which program instructions are stored, which when executed by a processor, implement the full-term fuzzy matching method described in the above aspect.
The foregoing describes preferred embodiments of the present invention; it is to be understood that the invention is not limited to the specific embodiments described above, wherein devices and structures not described in detail are to be understood as being implemented in a manner common in the art; any person skilled in the art will make many possible variations and modifications, or adaptations to equivalent embodiments without departing from the technical solution of the present invention, which do not affect the essential content of the present invention; therefore, any simple modification, equivalent variation and modification of the above embodiments according to the technical substance of the present invention still fall within the scope of the technical solution of the present invention.

Claims (8)

1. A full-term fuzzy matching method, the method comprising:
performing word segmentation on text clauses in an original text and matching clauses in a comparison text through a text word stock, and calculating weight scores of the matching clauses; the original text and the comparison text comprise a plurality of text clauses to be matched and matching clauses; reading the text word stock, and splitting the text clause and the matching clause according to the word stock content;
calculating word segmentation weight sum of each word segmentation in the matched clause according to a word segmentation weight table corresponding to the text word stock to obtain the weight score S; the word segmentation weight table comprises word segmentation weights of all word segmentation in the text word stock;
calculating matching degree scores between the text clause and all the matching clauses based on the weight polling, and determining candidate matching clauses; the method specifically comprises the steps of determining the same word and the corresponding matching weight score S1 in the text clause and the matching clause, and determining the matching degree score P of the text clause and the matching clause according to a matching degree calculation formula; the matching degree calculation formula is as follows:
wherein S1 represents the sum of the word segmentation weights of the same word segmentation part matched in the two clauses,a matching score representing the ith matching clause of the poll;
when the candidate matching clause uniquely exceeding the matching degree score threshold exists, determining the candidate matching clause as a target matching clause;
when the candidate matching clause exceeding the matching degree score threshold value does not exist, outputting a null;
determining the candidate matching clause as the candidate matching clause when there are a plurality of candidate matching clauses exceeding a matching degree score threshold;
when a plurality of candidate matching clauses are matched, determining a unique target matching clause from the candidate matching clauses based on a target matching condition, and updating the text word stock and the word segmentation weight according to the candidate matching clauses.
2. The method of claim 1, wherein after determining the weight score for the matching clause, further comprising:
and comparing texts sequentially based on the text clauses in a polling way, determining the matched clauses as the target matched clauses when the matched clauses with the text contents and the word order completely consistent exist, and otherwise, calculating the matching degree scores of the matched clauses in a polling way.
3. The method of claim 2, wherein the determining a unique target matching clause from the candidate matching clauses based on target matching conditions comprises:
sorting the candidate matching clauses according to the magnitude of the matching degree score;
determining the candidate matching clause with the highest matching degree score as the target matching clause, and determining the rest candidate matching clauses as candidate matching clauses; or (b)
And selecting the target matching clause from the candidate matching clauses except for the highest matching degree score based on a confirmation instruction of manual auditing, and determining the rest as the candidate matching clause.
4. A method according to claim 3, wherein when the target matching clause is selected from the candidate matching clauses; the updating the text word stock and the word segmentation weight according to the candidate matching clause comprises the following steps:
determining the matching degree score of the target matching clause as a target matching score, and determining the alternative matching clause higher than the target matching score as a refusing clause;
performing difference set operation on the word segmentation split by the rejecting clause and the target matching clause, and determining invalid words in the rejecting clause;
and updating the word segmentation weight table based on the invalid words, and updating the text word stock according to the keywords determined by word stock updating conditions.
5. The method of claim 4, wherein the updating the segmentation weight table based on the invalid word comprises:
matching the current word segmentation weight of the invalid word from the word segmentation weight table, and comparing the current word segmentation weight with the lowest weight value;
if the current word segmentation weight value is larger than the lowest weight value, reducing the weight value according to a preset gradient progressive method;
if the current word segmentation weight value is not greater than the lowest weight value, the updating is not carried out.
6. The method of claim 4, wherein the updating the text thesaurus based on the keywords determined by thesaurus updating conditions comprises:
performing difference set operation on the target matching clause and the word split by the rejecting clause, and determining candidate keywords in the target matching clause;
the candidate keywords are differentiated from the sum value of the word segmentation weight values after the invalid words are updated according to the sum value of the initialized word segmentation weight values, so that keyword context scores are obtained;
when the keyword context score is larger than an admission threshold value, determining the keyword context score as a keyword, and adding the keyword context score into the text word stock;
and when the keyword context score is smaller than an admission threshold value, not updating the text word stock.
7. The method of claim 1, wherein the text word stock comprises a standard word stock and a keyword word stock; the standard vocabulary entries of each field are stored in the standard vocabulary library, and keywords added when the vocabulary library is updated are stored in the keyword vocabulary library.
8. A full-term fuzzy matching device, comprising:
the weight score calculating module is used for carrying out word segmentation processing on the text clauses in the original text and the matching clauses in the comparison text through the text word stock and calculating the weight score of the matching clauses; the original text and the comparison text comprise a plurality of text clauses to be matched and matching clauses; reading the text word stock, and splitting the text clause and the matching clause according to the word stock content;
calculating word segmentation weight sum of each word segmentation in the matched clause according to a word segmentation weight table corresponding to the text word stock to obtain the weight score S; the word segmentation weight table comprises word segmentation weights of all word segmentation in the text word stock;
the matching degree calculation module is used for calculating the matching degree scores between the text clauses and all the matching clauses based on the weight polling and determining candidate matching clauses; the method specifically comprises the steps of determining the same word and the corresponding matching weight score S1 in the text clause and the matching clause, and determining the matching degree score P of the text clause and the matching clause according to a matching degree calculation formula; the matching degree calculation formula is as follows:
wherein S1 represents the sum of the word segmentation weights of the same word segmentation part matched in the two clauses,a matching score representing the ith matching clause of the poll;
when the candidate matching clause uniquely exceeding the matching degree score threshold exists, determining the candidate matching clause as a target matching clause;
when the candidate matching clause exceeding the matching degree score threshold value does not exist, outputting a null;
determining the candidate matching clause as the candidate matching clause when there are a plurality of candidate matching clauses exceeding a matching degree score threshold;
and the updating module is used for determining a unique target matching clause from the candidate matching clauses based on target matching conditions when a plurality of candidate matching clauses are matched, and updating the text word stock and the word segmentation weight according to the candidate matching clauses.
CN202211638615.8A 2022-12-20 2022-12-20 Full noun fuzzy matching method, device and storage medium Active CN115905473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211638615.8A CN115905473B (en) 2022-12-20 2022-12-20 Full noun fuzzy matching method, device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211638615.8A CN115905473B (en) 2022-12-20 2022-12-20 Full noun fuzzy matching method, device and storage medium

Publications (2)

Publication Number Publication Date
CN115905473A CN115905473A (en) 2023-04-04
CN115905473B true CN115905473B (en) 2024-03-05

Family

ID=86481664

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211638615.8A Active CN115905473B (en) 2022-12-20 2022-12-20 Full noun fuzzy matching method, device and storage medium

Country Status (1)

Country Link
CN (1) CN115905473B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955772A (en) * 2011-08-17 2013-03-06 北京百度网讯科技有限公司 Similarity computing method and similarity computing device on basis of semanteme
CN112883730A (en) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 Similar text matching method and device, electronic equipment and storage medium
CN113268986A (en) * 2021-05-24 2021-08-17 交通银行股份有限公司 Unit name matching and searching method and device based on fuzzy matching algorithm

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110020422B (en) * 2018-11-26 2020-08-04 阿里巴巴集团控股有限公司 Feature word determining method and device and server

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102955772A (en) * 2011-08-17 2013-03-06 北京百度网讯科技有限公司 Similarity computing method and similarity computing device on basis of semanteme
CN112883730A (en) * 2021-03-25 2021-06-01 平安国际智慧城市科技股份有限公司 Similar text matching method and device, electronic equipment and storage medium
CN113268986A (en) * 2021-05-24 2021-08-17 交通银行股份有限公司 Unit name matching and searching method and device based on fuzzy matching algorithm

Also Published As

Publication number Publication date
CN115905473A (en) 2023-04-04

Similar Documents

Publication Publication Date Title
US7343371B2 (en) Queries-and-responses processing method, queries-and-responses processing program, queries-and-responses processing program recording medium, and queries-and-responses processing apparatus
US20070005567A1 (en) System and method for adaptive multi-cultural searching and matching of personal names
CN110362824B (en) Automatic error correction method, device, terminal equipment and storage medium
US9575937B2 (en) Document analysis system, document analysis method, document analysis program and recording medium
CN107870901A (en) Similar literary method, program, device and system are generated from translation source original text
WO2014097670A1 (en) Document classification device and program
CN110929498A (en) Short text similarity calculation method and device and readable storage medium
CN114186019A (en) Enterprise project auditing method and device combining RPA and AI
Kotenko et al. Evaluation of text classification techniques for inappropriate web content blocking
CN110795942B (en) Keyword determination method and device based on semantic recognition and storage medium
JP2002175330A (en) Information retrieval device, score-determining device, method for retrieving information, method for determining score, and program recording medium
WO2008062822A1 (en) Text mining device, text mining method and text mining program
JP4959603B2 (en) Program, apparatus and method for analyzing document
Tüselmann et al. Are end-to-end systems really necessary for NER on handwritten document images?
EP2544100A2 (en) Method and system for making document modules
CN113591476A (en) Data label recommendation method based on machine learning
CN115905473B (en) Full noun fuzzy matching method, device and storage medium
JP5204203B2 (en) Example translation system, example translation method, and example translation program
JP4479745B2 (en) Document similarity correction method, program, and computer
US20220083581A1 (en) Text classification device, text classification method, and text classification program
CN112328757B (en) Similar text retrieval method for question-answering system of business robot
KR20220099690A (en) Apparatus, method and computer program for summarizing document
JP4314271B2 (en) Inter-word relevance calculation device, inter-word relevance calculation method, inter-word relevance calculation program, and recording medium recording the program
JP2008282328A (en) Text sorting device, text sorting method, text sort program, and recording medium with its program recorded thereon
JPH06266769A (en) Synonym information preparing device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant