CN115964457A

CN115964457A - Fuzzy matching method for document character string codes

Info

Publication number: CN115964457A
Application number: CN202111192730.2A
Authority: CN
Inventors: 姚昊; 刘忠良; 杨沥铭; 任宇阳; 尚鑫鑫; 葛旭阳; 楼宝川; 潘炼; 汤奔; 杜梦娟
Original assignee: CNNC Nuclear Power Operation Management Co Ltd
Current assignee: CNNC Nuclear Power Operation Management Co Ltd
Priority date: 2021-10-13
Filing date: 2021-10-13
Publication date: 2023-04-14

Abstract

The invention relates to the technical field of data processing, and particularly discloses a fuzzy matching method for document character string codes. The method comprises the following steps: constructing a character string coding information base with labels; acquiring document character string coding information, and performing pretreatment and feature selection on the document character string coding information to form a feature set; extracting the features of the feature items in the feature set to construct a coding vector; constructing a support vector machine classifier, training the support vector machine through the coding vector and obtaining a classification result label of document coding; when fuzzy matching is carried out on the document character strings, dividing the inquired character strings and adding indexes; and when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and adding the character strings into a result set. The method can improve the text classification efficiency and the classification accuracy, can reflect the difference that paragraphs with different lengths do not influence the matching result, and has fewer editing distance verification operations.

Description

Fuzzy matching method for document character string codes

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a fuzzy matching method for document character string codes.

Background

Different texts need to be classified currently, due to the fact that the information content is huge, the display forms of the same or similar information are different, the work of the text information is affected accurately and rapidly, and Word documents need to be classified so as to work normally and orderly; the existing classification method has poor efficiency and low classification resolution precision, and is easy to influence the normal operation of work due to calculation errors.

There are many studies on the fuzzy query problem of character strings, and most of them are based on a filtering-verifying framework. In the filtering stage, a threshold value t is used as an effective filter, which can be used to filter most of the dissimilar character strings to obtain a candidate set. And then, in a verification stage, calculating the actual editing distance between the character string and the query string in the candidate set to obtain a result set. The edit distance is a measurement method for calculating the similarity degree between characters, but the edit distance is calculated by adopting a dynamic programming idea, and if the edit distance between each pair of characters in a data set is calculated to judge whether the character strings are matched or not, huge expenses are caused. The existing method also has the problems of more complex calculation, excessive editing distance verification times and the like.

Disclosure of Invention

The invention aims to provide a fuzzy matching method for document character string codes, which solves the problems of low text classification efficiency and classification accuracy in the prior art, and achieves the effects that paragraphs with different lengths do not influence the matching result and the editing distance verification operation frequency is less.

The technical scheme of the invention is as follows: a fuzzy matching method for document character string editing specifically comprises the following steps:

constructing a character string coding information base with labels;

acquiring document character string coding information, and performing preprocessing and feature selection on the document character string coding information to form a feature set;

extracting the features of the feature items in the feature set to construct a coding vector;

constructing a support vector machine classifier, training the support vector machine through the coding vector and obtaining a classification result label of the document coding;

when fuzzy matching is carried out on the document character strings, dividing the inquired character strings and adding indexes;

and when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and when the position relations are consistent, adding the character strings into a result set.

After the step of adding the character string into the result set is completed, the method further comprises the step of verifying the editing distance of the character string codes which are not added into the result set, and comprises the following steps:

and judging whether a matching result is obtained within a distance threshold value by verifying the character strings and inquiring the editing distance between the character strings.

The feature extraction of the feature items in the feature set is carried out to construct the coding vector, and the TF-IDF algorithm is utilized to process the document character string code to obtain the coding vector by carrying out the feature extraction of the feature items in the document feature combination.

The step of obtaining the document character string coding information, preprocessing the document character string coding information and selecting the characteristics of the document character string coding information to form a characteristic set specifically comprises the following steps:

segmenting words of the document character string coding information to form a set of a plurality of codes and corresponding labels thereof;

filtering the corresponding label of each code split from the document;

and filtering all the coded data of the document to generate a feature set.

When fuzzy matching is carried out on the document character strings, the steps of dividing the inquired character strings and adding indexes are specifically as follows:

grouping character strings in a document data set to be queried according to length, and dividing the character strings with the same length into a group;

and constructing a complete binary tree for each character string according to the length, and recording the binary tree as a character string search tree, wherein each node in the character string search tree stores the divided character string, the original character string ID of the character string, the length of the character string and the starting position of the character string.

When the character string code is inquired, the character string code length filtering and matching filtering specifically comprise the following steps:

inquiring and length filtering the character string codes; inputting the length q and the distance threshold t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];

dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string set;

and matching the character string codes, and adding the character string to a result set after matching is finished.

The specific steps for matching the character string codes are as follows:

when the character string in the paragraph is matched with the character string in the query character string, adding the matching degree of the original character string indexed corresponding to the character string to the length of the character string, when the matching degree of the character string is larger than a preset upper bound value, verifying the position list of the matched character string in the original character string and the position list in the query string, and when the position list has no repeated elements, adding the character string to a result set.

The specific steps of verifying the edit distance of the character string codes which are not added into the result set are as follows:

when the matching degree of the character string is judged to be smaller than a preset lower bound value, the character string code is directly filtered; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.

The verifying the editing distance of the character string specifically comprises the following steps:

judging the relation between the editing distance of the character string and a distance threshold, and adding the character string to a result set when the editing distance of the character string is less than or equal to the distance threshold t;

and when the editing distance of the character string is greater than a distance threshold value t, directly filtering the character string.

The specific steps of processing the document character string code by using the TF-IDF algorithm to obtain the code vector are as follows:

obtaining a feature code t in a document Di _k Word frequency tf of _ik And inverse document word frequency idf _k Then the vector encoding the document Di string can be represented as:

w _ik ＝tf _ik *idf _k

wherein, the first and the second end of the pipe are connected with each other,

in the above formula, N represents the number of all documents in the coding library, N _k Represents all appearance characteristic code t _k The number of documents;

carrying out normalization processing on the document character string coding vector to obtain and construct a coding vector as follows:

the specific steps of generating the feature set after filtering all the coded data of the document are as follows:

selecting feature items from the filtered document character string coding data to form a feature item set;

evaluating all the characteristic items in the characteristic item set, and sorting in a descending order according to the evaluation value of each characteristic item;

and selecting the characteristic items with the top rank according to a preset threshold value or a determined characteristic quantity value to obtain a final characteristic set.

The step of constructing the support vector machine classifier, training the support vector machine through the coding vector and obtaining the classification result label of the document coding specifically comprises:

constructing a support vector machine classifier, taking the coded features and the labels thereof as a training set, and training the parameters of the support vector machine;

and inputting the classified document coding vectors into a support vector machine classifier model, classifying the document codes and obtaining applied classification result labels.

The invention has the remarkable effects that: the fuzzy matching method for the document character string codes has the following advantages that: (1) Through the statistics of the character string coding information and the establishment of a coding database, information searching and borrowing bases are provided for subsequent classification and matching, and the classification efficiency and accuracy are improved conveniently; (2) Acquiring Word document character string codes to be classified, preprocessing the Word document character string codes, and removing codes which do not relate to classification information in the codes, so that the codes can be conveniently classified in subsequent steps, and the classification efficiency is improved; the feature set is simplified through feature selection, and the coded data is further refined, so that the precision of code classification is improved; (3) When fuzzy matching of character string codes is carried out, the character strings are divided into a set which is determined to be matched, a set which can be matched and a set which is not matched, and when the editing distance of the character string codes is smaller than or equal to a distance threshold value t, the character strings are directly added into a result set; when the edit distance of the character string is greater than the distance threshold t, the character string code is directly filtered, so that the times of edit distance verification operation are reduced; and when the matching degree of the character string is greater than a preset upper bound value, verifying the position list of the matched character string codes in the original character string codes and the position list in the query string, and avoiding the problems of repeated matching of the character string interchange positions and the character string sections. In addition, when the character string in the paragraph is matched with the character string in the query string, the matching degree of the original character string indexed corresponding to the character string is added with the length of the character string, so that the difference of the influence of the paragraphs with different lengths on the matching degree is reflected.

Drawings

FIG. 1 is a fuzzy matching method for document string codes according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the figures and the embodiments.

As shown in fig. 1, a fuzzy matching method for document string codes specifically includes the following steps:

s1, constructing a character string coding information base with labels;

collecting document data, marking character string information in the document to form a character string coding information base with a label, wherein the document can be Word documents of various versions or other similar document data;

s2, acquiring document character string coding information, and performing preprocessing and feature selection on the document character string coding information to form a feature set;

acquiring and obtaining document character string coding information, preprocessing the character string coding information of the document through word segmentation and filtering, and obtaining a feature item set;

s2.1, performing word segmentation on the document character string coding information to form a set of a plurality of codes and corresponding labels thereof;

acquiring and obtaining document character string coding information, performing word segmentation on the document character string coding information by using a word segmentation tool, and splitting the document character string coding information into a series of codes and corresponding tag data thereof;

s2.2, filtering the corresponding label of each code split from the document;

filtering each code split from the document according to the corresponding label, and filtering out unnecessary coded data;

s2.3, filtering all coded data of the document to generate a feature set;

s2.3.1, selecting feature items of the filtered document character string coding data to form a feature item set;

s2.3.2, evaluating all the characteristic items in the characteristic item set, and performing descending sorting according to the evaluation value of each characteristic item;

s2.3.3, selecting the feature items with the top rank according to a preset threshold value or a determined feature quantity value to obtain a final feature set;

s3, extracting features of the feature items in the feature set to construct a coding vector;

extracting the characteristics of the characteristic items in the document characteristic set, and processing the document character string codes by using a TF-IDF algorithm to obtain a code vector;

w _ik ＝tf _ik *idf _k

wherein the content of the first and second substances,

s4, constructing a support vector machine classifier, training the support vector machine through the document coding vector and obtaining a classification result label of the document coding;

s4.1, constructing a support vector machine classifier, taking the coded features and the labels thereof as a training set, and training the parameters of the support vector machine;

s4.2, inputting the classified document coding vectors into a support vector machine classifier model, classifying the document codes and obtaining applied classification result labels;

s5, dividing the inquired character strings and adding indexes when fuzzy matching is carried out on the document character strings;

when character string fuzzy matching is carried out, dividing the character string to be inquired, and adding an index, wherein the index indicates that the original character string ID number in the document character string code is contained;

the process of dividing the document character string to be queried and adding the index specifically comprises the following steps:

s5.1, grouping the character strings in the data set of the document to be queried according to the length, and dividing the character strings with the same length into a group;

s5.2, constructing a complete binary tree for each character string according to the length, and recording the binary tree as a character string search tree, wherein each node in the character string search tree stores a divided character string, an original character string ID of the character string, the length of the character string and the starting position of the character string;

s6, when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and when the position relations are consistent, adding the character strings into a result set;

s6.1, inquiring and length filtering character string codes;

inputting the length q and the distance threshold t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];

s6.2, matching and filtering character string codes;

s6.2.1, dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string collection;

s6.2.2, matching character string codes, and adding the character strings into a result set after matching is completed;

S7, judging whether a matching result is obtained within a distance threshold value through verifying the character strings and inquiring the editing distance between the character strings;

carrying out editing distance verification on character string codes which are not added into a result set, and directly filtering the character string codes when the matching degree of the character strings is smaller than a preset lower bound value; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.

Claims

1. A fuzzy matching method for document character string editing is characterized in that: the method specifically comprises the following steps:

constructing a character string coding information base with labels;

2. The fuzzy matching method for document character string editing according to claim 1, characterized in that: after the step of adding the character string into the result set is completed, the step of verifying the editing distance of the character string codes which are not added into the result set is further included as follows:

and verifying the character strings and inquiring the editing distance between the character strings to judge whether the matching result is obtained within the distance threshold value.

3. The fuzzy matching method for document string editing according to claim 1, wherein: the feature extraction of the feature items in the feature set is carried out to construct the coding vector, and the TF-IDF algorithm is utilized to process the document character string code to obtain the coding vector by carrying out the feature extraction of the feature items in the document feature combination.

4. The fuzzy matching method for document string editing according to claim 1, wherein: the step of obtaining the document character string coding information, preprocessing the document character string coding information and selecting the features to form the feature set specifically comprises the following steps:

filtering the corresponding label of each code split from the document;

and filtering all the coded data of the document to generate a feature set.

5. The fuzzy matching method for document character string editing according to claim 1, characterized in that: when fuzzy matching is carried out on the document character strings, the steps of dividing the inquired character strings and adding indexes are specifically as follows:

6. The fuzzy matching method for document string editing according to claim 1, wherein: when the character string code is inquired, the character string code length filtering and matching filtering specifically comprise the following steps:

inquiring and length filtering character string codes; inputting the length q and the distance threshold value t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];

dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string collection;

7. The fuzzy matching method for document character string editing according to claim 6, characterized in that: the specific steps of matching the character string codes are as follows:

when the character string in the paragraph is matched with the character string in the query character string, adding the matching degree of the original character string indexed correspondingly to the character string to the length of the character string, when the matching degree of the character string is larger than a preset upper bound value, verifying the position list of the matched character string in the original character string and the position list in the query character string, and when the position list has no repeated elements, adding the character string to a result set.

8. The fuzzy matching method for document character string editing according to claim 2, characterized in that: the specific steps of verifying the editing distance of the character string codes which are not added into the result set are as follows:

when the matching degree of the character string is smaller than a preset lower bound value, directly filtering the character string code; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.

9. The fuzzy matching method for document string editing according to claim 8, wherein: the verifying the editing distance of the character string specifically comprises the following steps:

and when the editing distance of the character string is greater than a distance threshold t, directly filtering the character string.

10. The fuzzy matching method for document character string editing according to claim 3, characterized in that: the specific steps of processing the document character string codes by using the TF-IDF algorithm to obtain the coding vectors are as follows:

obtaining feature codes t in document Di _k Word frequency tf of _ik And inverse document word frequency idf _k Then the vector encoding the document Di string can be represented as:

w _ik ＝tf _ik *idf _k

wherein the content of the first and second substances,

in the above formula, N represents the number of all documents in the coding library, N _k Represents all appearance characteristic codes t _k The number of documents;

11. the fuzzy matching method for document character string editing according to claim 4, wherein: the specific steps of generating the feature set after filtering all the coded data of the document are as follows:

12. The fuzzy matching method for document character string editing according to claim 1, characterized in that: the step of constructing the support vector machine classifier, training the support vector machine through the coding vector and obtaining the classification result label of the document coding specifically comprises: