CN115964457A - Fuzzy matching method for document character string codes - Google Patents

Fuzzy matching method for document character string codes Download PDF

Info

Publication number
CN115964457A
CN115964457A CN202111192730.2A CN202111192730A CN115964457A CN 115964457 A CN115964457 A CN 115964457A CN 202111192730 A CN202111192730 A CN 202111192730A CN 115964457 A CN115964457 A CN 115964457A
Authority
CN
China
Prior art keywords
character string
document
coding
feature
codes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111192730.2A
Other languages
Chinese (zh)
Inventor
姚昊
刘忠良
杨沥铭
任宇阳
尚鑫鑫
葛旭阳
楼宝川
潘炼
汤奔
杜梦娟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
CNNC Nuclear Power Operation Management Co Ltd
Original Assignee
CNNC Nuclear Power Operation Management Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by CNNC Nuclear Power Operation Management Co Ltd filed Critical CNNC Nuclear Power Operation Management Co Ltd
Priority to CN202111192730.2A priority Critical patent/CN115964457A/en
Publication of CN115964457A publication Critical patent/CN115964457A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention relates to the technical field of data processing, and particularly discloses a fuzzy matching method for document character string codes. The method comprises the following steps: constructing a character string coding information base with labels; acquiring document character string coding information, and performing pretreatment and feature selection on the document character string coding information to form a feature set; extracting the features of the feature items in the feature set to construct a coding vector; constructing a support vector machine classifier, training the support vector machine through the coding vector and obtaining a classification result label of document coding; when fuzzy matching is carried out on the document character strings, dividing the inquired character strings and adding indexes; and when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and adding the character strings into a result set. The method can improve the text classification efficiency and the classification accuracy, can reflect the difference that paragraphs with different lengths do not influence the matching result, and has fewer editing distance verification operations.

Description

Fuzzy matching method for document character string codes
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a fuzzy matching method for document character string codes.
Background
Different texts need to be classified currently, due to the fact that the information content is huge, the display forms of the same or similar information are different, the work of the text information is affected accurately and rapidly, and Word documents need to be classified so as to work normally and orderly; the existing classification method has poor efficiency and low classification resolution precision, and is easy to influence the normal operation of work due to calculation errors.
There are many studies on the fuzzy query problem of character strings, and most of them are based on a filtering-verifying framework. In the filtering stage, a threshold value t is used as an effective filter, which can be used to filter most of the dissimilar character strings to obtain a candidate set. And then, in a verification stage, calculating the actual editing distance between the character string and the query string in the candidate set to obtain a result set. The edit distance is a measurement method for calculating the similarity degree between characters, but the edit distance is calculated by adopting a dynamic programming idea, and if the edit distance between each pair of characters in a data set is calculated to judge whether the character strings are matched or not, huge expenses are caused. The existing method also has the problems of more complex calculation, excessive editing distance verification times and the like.
Disclosure of Invention
The invention aims to provide a fuzzy matching method for document character string codes, which solves the problems of low text classification efficiency and classification accuracy in the prior art, and achieves the effects that paragraphs with different lengths do not influence the matching result and the editing distance verification operation frequency is less.
The technical scheme of the invention is as follows: a fuzzy matching method for document character string editing specifically comprises the following steps:
constructing a character string coding information base with labels;
acquiring document character string coding information, and performing preprocessing and feature selection on the document character string coding information to form a feature set;
extracting the features of the feature items in the feature set to construct a coding vector;
constructing a support vector machine classifier, training the support vector machine through the coding vector and obtaining a classification result label of the document coding;
when fuzzy matching is carried out on the document character strings, dividing the inquired character strings and adding indexes;
and when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and when the position relations are consistent, adding the character strings into a result set.
After the step of adding the character string into the result set is completed, the method further comprises the step of verifying the editing distance of the character string codes which are not added into the result set, and comprises the following steps:
and judging whether a matching result is obtained within a distance threshold value by verifying the character strings and inquiring the editing distance between the character strings.
The feature extraction of the feature items in the feature set is carried out to construct the coding vector, and the TF-IDF algorithm is utilized to process the document character string code to obtain the coding vector by carrying out the feature extraction of the feature items in the document feature combination.
The step of obtaining the document character string coding information, preprocessing the document character string coding information and selecting the characteristics of the document character string coding information to form a characteristic set specifically comprises the following steps:
segmenting words of the document character string coding information to form a set of a plurality of codes and corresponding labels thereof;
filtering the corresponding label of each code split from the document;
and filtering all the coded data of the document to generate a feature set.
When fuzzy matching is carried out on the document character strings, the steps of dividing the inquired character strings and adding indexes are specifically as follows:
grouping character strings in a document data set to be queried according to length, and dividing the character strings with the same length into a group;
and constructing a complete binary tree for each character string according to the length, and recording the binary tree as a character string search tree, wherein each node in the character string search tree stores the divided character string, the original character string ID of the character string, the length of the character string and the starting position of the character string.
When the character string code is inquired, the character string code length filtering and matching filtering specifically comprise the following steps:
inquiring and length filtering the character string codes; inputting the length q and the distance threshold t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];
dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string set;
and matching the character string codes, and adding the character string to a result set after matching is finished.
The specific steps for matching the character string codes are as follows:
when the character string in the paragraph is matched with the character string in the query character string, adding the matching degree of the original character string indexed corresponding to the character string to the length of the character string, when the matching degree of the character string is larger than a preset upper bound value, verifying the position list of the matched character string in the original character string and the position list in the query string, and when the position list has no repeated elements, adding the character string to a result set.
The specific steps of verifying the edit distance of the character string codes which are not added into the result set are as follows:
when the matching degree of the character string is judged to be smaller than a preset lower bound value, the character string code is directly filtered; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.
The verifying the editing distance of the character string specifically comprises the following steps:
judging the relation between the editing distance of the character string and a distance threshold, and adding the character string to a result set when the editing distance of the character string is less than or equal to the distance threshold t;
and when the editing distance of the character string is greater than a distance threshold value t, directly filtering the character string.
The specific steps of processing the document character string code by using the TF-IDF algorithm to obtain the code vector are as follows:
obtaining a feature code t in a document Di k Word frequency tf of ik And inverse document word frequency idf k Then the vector encoding the document Di string can be represented as:
w ik =tf ik *idf k
wherein, the first and the second end of the pipe are connected with each other,
Figure BDA0003301854050000041
in the above formula, N represents the number of all documents in the coding library, N k Represents all appearance characteristic code t k The number of documents;
carrying out normalization processing on the document character string coding vector to obtain and construct a coding vector as follows:
Figure BDA0003301854050000042
the specific steps of generating the feature set after filtering all the coded data of the document are as follows:
selecting feature items from the filtered document character string coding data to form a feature item set;
evaluating all the characteristic items in the characteristic item set, and sorting in a descending order according to the evaluation value of each characteristic item;
and selecting the characteristic items with the top rank according to a preset threshold value or a determined characteristic quantity value to obtain a final characteristic set.
The step of constructing the support vector machine classifier, training the support vector machine through the coding vector and obtaining the classification result label of the document coding specifically comprises:
constructing a support vector machine classifier, taking the coded features and the labels thereof as a training set, and training the parameters of the support vector machine;
and inputting the classified document coding vectors into a support vector machine classifier model, classifying the document codes and obtaining applied classification result labels.
The invention has the remarkable effects that: the fuzzy matching method for the document character string codes has the following advantages that: (1) Through the statistics of the character string coding information and the establishment of a coding database, information searching and borrowing bases are provided for subsequent classification and matching, and the classification efficiency and accuracy are improved conveniently; (2) Acquiring Word document character string codes to be classified, preprocessing the Word document character string codes, and removing codes which do not relate to classification information in the codes, so that the codes can be conveniently classified in subsequent steps, and the classification efficiency is improved; the feature set is simplified through feature selection, and the coded data is further refined, so that the precision of code classification is improved; (3) When fuzzy matching of character string codes is carried out, the character strings are divided into a set which is determined to be matched, a set which can be matched and a set which is not matched, and when the editing distance of the character string codes is smaller than or equal to a distance threshold value t, the character strings are directly added into a result set; when the edit distance of the character string is greater than the distance threshold t, the character string code is directly filtered, so that the times of edit distance verification operation are reduced; and when the matching degree of the character string is greater than a preset upper bound value, verifying the position list of the matched character string codes in the original character string codes and the position list in the query string, and avoiding the problems of repeated matching of the character string interchange positions and the character string sections. In addition, when the character string in the paragraph is matched with the character string in the query string, the matching degree of the original character string indexed corresponding to the character string is added with the length of the character string, so that the difference of the influence of the paragraphs with different lengths on the matching degree is reflected.
Drawings
FIG. 1 is a fuzzy matching method for document string codes according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the figures and the embodiments.
As shown in fig. 1, a fuzzy matching method for document string codes specifically includes the following steps:
s1, constructing a character string coding information base with labels;
collecting document data, marking character string information in the document to form a character string coding information base with a label, wherein the document can be Word documents of various versions or other similar document data;
s2, acquiring document character string coding information, and performing preprocessing and feature selection on the document character string coding information to form a feature set;
acquiring and obtaining document character string coding information, preprocessing the character string coding information of the document through word segmentation and filtering, and obtaining a feature item set;
s2.1, performing word segmentation on the document character string coding information to form a set of a plurality of codes and corresponding labels thereof;
acquiring and obtaining document character string coding information, performing word segmentation on the document character string coding information by using a word segmentation tool, and splitting the document character string coding information into a series of codes and corresponding tag data thereof;
s2.2, filtering the corresponding label of each code split from the document;
filtering each code split from the document according to the corresponding label, and filtering out unnecessary coded data;
s2.3, filtering all coded data of the document to generate a feature set;
s2.3.1, selecting feature items of the filtered document character string coding data to form a feature item set;
s2.3.2, evaluating all the characteristic items in the characteristic item set, and performing descending sorting according to the evaluation value of each characteristic item;
s2.3.3, selecting the feature items with the top rank according to a preset threshold value or a determined feature quantity value to obtain a final feature set;
s3, extracting features of the feature items in the feature set to construct a coding vector;
extracting the characteristics of the characteristic items in the document characteristic set, and processing the document character string codes by using a TF-IDF algorithm to obtain a code vector;
obtaining a feature code t in a document Di k Word frequency tf of ik And inverse document word frequency idf k Then the vector encoding the document Di string can be represented as:
w ik =tf ik *idf k
wherein the content of the first and second substances,
Figure BDA0003301854050000071
in the above formula, N represents the number of all documents in the coding library, N k Represents all appearance characteristic code t k The number of documents;
carrying out normalization processing on the document character string coding vector to obtain and construct a coding vector as follows:
Figure BDA0003301854050000072
s4, constructing a support vector machine classifier, training the support vector machine through the document coding vector and obtaining a classification result label of the document coding;
s4.1, constructing a support vector machine classifier, taking the coded features and the labels thereof as a training set, and training the parameters of the support vector machine;
s4.2, inputting the classified document coding vectors into a support vector machine classifier model, classifying the document codes and obtaining applied classification result labels;
s5, dividing the inquired character strings and adding indexes when fuzzy matching is carried out on the document character strings;
when character string fuzzy matching is carried out, dividing the character string to be inquired, and adding an index, wherein the index indicates that the original character string ID number in the document character string code is contained;
the process of dividing the document character string to be queried and adding the index specifically comprises the following steps:
s5.1, grouping the character strings in the data set of the document to be queried according to the length, and dividing the character strings with the same length into a group;
s5.2, constructing a complete binary tree for each character string according to the length, and recording the binary tree as a character string search tree, wherein each node in the character string search tree stores a divided character string, an original character string ID of the character string, the length of the character string and the starting position of the character string;
s6, when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and when the position relations are consistent, adding the character strings into a result set;
s6.1, inquiring and length filtering character string codes;
inputting the length q and the distance threshold t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];
s6.2, matching and filtering character string codes;
s6.2.1, dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string collection;
s6.2.2, matching character string codes, and adding the character strings into a result set after matching is completed;
when the character string in the paragraph is matched with the character string in the query character string, adding the matching degree of the original character string indexed corresponding to the character string to the length of the character string, when the matching degree of the character string is larger than a preset upper bound value, verifying the position list of the matched character string in the original character string and the position list in the query string, and when the position list has no repeated elements, adding the character string to a result set.
S7, judging whether a matching result is obtained within a distance threshold value through verifying the character strings and inquiring the editing distance between the character strings;
carrying out editing distance verification on character string codes which are not added into a result set, and directly filtering the character string codes when the matching degree of the character strings is smaller than a preset lower bound value; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.
The verifying the editing distance of the character string specifically comprises the following steps:
judging the relation between the editing distance of the character string and a distance threshold, and adding the character string to a result set when the editing distance of the character string is less than or equal to the distance threshold t;
and when the editing distance of the character string is greater than a distance threshold value t, directly filtering the character string.

Claims (12)

1. A fuzzy matching method for document character string editing is characterized in that: the method specifically comprises the following steps:
constructing a character string coding information base with labels;
acquiring document character string coding information, and performing preprocessing and feature selection on the document character string coding information to form a feature set;
extracting the features of the feature items in the feature set to construct a coding vector;
constructing a support vector machine classifier, training the support vector machine through the coding vector and obtaining a classification result label of the document coding;
when fuzzy matching is carried out on the document character strings, dividing the inquired character strings and adding indexes;
and when the character string codes are inquired, carrying out character string code length filtering and matching filtering, and when the position relations are consistent, adding the character strings into a result set.
2. The fuzzy matching method for document character string editing according to claim 1, characterized in that: after the step of adding the character string into the result set is completed, the step of verifying the editing distance of the character string codes which are not added into the result set is further included as follows:
and verifying the character strings and inquiring the editing distance between the character strings to judge whether the matching result is obtained within the distance threshold value.
3. The fuzzy matching method for document string editing according to claim 1, wherein: the feature extraction of the feature items in the feature set is carried out to construct the coding vector, and the TF-IDF algorithm is utilized to process the document character string code to obtain the coding vector by carrying out the feature extraction of the feature items in the document feature combination.
4. The fuzzy matching method for document string editing according to claim 1, wherein: the step of obtaining the document character string coding information, preprocessing the document character string coding information and selecting the features to form the feature set specifically comprises the following steps:
segmenting words of the document character string coding information to form a set of a plurality of codes and corresponding labels thereof;
filtering the corresponding label of each code split from the document;
and filtering all the coded data of the document to generate a feature set.
5. The fuzzy matching method for document character string editing according to claim 1, characterized in that: when fuzzy matching is carried out on the document character strings, the steps of dividing the inquired character strings and adding indexes are specifically as follows:
grouping character strings in a document data set to be queried according to length, and dividing the character strings with the same length into a group;
and constructing a complete binary tree for each character string according to the length, and recording the binary tree as a character string search tree, wherein each node in the character string search tree stores the divided character string, the original character string ID of the character string, the length of the character string and the starting position of the character string.
6. The fuzzy matching method for document string editing according to claim 1, wherein: when the character string code is inquired, the character string code length filtering and matching filtering specifically comprise the following steps:
inquiring and length filtering character string codes; inputting the length q and the distance threshold value t of the character string to be queried, and searching the corresponding character string code by using the two parameters, wherein the length range of the query character string is [ | q | -t, | q | + t ];
dividing the inquired character strings according to the length of the character strings in the document paragraphs to obtain an inquiry character string collection;
and matching the character string codes, and adding the character string to a result set after matching is finished.
7. The fuzzy matching method for document character string editing according to claim 6, characterized in that: the specific steps of matching the character string codes are as follows:
when the character string in the paragraph is matched with the character string in the query character string, adding the matching degree of the original character string indexed correspondingly to the character string to the length of the character string, when the matching degree of the character string is larger than a preset upper bound value, verifying the position list of the matched character string in the original character string and the position list in the query character string, and when the position list has no repeated elements, adding the character string to a result set.
8. The fuzzy matching method for document character string editing according to claim 2, characterized in that: the specific steps of verifying the editing distance of the character string codes which are not added into the result set are as follows:
when the matching degree of the character string is smaller than a preset lower bound value, directly filtering the character string code; and when the matching degree of the character string is between a preset lower bound value and a preset upper bound value, carrying out editing distance verification on the character string, and adding the character string passing the verification into a result set.
9. The fuzzy matching method for document string editing according to claim 8, wherein: the verifying the editing distance of the character string specifically comprises the following steps:
judging the relation between the editing distance of the character string and a distance threshold, and adding the character string to a result set when the editing distance of the character string is less than or equal to the distance threshold t;
and when the editing distance of the character string is greater than a distance threshold t, directly filtering the character string.
10. The fuzzy matching method for document character string editing according to claim 3, characterized in that: the specific steps of processing the document character string codes by using the TF-IDF algorithm to obtain the coding vectors are as follows:
obtaining feature codes t in document Di k Word frequency tf of ik And inverse document word frequency idf k Then the vector encoding the document Di string can be represented as:
w ik =tf ik *idf k
wherein the content of the first and second substances,
Figure FDA0003301854040000031
in the above formula, N represents the number of all documents in the coding library, N k Represents all appearance characteristic codes t k The number of documents;
carrying out normalization processing on the document character string coding vector to obtain and construct a coding vector as follows:
Figure FDA0003301854040000041
11. the fuzzy matching method for document character string editing according to claim 4, wherein: the specific steps of generating the feature set after filtering all the coded data of the document are as follows:
selecting feature items from the filtered document character string coding data to form a feature item set;
evaluating all the characteristic items in the characteristic item set, and sorting in a descending order according to the evaluation value of each characteristic item;
and selecting the characteristic items with the top rank according to a preset threshold value or a determined characteristic quantity value to obtain a final characteristic set.
12. The fuzzy matching method for document character string editing according to claim 1, characterized in that: the step of constructing the support vector machine classifier, training the support vector machine through the coding vector and obtaining the classification result label of the document coding specifically comprises:
constructing a support vector machine classifier, taking the coded features and the labels thereof as a training set, and training the parameters of the support vector machine;
and inputting the classified document coding vectors into a support vector machine classifier model, classifying the document codes and obtaining applied classification result labels.
CN202111192730.2A 2021-10-13 2021-10-13 Fuzzy matching method for document character string codes Pending CN115964457A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111192730.2A CN115964457A (en) 2021-10-13 2021-10-13 Fuzzy matching method for document character string codes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111192730.2A CN115964457A (en) 2021-10-13 2021-10-13 Fuzzy matching method for document character string codes

Publications (1)

Publication Number Publication Date
CN115964457A true CN115964457A (en) 2023-04-14

Family

ID=87351491

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111192730.2A Pending CN115964457A (en) 2021-10-13 2021-10-13 Fuzzy matching method for document character string codes

Country Status (1)

Country Link
CN (1) CN115964457A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194490A (en) * 2023-11-07 2023-12-08 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117194490A (en) * 2023-11-07 2023-12-08 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence
CN117194490B (en) * 2023-11-07 2024-04-05 长春金融高等专科学校 Financial big data storage query method based on artificial intelligence

Similar Documents

Publication Publication Date Title
CN110471948B (en) Intelligent customs clearance commodity classification method based on historical data mining
CN112800113B (en) Bidding auditing method and system based on data mining analysis technology
US7814111B2 (en) Detection of patterns in data records
US8799772B2 (en) System and method for gathering, indexing, and supplying publicly available data charts
Lladós et al. On the influence of word representations for handwritten word spotting in historical documents
US20120041955A1 (en) Enhanced identification of document types
CN102194013A (en) Domain-knowledge-based short text classification method and text classification system
CN104199965A (en) Semantic information retrieval method
CN111191022B (en) Commodity short header generation method and device
WO2012054788A1 (en) Method and system for performing a comparison
CN109213866A (en) A kind of tax commodity code classification method and system based on deep learning
CN112051986B (en) Code search recommendation device and method based on open source knowledge
CN106844482B (en) Search engine-based retrieval information matching method and device
CN108573020A (en) Merge the three-dimensional assembling model search method of assembly information
CN113312474A (en) Similar case intelligent retrieval system of legal documents based on deep learning
CN106844481A (en) Font similarity and font replacement method
CN102360436B (en) Identification method for on-line handwritten Tibetan characters based on components
CN115618866A (en) Method and system for paragraph identification and subject extraction of engineering project bid document
CN105678244A (en) Approximate video retrieval method based on improvement of editing distance
CN115964457A (en) Fuzzy matching method for document character string codes
Roy et al. An efficient coarse-to-fine indexing technique for fast text retrieval in historical documents
CN116629258B (en) Structured analysis method and system for judicial document based on complex information item data
CN116414872B (en) Data searching method and system based on natural language identification and knowledge graph
CN112668301A (en) Method and system for detecting duplication degree of ring assessment file
CN115982316A (en) Multi-mode-based text retrieval method, system and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination