CN115392235A - Character matching method and device, electronic equipment and readable storage medium - Google Patents

Character matching method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN115392235A
CN115392235A CN202210976998.3A CN202210976998A CN115392235A CN 115392235 A CN115392235 A CN 115392235A CN 202210976998 A CN202210976998 A CN 202210976998A CN 115392235 A CN115392235 A CN 115392235A
Authority
CN
China
Prior art keywords
character string
word
similarity
character
matching
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210976998.3A
Other languages
Chinese (zh)
Inventor
阳毅
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongdian Jinxin Software Co Ltd
Original Assignee
Zhongdian Jinxin Software Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhongdian Jinxin Software Co Ltd filed Critical Zhongdian Jinxin Software Co Ltd
Priority to CN202210976998.3A priority Critical patent/CN115392235A/en
Publication of CN115392235A publication Critical patent/CN115392235A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Abstract

The embodiment of the application provides a character matching method and device, electronic equipment and a readable storage medium, and relates to the technical field of computers. The method comprises the following steps: acquiring two character strings to be compared, performing word segmentation processing on each character string, taking the character string with less words after word segmentation as a first character string, and taking the character string with more words as a second character string. And respectively calculating the similarity between each first word in the first character string and each second word in the second character string, and acquiring the target similarity between the first character string and the second character string according to the calculated similarity between each first word and each second word. According to the character matching method, the similarity between each first word and each second word is calculated, the calculated similarities are combined, the target similarity between the first character string and the second character string is obtained, the target similarity between the two character strings is accurately calculated under the condition that the sequence of the character strings is ignored, and the defect that the conventional similarity calculation method does not support calculation under the condition that the sequence of the words is ignored can be overcome.

Description

Character matching method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of computer technologies, and in particular, to a character matching method, device, electronic device, and readable storage medium.
Background
With the development of computer technology, more and more services can be processed on line, and the service plane becomes more extensive and comprehensive. After the business processing flow line is digitalized, a plurality of businesses have the requirement that the similarity between character strings needs to be compared. Particularly, for international business, the languages and name orders used by files for business transaction between enterprises or organizations in each country may be different, and how to more accurately evaluate the similarity between character strings becomes an urgent problem to be solved.
At present, when processing names related to international services, an algorithm for comparing similarity of character strings often needs to ignore the sequence of words when calculating the similarity of a query character string and a hit character string, and ignoring the sequence of words may cause the calculated similarity to be inaccurate, and the similarity between character strings is difficult to represent reasonably, thereby causing problems in service processing and increasing risks of services and funds.
Disclosure of Invention
The purpose of the embodiment of the application is to solve the problem that the similarity between character strings is difficult to accurately calculate.
The embodiment of the application provides a character matching method, which comprises the following steps:
acquiring two character strings to be processed, and performing word segmentation processing on each character string to obtain a corresponding word;
respectively determining the number of words of the two character strings, and if the number of words of the two character strings is not equal, using the character string with less number of words as a first character string, and using the character string with more number of words as a second character string;
if the number of words in the two character strings is equal, selecting one character string from the two character strings as a first character string, and selecting the other character string as a second character string;
for each first word of the first character string, respectively calculating the similarity between the first word and each second word of the second character string;
and acquiring the target similarity between the first character string and the second character string based on the similarity between each first word and each second word.
In an optional embodiment of the first aspect, obtaining the target similarity between the first character string and the second character string based on the similarity between each first word and each second word comprises:
for each first word, determining the maximum similarity corresponding to the first word from the similarity between the first word and each second word;
acquiring the comprehensive similarity of the first character string to the second character string based on the maximum similarity corresponding to each first word;
determining a target character string with a small number of characters from the first character string and the second character string; and acquiring the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of the characters of the target character string.
In an optional embodiment of the first aspect, obtaining the comprehensive similarity of the first character string with respect to the second character string based on the maximum similarity corresponding to each first word includes:
acquiring the weight corresponding to each first word;
and carrying out weighted summation on the maximum similarity corresponding to each first word based on the weight corresponding to each first word to obtain the comprehensive similarity of the first character string to the second character string.
In an optional embodiment of the first aspect, obtaining the weight corresponding to each first word comprises:
and calculating the proportion of the length of the first word in the total length of the first character string for each first word, and taking the proportion as the weight corresponding to the first word.
In an optional embodiment of the first aspect, obtaining the target similarity between the first character string and the second character string based on the integrated similarity and the number of characters of the second character string includes:
and dividing the comprehensive similarity by the number of the characters of the target character string to obtain the target similarity between the first character string and the second character string.
In an optional embodiment of the first aspect, for each character string, performing word segmentation processing to obtain a corresponding word includes:
aiming at each character string, taking a preset number of characters in the character string as a matching field during each matching;
matching the matching fields with entries in a word segmentation dictionary one by one according to a preset word segmentation dictionary, and if the word segmentation dictionary contains the matching fields, successfully matching, and taking the matching fields as a segmented word;
if the matching fails, one character in the matching field is deleted, and the matching is carried out again until all characters in the character string are successfully matched.
In a second aspect, there is provided a character matching apparatus, the apparatus including:
the character string word segmentation module is used for acquiring two character strings to be processed, and performing word segmentation processing on each character string to obtain a corresponding word;
the word number comparison module is used for respectively determining the word number of the two character strings, taking the character string with the small word number as a first character string and taking the character string with the large word number as a second character string;
the similarity calculation module is used for calculating the similarity between each first word of the first character string and each second word of the second character string;
and the similarity processing module is used for acquiring the target similarity between the first character string and the second character string based on the similarity between each first word and each second word.
In a third aspect, an electronic device is provided, which includes:
the character matching method comprises a memory, a processor and a program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the character matching method of any one of the embodiments.
In a fourth aspect, a readable storage medium is provided, where a program is stored, and the program is executed by a processor to implement the character matching method of any of the above embodiments.
The character matching method comprises the following steps: acquiring two character strings to be compared, performing word segmentation processing on each character string, taking the character string with a small number of words after word segmentation as a first character string, and taking the character string with a large number of words as a second character string. And respectively calculating the similarity between each first word in the first character string and each second word in the second character string, and acquiring the target similarity between the first character string and the second character string according to the calculated similarity between each first word and each second word. According to the character matching method, the similarity between each first word and each second word is calculated, the calculated similarities are combined, the target similarity between the first character string and the second character string is obtained, the target similarity between the two character strings is accurately calculated under the condition that the sequence of the character strings is ignored, and the defect that the conventional similarity calculation method does not support calculation under the condition that the sequence of the words is ignored can be overcome.
Drawings
In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings used in the description of the embodiments of the present application will be briefly described below.
Fig. 1 is a schematic flowchart of a character matching method according to an embodiment of the present disclosure;
fig. 2 is a schematic flowchart of a character matching method according to an embodiment of the present application;
fig. 3 is a schematic flowchart of a character matching method according to an embodiment of the present disclosure;
fig. 4 is a schematic structural diagram of a character matching apparatus according to an embodiment of the present application;
fig. 5 is a schematic structural diagram of an electronic device for character matching according to an embodiment of the present disclosure.
Detailed Description
Embodiments of the present application are described below in conjunction with the drawings in the present application. It should be understood that the embodiments set forth below in connection with the drawings are exemplary descriptions for explaining technical solutions of the embodiments of the present application, and do not limit the technical solutions of the embodiments of the present application.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, information, data, steps, operations, elements, and/or components, but do not preclude the presence or addition of other features, information, data, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. The term "and/or" as used herein indicates at least one of the items defined by the term, e.g., "a and/or B" may be implemented as "a", or as "B", or as "a and B".
To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.
For international business, because the languages and name orders used for business transaction files among enterprises or organizations in various countries may be different, how to accurately evaluate the similarity between character strings becomes a problem to be solved.
At present, when the name related to international business is processed by an algorithm for comparing the similarity of character strings, the sequence of words is often required to be ignored when the similarity of a query character string and a hit character string is calculated, and the calculated similarity is possibly not accurate enough due to the neglect of the sequence of the words, so that the similarity between the character strings is difficult to represent reasonably, further, problems occur during business processing, and the risks of business and fund are increased.
The application provides a character string similarity method, a character string similarity device, an electronic device and a readable storage medium, and aims to solve the technical problems in the prior art.
The technical solutions of the embodiments of the present application and the technical effects produced by the technical solutions of the present application are explained below by describing several exemplary embodiments. It should be noted that the following embodiments may be referred to, referred to or combined with each other, and the description of the same terms, similar features, similar implementation steps and the like in different embodiments is not repeated.
An embodiment of the present application provides a character matching method, and as shown in fig. 1, the method includes:
step S101, two character strings to be processed are obtained, and word segmentation processing is carried out on each character string to obtain a corresponding word.
In the embodiment of the present application, the two character strings to be processed may be any two character strings with a comparison requirement, and specific content or meaning is not limited. The character string may include, but is not limited to, chinese characters, english letters, common characters, and the like.
And performing word segmentation processing on the character strings aiming at each character string to obtain words corresponding to each character string.
The word segmentation processing may be performed on each character string through a preset word segmentation algorithm, and specifically, the word segmentation algorithm that may be used includes, but is not limited to, the following two categories:
(1) The method based on character string matching comprises the following steps: the character string matching method is also called mechanical word segmentation method or dictionary matching method, and it matches the Chinese character string to be segmented with the vocabulary entry in the dictionary one by one according to the dictionary information without using rule knowledge and statistical information, if the vocabulary entry is found in the dictionary, the matching is successful, otherwise, other corresponding processing is done. The mechanical word segmentation method comprises the steps of forward matching, reverse matching and bidirectional matching according to different scanning directions of texts to be segmented; according to whether the word segmentation process is combined with the part-of-speech tagging process or not, the method can be divided into a simple word segmentation method and an integrated method combining word segmentation and tagging; the matching is divided into maximum matching and minimum matching according to whether each matching preferentially considers the long words or the short words. The common word segmentation method based on string matching usually combines several single methods, such as: forward maximum matching, reverse maximum matching, two-way maximum matching, and least-cut based on the character string, etc.
(2) A statistical-based approach: the word is a fixed combination of words, and the more times adjacent words appear simultaneously in the text, the more likely it is a word, so that the probability of word formation can be judged by calculating the combined appearance probability of adjacent words in the context. And calculating the mutual occurrence information of adjacent co-occurring words in the material by counting the combination frequency of the words. The mutual presentation information reflects the closeness of the combination relationship between the Chinese characters, and when the closeness is higher than a certain threshold, the character group can be judged to form a word. The method has the advantages of no limitation of the field of the text to be processed and no need of a special dictionary. The statistical word segmentation is based on the probability theory, the occurrence of the Chinese character combination string in the context of the Chinese character is abstracted into a random process, and the parameters of the random process can be obtained through the training of a large-scale corpus. The statistical-based word segmentation adopts the principles of mutual information, an N-element statistical model and other statistical models such as a hidden Markov model, a conditional random field model, a neural network model, a maximum entropy model and the like.
Taking a method based on character string matching as an example, in the character matching method provided by the present application, performing word segmentation processing on each character string to obtain a corresponding word may include the following steps:
(1) And aiming at each character string, taking a preset number of characters in the character string as a matching field during each matching. Specifically, the preset number may be the number of characters included in the longest entry in the word segmentation dictionary.
(2) Matching the matching fields with entries in a word segmentation dictionary one by one according to a preset word segmentation dictionary, and if the word segmentation dictionary contains the matching fields, successfully matching, and taking the matching fields as a segmented word;
if the matching fails, deleting one character in the matching field, and matching again until all characters in the character string are successfully matched.
The method for deleting one character in the matching field is not limited in the application, and when the matching field is selected from the head of the character string, the last character in the matching field can be deleted; when the matching field is processed from the end of the string, the previous character in the matching field may be deleted.
With the development of computer technology, word segmentation algorithms for character strings or texts are developed vigorously, and the application does not limit the word segmentation algorithms used when the character matching method provided by the application is applied specifically.
And step S102, respectively determining the number of words of the two character strings, and if the number of words of the two character strings is not equal, using the character string with less number of words as a first character string and using the character string with more number of words as a second character string.
Specifically, after the word segmentation processing is performed on the two character strings, each character string has a plurality of words after word segmentation, the number of words in the two character strings can be determined respectively, the character string with the small number of words is used as the first character string, and the character string with the large number of words is used as the second character string.
In the embodiment of the present application, if the number of words in two character strings is equal, one character string is selected as a first character string and the other character string is selected as a second character string in the two character strings.
It is to be understood that the prefixes of "first" and "second" in the first string and the second string in the present application are for convenience of description of subsequent steps, and do not limit the naming of the strings.
Step S103, for each first word of the first character string, calculating a similarity between the first word and each second word of the second character string.
In the embodiment of the present application, after determining which of the two character strings is the first character string and which is the second character string, the similarity between the first word and each second word of the second character string may be calculated for each first word of the first character string.
For example, it is assumed that the words resulting from the word segmentation of the first character string are "word 1", "word 2", and "word 3", and the words resulting from the word segmentation of the second character string are "word a", "word b", "word c", and "word d".
The similarity between "word 1" and "word a", "word b", "word c", and "word d" can be calculated, respectively; calculating the similarity between the word 2 and the words a, b, c and d respectively; the similarity between "word 3" and "word a", "word b", "word c", and "word d" is calculated, respectively. Each of "word 1", "word 2", and "word 3" was calculated to obtain 4 similarities.
The method for calculating the similarity between two words may be an edit distance algorithm, a cosine similarity algorithm, or a matrix similarity algorithm, and the present application is not limited thereto.
Step S104, based on the similarity between each first word and each second word, the target similarity between the first character string and the second character string is obtained.
In this embodiment of the present application, obtaining a target similarity between a first character string and a second character string based on a similarity between each first word and each second word may include the following steps:
(1) And for each first word, determining the maximum similarity corresponding to the first word from the similarities of the first word and the second words.
For example, assuming that the degree of similarity between "word 1" and "word a", "word b", "word c", and "word d" is the greatest among the degrees of similarity between "word 1" and "word a", the degree of similarity between "word 1" and "word a" can be taken as the greatest degree of similarity corresponding to "word 1".
(2) And acquiring the comprehensive similarity of the first character string to the second character string based on the maximum similarity corresponding to each first word.
Specifically, the maximum similarity corresponding to each first word may be merged to obtain the comprehensive similarity of the first character string to the second character string. The merging scheme will be specifically described later.
(3) Determining a target character string with a small number of characters from the first character string and the second character string; and acquiring the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of the characters of the target character string.
Determining a target character string with a small number of characters from the first character string and the second character string; the target similarity between the first character string and the second character string is obtained based on the integrated similarity and the number of characters of the target character string (i.e., the smaller number of characters in the two character strings).
The target similarity can be used as a basis for finally evaluating the similarity between the two character strings.
In this embodiment of the present application, acquiring the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of characters of the second character string may include: and dividing the comprehensive similarity by the number of the characters of the target character string to obtain the target similarity between the first character string and the second character string.
The character matching method comprises the following steps: acquiring two character strings to be compared, performing word segmentation processing on each character string, taking the character string with a small number of words after word segmentation as a first character string, and taking the character string with a large number of words as a second character string. And respectively calculating the similarity between each first word in the first character string and each second word in the second character string, and acquiring the target similarity between the first character string and the second character string according to the calculated similarity between each first word and each second word. According to the character matching method, the similarity between each first word and each second word is calculated, the calculated similarities are combined, the target similarity between the first character string and the second character string is obtained, the target similarity between the two character strings is accurately calculated under the condition that the sequence of the character strings is ignored, and the defect that the conventional similarity calculation method does not support calculation under the condition that the sequence of the words is ignored can be overcome.
The embodiment of the present application provides a possible implementation manner, where the obtaining of the comprehensive similarity of the first character string with respect to the second character string based on the maximum similarity corresponding to each first word may include the following steps:
(1) And acquiring the corresponding weight of each first word. Wherein the corresponding weight of each first word can be used to characterize the importance of each first word in the first character string. The weight corresponding to each first word may be determined according to the length ratio of each first word in the first character string, or may be set according to other methods, for example, the weights of some important words in the character string are preset, and the importance of the important words is highlighted.
(2) And performing weighted summation on the maximum similarity corresponding to each first word based on the weight corresponding to each first word to obtain the comprehensive similarity of the first character string to the second character string.
For example, the first string has two words, "word 1" and "word 2," the maximum similarity of the word "word 1" is 90%, and the maximum similarity of the word "word 2" is 100%. While the weight of "word 1" is 25% and the weight of "word 2" is 75%. The calculation process of the overall similarity of the entire character string is "90% × 25% +100% × 75% =22.5% +75% =97.5%".
The embodiment of the present application provides a possible implementation manner, and obtaining a weight corresponding to each first word includes:
and calculating the proportion of the length of the first word in the total length of the first character string for each first word, and taking the proportion as the weight corresponding to the first word.
In the embodiment of the present application, the length of the character string may include the following two definitions, which are not limited in the present application:
(1) The length of a string may generally refer to the number of characters contained in the string;
(2) It may also refer to the number of bytes occupied by a string, for example, in some systems the english word is a half character and the chinese is 1 character.
In one example, for example, the first string has two words, "word 1" and "word 2," the maximum similarity for word 1 "being 90%, and the maximum similarity for word 2 being 100%.
"word 1" accounts for 25% of the total length of the entire string, and "word 2" accounts for 75% of the total length of the entire string, the weight of "word 1" is 25%, and the weight of "word 2" is 75%. The calculation process of the overall similarity of the entire character string is "90% × 25% +100% × 75% =22.5% +75% =97.5%".
In one example, the character matching method provided by the present application is shown in fig. 2, and may include the following steps:
step S201, acquiring two character strings to be processed, and performing word segmentation processing on each character string to obtain a corresponding word;
step S202, respectively determining the number of words of two character strings, taking the character string with small number of words as a first character string, and taking the character string with large number of words as a second character string;
step S203, aiming at each first word, determining the maximum similarity corresponding to the first word from the similarity of the first word and each second word;
step S204, acquiring the weight corresponding to each first word; calculating the proportion of the length of the first word in the total length of the first character string aiming at each first word, and taking the proportion as the weight corresponding to the first word;
step S205, carrying out weighted summation on the maximum similarity corresponding to each first word based on the weight corresponding to each first word to obtain the comprehensive similarity of the first character string to the second character string;
step S206, determining a target character string with less characters from the first character string and the second character string; based on the comprehensive similarity and the number of characters of the target character string; specifically, the target similarity between the first character string and the second character string is obtained by dividing the comprehensive similarity by the number of characters of the target character string.
In an actual application scenario, the method for calculating the similarity of the character strings can be applied to a list management monitoring platform and used for calculating the similarity between the character strings in a list so as to more accurately evaluate the similarity between the character strings.
Specifically, the method for calculating the similarity of character strings provided by the present application may be as shown in fig. 3, and includes the following steps:
the method comprises the steps of firstly, obtaining a first character string and a second character string to be processed. Wherein the first character string may be a source character string and the second character string may be a target character string;
secondly, performing word segmentation processing on the source character string and the target character string respectively to obtain at least one word corresponding to the source character string and the target character string after word segmentation;
and thirdly, after word segmentation processing, comparing the number of words in the source character string with that in the target character string, and calculating the similarity between each word in the character string and each word in another character string by taking the character string with a small number of words as a reference.
Specifically, if the number of words obtained after the word segmentation of the source character string is less than the number of words of the target character string, the similarity between each word after the word segmentation of the source character string and each word after the word segmentation of the target character string may be calculated respectively by using the source character string as a reference. And for each word in the source character string, calculating the similarity with the maximum value in the similarities corresponding to the word to be used as the maximum similarity of the word. The maximum similarity corresponding to each word in the source string can be determined by the above method.
Similarly, if the number of words obtained after the word segmentation of the target character string is less than that of the words of the source character string, the similarity between each word after the word segmentation of the target character string and each word after the word segmentation of the source character string can be respectively calculated by taking the target character string as a reference. And for each word in the target character string, calculating the similarity with the maximum value in the similarities corresponding to the word to be used as the maximum similarity of the word. The maximum similarity corresponding to each word in the target character string can be determined by the method.
The similarity may be calculated by a preset algorithm, and the preset algorithm may be an edit distance algorithm.
And fourthly, merging the similarity to obtain the comprehensive similarity between the character strings. The weight corresponding to each word of the character string as a reference may be obtained, and the maximum similarity corresponding to each word is weighted and summed based on the weight of each word, so as to obtain the comprehensive similarity between the character strings.
Specifically, if the source character string is used as a reference, the weight corresponding to each word in the source character string is obtained, and the maximum similarity corresponding to each word is weighted and summed based on the weight of each word in the source character string, so as to obtain the comprehensive similarity between the character strings.
And if the target character string is used as a reference, acquiring the weight corresponding to each word in the target character string, and performing weighted summation on the maximum similarity corresponding to each word based on the weight of each word in the target character string to obtain the comprehensive similarity between the character strings.
The method for calculating the weight includes, but is not limited to: and calculating the proportion of the length of each word in the total length of the character string to which the word belongs for each word, and taking the calculated proportion as the weight of the word.
And fifthly, acquiring the target character number of the character string with less characters in the source character string and the target character string, and dividing the comprehensive similarity between the two character strings by the target character number to obtain the target similarity between the two character strings.
An embodiment of the present application provides a character matching apparatus, and as shown in fig. 4, the character matching apparatus 40 may include: a character string segmentation module 401, a word number comparison module 402, a similarity calculation 403, and a similarity process 404, wherein,
the character string segmentation module 401 is configured to obtain two character strings to be processed, and perform segmentation processing on each character string to obtain a corresponding word;
a word number comparison module 402, configured to determine the number of words in two character strings, respectively, and if the number of words in two character strings is not equal, take a character string with a small number of words as a first character string, and take a character string with a large number of words as a second character string; if the number of words in the two character strings is equal, selecting one character string from the two character strings as a first character string, and selecting the other character string as a second character string;
a similarity calculation module 403, configured to calculate, for each first word of the first character string, a similarity between the first word and each second word of the second character string;
and a similarity processing module 404, configured to obtain a target similarity between the first character string and the second character string based on a similarity between each first word and each second word.
The character matching device described above includes: acquiring two character strings to be compared, performing word segmentation processing on each character string, taking the character string with a small number of words after word segmentation as a first character string, and taking the character string with a large number of words as a second character string. And respectively calculating the similarity between each first word in the first character string and each second word in the second character string, and acquiring the target similarity between the first character string and the second character string according to the calculated similarity between each first word and each second word. According to the character matching method, the similarity between each first word and each second word is calculated, the calculated similarities are combined, the target similarity between the first character string and the second character string is obtained, the target similarity between the two character strings is accurately calculated under the condition that the sequence of the character strings is ignored, and the defect that the conventional similarity calculation method does not support calculation under the condition that the sequence of the words is ignored can be overcome.
In this embodiment of the application, when the similarity processing module 404 obtains the target similarity between the first character string and the second character string based on the similarity between each first word and each second word, the similarity processing module is specifically configured to:
for each first word, determining the maximum similarity corresponding to the first word from the similarity between the first word and each second word;
acquiring the comprehensive similarity of the first character string to the second character string based on the maximum similarity corresponding to each first word;
determining a target character string with a small number of characters from the first character string and the second character string; and acquiring the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of the characters of the target character string.
In this embodiment of the present application, when the similarity processing module 404 obtains the comprehensive similarity of the first character string with respect to the second character string based on the maximum similarity corresponding to each first word, the similarity processing module is specifically configured to:
acquiring the weight corresponding to each first word;
and performing weighted summation on the maximum similarity corresponding to each first word based on the weight corresponding to each first word to obtain the comprehensive similarity of the first character string to the second character string.
In this embodiment of the present application, the obtaining, by the similarity processing module 403, a weight corresponding to each first word includes:
and calculating the proportion of the length of the first word in the total length of the first character string for each first word, and taking the proportion as the weight corresponding to the first word.
In this embodiment of the present application, the similarity processing module 404 obtains the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of characters of the second character string, and is specifically configured to:
and dividing the comprehensive similarity by the number of the characters of the second character string to obtain the target similarity between the first character string and the second character string.
In this embodiment of the present application, when the character string segmentation module performs segmentation processing on each character string to obtain a corresponding word, the character string segmentation module is specifically configured to:
aiming at each character string, taking a preset number of characters in the character string as a matching field during each matching;
matching the matching fields with entries in a segmentation dictionary one by one according to a preset segmentation dictionary, and if the segmentation dictionary comprises the matching fields, successfully matching and taking the matching fields as a segmented word;
if the matching fails, one character in the matching field is deleted, and the matching is carried out again until all characters in the character string are successfully matched.
The apparatus in the embodiment of the present application may execute the method provided in the embodiment of the present application, and the implementation principle is similar, the actions executed by the modules in the apparatus in the embodiments of the present application correspond to the steps in the method in the embodiments of the present application, and for the detailed functional description of the modules in the apparatus, reference may be made to the description in the corresponding method shown in the foregoing, and details are not repeated here.
The embodiment of the application provides an electronic device, which comprises a memory, a processor and a computer program stored on the memory, wherein the processor executes the computer program to realize the steps of the character matching method, and compared with the related technology, the method can realize the following steps: the target similarity between two character strings is accurately calculated under the condition of neglecting the sequence of the character strings, and the defect that the conventional similarity calculation method does not support calculation under the condition of neglecting the sequence of words can be overcome.
In an alternative embodiment, an electronic device is provided, as shown in fig. 5, the electronic device 4000 shown in fig. 5 comprising: a processor 4001 and a memory 4003. Processor 4001 is coupled to memory 4003, such as via bus 4002. Optionally, the electronic device 4000 may further include a transceiver 4004, and the transceiver 4004 may be used for data interaction between the electronic device and other electronic devices, such as transmission of data and/or reception of data. In addition, the transceiver 4004 is not limited to one in practical applications, and the structure of the electronic device 4000 is not limited to the embodiment of the present application.
The Processor 4001 may be a CPU (Central Processing Unit), a general-purpose Processor, a DSP (Digital Signal Processor), an ASIC (Application Specific Integrated Circuit), an FPGA (Field Programmable Gate Array), or other Programmable logic device, transistor logic device, hardware component, or any combination thereof. Which may implement or perform the various illustrative logical blocks, modules, and circuits described in connection with the disclosure. The processor 4001 may also be a combination that performs a computing function, e.g., comprising one or more microprocessors, a combination of DSPs and microprocessors, etc.
Bus 4002 may include a path that carries information between the aforementioned components. The bus 4002 may be a PCI (Peripheral Component Interconnect) bus, an EISA (Extended Industry Standard Architecture) bus, or the like. The bus 4002 may be divided into an address bus, a data bus, a control bus, and the like. For ease of illustration, only one thick line is shown in FIG. 5, but that does not indicate only one bus or one type of bus.
The Memory 4003 may be a ROM (Read Only Memory) or other types of static storage devices that can store static information and instructions, a RAM (Random Access Memory) or other types of dynamic storage devices that can store information and instructions, an EEPROM (Electrically Erasable Programmable Read Only Memory), a CD-ROM (Compact Disc Read Only Memory) or other optical Disc storage, optical Disc storage (including Compact Disc, laser Disc, optical Disc, digital versatile Disc, blu-ray Disc, etc.), a magnetic Disc storage medium, other magnetic storage devices, or any other medium that can be used to carry or store a computer program and that can be Read by a computer, without limitation.
The memory 4003 is used for storing computer programs for executing the embodiments of the present application, and is controlled by the processor 4001 to execute. The processor 4001 is used to execute computer programs stored in the memory 4003 to implement the steps shown in the foregoing method embodiments.
Among them, electronic devices include but are not limited to: mobile terminals such as mobile phones, notebook computers, PADs, etc., and fixed terminals such as digital TVs, desktop computers, etc.
The embodiments of the present application provide a computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, can implement the steps of the foregoing method embodiments and corresponding content.
Embodiments of the present application further provide a computer program product, which includes a computer program, and when the computer program is executed by a processor, the steps and corresponding contents of the foregoing method embodiments can be implemented.
The terms "first," "second," "third," "fourth," "1," "2," and the like in the description and in the claims of the present application and in the above-described drawings (if any) are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It should be understood that the data so used are interchangeable under appropriate circumstances such that the embodiments of the application described herein are capable of operation in other sequences than illustrated or otherwise described herein.
It should be understood that, although each operation step is indicated by an arrow in the flowchart of the embodiment of the present application, the implementation order of the steps is not limited to the order indicated by the arrow. In some implementation scenarios of the embodiments of the present application, the implementation steps in the flowcharts may be performed in other sequences as needed, unless explicitly stated otherwise herein. In addition, some or all of the steps in each flowchart may include multiple sub-steps or multiple stages based on an actual implementation scenario. Some or all of these sub-steps or stages may be performed at the same time, or each of these sub-steps or stages may be performed at different times, respectively. In a scenario where execution times are different, an execution sequence of the sub-steps or the phases may be flexibly configured according to requirements, which is not limited in the embodiment of the present application.
The foregoing is only an optional implementation manner of a part of implementation scenarios in the present application, and it should be noted that, for those skilled in the art, other similar implementation means based on the technical idea of the present application are also within the protection scope of the embodiments of the present application without departing from the technical idea of the present application.

Claims (9)

1. A character matching method, comprising:
acquiring two character strings to be processed, and performing word segmentation processing on each character string to obtain a corresponding word;
respectively determining the number of words of two character strings, and if the number of words of the two character strings is not equal, taking the character string with small number of words as a first character string and taking the character string with large number of words as a second character string;
if the number of words in the two character strings is equal, selecting one character string from the two character strings as a first character string, and selecting the other character string as a second character string;
for each first word of the first character string, respectively calculating the similarity between the first word and each second word of the second character string;
and acquiring the target similarity between the first character string and the second character string based on the similarity between each first word and each second word.
2. The character matching method according to claim 1, wherein the obtaining of the target similarity between the first character string and the second character string based on the similarity between each first word and each second word comprises:
for each first word, determining the maximum similarity corresponding to the first word from the similarities of the first word and each second word;
acquiring the comprehensive similarity of the first character string aiming at the second character string based on the maximum similarity corresponding to each first word;
determining a target character string with a smaller number of characters from the first character string and the second character string; and acquiring the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of the characters of the target character string.
3. The character matching method according to claim 2, wherein said obtaining the comprehensive similarity of the first character string with respect to the second character string based on the maximum similarity corresponding to each first word comprises:
acquiring the weight corresponding to each first word;
and carrying out weighted summation on the maximum similarity corresponding to each first word based on the weight corresponding to each first word to obtain the comprehensive similarity of the first character string to the second character string.
4. The character matching method according to claim 3, wherein the obtaining of the weighting corresponding to each first word comprises:
and calculating the proportion of the length of the first word in the total length of the first character string for each first word, and taking the proportion as the weight corresponding to the first word.
5. The character matching method according to any one of claims 2 to 4, wherein the obtaining of the target similarity between the first character string and the second character string based on the comprehensive similarity and the number of characters of the second character string includes:
and dividing the comprehensive similarity by the number of the characters of the target character string to obtain the target similarity between the first character string and the second character string.
6. The character matching method according to claim 1, wherein the performing word segmentation processing on each character string to obtain a corresponding word comprises:
aiming at each character string, taking a preset number of characters in the character string as a matching field in each matching;
matching the matching fields with entries in a word segmentation dictionary one by one according to a preset word segmentation dictionary, and if the word segmentation dictionary comprises the matching fields, successfully matching the matching fields, and using the matching fields as a segmented word;
if the matching fails, deleting one character in the matching field, and matching again until all characters in the character string are successfully matched.
7. A character matching apparatus, comprising:
the character string word segmentation module is used for acquiring two character strings to be processed, and performing word segmentation processing on each character string to obtain a corresponding word;
the word number comparison module is used for respectively determining the word number of the two character strings, if the word number of the two character strings is not equal, the character string with the small word number is used as a first character string, and the character string with the large word number is used as a second character string; if the number of words in the two character strings is equal, selecting one character string from the two character strings as a first character string, and selecting the other character string as a second character string;
a similarity calculation module, configured to calculate, for each first word of the first character string, a similarity between the first word and each second word of the second character string;
and the similarity processing module is used for acquiring the target similarity between the first character string and the second character string based on the similarity between each first word and each second word.
8. An electronic device comprising a memory, a processor and a computer program stored on the memory, characterized in that the processor executes the computer program to implement the steps of the method of any of claims 1-6.
9. A computer-readable storage medium, on which a computer program is stored, which, when being executed by a processor, carries out the steps of the character matching method according to any one of claims 1 to 6.
CN202210976998.3A 2022-08-15 2022-08-15 Character matching method and device, electronic equipment and readable storage medium Pending CN115392235A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210976998.3A CN115392235A (en) 2022-08-15 2022-08-15 Character matching method and device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210976998.3A CN115392235A (en) 2022-08-15 2022-08-15 Character matching method and device, electronic equipment and readable storage medium

Publications (1)

Publication Number Publication Date
CN115392235A true CN115392235A (en) 2022-11-25

Family

ID=84117852

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210976998.3A Pending CN115392235A (en) 2022-08-15 2022-08-15 Character matching method and device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN115392235A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757189A (en) * 2023-08-11 2023-09-15 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features
CN117573943A (en) * 2024-01-11 2024-02-20 云筑信息科技(成都)有限公司 Data comparison method based on serialization similarity calculation

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116757189A (en) * 2023-08-11 2023-09-15 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features
CN116757189B (en) * 2023-08-11 2023-10-31 四川互慧软件有限公司 Patient name disambiguation method based on Chinese character features
CN117573943A (en) * 2024-01-11 2024-02-20 云筑信息科技(成都)有限公司 Data comparison method based on serialization similarity calculation

Similar Documents

Publication Publication Date Title
CN111368043A (en) Event question-answering method, device, equipment and storage medium based on artificial intelligence
CN111797210A (en) Information recommendation method, device and equipment based on user portrait and storage medium
CN115392235A (en) Character matching method and device, electronic equipment and readable storage medium
CN110941951B (en) Text similarity calculation method, text similarity calculation device, text similarity calculation medium and electronic equipment
CN110427487B (en) Data labeling method and device and storage medium
CN107885717B (en) Keyword extraction method and device
CN111143556A (en) Software function point automatic counting method, device, medium and electronic equipment
CN111459977A (en) Conversion of natural language queries
CN111259262A (en) Information retrieval method, device, equipment and medium
CN111930792A (en) Data resource labeling method and device, storage medium and electronic equipment
CN114116973A (en) Multi-document text duplicate checking method, electronic equipment and storage medium
CN113934848B (en) Data classification method and device and electronic equipment
CN110362798B (en) Method, apparatus, computer device and storage medium for judging information retrieval analysis
CN113486178B (en) Text recognition model training method, text recognition method, device and medium
CN110704608A (en) Text theme generation method and device and computer equipment
CN110929499B (en) Text similarity obtaining method, device, medium and electronic equipment
CN111143515B (en) Text matching method and device
CN112417875A (en) Configuration information updating method and device, computer equipment and medium
CN114691835A (en) Audit plan data generation method, device and equipment based on text mining
CN111062208B (en) File auditing method, device, equipment and storage medium
CN110276001B (en) Checking page identification method and device, computing equipment and medium
CN112926297A (en) Method, apparatus, device and storage medium for processing information
CN111967248A (en) Pinyin identification method and device, terminal equipment and computer readable storage medium
CN111985235A (en) Text processing method and device, computer readable storage medium and electronic equipment
CN113377922B (en) Method, device, electronic equipment and medium for matching information

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination