CN111324784B - Character string processing method and device - Google Patents

Character string processing method and device Download PDF

Info

Publication number
CN111324784B
CN111324784B CN202010065546.0A CN202010065546A CN111324784B CN 111324784 B CN111324784 B CN 111324784B CN 202010065546 A CN202010065546 A CN 202010065546A CN 111324784 B CN111324784 B CN 111324784B
Authority
CN
China
Prior art keywords
character
string
semantic
sub
character string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010065546.0A
Other languages
Chinese (zh)
Other versions
CN111324784A (en
Inventor
魏爱勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Advanced New Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Advanced New Technologies Co Ltd filed Critical Advanced New Technologies Co Ltd
Priority to CN202010065546.0A priority Critical patent/CN111324784B/en
Publication of CN111324784A publication Critical patent/CN111324784A/en
Application granted granted Critical
Publication of CN111324784B publication Critical patent/CN111324784B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The application discloses a character string processing method, which comprises the following steps: acquiring a character string to be identified; word segmentation is carried out on the character strings to be identified, and each character sub-string to be identified is obtained; determining the semantic weight of each character substring to be identified; searching a target character string according to each character sub-string to be identified; word segmentation is carried out on the target character strings to obtain target character sub-strings; determining the semantic weight of each target character substring; determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings; and determining the similarity between the character string to be identified and the target character string according to the semantic editing distance. The method and the device adopt the character sub-strings with semantic weights to determine the semantic editing distance, greatly improve the accuracy of character string similarity recognition, and solve the problem of poor accuracy of the existing character string recognition. The application also discloses another character string processing device.

Description

Character string processing method and device
The present application is a divisional application of patent application with application number 201510103200.4, in which the application date is 2015, 03, 09, and the name is "a character string processing method and apparatus".
Technical Field
The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing character string.
Background
At present, the influence of the internet on the daily life of people is increasingly strengthened, so that the internet data is greatly exploded, and the storage and identification of various data become increasingly important problems. In some application scenarios, identification and categorization of addresses, black lists, problem names, etc. are required, which involves the problem of similarity calculation of character strings in the huge database.
In the internet field, huge commodity service and user data including user addresses, company names, commodity names and the like are stored in a service provider database, if character strings representing the addresses and the company names directly originate from information filled by users, the character strings are various in form, for example, a company is fully called Shanghai XXX stock limited company, and then the company names filled by staff of the company can be in various character string forms such as XXX, XXX company, shanghai XXX and the like. In this case, it is often difficult to identify XXX, XXX corporation, shanghai XXX matches as the complete Shanghai XXX stock, inc.
Currently, the algorithm for calculating the similarity of character strings in the database is an edit distance (Levenshtein Distance) algorithm, which is the minimum number of insertions, deletions and substitutions required to calculate the transition from the original string S to the target string T. The string similarity calculation formula: the similarity of the character strings (S, T) =1- (edit distance/(MAX (S), length (T))), but the common edit distance algorithm is only mechanically calculated according to a single character set constituting the character strings, the similarity result calculated according to the method is not accurate enough, for example, the result obtained by the similarity calculation method of the character strings "ABC information technology limited company" and "ABC" may be 27.27%, and the result obtained by the similarity calculation method of the "ABC information technology limited company" and "XYZ information technology limited company" may be 72.73%, the difference between the obtained similarity result and the actual result is large, the accuracy is low, and the accuracy of subsequent processing such as recognition classification is poor.
Disclosure of Invention
The embodiment of the application provides a character string processing method, which is used for solving the problems that the character string similarity with lower accuracy is obtained by adopting a character sub-string synthesized by a single character set in character string identification such as lists, addresses and the like to edit a distance algorithm in the prior art, and the subsequent processing accuracy such as identification and classification is poor.
The embodiment of the application also provides a character string processing device, which is used for solving the problems that the character string similarity with lower accuracy is obtained by adopting a character sub-string synthesized by a single character set in character string identification such as lists, addresses and the like to edit a distance algorithm in the prior art, and the accuracy of subsequent processing such as identification and classification is poor.
The embodiment of the application adopts the following technical scheme:
a character string processing method, comprising:
acquiring a character string to be identified;
word segmentation is carried out on the character strings to be identified, and each character sub-string to be identified is obtained;
determining the semantic weight of each character substring to be identified;
searching a target character string according to each character sub-string to be identified;
word segmentation is carried out on the target character strings to obtain target character sub-strings;
determining the semantic weight of each target character substring;
determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings;
determining the similarity between the character string to be identified and the target character string according to the semantic editing distance;
and carrying out subsequent processing on the character strings to be identified according to the similarity.
A character string processing apparatus comprising:
an acquisition unit for acquiring a character string to be identified;
the searching unit is used for searching the target character string according to the character string to be identified;
the word segmentation unit is used for respectively segmenting the character strings to be identified and the target character strings to obtain character sub-strings to be identified and target character sub-strings;
the semantic weight determining unit is used for determining the semantic weight of each character sub-string to be identified and each target character sub-string;
the editing distance determining unit is used for determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings;
the similarity determining unit is used for determining the similarity between the character string to be identified and the target character string according to the semantic editing distance;
and the subsequent processing unit is used for performing subsequent processing on the character string to be identified according to the similarity.
The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:
in the processing of internet data, word segmentation is carried out on character strings, character sub-strings are obtained, semantic weights are given to the character sub-strings, semantic editing distances are calculated by the character sub-strings with the semantic weights, the accuracy of character string identification according to semantic requirements is improved, and the problem that the accuracy of subsequent processing such as identification and classification is poor due to the fact that the character sub-strings synthesized by single character sets are subjected to an editing distance algorithm in the prior art for character string identification such as lists and addresses is solved, so that character string similarity with lower accuracy is obtained.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:
FIG. 1 is a flowchart of a method for processing a character string according to an embodiment of the present disclosure;
fig. 2 is a schematic diagram of a character string processing device according to a second embodiment of the present application.
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Example 1
Fig. 1 is a flowchart of a character string processing method according to an embodiment of the present application, in which the character string is decomposed into character sub-strings with semantic weights, and similarity calculation is performed after the semantic editing distance between the character strings is calculated according to the semantic weights, so that the similarity of the character strings according to semantic units can be effectively improved, and the character strings can be conveniently subjected to subsequent processing such as classification and identification. The method comprises the following steps:
s101: and acquiring the character string to be identified.
The obtained character string to be recognized S includes one or more of company name, address, commodity name, blacklist, problem name or description input by the user.
Such as the user's need to enter a shipping address at some service website, the service provider's need to enter the name of the good, some users may need to set some blacklists. All the data may have a string of characters representing the same meaning but different expressions, and the data amount to be saved by the service website is increasingly large, at this time, the system needs to identify the data input by the user so as to facilitate subsequent operations of classifying, adding, replacing and the like.
S102: and segmenting the character strings to be identified to obtain each character sub-string to be identified.
Word segmentation is carried out on the character string S to be recognized according to semantic units, and each character substring S= { S with semantics is obtained 1 ,s 2 ,s 3 …,s i }. The step adopts a grammar analysis unit to process word segmentation.
S103: and determining the semantic weight of each character substring to be identified.
Firstly, a semantic weight table Wn exists in a local database, the semantic weight table is obtained by calculation in advance according to samples stored in the database, and the calculation method comprises the following steps:
extracting a number of character string samples, the charactersThe string sample may be a list, an address, etc. of more than 10000 lines in the same class; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inverse document frequency (TF-IDF) is provided
Figure SMS_1
Calculating the semantic weight W of each sample substring n = {(w 1 ,idf 1 ), (w 2 ,idf 2 ), (w 3 ,idf 3 )…, (w n ,idf n ) -wherein |d| represents the total number of sample strings, | { j: t i ∈d j The expression "includes a substring t of samples i If the sub-string of samples is not present in the sample, the denominator is zero, so 1+| { j: t is typically used i ∈d j And } |. If the class sample substring weight set has universality, a class name is taken to store the set, such as 'W (companyName)', 'W (address)', and the like, and the corresponding weight set can be directly called in the same scene next time.
Firstly, searching the semantic weight table according to each character substring to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights 1 ,sw 1 ),(s 2 ,sw 2 ),(s 3 ,sw 3 )…,(s m ,sw m )}。
S104: and searching for a target character string according to each character sub-string to be identified.
The target string T is one or more of the correct company name, address, commodity name, blacklist, problem name or description stored in the local database.
Selecting a character sub-string to be recognized, the semantic weight of which is greater than a set threshold value, from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string.
The character sub-strings to be identified with the semantic weight larger than a set threshold value refer to one or more character sub-strings to be identified with the semantic weight larger than a certain threshold value, the target character strings can be one or more, and each target character string comprises the selected character sub-string to be identified.
S105: and word segmentation is carried out on the target character strings to obtain target character sub-strings.
The selected target character strings are subjected to word segmentation one by one, the word segmentation step S102 of the step is the same, and target character substrings T= { T are obtained after word segmentation 1 ,t 2 ,t 3 …,t n }。
S106: and determining the semantic weight of each target character substring.
Step S103, searching the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights 1 ,tw 1 ),(t 2 ,tw 2 ),(t 3 ,tw 3 )…,(t n ,tw n )}。
S107: and determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings.
This step is to
Calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents a character substring to be recognizedThe number j represents the number of target character substrings; tw (tw) j T representing a target character substring j Semantic weights, sw i Representing character substring s to be recognized i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s 1 , s 2 , s 3 … s i ) To the target character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized i Conversion to the jth target character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
S108: and determining the similarity between the character string to be identified and the target character string according to the semantic editing distance.
The step mainly includes calculating the similarity between the character string to be identified and the target character string according to the semantic editing distance obtained in the step S107. The similarity calculation formula is:
Figure SMS_2
wherein the wait (S, T) represents a semantic editing distance between the character string S to be recognized and the target character string T, the length (S) represents a sum of semantic weights of all the character sub-strings to be recognized in the character string S to be recognized, and the length (T) represents a sum of semantic weights of all the target character sub-strings in the target character string T.
S109: and carrying out subsequent processing on the character strings to be identified according to the similarity.
The step mainly refers to one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the target character strings as a blacklist and the like in different application scenes by utilizing the similarity result.
Example 1: when the acquired character string S to be identified is ABC information technology limitedCompany "; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, information, technology, limited, company }, i=5; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is 'XYZ information technology limited company'; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { XYZ, information, technology, limited, company }, and j=5; while the semantic weight table W n Weights for the character substrings are shown in table 1 below:
Figure SMS_3
TABLE 1
Then the result is that the character sub-string to be recognized with semantic weight is sw= { (ABC, 0.98), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (XYZ, 0.99), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }.
The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 2, and the calculation is similar to the existing edit distance algorithm, and is not described in detail, except that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.99.
Figure SMS_4
TABLE 2
And then according to a similarity calculation formula:
Figure SMS_5
calculating the to-be-recognized character substring S= { ABC, information, technology, limited, company } and targetThe similarity between the character substrings t= { XYZ, information, technology, limited, company } is: 1 to 0.99/max ((0.98+0.02+0.02+0.01+0.01), (0.99+0.02+0.02+0.01+0.01))=5.71%, at which point the similarity between the character string S to be recognized and the target character string T is small.
Example 2: when the acquired character string S to be identified is 'ABC company'; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, company }, i=2; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is found to be ABC information technology limited company; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { ABC, information, technology, limited, company }, and j=5; while the semantic weight table W n Weights for the character substrings are shown in Table 3 below:
Figure SMS_6
TABLE 3 Table 3
Then the to-be-recognized character sub-string with semantic weight is obtained as sw= { (ABC, 0.98), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (ABC, 0.98), (information, 0.02), (technique, 0.02), (limited, 0.01), (company, 0.01) }.
The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 4, where the calculation is similar to the existing edit distance algorithm, and details are not repeated, and the difference is that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.05.
0 ABC Company (Corp)
0 0 0.98 0.99
ABC 0.98 0 0.01
Information processing system 1.00 0.02 0.03
Techniques for 1.02 0.04 0.05
Limited and limited 1.03 0.05 0.06
Company (Corp) 1.04 0.06 0.05
TABLE 4 Table 4
And then according to a similarity calculation formula:
Figure SMS_7
and calculating the similarity between the character string S= { ABC, company } and the target character substring T= { ABC, information, technology, limited, company } as follows: 1-0.05/max ((0.98+0.01), (0.98+0.02+0.02+0.01+0.01))=95.19%, at this time, the similarity between the character string S to be recognized and the target character string T is very large, and the character string to be recognized may be subjected to subsequent processing such as classifying the character string to be recognized into the same class as the target character string, directly replacing the character string with the target character string, or setting the character string to be recognized as a blacklist.
Example 2
The above method for processing a character string provided in the present application is based on the same idea, and the second embodiment of the present application further provides a corresponding device for processing a character string, as shown in fig. 2.
Fig. 2 is a schematic structural diagram of a character string processing device according to a second embodiment, which specifically includes:
an acquisition unit 201, configured to acquire a character string to be recognized;
a searching unit 202, configured to search for a target character string according to the character string to be identified;
the word segmentation unit 203 is configured to segment the character string to be identified and the target character string respectively to obtain each character sub-string to be identified and each target character sub-string;
a semantic weight determining unit 204, configured to determine semantic weights of each character sub-string to be identified and each target character sub-string;
a semantic editing distance determining unit 205, configured to determine a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings;
a similarity determining unit 206, configured to determine a similarity between the character string to be identified and the target character string according to the semantic editing distance;
and a subsequent processing unit 207, configured to perform subsequent processing on the character string to be identified according to the similarity.
The character string S to be recognized acquired by the acquisition unit 201 includes one or more of a company name, an address, a commodity name, a blacklist, a problem name, or a description input by a user.
Such as the user's need to enter a shipping address at some service website, the service provider's need to enter the name of the good, some users may need to set some blacklists. All the data may have a string of characters representing the same meaning but different expressions, and the data amount to be saved by the service website is increasingly large, at this time, the system needs to identify the data input by the user so as to facilitate subsequent operations of classifying, adding, replacing and the like.
The word segmentation unit 203 performs word segmentation processing on the obtained character string S to be recognized according to semantic units to obtain each character substring S = { S with semantics 1 ,s 2 ,s 3 …,s i }. The device adopts a grammar analysis unit to process word segmentation.
In the semantic weight determining unit 204 or the local database, there is a semantic weight table Wn, which is obtained by performing calculation in advance according to samples stored in the database, and the calculation method includes:
extracting a certain number of character string samples, wherein the character string samples can be a list, an address and the like of the same type with more than 10000 lines; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inversedocument frequency (TF-IDF) is provided
Figure SMS_8
Calculating the semantic weight W of each sample substring n = {(w 1 ,idf 1 ), (w 2 ,idf 2 ), (w 3 ,idf 3 )…, (w n ,idf n ) -wherein |d| represents the total number of sample strings, | { j: t i ∈d j The expression "includes a substring t of samples i If the sub-string of samples is not present in the sample, the denominator is zero, so 1+| { j: t is typically used i ∈d j And } |. If the class sample substring weight set has universality, a class name is taken to store the set, such as 'W (companyName)', 'W (address)', and the like, and the corresponding weight set can be directly called in the same scene next time.
The semantic weight determining unit 204 searches the semantic weight table according to each character sub-string to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights 1 ,sw 1 ),(s 2 ,sw 2 ),(s 3 ,sw 3 )…,(s m ,sw m )}。
The searching unit 202 searches for the target character string according to the character sub-strings to be identified in the character strings to be identified. Selecting character substrings to be recognized with semantic weights larger than a set threshold value from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string T.
The target string T is one or more of the correct company name, address, commodity name, blacklist, problem name or description stored in the local database.
The character sub-strings to be identified with the semantic weight larger than a set threshold value refer to one or more character sub-strings to be identified with the semantic weight larger than a certain threshold value, the target character strings can be one or more, and each target character string comprises the selected character sub-string to be identified.
After the target character string T is obtained, the word segmentation unit 203 performs word segmentation on the target character string according to semantic units to obtain a target character sub-string t= { T 1 ,t 2 ,t 3 …,t n -a }; then the semantic weight determining unit 204 searches the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights 1 ,tw 1 ),(t 2 ,tw 2 ),(t 3 ,tw 3 )…,(t n ,tw n )}。
The semantic editing distance determining unit 205 determines a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings, including:
calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents the number of character substrings to be identified, and j represents the number of target character substrings; tw (tw) j T representing a target character substring j Semantic weights, sw i Representing character substring s to be recognized i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s 1 , s 2 , s 3 … s i ) To the target character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized i Conversion to the jth target character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
The similarity determining unit 206 determines the similarity between the character string to be recognized and the target character string according to the semantic editing distance between the character string to be recognized S and the target recognition character string T obtained by the semantic editing distance determining unit 205.
The similarity calculation formula is:
Figure SMS_9
wherein the wait (S, T) represents a semantic editing distance between the character string S to be recognized and the target character string T, the length (S) represents a sum of semantic weights of all the character sub-strings to be recognized in the character string S to be recognized, and the length (T) represents a sum of semantic weights of all the target character sub-strings in the target character string T.
The post-processing unit 207 performs post-processing on the character string to be recognized according to the similarity between the character string to be recognized S and the target character string T determined by the similarity determining unit 206.
The subsequent processing unit 207 performs different processing functions in different application scenarios, for example, one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the character strings as a blacklist, and the like according to the similarity result.
Examples of the present apparatus are the same as examples 1 and 2 in the first embodiment.
It should be noted that, the execution subjects of the steps of the method provided in the first embodiment may be the same device, or the method may also be executed by different devices.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims (18)

1. A character string processing method, comprising:
word segmentation is carried out on the first character strings according to semantics, and each first character sub-string is obtained;
word segmentation is carried out on the second character strings according to semantics, and each second character sub-string is obtained; the second character sub-strings are one or more and each second character sub-string comprises a selected first character sub-string; the selected first character substring is a first character substring with semantic weight larger than a preset threshold value;
determining the semantic editing distance between the first character string and the second character string according to the semantic weights of the first character sub-strings and the second character sub-strings;
and determining the similarity between the first character string and the second character string according to the semantic editing distance.
2. The method of claim 1, further comprising, prior to semantically segmenting the second string:
and retrieving the second character string according to each first character string sub-string.
3. The method of claim 2, wherein retrieving the second string according to the first string sub-strings comprises:
selecting a first character substring with a semantic weight greater than a set threshold;
and searching a character string database by using the selected first character sub-string to find out the second character string.
4. The method of claim 1, wherein the word segmentation of the first string according to semantics comprises:
word segmentation is carried out on the first character string according to semantic units;
word segmentation is carried out on the second character string according to semantics, and the method specifically comprises the following steps:
and word segmentation is carried out on the second character string according to the semantic unit.
5. The method of claim 1, further comprising, prior to weighting in accordance with the respective semantics of the first and second character sub-strings:
searching a semantic weight table according to the first character substrings and the second character substrings;
and finding out the semantic weights corresponding to the first character sub-strings and the second character sub-strings from the semantic weight table.
6. The method of claim 5, wherein the semantic weight table is calculated from samples stored in a database, comprising:
pre-extracting a certain number of non-repeated character string samples;
word segmentation is carried out on the character string samples with a certain number to obtain a plurality of sample substrings with certain semantic units;
and calculating the semantic weight of each sample substring according to a measurement calculation formula of the universal importance of the reverse file frequency IDF words.
7. The method according to any one of claims 1-6, wherein determining the semantic editing distance between the first character string and the second character string according to the respective semantic weights of the first character sub-string and the second character sub-string specifically comprises:
calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and 0j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents the number of first character substrings and j represents the number of second character substrings; tw (tw) j T representing a second character substring j Semantic weights, sw i Representing a first character substring s i Semantic weights of (2); the wait (i, j) represents a first subset of strings (s 1 , s 2 , s 3 … s i ) To the second character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the first string S and the second string T, respectively, the edit (i, j) is equal to the edit (S, T) distance between the first string S and the second string T; f (i, j) represents the ith first character substring s i Conversion to the jth second character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
8. The method of claim 7, determining the similarity between the first string and the second string according to the semantic editing distance, specifically comprising:
calculating the similarity between the first character string S and the second character string T according to a character string similarity calculation formula;
the similarity calculation formula is:
Figure QLYQS_1
wherein the edit (S, T) represents a semantic editing distance between the first string S and the second string T, length (S) represents a sum of semantic weights of all the first character sub-strings in the first string S, and length (T) represents a sum of semantic weights of all the second character sub-strings in the second string T.
9. The method of claim 8, wherein the subsequent processing of the first string specifically includes:
and classifying the first character string, replacing the first character string with a second character string meeting the similarity condition, and setting the first character string as one or more of black lists.
10. A character string processing apparatus, comprising:
the word segmentation unit is used for segmenting the first character string and the second character string according to semantics respectively to obtain first character sub-strings and second character sub-strings; the second character sub-strings are one or more and each second character sub-string comprises a selected first character sub-string; the selected first character substring is a first character substring with semantic weight larger than a preset threshold value;
an edit distance determining unit for determining a semantic edit distance between the first character string and the second character string according to the semantic weights of the first character sub-strings and the second character sub-strings;
and the similarity determining unit is used for determining the similarity between the first character string and the second character string according to the semantic editing distance.
11. The apparatus of claim 10, further comprising, prior to the word segmentation of the second string according to semantics:
and retrieving the second character string according to each first character string sub-string.
12. The apparatus of claim 11, retrieving the second string according to the first string sub-strings, specifically comprising:
selecting a first character substring with a semantic weight greater than a set threshold;
and searching a character string database by using the selected first character sub-string to find out the second character string.
13. The apparatus of claim 10, wherein the first character string and the second character string are each segmented according to semantics, and specifically comprising:
word segmentation is carried out on the first character string according to semantic units;
and word segmentation is carried out on the second character string according to the semantic unit.
14. The apparatus of claim 10, further comprising, prior to weighting according to the respective semantics of the first and second character sub-strings:
searching a semantic weight table according to the first character substrings and the second character substrings;
and finding out the semantic weights corresponding to the first character sub-strings and the second character sub-strings from the semantic weight table.
15. The apparatus of claim 14, the semantic weight table being calculated in advance from samples stored in a database, comprising:
pre-extracting a certain number of non-repeated character string samples;
word segmentation is carried out on the character string samples with a certain number to obtain a plurality of sample substrings with certain semantic units;
and calculating the semantic weight of each sample substring according to a measurement calculation formula of the universal importance of the reverse file frequency IDF words.
16. The apparatus of any of claims 10-15, determining a semantic editing distance between the first character string and the second character string according to respective semantic weights of the first character sub-string and the second character sub-string, specifically comprising:
calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents a first character sub-The number of strings, j, represents the number of second character substrings; tw (tw) j T representing a second character substring j Semantic weights, sw i Representing a first character substring s i Semantic weights of (2); the wait (i, j) represents a first subset of strings (s 1 , s 2 , s 3 … s i ) To the second character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the first string S and the second string T, respectively, the edit (i, j) is equal to the edit (S, T) distance between the first string S and the second string T; f (i, j) represents the ith first character substring s i Conversion to the jth second character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
17. The apparatus of claim 16, determining a similarity between the first string and the second string according to the semantic editing distance, specifically comprising:
calculating the similarity between the first character string S and the second character string T according to a character string similarity calculation formula;
the similarity calculation formula is:
Figure QLYQS_2
wherein the edit (S, T) represents a semantic editing distance between the first string S and the second string T, length (S) represents a sum of semantic weights of all the first character sub-strings in the first string S, and length (T) represents a sum of semantic weights of all the second character sub-strings in the second string T.
18. The apparatus of claim 17, wherein the subsequent processing of the first string specifically comprises:
and classifying the first character string, replacing the first character string with a second character string meeting the similarity condition, and setting the first character string as one or more of black lists.
CN202010065546.0A 2015-03-09 2015-03-09 Character string processing method and device Active CN111324784B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010065546.0A CN111324784B (en) 2015-03-09 2015-03-09 Character string processing method and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010065546.0A CN111324784B (en) 2015-03-09 2015-03-09 Character string processing method and device
CN201510103200.4A CN106033416B (en) 2015-03-09 2015-03-09 Character string processing method and device

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN201510103200.4A Division CN106033416B (en) 2015-03-09 2015-03-09 Character string processing method and device

Publications (2)

Publication Number Publication Date
CN111324784A CN111324784A (en) 2020-06-23
CN111324784B true CN111324784B (en) 2023-05-16

Family

ID=57149686

Family Applications (2)

Application Number Title Priority Date Filing Date
CN201510103200.4A Active CN106033416B (en) 2015-03-09 2015-03-09 Character string processing method and device
CN202010065546.0A Active CN111324784B (en) 2015-03-09 2015-03-09 Character string processing method and device

Family Applications Before (1)

Application Number Title Priority Date Filing Date
CN201510103200.4A Active CN106033416B (en) 2015-03-09 2015-03-09 Character string processing method and device

Country Status (1)

Country Link
CN (2) CN106033416B (en)

Families Citing this family (29)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776863A (en) * 2016-11-28 2017-05-31 合网络技术(北京)有限公司 The determination method of the text degree of correlation, the method for pushing and device of Query Result
CN106650803B (en) * 2016-12-09 2019-06-18 北京锐安科技有限公司 The method and device of similarity between a kind of calculating character string
CN108255836B (en) * 2016-12-28 2020-12-25 普天信息技术有限公司 Character string matching method and device
CN106980870B (en) * 2016-12-30 2020-07-28 中国银联股份有限公司 Method for calculating text matching degree between short texts
CN108269112A (en) * 2016-12-30 2018-07-10 北京国双科技有限公司 The update method and device of advertising creative
CN106598954A (en) * 2017-01-05 2017-04-26 北京工商大学 Method for recognizing social network sock puppet model based on frequency sub-tree
CN106909609B (en) * 2017-01-09 2020-08-04 北方工业大学 Method for determining similar character strings, method and system for searching duplicate files
CN108509409A (en) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 A method of automatically generating semantic similarity sentence sample
CN107220639A (en) * 2017-04-14 2017-09-29 北京捷通华声科技股份有限公司 The correcting method and device of OCR recognition results
CN107862062A (en) * 2017-11-15 2018-03-30 中国银行股份有限公司 A kind of information query method, device and electronic equipment
CN108363686A (en) * 2018-01-12 2018-08-03 中国平安人寿保险股份有限公司 A kind of character string segmenting method, device, terminal device and storage medium
CN109165326A (en) * 2018-08-16 2019-01-08 蜜小蜂智慧(北京)科技有限公司 A kind of character string matching method and device
CN109271481A (en) * 2018-08-31 2019-01-25 国网河北省电力有限公司沧州供电分公司 A kind of classification method, system and the terminal device of electric power demand information
CN109359274B (en) * 2018-09-14 2023-05-02 蚂蚁金服(杭州)网络技术有限公司 Method, device and equipment for identifying character strings generated in batch
CN111090787A (en) * 2018-10-23 2020-05-01 阿里巴巴集团控股有限公司 Message processing method, device, system and storage medium
CN109657210B (en) * 2018-11-13 2023-10-10 平安科技(深圳)有限公司 Text accuracy rate calculation method and device based on semantic analysis and computer equipment
CN111382322B (en) * 2018-12-27 2023-06-13 北京猎户星空科技有限公司 Method and device for determining similarity of character strings
CN111428491B (en) * 2019-01-09 2024-03-22 阿里巴巴集团控股有限公司 Merging method and device of character streams and electronic equipment
SG10201904554TA (en) 2019-05-21 2019-09-27 Alibaba Group Holding Ltd Methods and devices for quantifying text similarity
CN110348021B (en) * 2019-07-17 2021-05-18 湖北亿咖通科技有限公司 Character string recognition method based on named entity model, electronic device and storage medium
CN110399615B (en) * 2019-07-29 2023-08-18 中国工商银行股份有限公司 Transaction risk monitoring method and device
CN110717483B (en) * 2019-09-19 2023-04-18 浙江善政科技有限公司 Network image recognition processing method, computer readable storage medium and mobile terminal
CN110688995B (en) * 2019-09-19 2022-11-15 浙江善政科技有限公司 Map query processing method, computer-readable storage medium and mobile terminal
CN111221943B (en) * 2020-01-13 2023-08-08 口口相传(北京)网络技术有限公司 Query result matching degree calculation method and device
CN111461186B (en) * 2020-03-20 2022-11-04 支付宝(杭州)信息技术有限公司 Data similarity processing method and device, storage medium and computer equipment
CN111626040A (en) * 2020-05-28 2020-09-04 数网金融有限公司 Method for determining sentence similarity, related equipment and readable storage medium
CN112100381B (en) * 2020-09-22 2022-05-17 福建天晴在线互动科技有限公司 Method and system for quantizing text similarity
CN116029284B (en) * 2023-03-27 2023-07-21 上海蜜度信息技术有限公司 Chinese substring extraction method, chinese substring extraction system, storage medium and electronic equipment
CN116502611B (en) * 2023-06-28 2023-12-05 深圳魔视智能科技有限公司 Labeling method, labeling device, equipment and readable storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101702171A (en) * 2009-11-19 2010-05-05 新蛋信息技术(西安)有限公司 Approximating matching method for numerous character strings
KR20100060165A (en) * 2008-11-27 2010-06-07 엔에이치엔(주) Method and system for determining similar word with input string
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20040002849A1 (en) * 2002-06-28 2004-01-01 Ming Zhou System and method for automatic retrieval of example sentences based upon weighted editing distance
US9031967B2 (en) * 2012-02-27 2015-05-12 Truecar, Inc. Natural language processing system, method and computer program product useful for automotive data mapping
CN103838789A (en) * 2012-11-27 2014-06-04 大连灵动科技发展有限公司 Text similarity computing method
US9535899B2 (en) * 2013-02-20 2017-01-03 International Business Machines Corporation Automatic semantic rating and abstraction of literature
CN104008166B (en) * 2014-05-30 2017-05-24 华东师范大学 Dialogue short text clustering method based on form and semantic similarity

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20100060165A (en) * 2008-11-27 2010-06-07 엔에이치엔(주) Method and system for determining similar word with input string
CN101702171A (en) * 2009-11-19 2010-05-05 新蛋信息技术(西安)有限公司 Approximating matching method for numerous character strings
CN102184169A (en) * 2011-04-20 2011-09-14 北京百度网讯科技有限公司 Method, device and equipment used for determining similarity information among character string information
CN102289522A (en) * 2011-09-19 2011-12-21 北京金和软件股份有限公司 Method of intelligently classifying texts
CN103399907A (en) * 2013-07-31 2013-11-20 深圳市华傲数据技术有限公司 Method and device for calculating similarity of Chinese character strings on the basis of edit distance

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Guoliang Li 等.A partition-based method for string similarity joins with edit-distance constraints.ACM Transactions on Database Systems.2013,第38卷(第2期 ),全文. *
章成志.基于多层特征的字符串相似度计算模型.情报学报.2005,(06),全文. *
黄林晟等.基于编辑距离的中文组织机构名简称-全称匹配算法.山东大学学报(理学版).2012,第5期(第47卷),全文. *

Also Published As

Publication number Publication date
CN111324784A (en) 2020-06-23
CN106033416A (en) 2016-10-19
CN106033416B (en) 2019-12-24

Similar Documents

Publication Publication Date Title
CN111324784B (en) Character string processing method and device
WO2018157805A1 (en) Automatic questioning and answering processing method and automatic questioning and answering system
US20180107933A1 (en) Web page training method and device, and search intention identifying method and device
JP5346279B2 (en) Annotation by search
CN107463548B (en) Phrase mining method and device
WO2017097231A1 (en) Topic processing method and device
US11907659B2 (en) Item recall method and system, electronic device and readable storage medium
CN108027814B (en) Stop word recognition method and device
WO2020114100A1 (en) Information processing method and apparatus, and computer storage medium
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
US20160253577A1 (en) Image Clustering Method, Image Clustering System, And Image Clustering Server
JP6767042B2 (en) Scenario passage classifier, scenario classifier, and computer programs for it
US9298757B1 (en) Determining similarity of linguistic objects
CN112784009B (en) Method and device for mining subject term, electronic equipment and storage medium
CN110858217A (en) Method and device for detecting microblog sensitive topics and readable storage medium
US11557141B2 (en) Text document categorization using rules and document fingerprints
CN110147558B (en) Method and device for processing translation corpus
CN114398315A (en) Data storage method, system, storage medium and electronic equipment
CN112287657B (en) Information matching system based on text similarity
CN113761137B (en) Method and device for extracting address information
CN111160445B (en) Bid file similarity calculation method and device
CN110532569B (en) Data collision method and system based on Chinese word segmentation
CN110020078B (en) Method and related device for generating relevance mapping dictionary and verifying relevance
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
CN115796146A (en) File comparison method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201012

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant