CN111324784B

CN111324784B - Character string processing method and device

Info

Publication number: CN111324784B
Application number: CN202010065546.0A
Authority: CN
Inventors: 魏爱勇
Original assignee: Advanced New Technologies Co Ltd
Current assignee: Advanced New Technologies Co Ltd; Advantageous New Technologies Co Ltd
Priority date: 2015-03-09
Filing date: 2015-03-09
Publication date: 2023-05-16
Anticipated expiration: 2035-03-09
Also published as: CN106033416B; CN111324784A; CN106033416A

Abstract

The application discloses a character string processing method, which comprises the following steps: acquiring a character string to be identified; word segmentation is carried out on the character strings to be identified, and each character sub-string to be identified is obtained; determining the semantic weight of each character substring to be identified; searching a target character string according to each character sub-string to be identified; word segmentation is carried out on the target character strings to obtain target character sub-strings; determining the semantic weight of each target character substring; determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings; and determining the similarity between the character string to be identified and the target character string according to the semantic editing distance. The method and the device adopt the character sub-strings with semantic weights to determine the semantic editing distance, greatly improve the accuracy of character string similarity recognition, and solve the problem of poor accuracy of the existing character string recognition. The application also discloses another character string processing device.

Description

Character string processing method and device

The present application is a divisional application of patent application with application number 201510103200.4, in which the application date is 2015, 03, 09, and the name is "a character string processing method and apparatus".

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to a method and an apparatus for processing character string.

Background

At present, the influence of the internet on the daily life of people is increasingly strengthened, so that the internet data is greatly exploded, and the storage and identification of various data become increasingly important problems. In some application scenarios, identification and categorization of addresses, black lists, problem names, etc. are required, which involves the problem of similarity calculation of character strings in the huge database.

In the internet field, huge commodity service and user data including user addresses, company names, commodity names and the like are stored in a service provider database, if character strings representing the addresses and the company names directly originate from information filled by users, the character strings are various in form, for example, a company is fully called Shanghai XXX stock limited company, and then the company names filled by staff of the company can be in various character string forms such as XXX, XXX company, shanghai XXX and the like. In this case, it is often difficult to identify XXX, XXX corporation, shanghai XXX matches as the complete Shanghai XXX stock, inc.

Currently, the algorithm for calculating the similarity of character strings in the database is an edit distance (Levenshtein Distance) algorithm, which is the minimum number of insertions, deletions and substitutions required to calculate the transition from the original string S to the target string T. The string similarity calculation formula: the similarity of the character strings (S, T) =1- (edit distance/(MAX (S), length (T))), but the common edit distance algorithm is only mechanically calculated according to a single character set constituting the character strings, the similarity result calculated according to the method is not accurate enough, for example, the result obtained by the similarity calculation method of the character strings "ABC information technology limited company" and "ABC" may be 27.27%, and the result obtained by the similarity calculation method of the "ABC information technology limited company" and "XYZ information technology limited company" may be 72.73%, the difference between the obtained similarity result and the actual result is large, the accuracy is low, and the accuracy of subsequent processing such as recognition classification is poor.

Disclosure of Invention

The embodiment of the application provides a character string processing method, which is used for solving the problems that the character string similarity with lower accuracy is obtained by adopting a character sub-string synthesized by a single character set in character string identification such as lists, addresses and the like to edit a distance algorithm in the prior art, and the subsequent processing accuracy such as identification and classification is poor.

The embodiment of the application also provides a character string processing device, which is used for solving the problems that the character string similarity with lower accuracy is obtained by adopting a character sub-string synthesized by a single character set in character string identification such as lists, addresses and the like to edit a distance algorithm in the prior art, and the accuracy of subsequent processing such as identification and classification is poor.

The embodiment of the application adopts the following technical scheme:

a character string processing method, comprising:

acquiring a character string to be identified;

word segmentation is carried out on the character strings to be identified, and each character sub-string to be identified is obtained;

determining the semantic weight of each character substring to be identified;

searching a target character string according to each character sub-string to be identified;

word segmentation is carried out on the target character strings to obtain target character sub-strings;

determining the semantic weight of each target character substring;

determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings;

determining the similarity between the character string to be identified and the target character string according to the semantic editing distance;

and carrying out subsequent processing on the character strings to be identified according to the similarity.

A character string processing apparatus comprising:

an acquisition unit for acquiring a character string to be identified;

the searching unit is used for searching the target character string according to the character string to be identified;

the word segmentation unit is used for respectively segmenting the character strings to be identified and the target character strings to obtain character sub-strings to be identified and target character sub-strings;

the semantic weight determining unit is used for determining the semantic weight of each character sub-string to be identified and each target character sub-string;

the editing distance determining unit is used for determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings;

the similarity determining unit is used for determining the similarity between the character string to be identified and the target character string according to the semantic editing distance;

and the subsequent processing unit is used for performing subsequent processing on the character string to be identified according to the similarity.

The above-mentioned at least one technical scheme that this application embodiment adopted can reach following beneficial effect:

in the processing of internet data, word segmentation is carried out on character strings, character sub-strings are obtained, semantic weights are given to the character sub-strings, semantic editing distances are calculated by the character sub-strings with the semantic weights, the accuracy of character string identification according to semantic requirements is improved, and the problem that the accuracy of subsequent processing such as identification and classification is poor due to the fact that the character sub-strings synthesized by single character sets are subjected to an editing distance algorithm in the prior art for character string identification such as lists and addresses is solved, so that character string similarity with lower accuracy is obtained.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute an undue limitation to the application. In the drawings:

FIG. 1 is a flowchart of a method for processing a character string according to an embodiment of the present disclosure;

fig. 2 is a schematic diagram of a character string processing device according to a second embodiment of the present application.

Detailed Description

For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

Example 1

Fig. 1 is a flowchart of a character string processing method according to an embodiment of the present application, in which the character string is decomposed into character sub-strings with semantic weights, and similarity calculation is performed after the semantic editing distance between the character strings is calculated according to the semantic weights, so that the similarity of the character strings according to semantic units can be effectively improved, and the character strings can be conveniently subjected to subsequent processing such as classification and identification. The method comprises the following steps:

s101: and acquiring the character string to be identified.

The obtained character string to be recognized S includes one or more of company name, address, commodity name, blacklist, problem name or description input by the user.

Such as the user's need to enter a shipping address at some service website, the service provider's need to enter the name of the good, some users may need to set some blacklists. All the data may have a string of characters representing the same meaning but different expressions, and the data amount to be saved by the service website is increasingly large, at this time, the system needs to identify the data input by the user so as to facilitate subsequent operations of classifying, adding, replacing and the like.

S102: and segmenting the character strings to be identified to obtain each character sub-string to be identified.

Word segmentation is carried out on the character string S to be recognized according to semantic units, and each character substring S= { S with semantics is obtained ₁ ,s ₂ ,s ₃ …,s _i }. The step adopts a grammar analysis unit to process word segmentation.

S103: and determining the semantic weight of each character substring to be identified.

Firstly, a semantic weight table Wn exists in a local database, the semantic weight table is obtained by calculation in advance according to samples stored in the database, and the calculation method comprises the following steps:

extracting a number of character string samples, the charactersThe string sample may be a list, an address, etc. of more than 10000 lines in the same class; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inverse document frequency (TF-IDF) is provided

Calculating the semantic weight W of each sample substring _n = {(w ₁ ,idf ₁ ), (w ₂ ,idf ₂ ), (w ₃ ,idf ₃ )…, (w _n ,idf _n ) -wherein |d| represents the total number of sample strings, | { j: t _i ∈d _j The expression "includes a substring t of samples _i If the sub-string of samples is not present in the sample, the denominator is zero, so 1+| { j: t is typically used _i ∈d _j And } |. If the class sample substring weight set has universality, a class name is taken to store the set, such as 'W (companyName)', 'W (address)', and the like, and the corresponding weight set can be directly called in the same scene next time.

Firstly, searching the semantic weight table according to each character substring to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights ₁ ,sw ₁ ),(s ₂ ,sw ₂ ),(s ₃ ,sw ₃ )…,(s _m ,sw _m )}。

S104: and searching for a target character string according to each character sub-string to be identified.

The target string T is one or more of the correct company name, address, commodity name, blacklist, problem name or description stored in the local database.

Selecting a character sub-string to be recognized, the semantic weight of which is greater than a set threshold value, from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string.

The character sub-strings to be identified with the semantic weight larger than a set threshold value refer to one or more character sub-strings to be identified with the semantic weight larger than a certain threshold value, the target character strings can be one or more, and each target character string comprises the selected character sub-string to be identified.

S105: and word segmentation is carried out on the target character strings to obtain target character sub-strings.

The selected target character strings are subjected to word segmentation one by one, the word segmentation step S102 of the step is the same, and target character substrings T= { T are obtained after word segmentation ₁ ,t ₂ ,t ₃ …,t _n }。

S106: and determining the semantic weight of each target character substring.

Step S103, searching the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights ₁ ,tw ₁ ),(t ₂ ,tw ₂ ),(t ₃ ,tw ₃ )…,(t _n ,tw _n )}。

S107: and determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings.

This step is to

Calculating a semantic editing distance according to the following formula:

when i=0 and j=0, the wait (0, 0) =0;

when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw _j ;

When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw _i ;

When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw _i , edit(i,j-1)+ tw _j , edit(i-1,j-1)+f(i, j) )；

Wherein i represents a character substring to be recognizedThe number j represents the number of target character substrings; tw (tw) _j T representing a target character substring _j Semantic weights, sw _i Representing character substring s to be recognized _i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s ₁ , s ₂ , s ₃ … s _i ) To the target character string subset (t ₁ , t ₂ , t ₃ … t _j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized _i Conversion to the jth target character substring t _j The resulting semantic edit distance, s _i =t _j F (i, j) =0; when s is _i ≠t _j F (i, j) =max (sw _i , tw _j )。

S108: and determining the similarity between the character string to be identified and the target character string according to the semantic editing distance.

The step mainly includes calculating the similarity between the character string to be identified and the target character string according to the semantic editing distance obtained in the step S107. The similarity calculation formula is:

wherein the wait (S, T) represents a semantic editing distance between the character string S to be recognized and the target character string T, the length (S) represents a sum of semantic weights of all the character sub-strings to be recognized in the character string S to be recognized, and the length (T) represents a sum of semantic weights of all the target character sub-strings in the target character string T.

S109: and carrying out subsequent processing on the character strings to be identified according to the similarity.

The step mainly refers to one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the target character strings as a blacklist and the like in different application scenes by utilizing the similarity result.

Example 1: when the acquired character string S to be identified is ABC information technology limitedCompany "; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, information, technology, limited, company }, i=5; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is 'XYZ information technology limited company'; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { XYZ, information, technology, limited, company }, and j=5; while the semantic weight table W _n Weights for the character substrings are shown in table 1 below:

TABLE 1

Then the result is that the character sub-string to be recognized with semantic weight is sw= { (ABC, 0.98), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (XYZ, 0.99), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }.

The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 2, and the calculation is similar to the existing edit distance algorithm, and is not described in detail, except that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.99.

TABLE 2

And then according to a similarity calculation formula:

calculating the to-be-recognized character substring S= { ABC, information, technology, limited, company } and targetThe similarity between the character substrings t= { XYZ, information, technology, limited, company } is: 1 to 0.99/max ((0.98+0.02+0.02+0.01+0.01), (0.99+0.02+0.02+0.01+0.01))=5.71%, at which point the similarity between the character string S to be recognized and the target character string T is small.

Example 2: when the acquired character string S to be identified is 'ABC company'; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, company }, i=2; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is found to be ABC information technology limited company; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { ABC, information, technology, limited, company }, and j=5; while the semantic weight table W _n Weights for the character substrings are shown in Table 3 below:

TABLE 3 Table 3

Then the to-be-recognized character sub-string with semantic weight is obtained as sw= { (ABC, 0.98), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (ABC, 0.98), (information, 0.02), (technique, 0.02), (limited, 0.01), (company, 0.01) }.

The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 4, where the calculation is similar to the existing edit distance algorithm, and details are not repeated, and the difference is that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.05.

	0	ABC	Company (Corp)
				0	0	0.98	0.99
ABC	0.98	0	0.01
				Information processing system	1.00	0.02	0.03
Techniques for	1.02	0.04	0.05
				Limited and limited	1.03	0.05	0.06
Company (Corp)	1.04	0.06	0.05

TABLE 4 Table 4

And then according to a similarity calculation formula:

and calculating the similarity between the character string S= { ABC, company } and the target character substring T= { ABC, information, technology, limited, company } as follows: 1-0.05/max ((0.98+0.01), (0.98+0.02+0.02+0.01+0.01))=95.19%, at this time, the similarity between the character string S to be recognized and the target character string T is very large, and the character string to be recognized may be subjected to subsequent processing such as classifying the character string to be recognized into the same class as the target character string, directly replacing the character string with the target character string, or setting the character string to be recognized as a blacklist.

Example 2

The above method for processing a character string provided in the present application is based on the same idea, and the second embodiment of the present application further provides a corresponding device for processing a character string, as shown in fig. 2.

Fig. 2 is a schematic structural diagram of a character string processing device according to a second embodiment, which specifically includes:

an acquisition unit 201, configured to acquire a character string to be recognized;

a searching unit 202, configured to search for a target character string according to the character string to be identified;

the word segmentation unit 203 is configured to segment the character string to be identified and the target character string respectively to obtain each character sub-string to be identified and each target character sub-string;

a semantic weight determining unit 204, configured to determine semantic weights of each character sub-string to be identified and each target character sub-string;

a semantic editing distance determining unit 205, configured to determine a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings;

a similarity determining unit 206, configured to determine a similarity between the character string to be identified and the target character string according to the semantic editing distance;

and a subsequent processing unit 207, configured to perform subsequent processing on the character string to be identified according to the similarity.

The character string S to be recognized acquired by the acquisition unit 201 includes one or more of a company name, an address, a commodity name, a blacklist, a problem name, or a description input by a user.

The word segmentation unit 203 performs word segmentation processing on the obtained character string S to be recognized according to semantic units to obtain each character substring S = { S with semantics ₁ ,s ₂ ,s ₃ …,s _i }. The device adopts a grammar analysis unit to process word segmentation.

In the semantic weight determining unit 204 or the local database, there is a semantic weight table Wn, which is obtained by performing calculation in advance according to samples stored in the database, and the calculation method includes:

extracting a certain number of character string samples, wherein the character string samples can be a list, an address and the like of the same type with more than 10000 lines; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inversedocument frequency (TF-IDF) is provided

The semantic weight determining unit 204 searches the semantic weight table according to each character sub-string to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights ₁ ,sw ₁ ),(s ₂ ,sw ₂ ),(s ₃ ,sw ₃ )…,(s _m ,sw _m )}。

The searching unit 202 searches for the target character string according to the character sub-strings to be identified in the character strings to be identified. Selecting character substrings to be recognized with semantic weights larger than a set threshold value from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string T.

After the target character string T is obtained, the word segmentation unit 203 performs word segmentation on the target character string according to semantic units to obtain a target character sub-string t= { T ₁ ,t ₂ ,t ₃ …,t _n -a }; then the semantic weight determining unit 204 searches the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights ₁ ,tw ₁ ),(t ₂ ,tw ₂ ),(t ₃ ,tw ₃ )…,(t _n ,tw _n )}。

The semantic editing distance determining unit 205 determines a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings, including:

calculating a semantic editing distance according to the following formula:

when i=0 and j=0, the wait (0, 0) =0;

when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw _j ;

When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw _i ;

Wherein i represents the number of character substrings to be identified, and j represents the number of target character substrings; tw (tw) _j T representing a target character substring _j Semantic weights, sw _i Representing character substring s to be recognized _i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s ₁ , s ₂ , s ₃ … s _i ) To the target character string subset (t ₁ , t ₂ , t ₃ … t _j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized _i Conversion to the jth target character substring t _j The resulting semantic edit distance, s _i =t _j F (i, j) =0; when s is _i ≠t _j F (i, j) =max (sw _i , tw _j )。

The similarity determining unit 206 determines the similarity between the character string to be recognized and the target character string according to the semantic editing distance between the character string to be recognized S and the target recognition character string T obtained by the semantic editing distance determining unit 205.

The similarity calculation formula is:

The post-processing unit 207 performs post-processing on the character string to be recognized according to the similarity between the character string to be recognized S and the target character string T determined by the similarity determining unit 206.

The subsequent processing unit 207 performs different processing functions in different application scenarios, for example, one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the character strings as a blacklist, and the like according to the similarity result.

Examples of the present apparatus are the same as examples 1 and 2 in the first embodiment.

It should be noted that, the execution subjects of the steps of the method provided in the first embodiment may be the same device, or the method may also be executed by different devices.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.

Claims

1. A character string processing method, comprising:

word segmentation is carried out on the first character strings according to semantics, and each first character sub-string is obtained;

word segmentation is carried out on the second character strings according to semantics, and each second character sub-string is obtained; the second character sub-strings are one or more and each second character sub-string comprises a selected first character sub-string; the selected first character substring is a first character substring with semantic weight larger than a preset threshold value;

determining the semantic editing distance between the first character string and the second character string according to the semantic weights of the first character sub-strings and the second character sub-strings;

and determining the similarity between the first character string and the second character string according to the semantic editing distance.

2. The method of claim 1, further comprising, prior to semantically segmenting the second string:

and retrieving the second character string according to each first character string sub-string.

3. The method of claim 2, wherein retrieving the second string according to the first string sub-strings comprises:

selecting a first character substring with a semantic weight greater than a set threshold;

and searching a character string database by using the selected first character sub-string to find out the second character string.

4. The method of claim 1, wherein the word segmentation of the first string according to semantics comprises:

word segmentation is carried out on the first character string according to semantic units;

word segmentation is carried out on the second character string according to semantics, and the method specifically comprises the following steps:

and word segmentation is carried out on the second character string according to the semantic unit.

5. The method of claim 1, further comprising, prior to weighting in accordance with the respective semantics of the first and second character sub-strings:

searching a semantic weight table according to the first character substrings and the second character substrings;

and finding out the semantic weights corresponding to the first character sub-strings and the second character sub-strings from the semantic weight table.

6. The method of claim 5, wherein the semantic weight table is calculated from samples stored in a database, comprising:

pre-extracting a certain number of non-repeated character string samples;

word segmentation is carried out on the character string samples with a certain number to obtain a plurality of sample substrings with certain semantic units;

and calculating the semantic weight of each sample substring according to a measurement calculation formula of the universal importance of the reverse file frequency IDF words.

7. The method according to any one of claims 1-6, wherein determining the semantic editing distance between the first character string and the second character string according to the respective semantic weights of the first character sub-string and the second character sub-string specifically comprises:

calculating a semantic editing distance according to the following formula:

when i=0 and j=0, the wait (0, 0) =0;

when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw _j ;

When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw _i ;

When i>0 and 0j>At 0, wait (i, j) =min (wait (i-1, j) +sw _i , edit(i,j-1)+ tw _j , edit(i-1,j-1)+f(i, j) )；

Wherein i represents the number of first character substrings and j represents the number of second character substrings; tw (tw) _j T representing a second character substring _j Semantic weights, sw _i Representing a first character substring s _i Semantic weights of (2); the wait (i, j) represents a first subset of strings (s ₁ , s ₂ , s ₃ … s _i ) To the second character string subset (t ₁ , t ₂ , t ₃ … t _j ) When i and j are the number of all substrings contained in the first string S and the second string T, respectively, the edit (i, j) is equal to the edit (S, T) distance between the first string S and the second string T; f (i, j) represents the ith first character substring s _i Conversion to the jth second character substring t _j The resulting semantic edit distance, s _i =t _j F (i, j) =0; when s is _i ≠t _j F (i, j) =max (sw _i , tw _j )。

8. The method of claim 7, determining the similarity between the first string and the second string according to the semantic editing distance, specifically comprising:

calculating the similarity between the first character string S and the second character string T according to a character string similarity calculation formula;

the similarity calculation formula is:

wherein the edit (S, T) represents a semantic editing distance between the first string S and the second string T, length (S) represents a sum of semantic weights of all the first character sub-strings in the first string S, and length (T) represents a sum of semantic weights of all the second character sub-strings in the second string T.

9. The method of claim 8, wherein the subsequent processing of the first string specifically includes:

and classifying the first character string, replacing the first character string with a second character string meeting the similarity condition, and setting the first character string as one or more of black lists.

10. A character string processing apparatus, comprising:

the word segmentation unit is used for segmenting the first character string and the second character string according to semantics respectively to obtain first character sub-strings and second character sub-strings; the second character sub-strings are one or more and each second character sub-string comprises a selected first character sub-string; the selected first character substring is a first character substring with semantic weight larger than a preset threshold value;

an edit distance determining unit for determining a semantic edit distance between the first character string and the second character string according to the semantic weights of the first character sub-strings and the second character sub-strings;

and the similarity determining unit is used for determining the similarity between the first character string and the second character string according to the semantic editing distance.

11. The apparatus of claim 10, further comprising, prior to the word segmentation of the second string according to semantics:

12. The apparatus of claim 11, retrieving the second string according to the first string sub-strings, specifically comprising:

13. The apparatus of claim 10, wherein the first character string and the second character string are each segmented according to semantics, and specifically comprising:

14. The apparatus of claim 10, further comprising, prior to weighting according to the respective semantics of the first and second character sub-strings:

15. The apparatus of claim 14, the semantic weight table being calculated in advance from samples stored in a database, comprising:

pre-extracting a certain number of non-repeated character string samples;

16. The apparatus of any of claims 10-15, determining a semantic editing distance between the first character string and the second character string according to respective semantic weights of the first character sub-string and the second character sub-string, specifically comprising:

calculating a semantic editing distance according to the following formula:

when i=0 and j=0, the wait (0, 0) =0;

when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw _j ;

When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw _i ;

Wherein i represents a first character sub-The number of strings, j, represents the number of second character substrings; tw (tw) _j T representing a second character substring _j Semantic weights, sw _i Representing a first character substring s _i Semantic weights of (2); the wait (i, j) represents a first subset of strings (s ₁ , s ₂ , s ₃ … s _i ) To the second character string subset (t ₁ , t ₂ , t ₃ … t _j ) When i and j are the number of all substrings contained in the first string S and the second string T, respectively, the edit (i, j) is equal to the edit (S, T) distance between the first string S and the second string T; f (i, j) represents the ith first character substring s _i Conversion to the jth second character substring t _j The resulting semantic edit distance, s _i =t _j F (i, j) =0; when s is _i ≠t _j F (i, j) =max (sw _i , tw _j )。

17. The apparatus of claim 16, determining a similarity between the first string and the second string according to the semantic editing distance, specifically comprising:

the similarity calculation formula is:

18. The apparatus of claim 17, wherein the subsequent processing of the first string specifically comprises: