The present application is a divisional application of patent application with application number 201510103200.4, in which the application date is 2015, 03, 09, and the name is "a character string processing method and apparatus".
Detailed Description
For the purposes, technical solutions and advantages of the present application, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art without undue burden from the present disclosure, are within the scope of the present disclosure.
The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.
Example 1
Fig. 1 is a flowchart of a character string processing method according to an embodiment of the present application, in which the character string is decomposed into character sub-strings with semantic weights, and similarity calculation is performed after the semantic editing distance between the character strings is calculated according to the semantic weights, so that the similarity of the character strings according to semantic units can be effectively improved, and the character strings can be conveniently subjected to subsequent processing such as classification and identification. The method comprises the following steps:
s101: and acquiring the character string to be identified.
The obtained character string to be recognized S includes one or more of company name, address, commodity name, blacklist, problem name or description input by the user.
Such as the user's need to enter a shipping address at some service website, the service provider's need to enter the name of the good, some users may need to set some blacklists. All the data may have a string of characters representing the same meaning but different expressions, and the data amount to be saved by the service website is increasingly large, at this time, the system needs to identify the data input by the user so as to facilitate subsequent operations of classifying, adding, replacing and the like.
S102: and segmenting the character strings to be identified to obtain each character sub-string to be identified.
Word segmentation is carried out on the character string S to be recognized according to semantic units, and each character substring S= { S with semantics is obtained 1 ,s 2 ,s 3 …,s i }. The step adopts a grammar analysis unit to process word segmentation.
S103: and determining the semantic weight of each character substring to be identified.
Firstly, a semantic weight table Wn exists in a local database, the semantic weight table is obtained by calculation in advance according to samples stored in the database, and the calculation method comprises the following steps:
extracting a number of character string samples, the charactersThe string sample may be a list, an address, etc. of more than 10000 lines in the same class; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inverse document frequency (TF-IDF) is provided
Calculating the semantic weight W of each sample substring
n = {(w
1 ,idf
1 ), (w
2 ,idf
2 ), (w
3 ,idf
3 )…, (w
n ,idf
n ) -wherein |d| represents the total number of sample strings, | { j: t
i ∈d
j The expression "includes a substring t of samples
i If the sub-string of samples is not present in the sample, the denominator is zero, so 1+| { j: t is typically used
i ∈d
j And } |. If the class sample substring weight set has universality, a class name is taken to store the set, such as 'W (companyName)', 'W (address)', and the like, and the corresponding weight set can be directly called in the same scene next time.
Firstly, searching the semantic weight table according to each character substring to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights 1 ,sw 1 ),(s 2 ,sw 2 ),(s 3 ,sw 3 )…,(s m ,sw m )}。
S104: and searching for a target character string according to each character sub-string to be identified.
The target string T is one or more of the correct company name, address, commodity name, blacklist, problem name or description stored in the local database.
Selecting a character sub-string to be recognized, the semantic weight of which is greater than a set threshold value, from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string.
The character sub-strings to be identified with the semantic weight larger than a set threshold value refer to one or more character sub-strings to be identified with the semantic weight larger than a certain threshold value, the target character strings can be one or more, and each target character string comprises the selected character sub-string to be identified.
S105: and word segmentation is carried out on the target character strings to obtain target character sub-strings.
The selected target character strings are subjected to word segmentation one by one, the word segmentation step S102 of the step is the same, and target character substrings T= { T are obtained after word segmentation 1 ,t 2 ,t 3 …,t n }。
S106: and determining the semantic weight of each target character substring.
Step S103, searching the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights 1 ,tw 1 ),(t 2 ,tw 2 ),(t 3 ,tw 3 )…,(t n ,tw n )}。
S107: and determining the semantic editing distance between the character string to be recognized and the target character string according to the semantic weights of the character strings to be recognized and the target character strings.
This step is to
Calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents a character substring to be recognizedThe number j represents the number of target character substrings; tw (tw) j T representing a target character substring j Semantic weights, sw i Representing character substring s to be recognized i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s 1 , s 2 , s 3 … s i ) To the target character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized i Conversion to the jth target character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
S108: and determining the similarity between the character string to be identified and the target character string according to the semantic editing distance.
The step mainly includes calculating the similarity between the character string to be identified and the target character string according to the semantic editing distance obtained in the step S107. The similarity calculation formula is:
wherein the wait (S, T) represents a semantic editing distance between the character string S to be recognized and the target character string T, the length (S) represents a sum of semantic weights of all the character sub-strings to be recognized in the character string S to be recognized, and the length (T) represents a sum of semantic weights of all the target character sub-strings in the target character string T.
S109: and carrying out subsequent processing on the character strings to be identified according to the similarity.
The step mainly refers to one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the target character strings as a blacklist and the like in different application scenes by utilizing the similarity result.
Example 1: when the acquired character string S to be identified is ABC information technology limitedCompany "; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, information, technology, limited, company }, i=5; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is 'XYZ information technology limited company'; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { XYZ, information, technology, limited, company }, and j=5; while the semantic weight table W n Weights for the character substrings are shown in table 1 below:
TABLE 1
Then the result is that the character sub-string to be recognized with semantic weight is sw= { (ABC, 0.98), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (XYZ, 0.99), (information, 0.02), (technology, 0.02), (limited, 0.01), (company, 0.01) }.
The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 2, and the calculation is similar to the existing edit distance algorithm, and is not described in detail, except that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.99.
TABLE 2
And then according to a similarity calculation formula:
calculating the to-be-recognized character substring S= { ABC, information, technology, limited, company } and targetThe similarity between the character substrings t= { XYZ, information, technology, limited, company } is: 1 to 0.99/max ((0.98+0.02+0.02+0.01+0.01), (0.99+0.02+0.02+0.01+0.01))=5.71%, at which point the similarity between the character string S to be recognized and the target character string T is small.
Example 2: when the acquired character string S to be identified is 'ABC company'; then the character string S to be recognized is subjected to word segmentation processing to obtain each character sub-string S= { ABC, company }, i=2; searching a target character string T from a target string database according to the character sub-strings to be identified, and supposing that one of the target character strings is found to be ABC information technology limited company; word segmentation is carried out on the target character string T to obtain a target character sub-string T= { ABC, information, technology, limited, company }, and j=5; while the semantic weight table W n Weights for the character substrings are shown in Table 3 below:
TABLE 3 Table 3
Then the to-be-recognized character sub-string with semantic weight is obtained as sw= { (ABC, 0.98), (company, 0.01) }, and the target character sub-string with semantic weight is tw= { (ABC, 0.98), (information, 0.02), (technique, 0.02), (limited, 0.01), (company, 0.01) }.
The edit distance with semantic weight between the character sub-string to be identified and the target character sub-string is calculated according to the edit distance formula, which is called semantic edit distance herein. The two-dimensional matrix with semantic weights between the character sub-string Sw to be recognized and the target character sub-string Tw is established as shown in the following table 4, where the calculation is similar to the existing edit distance algorithm, and details are not repeated, and the difference is that the number of operation steps is converted into semantic weights to perform the calculation, and the semantic edit distance between the character sub-string Sw to be recognized and the target character sub-string Tw can be obtained according to the two-dimensional matrix table to be 0.05.
|
0
|
ABC
|
Company (Corp)
|
0
|
0
|
0.98
|
0.99
|
ABC
|
0.98
|
0
|
0.01
|
Information processing system
|
1.00
|
0.02
|
0.03
|
Techniques for
|
1.02
|
0.04
|
0.05
|
Limited and limited
|
1.03
|
0.05
|
0.06
|
Company (Corp)
|
1.04
|
0.06
|
0.05 |
TABLE 4 Table 4
And then according to a similarity calculation formula:
and calculating the similarity between the character string S= { ABC, company } and the target character substring T= { ABC, information, technology, limited, company } as follows: 1-0.05/max ((0.98+0.01), (0.98+0.02+0.02+0.01+0.01))=95.19%, at this time, the similarity between the character string S to be recognized and the target character string T is very large, and the character string to be recognized may be subjected to subsequent processing such as classifying the character string to be recognized into the same class as the target character string, directly replacing the character string with the target character string, or setting the character string to be recognized as a blacklist.
Example 2
The above method for processing a character string provided in the present application is based on the same idea, and the second embodiment of the present application further provides a corresponding device for processing a character string, as shown in fig. 2.
Fig. 2 is a schematic structural diagram of a character string processing device according to a second embodiment, which specifically includes:
an acquisition unit 201, configured to acquire a character string to be recognized;
a searching unit 202, configured to search for a target character string according to the character string to be identified;
the word segmentation unit 203 is configured to segment the character string to be identified and the target character string respectively to obtain each character sub-string to be identified and each target character sub-string;
a semantic weight determining unit 204, configured to determine semantic weights of each character sub-string to be identified and each target character sub-string;
a semantic editing distance determining unit 205, configured to determine a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings;
a similarity determining unit 206, configured to determine a similarity between the character string to be identified and the target character string according to the semantic editing distance;
and a subsequent processing unit 207, configured to perform subsequent processing on the character string to be identified according to the similarity.
The character string S to be recognized acquired by the acquisition unit 201 includes one or more of a company name, an address, a commodity name, a blacklist, a problem name, or a description input by a user.
Such as the user's need to enter a shipping address at some service website, the service provider's need to enter the name of the good, some users may need to set some blacklists. All the data may have a string of characters representing the same meaning but different expressions, and the data amount to be saved by the service website is increasingly large, at this time, the system needs to identify the data input by the user so as to facilitate subsequent operations of classifying, adding, replacing and the like.
The word segmentation unit 203 performs word segmentation processing on the obtained character string S to be recognized according to semantic units to obtain each character substring S = { S with semantics 1 ,s 2 ,s 3 …,s i }. The device adopts a grammar analysis unit to process word segmentation.
In the semantic weight determining unit 204 or the local database, there is a semantic weight table Wn, which is obtained by performing calculation in advance according to samples stored in the database, and the calculation method includes:
extracting a certain number of character string samples, wherein the character string samples can be a list, an address and the like of the same type with more than 10000 lines; performing a de-duplication operation on the extracted string samples, namely removing identical strings in the extracted string samples so as to prevent the string samples from being duplicated; performing word segmentation on the extracted character string sample to obtain a plurality of sample substrings with certain semantic units, and performing word segmentation processing in step S102; finally, a measurement calculation formula of the general importance of the words according to the reverse frequency-inversedocument frequency (TF-IDF) is provided
Calculating the semantic weight W of each sample substring
n = {(w
1 ,idf
1 ), (w
2 ,idf
2 ), (w
3 ,idf
3 )…, (w
n ,idf
n ) -wherein |d| represents the total number of sample strings, | { j: t
i ∈d
j The expression "includes a substring t of samples
i If the sub-string of samples is not present in the sample, the denominator is zero, so 1+| { j: t is typically used
i ∈d
j And } |. If the class sample substring weight set has universality, a class name is taken to store the set, such as 'W (companyName)', 'W (address)', and the like, and the corresponding weight set can be directly called in the same scene next time.
The semantic weight determining unit 204 searches the semantic weight table according to each character sub-string to be identified; then finding out the corresponding semantic weights of each character sub-string to be recognized from the semantic weight table to obtain the character sub-string Sw= {(s) with the semantic weights 1 ,sw 1 ),(s 2 ,sw 2 ),(s 3 ,sw 3 )…,(s m ,sw m )}。
The searching unit 202 searches for the target character string according to the character sub-strings to be identified in the character strings to be identified. Selecting character substrings to be recognized with semantic weights larger than a set threshold value from the character strings to be recognized; and then searching a target character string database by adopting the selected character sub-string to be identified, and finding out the target character string T.
The target string T is one or more of the correct company name, address, commodity name, blacklist, problem name or description stored in the local database.
The character sub-strings to be identified with the semantic weight larger than a set threshold value refer to one or more character sub-strings to be identified with the semantic weight larger than a certain threshold value, the target character strings can be one or more, and each target character string comprises the selected character sub-string to be identified.
After the target character string T is obtained, the word segmentation unit 203 performs word segmentation on the target character string according to semantic units to obtain a target character sub-string t= { T 1 ,t 2 ,t 3 …,t n -a }; then the semantic weight determining unit 204 searches the semantic weight table according to each target character substring; then finding out the corresponding semantic weights of each target character sub-string from the semantic weight table to obtain the target character sub-string Tw= { (t) with semantic weights 1 ,tw 1 ),(t 2 ,tw 2 ),(t 3 ,tw 3 )…,(t n ,tw n )}。
The semantic editing distance determining unit 205 determines a semantic editing distance between the character string to be recognized and the target character string according to semantic weights of the character strings to be recognized and the target character strings, including:
calculating a semantic editing distance according to the following formula:
when i=0 and j=0, the wait (0, 0) =0;
when i=0 and j>At 0, wait (0, j) =wait (0, j-1) +tw j ;
When i>0 and j=0, edit (i, 0) =edit (i-1, 0) +sw i ;
When i>0 and j>At 0, wait (i, j) =min (wait (i-1, j) +sw i , edit(i,j-1)+ tw j , edit(i-1,j-1)+f(i, j) );
Wherein i represents the number of character substrings to be identified, and j represents the number of target character substrings; tw (tw) j T representing a target character substring j Semantic weights, sw i Representing character substring s to be recognized i Semantic weights of (2); the wait (i, j) represents a set of sub-strings (s 1 , s 2 , s 3 … s i ) To the target character string subset (t 1 , t 2 , t 3 … t j ) When i and j are the number of all substrings contained in the character string S to be recognized and the target character string T respectively, the edit (i, j) is equal to the semantic editing distance edit (S, T) between the character string S to be recognized and the target character string T; f (i, j) represents the ith character substring s to be recognized i Conversion to the jth target character substring t j The resulting semantic edit distance, s i =t j F (i, j) =0; when s is i ≠t j F (i, j) =max (sw i , tw j )。
The similarity determining unit 206 determines the similarity between the character string to be recognized and the target character string according to the semantic editing distance between the character string to be recognized S and the target recognition character string T obtained by the semantic editing distance determining unit 205.
The similarity calculation formula is:
wherein the wait (S, T) represents a semantic editing distance between the character string S to be recognized and the target character string T, the length (S) represents a sum of semantic weights of all the character sub-strings to be recognized in the character string S to be recognized, and the length (T) represents a sum of semantic weights of all the target character sub-strings in the target character string T.
The post-processing unit 207 performs post-processing on the character string to be recognized according to the similarity between the character string to be recognized S and the target character string T determined by the similarity determining unit 206.
The subsequent processing unit 207 performs different processing functions in different application scenarios, for example, one or more of classifying the character strings to be identified, replacing the character strings with target character strings meeting the similarity condition, setting the character strings as a blacklist, and the like according to the similarity result.
Examples of the present apparatus are the same as examples 1 and 2 in the first embodiment.
It should be noted that, the execution subjects of the steps of the method provided in the first embodiment may be the same device, or the method may also be executed by different devices.
It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.
Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and changes may be made to the present application by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc. which are within the spirit and principles of the present application are intended to be included within the scope of the claims of the present application.