CN102929891B

CN102929891B - The method and apparatus of process text

Info

Publication number: CN102929891B
Application number: CN201110230270.8A
Authority: CN
Inventors: 许泰清; 徐磊石; 胡四海
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Cloud Computing Ltd
Priority date: 2011-08-11
Filing date: 2011-08-11
Publication date: 2015-09-16
Anticipated expiration: 2031-08-11
Also published as: CN102929891A; HK1176432A1

Abstract

This application provides a kind of method and apparatus processing text, the problem that the treatment effect in order to solve the detection of prior art Chinese version is not good.The method comprises: in inverted index, search the keyword in pending text chunk, that adds up each text chunk in the text chunk set that prestores or text section identifies the number of times comprised now in the entry of keyword, from the text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword; The similarity calculated between each text chunk in multiple text chunks of pending text chunk and selection obtains the value of multiple similarity; Judge whether the minimum value in the value of multiple similarity is in setting range, if then export the information of preset content.

Description

The method and apparatus of process text

Technical field

The application relates to computer technology, particularly relates to a kind of method and apparatus processing text.

Background technology

In internet, in order to avoid the propagation of garbage or harmful information, usually need to carry out text-processing.Such as, in the arranging of anti-rubbish mail, exact matching is carried out in the address of letter and the blacklist address prestored by mail reception device such as mail reception client software, if all characters in the two are identical, rejects this letter.The text processed in this case is the address of E-mail address.And for example, in e-commerce system, certain user can perform fraud, the address (being generally mailing address) to these users stay is needed to detect in order to limit fraud, also address blacklist is adopted at present, exact matching is carried out in each address, if all characters in this address are identical with all characters of at least one address in blacklist, then thinks that this user is accused of swindle.

In the application scenarios of this " address blacklist ", the user of some Email Sender or e-commerce system can adopt the mode changing address to hide detection, traditional way changes the minority character in the text of address, and above-mentioned detection mode cannot detect this address.

In addition, in text-processing, also running into the application scenarios of " historical address compares ", namely certain particular address being judged whether have certain address once occurred and it similar in existing address list, so as to analyzing number of times of different address appearance etc.The change address that traditional complete matching method None-identified is mentioned above, causes the result of judgement inaccurate.

In " address blacklist " scene, need to come the current estimative address of decision-making according to the result of text-processing and whether be accused of swindle.Existing exact matching technology can only process identical address, if modified to the minority character of address, then cannot go out amended address by direct-detection, cannot play the real effect of blacklist.Meanwhile, because blacklist needs manual maintenance, namely allow to the address after obtaining all modifications and add blacklist, this list also can become very huge, is difficult to safeguard.

In " historical address compares " scene, need to come whether the current estimative address of decision-making is the historical address occurred according to the result of text-processing, and add up the indexs such as the number of times of different address appearance.Existing matching technique can only mate identical address equally, and two similar addresses can be judged to be broken into two different addresses, but in fact they are same addresses.Therefore, traditional treatment technology can cause the result of adress analysis inaccurate.

For this two methods scene, existing Method for text detection treatment effect is not good, not yet proposes effective solution at present.

Summary of the invention

The fundamental purpose of the application is to provide a kind of method and apparatus processing text, to solve the problem for the poor effect of text detection in prior art.

To achieve these goals, according to an aspect of the application, a kind of method processing text is provided.

The method of the process text of the application comprises: in inverted index, search the keyword in pending text chunk, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword, calculate the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtain the value of multiple similarity, judge whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content.

Further, the similarity between each text chunk in the text chunk that described calculating is pending and multiple text chunks of selection comprises: the similarity between each text chunk using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.

Further, when the result of described judgement is for being, described pending text chunk is added in described text chunk set.

Further, before similarity between multiple text chunks in the text chunk that described calculating is pending and the text chunk set that prestores, also comprise: according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.

Further, the text chunk set prestored described in is the historic user mailing address in the computer system of ecommerce; The information of described preset content comprises: historical address information, belongs to described historic user mailing address for expressing pending text chunk.

Further, the text chunk set prestored described in is the blacklist mailing address in the computer system of ecommerce; The information of described preset content comprises: black list user's information, belongs to described blacklist mailing address for expressing pending text chunk.

According to the another aspect of the application, provide a kind of device processing text.

The device of the process text of the application comprises: text chunk selects module, for searching all keywords in pending text chunk in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword, computing module, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity, judge module, for judging whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content.

Further, described computing module also for use the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection each text chunk between similarity.

Further, described text chunk selects module judge module also for when described judged result is for being, is added in described text chunk set by described pending text chunk.

Further, also comprise pretreatment module, before calculating at described computing module, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.

Further, also comprise preservation module, for being preserved as described text chunk set by the historic user mailing address in the computer system of ecommerce; Described judge module is also for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.

Further, also comprise preservation module, for being preserved as described text chunk set by the blacklist mailing address in the computer system of ecommerce; Described judge module is also for exporting black list user's information, and this black list user's information belongs to described blacklist mailing address for expressing pending text chunk.

According to technical scheme of the present invention, use the keyword in inverted index recording address and address, multiple address stored is selected according to the number of times that current pending address occurs in relevant entry in this inverted index, again Similarity Measure is carried out in these addresses and current pending address, this mode can accelerate the speed of Similarity Measure greatly, thus confirm that whether current pending address is the address in historical address or blacklist very soon, improve the computing power of e-commerce system.

Accompanying drawing explanation

Figure of description is used to provide further understanding of the present application, and form a application's part, the schematic description and description of the application, for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:

Fig. 1 is the process flow diagram of the key step of the method for process text according to the embodiment of the present application;

Fig. 2 is the schematic diagram of the content of inverted index according to the embodiment of the present application;

Fig. 3 is the schematic diagram carrying out address similarity degree statistics according to the use inverted index of the embodiment of the present application;

Fig. 4 is the schematic diagram of the main modular of the device of process text according to the embodiment of the present application.

Embodiment

It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.

Fig. 1 is the process flow diagram of the key step of the method for process text according to the embodiment of the present application, and as shown in Figure 1, the method mainly comprises the steps:

Step S11: obtain pending text.

Step S13: calculate the similarity between the multiple text chunks in pending text chunk and the text chunk set that prestores, obtain the value of multiple similarity.

The algorithm of various similarity of character string comparisons that are existing or that may occur in the future can be used in this step to calculate, the algorithm of similarity of character string comparison such as: Levenshtein Distance algorithm, LCS algorithm, vector product algorithm etc.The algorithm of similarity comparison can calculate its distance according to two given character strings, and the value of this distance is the decimal between 0 to 1, and numerical value is larger, represents that two character strings are more not identical; In the present embodiment, the value of the similarity between two addresses deducts this distance with 1 and obtains, the value of similarity is also the decimal between 0 to 1, the larger expression of value two character strings of similarity are more similar, the value of similarity is 1 and thinks that two character strings are identical, is 0 and thinks that two character strings are completely different.

Step S15: judge whether the minimum value in the value of the multiple similarities obtained is in setting range, if so, then enters step S17, otherwise enters step S19.

According to the calculating in step S13, the value of the similarity between 0 to 1 can be obtained; In the present embodiment, the value for similarity arranges the basis that a scope is used as decision-making.

Step S17: the information exporting preset content.The content preset in this step is determined according to application scenarios.Such as in the application scenarios of " address blacklist ", the value scope of the similarity arranged in step S15 is for being greater than 0.7, if the value of similarity is greater than 0.7 like this, then think that swindle is accused of in the address representated by pending text, the information now exported can be black list user's information, blacklist mailing address is belonged to, such as: " current address is swindle address " for expressing pending text chunk.And for example in the application scenarios of " historical address compares ", the value scope of the similarity arranged in step S15 is for being greater than 0.75, if the value of similarity is less than 0.75 like this, then think that the address representated by pending text occurred in historical address list, the information now exported belongs to described historic user mailing address for expressing pending text chunk, such as: " current address is historical address ".

Step S19: obtain the text chunk that next is pending.Then step S13 is returned.If all pending text chunks are all processed, then terminate current process.

As can be seen from above step, the present embodiment processes text by the mode of similarity between calculating character string, can avoid the limitation that between character string, exact matching causes.For the application scenarios of above-mentioned " address blacklist ", if certain user adopts the mode of the partial character changed in address to hide detection, for this reality of outwardness in the information interaction of ecommerce, adopt the scheme of the present embodiment can find the address changing partial character, thus contribute to detecting address fraud all sidedly.Can according to the actual conditions of system, the experience of coupling system managerial personnel carrys out the setting range in setting steps S15, the demand that such system manager can make system realize oneself to produce objective reality, such as, check address fraud, or confirm that a certain address is historical address.

Next be telex network address in ecommerce with text chunk be example, the method for the process text in the present embodiment is described further.Telex network address is the address that the user of ecommerce provides, and the Shipping Address such as filled in when buying commodity, is made up of multiple keyword, the particular location in identified geographic meaning usually, such as country, area, street and number.

First obtain the list of address, the address in this list is used for comparing with current pending address.For the application scenarios of " address blacklist ", this list is swindle address list; For the application scenarios of " historical address compares ", this list is historical address.Below for the application scenarios of " historical address compares ", above-mentioned step S13 is described in detail.

In step s 13, the similarity between the multiple text chunks in pending text chunk and the text chunk set that prestores is calculated.Text chunk pending herein refers to pending mailing address; The text chunk set prestored refers to the blacklist in " blacklist " scene or the address set in " historical address compares " scene.If this set is Q, current pending address is character string S, in step s 13, all addresses in S and Q can be carried out Similarity Measure, also first can obtain several addresses the most similar to S in Q and (establish these several addresses to form Q ₀), then by S and Q ₀in address carry out Similarity Measure, the latter's mode can improve the speed compared.Before calculating, can carry out participle and standardization to S, keyword depending on the language of character string, such as, can separate according to space for English, and all lowercases are become capitalization by participle and standardized step.

In order to obtain above-mentioned " several the most similar addresses " in Q, a similarity degree manner of comparison can be determined in advance, as the principle that the address in S and Q is compared, addresses all in Q and S are compared, then from Q, selects the m the most similar to S (such as 10) address as Q according to similarity degree order from high to low ₀.Here similarity degree is had any different in similarity above, in the present embodiment, the value calculating similarity is carried out by the algorithm of similarity of character string comparison, similarity degree is then by relatively drawing, principle is relatively determined according to the language of address, such as English, can be that same keyword number is maximum.Directly the address in S and Q can be compared, also can accelerate to compare speed by inverted index, below the latter's mode be explained.

Each address in Q forms by several keywords.For inverted index set up in the keyword occurred in Q, i.e. the mark of the corresponding a string address of each keyword or address, each address in this string address all comprises corresponding keyword.Fig. 2 is the schematic diagram of the content of inverted index according to the embodiment of the present application.As shown in Figure 2, on the left of form for keyword (be illustrated as in figure " keyword 1 ", " keyword 2 " ... " keyword N "), right side such as, for comprising the sequence number (can be each address arranging sequence number in Q) of the address of this keyword, " address 1 ", " address 2 " etc.For a keyword, the sequence number of one or more address can be had to correspond.The right content of certain above table also can be address itself.

Next in inverted index, search all keywords in S, in each text chunk in statistics Q or Q, each text chunk identifies all number of times comprised in the inverted index entry of this keyword now.As shown in Figure 3, Fig. 3 is the schematic diagram carrying out address similarity degree statistics according to the use inverted index of the embodiment of the present application in the signal of this statistics.

Easy in order to describe, supposing to comprise 3 keywords in current pending address, be respectively in keyword 1, keyword 2 and keyword 3, Q and have four addresses, is 1 to address, address 4.Show the partial content of inverted index on the left of Fig. 3 in square frame 31, comprise above-mentioned keyword 1, keyword 2 and keyword 3, in the entry at these three keyword places, have one or several in 1 to address, address 4 respectively.The statistics compared according to each address is shown in square frame 32 on the right side of Fig. 3, the left side is each address, the right is for adding up the index obtained, and numeral is wherein the number of times that address occurs in all entries at keyword 1, keyword 2 and keyword 3 (i.e. all keywords of current pending address) place.

Particularly, as shown in Figure 3, address 1 occurred in the entry at keyword 1 and keyword 3, and therefore the statistical indicator of address 1 is 2; Address 2 occurred in the entry at keyword 1, keyword 2 and keyword 3, and therefore the statistical indicator of address 2 is 3.Statistics for address 3 and address 4 is similar, and their statistical indicator is respectively 2 and 1, as shown in block 32.Like this, because the index of address 2 is 3, be greater than the index of other addresses, so show that address 2 is the highest with the similarity degree of current pending address.Here for a simple example, from Q, an address the most similar to pending address is obtained.Usually, number of addresses in Q is larger, can from Q according to similarity degree from height to multiple such as 10 addresses of low acquisition, there is higher similarity degree these addresses and current pending address (i.e. S), namely above said " several the most similar addresses ", also i.e. Q above ₀.

At acquisition Q ₀afterwards, by S and Q ₀in each address make Similarity Measure one by one, namely obtain the value of multiple similarity, so far, step S13 is complete.In following step S15, in the value of the multiple similarities obtained, find minimum value, if this minimum value is less than preset value, then think that S is a historical address.Otherwise S is not historical address, now can S be added in Q, to upgrade historical address list.

Above explanation is made for the concrete mode compared all or part address in S and Q, in addition, carry out this relatively before, duplicate removal process can be carried out to Q, remove address similar to other addresses in Q, here similar judgment criterion can preset, and can carry out according to the mode that the present embodiment is mentioned above.Below this is described again.

If the historical address set before duplicate removal process is P, first element of 1 address as set Q first can be got from P, 1 address is got again as S from P, mode that similarity degree compares (comprises the mode that all elements in S and Q directly compares, and selects Q in advance in S and Q to use the address in S and Q above to carry out ₀the mode compared) compare the similarity degree of the address in S and P, if comparative result is dissimilar, then S is added in P.And then from P, get the similarity degree comparing address in this address and current Q again in 1 address.Like this, the part address in P enters Q, and in final P, each historical address has all carried out the screening of similarity degree, thus has lower similarity degree between each address in the Q obtained.

Adopt the processing mode of inverted index and similarity degree, greatly can reduce treatment capacity, especially when address list is very huge.If do not adopt inverted index, then address list must be traveled through one time and each keyword in address and each keyword in pending address are compared.And each keyword in the keyword in index and pending address only need compare by inverted index, be equivalent to, for address list establishes buffer memory, greatly accelerate processing speed.Meanwhile, similarity degree algorithm drastically reduce the area the call number of character string alignment algorithm especially.Do not adopt similarity degree algorithm, then must compare with each address in address list in pending address; And after adopting similarity degree algorithm, in fact only to need and in address list, compare in several addresses.Usual character string alignment algorithm is all more consuming time, and therefore similarity degree algorithm drastically reduce the area calculated amount.

Device below for the process text in the embodiment of the present application is explained.Fig. 4 is the schematic diagram of the main modular of the device of process text according to the embodiment of the present application.As shown in Figure 4, the device 40 processing text mainly comprises text chunk and selects module 41, computing module 42 and judge module 43.Wherein text chunk selects module 41 for searching all keywords in pending text chunk in inverted index, that adds up each text chunk in the text chunk set that prestores or text section identifies all number of times comprised in the inverted index entry of above-mentioned keyword now, from the set of text section, multiple text chunk is selected by this number of times order from high to low, this inverted index comprises multiple entry, have in each entry and allly in a keyword and text chunk set comprise the text chunk of this keyword or the mark of text section, keyword in entry comes from text chunk set.Computing module 42, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity; Judge module 43 is for judging whether the minimum value in the value of multiple similarities that computing module 42 obtains is in setting range, if so, then exports the information of preset content.

Similarity between each text chunk that text chunk selects module 41 can also be used for using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.Judge module 43 can also be used for, when judged result is for being, being added in text chunk set by pending text chunk.

The device 40 of process text can also comprise pretreatment module (not shown), before calculating at computing module 42, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in text chunk set.

The device 40 of process text can also comprise preserves module (not shown), for the historic user mailing address in the computer system of ecommerce is preserved as text chunk set, correspondingly, judge module 43 also can be used for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.Preserve module also to may be used for the blacklist mailing address in the computer system of ecommerce to preserve as text chunk set; Correspondingly judge module 43 also can be used for exporting black list user's information, and this black list user's information belongs to blacklist mailing address for expressing pending text chunk.

According to the technical scheme of the embodiment of the present application, each address is mated with historical address, whether identically with original address weigh change of address by similarity, contribute to the effect improving text detection.The problem hiding systems axiol-ogy is reached for character a small amount of in modified address ubiquitous in ecommerce, adopts the technical scheme of the embodiment of the present application to contribute to identifying similar address thus improve the swindle address detected performance of system.And, the keyword in inverted index recording address and address is used in the present embodiment, multiple address stored is selected according to the number of times that current pending address occurs in relevant entry in this inverted index, again Similarity Measure is carried out in these addresses and current pending address, this mode can accelerate the speed of Similarity Measure greatly, thus confirm that whether current pending address is the address in historical address or blacklist very soon, improve the computing power of e-commerce system.

Obviously, those skilled in the art should be understood that, each module of above-mentioned the application or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the application is not restricted to any specific hardware and software combination.

The foregoing is only the preferred embodiment of the application, be not limited to the application, for a person skilled in the art, the application can have various modifications and variations.Within all spirit in the application and principle, any amendment done, equivalent replacement, improvement etc., within the protection domain that all should be included in the application.

Claims

1. process a method for text, it is characterized in that, comprising:

The keyword in pending text chunk is searched in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword;

Calculate the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtain the value of multiple similarity;

Judge whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content;

Wherein, the similarity between each text chunk in the text chunk that described calculating is pending and multiple text chunks of selection comprises: the similarity between each text chunk using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.

2. method according to claim 1, is characterized in that, when the result of described judgement is for being, is added in described text chunk set by described pending text chunk.

3. method according to claim 1, it is characterized in that, before similarity between multiple text chunks in the text chunk that described calculating is pending and the text chunk set that prestores, also comprise: according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.

4. according to the method in any one of claims 1 to 3, it is characterized in that,

The described text chunk set prestored is the historic user mailing address in the computer system of ecommerce;

The information of described preset content comprises: historical address information, belongs to described historic user mailing address for expressing pending text chunk.

5. according to the method in any one of claims 1 to 3, it is characterized in that,

The described text chunk set prestored is the blacklist mailing address in the computer system of ecommerce;

The information of described preset content comprises: black list user's information, belongs to described blacklist mailing address for expressing pending text chunk.

6. process a device for text, it is characterized in that, comprising:

Text chunk selects module, for searching the keyword in pending text chunk in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword,

Computing module, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity;

Judge module, for judging whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content;

Wherein, described text chunk select module also for use the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection each text chunk between similarity.

7. device according to claim 6, is characterized in that, described pending text chunk also for when described judged result is for being, adds in described text chunk set by judge module.

8. device according to claim 6, it is characterized in that, also comprise pretreatment module, before calculating at described computing module, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.

9. the device according to any one of claim 6 to 8, is characterized in that,

Described device also comprises preservation module, for being preserved as described text chunk set by the historic user mailing address in the computer system of ecommerce;

Described judge module is also for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.

10. the device according to any one of claim 6 to 8, is characterized in that,

Described device also comprises preservation module, for being preserved as described text chunk set by the blacklist mailing address in the computer system of ecommerce;

Described judge module is also for exporting black list user's information, and this black list user's information belongs to described blacklist mailing address for expressing pending text chunk.