CN102929891B - The method and apparatus of process text - Google Patents

The method and apparatus of process text Download PDF

Info

Publication number
CN102929891B
CN102929891B CN201110230270.8A CN201110230270A CN102929891B CN 102929891 B CN102929891 B CN 102929891B CN 201110230270 A CN201110230270 A CN 201110230270A CN 102929891 B CN102929891 B CN 102929891B
Authority
CN
China
Prior art keywords
text
text chunk
address
chunk
similarity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201110230270.8A
Other languages
Chinese (zh)
Other versions
CN102929891A (en
Inventor
许泰清
徐磊石
胡四海
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201110230270.8A priority Critical patent/CN102929891B/en
Publication of CN102929891A publication Critical patent/CN102929891A/en
Priority to HK13103758.1A priority patent/HK1176432A1/en
Application granted granted Critical
Publication of CN102929891B publication Critical patent/CN102929891B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

This application provides a kind of method and apparatus processing text, the problem that the treatment effect in order to solve the detection of prior art Chinese version is not good.The method comprises: in inverted index, search the keyword in pending text chunk, that adds up each text chunk in the text chunk set that prestores or text section identifies the number of times comprised now in the entry of keyword, from the text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword; The similarity calculated between each text chunk in multiple text chunks of pending text chunk and selection obtains the value of multiple similarity; Judge whether the minimum value in the value of multiple similarity is in setting range, if then export the information of preset content.

Description

The method and apparatus of process text
Technical field
The application relates to computer technology, particularly relates to a kind of method and apparatus processing text.
Background technology
In internet, in order to avoid the propagation of garbage or harmful information, usually need to carry out text-processing.Such as, in the arranging of anti-rubbish mail, exact matching is carried out in the address of letter and the blacklist address prestored by mail reception device such as mail reception client software, if all characters in the two are identical, rejects this letter.The text processed in this case is the address of E-mail address.And for example, in e-commerce system, certain user can perform fraud, the address (being generally mailing address) to these users stay is needed to detect in order to limit fraud, also address blacklist is adopted at present, exact matching is carried out in each address, if all characters in this address are identical with all characters of at least one address in blacklist, then thinks that this user is accused of swindle.
In the application scenarios of this " address blacklist ", the user of some Email Sender or e-commerce system can adopt the mode changing address to hide detection, traditional way changes the minority character in the text of address, and above-mentioned detection mode cannot detect this address.
In addition, in text-processing, also running into the application scenarios of " historical address compares ", namely certain particular address being judged whether have certain address once occurred and it similar in existing address list, so as to analyzing number of times of different address appearance etc.The change address that traditional complete matching method None-identified is mentioned above, causes the result of judgement inaccurate.
In " address blacklist " scene, need to come the current estimative address of decision-making according to the result of text-processing and whether be accused of swindle.Existing exact matching technology can only process identical address, if modified to the minority character of address, then cannot go out amended address by direct-detection, cannot play the real effect of blacklist.Meanwhile, because blacklist needs manual maintenance, namely allow to the address after obtaining all modifications and add blacklist, this list also can become very huge, is difficult to safeguard.
In " historical address compares " scene, need to come whether the current estimative address of decision-making is the historical address occurred according to the result of text-processing, and add up the indexs such as the number of times of different address appearance.Existing matching technique can only mate identical address equally, and two similar addresses can be judged to be broken into two different addresses, but in fact they are same addresses.Therefore, traditional treatment technology can cause the result of adress analysis inaccurate.
For this two methods scene, existing Method for text detection treatment effect is not good, not yet proposes effective solution at present.
Summary of the invention
The fundamental purpose of the application is to provide a kind of method and apparatus processing text, to solve the problem for the poor effect of text detection in prior art.
To achieve these goals, according to an aspect of the application, a kind of method processing text is provided.
The method of the process text of the application comprises: in inverted index, search the keyword in pending text chunk, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword, calculate the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtain the value of multiple similarity, judge whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content.
Further, the similarity between each text chunk in the text chunk that described calculating is pending and multiple text chunks of selection comprises: the similarity between each text chunk using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.
Further, when the result of described judgement is for being, described pending text chunk is added in described text chunk set.
Further, before similarity between multiple text chunks in the text chunk that described calculating is pending and the text chunk set that prestores, also comprise: according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.
Further, the text chunk set prestored described in is the historic user mailing address in the computer system of ecommerce; The information of described preset content comprises: historical address information, belongs to described historic user mailing address for expressing pending text chunk.
Further, the text chunk set prestored described in is the blacklist mailing address in the computer system of ecommerce; The information of described preset content comprises: black list user's information, belongs to described blacklist mailing address for expressing pending text chunk.
According to the another aspect of the application, provide a kind of device processing text.
The device of the process text of the application comprises: text chunk selects module, for searching all keywords in pending text chunk in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword, computing module, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity, judge module, for judging whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content.
Further, described computing module also for use the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection each text chunk between similarity.
Further, described text chunk selects module judge module also for when described judged result is for being, is added in described text chunk set by described pending text chunk.
Further, also comprise pretreatment module, before calculating at described computing module, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.
Further, also comprise preservation module, for being preserved as described text chunk set by the historic user mailing address in the computer system of ecommerce; Described judge module is also for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.
Further, also comprise preservation module, for being preserved as described text chunk set by the blacklist mailing address in the computer system of ecommerce; Described judge module is also for exporting black list user's information, and this black list user's information belongs to described blacklist mailing address for expressing pending text chunk.
According to technical scheme of the present invention, use the keyword in inverted index recording address and address, multiple address stored is selected according to the number of times that current pending address occurs in relevant entry in this inverted index, again Similarity Measure is carried out in these addresses and current pending address, this mode can accelerate the speed of Similarity Measure greatly, thus confirm that whether current pending address is the address in historical address or blacklist very soon, improve the computing power of e-commerce system.
Accompanying drawing explanation
Figure of description is used to provide further understanding of the present application, and form a application's part, the schematic description and description of the application, for explaining the application, does not form the improper restriction to the application.In the accompanying drawings:
Fig. 1 is the process flow diagram of the key step of the method for process text according to the embodiment of the present application;
Fig. 2 is the schematic diagram of the content of inverted index according to the embodiment of the present application;
Fig. 3 is the schematic diagram carrying out address similarity degree statistics according to the use inverted index of the embodiment of the present application;
Fig. 4 is the schematic diagram of the main modular of the device of process text according to the embodiment of the present application.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the application in detail in conjunction with the embodiments.
Fig. 1 is the process flow diagram of the key step of the method for process text according to the embodiment of the present application, and as shown in Figure 1, the method mainly comprises the steps:
Step S11: obtain pending text.
Step S13: calculate the similarity between the multiple text chunks in pending text chunk and the text chunk set that prestores, obtain the value of multiple similarity.
The algorithm of various similarity of character string comparisons that are existing or that may occur in the future can be used in this step to calculate, the algorithm of similarity of character string comparison such as: Levenshtein Distance algorithm, LCS algorithm, vector product algorithm etc.The algorithm of similarity comparison can calculate its distance according to two given character strings, and the value of this distance is the decimal between 0 to 1, and numerical value is larger, represents that two character strings are more not identical; In the present embodiment, the value of the similarity between two addresses deducts this distance with 1 and obtains, the value of similarity is also the decimal between 0 to 1, the larger expression of value two character strings of similarity are more similar, the value of similarity is 1 and thinks that two character strings are identical, is 0 and thinks that two character strings are completely different.
Step S15: judge whether the minimum value in the value of the multiple similarities obtained is in setting range, if so, then enters step S17, otherwise enters step S19.
According to the calculating in step S13, the value of the similarity between 0 to 1 can be obtained; In the present embodiment, the value for similarity arranges the basis that a scope is used as decision-making.
Step S17: the information exporting preset content.The content preset in this step is determined according to application scenarios.Such as in the application scenarios of " address blacklist ", the value scope of the similarity arranged in step S15 is for being greater than 0.7, if the value of similarity is greater than 0.7 like this, then think that swindle is accused of in the address representated by pending text, the information now exported can be black list user's information, blacklist mailing address is belonged to, such as: " current address is swindle address " for expressing pending text chunk.And for example in the application scenarios of " historical address compares ", the value scope of the similarity arranged in step S15 is for being greater than 0.75, if the value of similarity is less than 0.75 like this, then think that the address representated by pending text occurred in historical address list, the information now exported belongs to described historic user mailing address for expressing pending text chunk, such as: " current address is historical address ".
Step S19: obtain the text chunk that next is pending.Then step S13 is returned.If all pending text chunks are all processed, then terminate current process.
As can be seen from above step, the present embodiment processes text by the mode of similarity between calculating character string, can avoid the limitation that between character string, exact matching causes.For the application scenarios of above-mentioned " address blacklist ", if certain user adopts the mode of the partial character changed in address to hide detection, for this reality of outwardness in the information interaction of ecommerce, adopt the scheme of the present embodiment can find the address changing partial character, thus contribute to detecting address fraud all sidedly.Can according to the actual conditions of system, the experience of coupling system managerial personnel carrys out the setting range in setting steps S15, the demand that such system manager can make system realize oneself to produce objective reality, such as, check address fraud, or confirm that a certain address is historical address.
Next be telex network address in ecommerce with text chunk be example, the method for the process text in the present embodiment is described further.Telex network address is the address that the user of ecommerce provides, and the Shipping Address such as filled in when buying commodity, is made up of multiple keyword, the particular location in identified geographic meaning usually, such as country, area, street and number.
First obtain the list of address, the address in this list is used for comparing with current pending address.For the application scenarios of " address blacklist ", this list is swindle address list; For the application scenarios of " historical address compares ", this list is historical address.Below for the application scenarios of " historical address compares ", above-mentioned step S13 is described in detail.
In step s 13, the similarity between the multiple text chunks in pending text chunk and the text chunk set that prestores is calculated.Text chunk pending herein refers to pending mailing address; The text chunk set prestored refers to the blacklist in " blacklist " scene or the address set in " historical address compares " scene.If this set is Q, current pending address is character string S, in step s 13, all addresses in S and Q can be carried out Similarity Measure, also first can obtain several addresses the most similar to S in Q and (establish these several addresses to form Q 0), then by S and Q 0in address carry out Similarity Measure, the latter's mode can improve the speed compared.Before calculating, can carry out participle and standardization to S, keyword depending on the language of character string, such as, can separate according to space for English, and all lowercases are become capitalization by participle and standardized step.
In order to obtain above-mentioned " several the most similar addresses " in Q, a similarity degree manner of comparison can be determined in advance, as the principle that the address in S and Q is compared, addresses all in Q and S are compared, then from Q, selects the m the most similar to S (such as 10) address as Q according to similarity degree order from high to low 0.Here similarity degree is had any different in similarity above, in the present embodiment, the value calculating similarity is carried out by the algorithm of similarity of character string comparison, similarity degree is then by relatively drawing, principle is relatively determined according to the language of address, such as English, can be that same keyword number is maximum.Directly the address in S and Q can be compared, also can accelerate to compare speed by inverted index, below the latter's mode be explained.
Each address in Q forms by several keywords.For inverted index set up in the keyword occurred in Q, i.e. the mark of the corresponding a string address of each keyword or address, each address in this string address all comprises corresponding keyword.Fig. 2 is the schematic diagram of the content of inverted index according to the embodiment of the present application.As shown in Figure 2, on the left of form for keyword (be illustrated as in figure " keyword 1 ", " keyword 2 " ... " keyword N "), right side such as, for comprising the sequence number (can be each address arranging sequence number in Q) of the address of this keyword, " address 1 ", " address 2 " etc.For a keyword, the sequence number of one or more address can be had to correspond.The right content of certain above table also can be address itself.
Next in inverted index, search all keywords in S, in each text chunk in statistics Q or Q, each text chunk identifies all number of times comprised in the inverted index entry of this keyword now.As shown in Figure 3, Fig. 3 is the schematic diagram carrying out address similarity degree statistics according to the use inverted index of the embodiment of the present application in the signal of this statistics.
Easy in order to describe, supposing to comprise 3 keywords in current pending address, be respectively in keyword 1, keyword 2 and keyword 3, Q and have four addresses, is 1 to address, address 4.Show the partial content of inverted index on the left of Fig. 3 in square frame 31, comprise above-mentioned keyword 1, keyword 2 and keyword 3, in the entry at these three keyword places, have one or several in 1 to address, address 4 respectively.The statistics compared according to each address is shown in square frame 32 on the right side of Fig. 3, the left side is each address, the right is for adding up the index obtained, and numeral is wherein the number of times that address occurs in all entries at keyword 1, keyword 2 and keyword 3 (i.e. all keywords of current pending address) place.
Particularly, as shown in Figure 3, address 1 occurred in the entry at keyword 1 and keyword 3, and therefore the statistical indicator of address 1 is 2; Address 2 occurred in the entry at keyword 1, keyword 2 and keyword 3, and therefore the statistical indicator of address 2 is 3.Statistics for address 3 and address 4 is similar, and their statistical indicator is respectively 2 and 1, as shown in block 32.Like this, because the index of address 2 is 3, be greater than the index of other addresses, so show that address 2 is the highest with the similarity degree of current pending address.Here for a simple example, from Q, an address the most similar to pending address is obtained.Usually, number of addresses in Q is larger, can from Q according to similarity degree from height to multiple such as 10 addresses of low acquisition, there is higher similarity degree these addresses and current pending address (i.e. S), namely above said " several the most similar addresses ", also i.e. Q above 0.
At acquisition Q 0afterwards, by S and Q 0in each address make Similarity Measure one by one, namely obtain the value of multiple similarity, so far, step S13 is complete.In following step S15, in the value of the multiple similarities obtained, find minimum value, if this minimum value is less than preset value, then think that S is a historical address.Otherwise S is not historical address, now can S be added in Q, to upgrade historical address list.
Above explanation is made for the concrete mode compared all or part address in S and Q, in addition, carry out this relatively before, duplicate removal process can be carried out to Q, remove address similar to other addresses in Q, here similar judgment criterion can preset, and can carry out according to the mode that the present embodiment is mentioned above.Below this is described again.
If the historical address set before duplicate removal process is P, first element of 1 address as set Q first can be got from P, 1 address is got again as S from P, mode that similarity degree compares (comprises the mode that all elements in S and Q directly compares, and selects Q in advance in S and Q to use the address in S and Q above to carry out 0the mode compared) compare the similarity degree of the address in S and P, if comparative result is dissimilar, then S is added in P.And then from P, get the similarity degree comparing address in this address and current Q again in 1 address.Like this, the part address in P enters Q, and in final P, each historical address has all carried out the screening of similarity degree, thus has lower similarity degree between each address in the Q obtained.
Adopt the processing mode of inverted index and similarity degree, greatly can reduce treatment capacity, especially when address list is very huge.If do not adopt inverted index, then address list must be traveled through one time and each keyword in address and each keyword in pending address are compared.And each keyword in the keyword in index and pending address only need compare by inverted index, be equivalent to, for address list establishes buffer memory, greatly accelerate processing speed.Meanwhile, similarity degree algorithm drastically reduce the area the call number of character string alignment algorithm especially.Do not adopt similarity degree algorithm, then must compare with each address in address list in pending address; And after adopting similarity degree algorithm, in fact only to need and in address list, compare in several addresses.Usual character string alignment algorithm is all more consuming time, and therefore similarity degree algorithm drastically reduce the area calculated amount.
Device below for the process text in the embodiment of the present application is explained.Fig. 4 is the schematic diagram of the main modular of the device of process text according to the embodiment of the present application.As shown in Figure 4, the device 40 processing text mainly comprises text chunk and selects module 41, computing module 42 and judge module 43.Wherein text chunk selects module 41 for searching all keywords in pending text chunk in inverted index, that adds up each text chunk in the text chunk set that prestores or text section identifies all number of times comprised in the inverted index entry of above-mentioned keyword now, from the set of text section, multiple text chunk is selected by this number of times order from high to low, this inverted index comprises multiple entry, have in each entry and allly in a keyword and text chunk set comprise the text chunk of this keyword or the mark of text section, keyword in entry comes from text chunk set.Computing module 42, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity; Judge module 43 is for judging whether the minimum value in the value of multiple similarities that computing module 42 obtains is in setting range, if so, then exports the information of preset content.
Similarity between each text chunk that text chunk selects module 41 can also be used for using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.Judge module 43 can also be used for, when judged result is for being, being added in text chunk set by pending text chunk.
The device 40 of process text can also comprise pretreatment module (not shown), before calculating at computing module 42, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in text chunk set.
The device 40 of process text can also comprise preserves module (not shown), for the historic user mailing address in the computer system of ecommerce is preserved as text chunk set, correspondingly, judge module 43 also can be used for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.Preserve module also to may be used for the blacklist mailing address in the computer system of ecommerce to preserve as text chunk set; Correspondingly judge module 43 also can be used for exporting black list user's information, and this black list user's information belongs to blacklist mailing address for expressing pending text chunk.
According to the technical scheme of the embodiment of the present application, each address is mated with historical address, whether identically with original address weigh change of address by similarity, contribute to the effect improving text detection.The problem hiding systems axiol-ogy is reached for character a small amount of in modified address ubiquitous in ecommerce, adopts the technical scheme of the embodiment of the present application to contribute to identifying similar address thus improve the swindle address detected performance of system.And, the keyword in inverted index recording address and address is used in the present embodiment, multiple address stored is selected according to the number of times that current pending address occurs in relevant entry in this inverted index, again Similarity Measure is carried out in these addresses and current pending address, this mode can accelerate the speed of Similarity Measure greatly, thus confirm that whether current pending address is the address in historical address or blacklist very soon, improve the computing power of e-commerce system.
Obviously, those skilled in the art should be understood that, each module of above-mentioned the application or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of calculation element, thus, they can be stored and be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the application is not restricted to any specific hardware and software combination.
The foregoing is only the preferred embodiment of the application, be not limited to the application, for a person skilled in the art, the application can have various modifications and variations.Within all spirit in the application and principle, any amendment done, equivalent replacement, improvement etc., within the protection domain that all should be included in the application.

Claims (10)

1. process a method for text, it is characterized in that, comprising:
The keyword in pending text chunk is searched in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword;
Calculate the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtain the value of multiple similarity;
Judge whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content;
Wherein, the similarity between each text chunk in the text chunk that described calculating is pending and multiple text chunks of selection comprises: the similarity between each text chunk using the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection.
2. method according to claim 1, is characterized in that, when the result of described judgement is for being, is added in described text chunk set by described pending text chunk.
3. method according to claim 1, it is characterized in that, before similarity between multiple text chunks in the text chunk that described calculating is pending and the text chunk set that prestores, also comprise: according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.
4. according to the method in any one of claims 1 to 3, it is characterized in that,
The described text chunk set prestored is the historic user mailing address in the computer system of ecommerce;
The information of described preset content comprises: historical address information, belongs to described historic user mailing address for expressing pending text chunk.
5. according to the method in any one of claims 1 to 3, it is characterized in that,
The described text chunk set prestored is the blacklist mailing address in the computer system of ecommerce;
The information of described preset content comprises: black list user's information, belongs to described blacklist mailing address for expressing pending text chunk.
6. process a device for text, it is characterized in that, comprising:
Text chunk selects module, for searching the keyword in pending text chunk in inverted index, each text chunk in the text chunk set that statistics prestores or the number of times identified in the entry comprising now described keyword of text section, from the described text chunk set prestored, multiple text chunk is selected by this number of times order from high to low, described inverted index is the inverted index set up the text chunk set prestored, it comprises multiple entry, each entry comprises a keyword, and correspondence preserves the mark of text chunk or the text section comprising this keyword,
Computing module, for calculating the similarity between each text chunk in multiple text chunks of pending text chunk and selection, obtains the value of multiple similarity;
Judge module, for judging whether the minimum value in the value of described multiple similarity is in setting range, if so, then exports the information of preset content;
Wherein, described text chunk select module also for use the algorithm of similarity of character string comparison to calculate in multiple text chunks of pending text chunk and selection each text chunk between similarity.
7. device according to claim 6, is characterized in that, described pending text chunk also for when described judged result is for being, adds in described text chunk set by judge module.
8. device according to claim 6, it is characterized in that, also comprise pretreatment module, before calculating at described computing module, according to the judgment criterion that the character string preset is similar, remove text chunk similar to other text chunks in described text chunk set.
9. the device according to any one of claim 6 to 8, is characterized in that,
Described device also comprises preservation module, for being preserved as described text chunk set by the historic user mailing address in the computer system of ecommerce;
Described judge module is also for exporting historical address information, and this historical address information belongs to described historic user mailing address for expressing pending text chunk.
10. the device according to any one of claim 6 to 8, is characterized in that,
Described device also comprises preservation module, for being preserved as described text chunk set by the blacklist mailing address in the computer system of ecommerce;
Described judge module is also for exporting black list user's information, and this black list user's information belongs to described blacklist mailing address for expressing pending text chunk.
CN201110230270.8A 2011-08-11 2011-08-11 The method and apparatus of process text Active CN102929891B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201110230270.8A CN102929891B (en) 2011-08-11 2011-08-11 The method and apparatus of process text
HK13103758.1A HK1176432A1 (en) 2011-08-11 2013-03-26 Method and device for processing text

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201110230270.8A CN102929891B (en) 2011-08-11 2011-08-11 The method and apparatus of process text

Publications (2)

Publication Number Publication Date
CN102929891A CN102929891A (en) 2013-02-13
CN102929891B true CN102929891B (en) 2015-09-16

Family

ID=47644690

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201110230270.8A Active CN102929891B (en) 2011-08-11 2011-08-11 The method and apparatus of process text

Country Status (2)

Country Link
CN (1) CN102929891B (en)
HK (1) HK1176432A1 (en)

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104624509B (en) * 2015-01-16 2017-06-13 浙江百世技术有限公司 A kind of express delivery Automated Sorting System and automatic sorting method
CN106302202B (en) * 2015-05-15 2020-07-28 阿里巴巴集团控股有限公司 Data current limiting method and device
CN105468742B (en) * 2015-11-25 2018-11-20 小米科技有限责任公司 The recognition methods of malice order and device
CN106961492A (en) * 2017-04-21 2017-07-18 广东浪潮大数据研究有限公司 IP address duplicate checking method and apparatus under a kind of linux system
CN108416062A (en) * 2018-03-26 2018-08-17 国家电网公司客户服务中心 A kind of electric network data correlating method based on address matching technology
CN110147531B (en) * 2018-06-11 2024-04-23 广州腾讯科技有限公司 Method, device and storage medium for identifying similar text content
CN112598321A (en) * 2018-07-10 2021-04-02 创新先进技术有限公司 Risk prevention and control method, system and terminal equipment
CN110866407B (en) * 2018-08-17 2024-03-01 阿里巴巴集团控股有限公司 Analysis method, device and equipment for determining similarity between text of mutual translation
CN109829150B (en) * 2018-11-27 2023-11-14 创新先进技术有限公司 Insurance claim text processing method and apparatus
CN112148843B (en) * 2020-11-25 2021-05-07 中电科新型智慧城市研究院有限公司 Text processing method and device, terminal equipment and storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050027717A1 (en) * 2003-04-21 2005-02-03 Nikolaos Koudas Text joins for data cleansing and integration in a relational database management system
US8645417B2 (en) * 2008-06-18 2014-02-04 Microsoft Corporation Name search using a ranking function

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928864A (en) * 2006-09-22 2007-03-14 浙江大学 FAQ based Chinese natural language ask and answer method
CN101308496A (en) * 2008-07-04 2008-11-19 沈阳格微软件有限责任公司 Large scale text data external clustering method and system
CN101373532A (en) * 2008-07-10 2009-02-25 昆明理工大学 FAQ Chinese request-answering system implementing method in tourism field

Also Published As

Publication number Publication date
CN102929891A (en) 2013-02-13
HK1176432A1 (en) 2013-07-26

Similar Documents

Publication Publication Date Title
CN102929891B (en) The method and apparatus of process text
Huang et al. Phishing URL detection via CNN and attention-based hierarchical RNN
CN103544255B (en) Text semantic relativity based network public opinion information analysis method
US9189746B2 (en) Machine-learning based classification of user accounts based on email addresses and other account information
CN105389722B (en) Malicious order identification method and device
CN103778151B (en) The method and device and searching method and device of a kind of identification feature colony
US11514063B2 (en) Method and apparatus of recommending information based on fused relationship network, and device and medium
CN103559235B (en) A kind of online social networks malicious web pages detection recognition methods
CN106815307A (en) Public Culture knowledge mapping platform and its use method
US20160314398A1 (en) Attitude Detection
CN102722709A (en) Method and device for identifying garbage pictures
CN104408191A (en) Method and device for obtaining correlated keywords of keywords
CN114363019B (en) Training method, device, equipment and storage medium for phishing website detection model
CN107341399A (en) Assess the method and device of code file security
CN113098887A (en) Phishing website detection method based on website joint characteristics
CN113657274B (en) Table generation method and device, electronic equipment and storage medium
CN110191096A (en) A kind of term vector homepage invasion detection method based on semantic analysis
CN105095381A (en) Method and device for new word identification
CN106933878B (en) Information processing method and device
Jiang et al. A feature selection method for malware detection
CN113360895B (en) Station group detection method and device and electronic equipment
Raja et al. Fake news detection on social networks using Machine learning techniques
CN110958244A (en) Method and device for detecting counterfeit domain name based on deep learning
CN117077153B (en) Static application security detection false alarm discrimination method based on large-scale language model
CN105808602B (en) Method and device for detecting junk information

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
REG Reference to a national code

Ref country code: HK

Ref legal event code: DE

Ref document number: 1176432

Country of ref document: HK

C14 Grant of patent or utility model
GR01 Patent grant
REG Reference to a national code

Ref country code: HK

Ref legal event code: GR

Ref document number: 1176432

Country of ref document: HK

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20200821

Address after: Building 8, No. 16, Zhuantang science and technology economic block, Xihu District, Hangzhou City, Zhejiang Province

Patentee after: ALIYUN COMPUTING Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Patentee before: Alibaba Group Holding Ltd.