CN107704538A

CN107704538A - A kind of rubbish text processing method, device, equipment and storage medium

Info

Publication number: CN107704538A
Application number: CN201710865928.XA
Authority: CN
Inventors: 谢永恒; 陶小龙; 火莽; 火一莽; 万月亮
Original assignee: Beijing Ruian Technology Co Ltd
Current assignee: Beijing Ruian Technology Co Ltd
Priority date: 2017-09-22
Filing date: 2017-09-22
Publication date: 2018-02-16

Abstract

The invention discloses a kind of rubbish text processing method, device, equipment and storage medium.This method includes：The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, URL includes request address and required parameter；Screening is carried out to request address based on default screening rule and obtains pending URL, and the pending entity file according to corresponding to being chosen pending URL request address；Processing entities file is treated using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies the value element in word segmentation result, and statistical analysis generation statistical result is carried out to value element；According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.The embodiment of the present invention solves in the prior art the problem of rubbish text accuracy of identification is not high, and inline system load is larger, realizes the load at utmost reduced to inline system, improves the reliability of identification.

Description

A kind of rubbish text processing method, device, equipment and storage medium

Technical field

The present embodiments relate to computer technology, more particularly to a kind of rubbish text processing method, device, equipment and deposit Storage media.

Background technology

Text mining is a cross discipline, is related to data mining, machine learning, pattern-recognition, artificial intelligence, statistics The multiple fields such as, Computational Linguistics, computer network road technique, informatics.Text mining is exactly to be sent out from substantial amounts of document A kind of Method and kit for of existing tacit knowledge and pattern, it is developed from data mining, but is had again with traditional data mining It is many different.The object of text mining is magnanimity, isomery, the document of distribution, in addition, document content is natural used in the mankind Language, lack the intelligible semanteme of computer.In real work, a large amount of certain patterns be present, and without unified between pattern Property, wherein, there is various patterns unstructured data mixed in together, such as：Http post, cookie etc..The number of these data It is huge according to measuring, while most of is again nugatory, is called rubbish text, these rubbish texts cause substantial amounts of storage Space is occupied and has had a strong impact on the performance of system.For the above situation, mainly there are two kinds of rubbish text processing sides at present Method, first, document classification method, i.e., by choosing text feature, it is trained according to the data marked in advance, according to training Model, judge whether text to be studied and judged is rubbish text；Second, rule-based filtering method, i.e., according to the setting of prior business expert Rule, text is filtered.

For document classification method, because http post request data contents are varied, the format character do not stablized and Key characteristics, learnt and trained it is difficult to choose available feature.Meanwhile Document Classification Method limited precision, it can abandon The data of value.For rule-based filtering method, the rule of this method usually requires to be determined in advance, and rambling http post Data are difficult to find clear and definite rule, while need a large amount of rules to improve precision, and environment deployment rule can reduce on line The process performance of system.

In view of the above-mentioned problems, not yet propose effective solution at present.

The content of the invention

The present invention provides a kind of rubbish text processing method, device, equipment and storage medium, is at utmost reduced with realizing To the load of inline system, identification reliability is improved.

In a first aspect, the embodiments of the invention provide a kind of rubbish text processing method, this method includes：

The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, the URL includes Request address and required parameter；

Screening is carried out to the request address based on default screening rule and obtains pending URL, and according to described pending Pending entity file corresponding to URL request address selection；

Word segmentation processing generation word segmentation result is carried out to the pending entity file using default segmentation methods, and identifies institute The value element in word segmentation result is stated, statistical analysis generation statistical result is carried out to the value element；

According to the statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.

Further, it is described that the screening pending URL of acquisition, bag are carried out to the request address based on default screening rule Include：

The HTTP data are grouped according to the request address, and count of the HTTP data included in every group Number；

According to the number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default to choose satisfaction URL corresponding to the request address of accumulative accounting, as the pending URL.

Further, the value element in the identification word segmentation result, statistical analysis is carried out to the value element Statistical result is generated, including：

Based on default recognition rule, the word segmentation result and default value element are subjected to match cognization, obtain described point Value element in word result；

The number that the value element occurs in the pending entity file is counted, generates statistical result.

Further, it is described according to the statistical result, it is determined whether corresponding pending URL is added to filtering URL List, including：

When the statistical result is less than default recommendation results, obtains pending URL corresponding to the statistical result and treat Processing entities file；

The pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment knot Fruit；

Based on the artificial judgment result, it is determined whether the pending URL is added into filtering URL name list.

Further, the default segmentation methods include reverse maximum matching algorithm.

Second aspect, the embodiment of the present invention additionally provide a kind of rubbish text processing unit, and the device includes：

URL acquisition modules, for obtaining the URL of the HTTP data in preset time, the URL includes request address and please Seek parameter；

Pending entity file acquisition module, for carrying out screening acquisition to the request address based on default screening rule Pending URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL；

Statistical result generation module, for carrying out word segmentation processing to the pending entity file using default segmentation methods Word segmentation result is generated, and identifies the value element in the word segmentation result, statistical analysis generation system is carried out to the value element Count result；

URL name list generation module is filtered, for according to the statistical result, it is determined whether corresponding pending URL is added It is added to filtering URL name list.

Further, the pending entity file acquisition module, including：

Classified statistics unit, for being grouped to the HTTP data according to the request address, and count in every group Comprising HTTP data number；

Pending URL acquiring units, for according to the number, carrying out descending sort to every group of request address, and count Accumulative accounting is calculated, URL corresponding to the request address for meeting default accumulative accounting is chosen, as the pending URL.

Further, the statistical result generation module, including：

Value element recognition unit, for based on default recognition rule, the word segmentation result and default value element to be entered Row match cognization, obtain the value element in the word segmentation result；

Statistical result generation unit, time occurred for counting the value element in the pending entity file Number, generate statistical result.

The third aspect, the embodiment of the present invention additionally provide a kind of equipment, and the equipment includes：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are by one or more of computing devices so that one or more of processing Device realizes rubbish text processing method as previously described.

Fourth aspect, the embodiment of the present invention additionally provide a kind of computer-readable recording medium, are stored thereon with computer Program, the program realize rubbish text processing method as previously described when being executed by processor.

By obtaining the HTTP in preset time, (Hyper Text Transfer Protocol, hypertext pass the present invention Defeated agreement) data URL (Uniform Resource Locator, URL), it include request address and please Parameter is sought, screening is carried out to request address based on default screening rule and obtains pending URL, and according to pending URL request Pending entity file corresponding to the selection of address, treat processing entities file using default segmentation methods and carry out word segmentation processing generation Word segmentation result, and the value element in word segmentation result is identified, statistical analysis generation statistical result, Jin Ergen are carried out to value element Result according to statistics, it is determined whether corresponding pending URL is added to filtering URL name list, solves rubbish text in the prior art The problem of this accuracy of identification is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, improved The reliability of identification.

Brief description of the drawings

Fig. 1 is a kind of flow chart of rubbish text processing method in the embodiment of the present invention one；

Fig. 2 is a kind of flow chart of rubbish text processing method in the embodiment of the present invention two；

Fig. 3 is a kind of structural representation of rubbish text processing unit in the embodiment of the present invention three；

Fig. 4 is a kind of structural representation of equipment in the embodiment of the present invention four.

Embodiment

The present invention is described in further detail with reference to the accompanying drawings and examples.It is understood that this place is retouched The specific embodiment stated is used only for explaining the present invention, rather than limitation of the invention.It also should be noted that in order to just Part related to the present invention rather than entire infrastructure are illustrate only in description, accompanying drawing.

Embodiment one

Fig. 1 is a kind of flow chart for rubbish text processing method that the embodiment of the present invention one provides, and the present embodiment is applicable In accurately identifying and filtering out the situation of the text comprising priceless value information, this method can be held by rubbish text processing unit OK, the device can realize that the device can be configured in terminal, such as is typically by the way of software and/or hardware Mobile phone, computer, tablet personal computer etc..As shown in figure 1, this method specifically comprises the following steps：

Step S110, the URL of the HTTP data in preset time is obtained, URL includes request address and required parameter；

In a particular embodiment of the present invention, in order to ensure the coverage rate of data, it is preferred that preset time is one week or extremely Few one day data, certainly, specific preset time can be not especially limited herein depending on actual conditions.HTTP is main For being application layer protocol from WWW (World Wide Web, WWW) server transport hypertext to local browser.URL Also referred to as Web addresses, it is commonly called as " network address ", URL overall format is made up of following essential part：Pattern (or agreement)+"：//”+ Host domain name (or IP address)+"：" port numbers+directory path+filename, such as " agreement：// mandate/pathInquiry ".Show Example property, such as " http://www.sogou.com/sieHdq=AQxRG-0000＆query=URL＆ie=utf8 ", “http://www.amdc.m.taobao.com/amdc/mobileDispatchPlatform=android＆v=3.1＆ DeviceId=＆appkey=umeng%3A58b7fafb07fe6513dc001456 ".In addition, it is complete, with authorization portion The common URL grammer divided is as follows：Agreement：// user name：Password@subdomain name domain name TLDs:Port Number/directory/file name file suffixesParameter=value # marks.Exemplary, such as " http://blog.163.com/xianyu_ 405@126/blog/static/161729131201082614930373/”。

Specifically, URL includes request address and required parameter, both with "" separators come, "" before represent please Address is asked, "" required parameter is represented afterwards.Exemplary, such as " http://www.sogou.com/sieHdq=AQxRG- " http in 0000＆query=URL＆ie=utf "://www.sogou.com/sie " represents request address, " hdq= AQxRG-0000＆query=URL＆ie=utf " represents required parameter.Wherein, key=value can be used in required parameter The form of key-value pair is joined to pass, and divides symbol to separate with " ＆ " between key-value pair.

It should be noted that the content included in the generally HTTP with same request address is shown with identical structure Example property, such as " http://s.taobao.com/searchQ=Shui Guos ＆imgfile=＆commend=all＆ssid=s5- E＆search_type=item＆sourceId=tb.index＆spm=a21bo.50862.201 856-taobao-item.1＆ Ie=utf8＆initiative_id=tbindexz_20170916 " and " http://s.taobao.com/searchQ=% E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C＆imgfile=＆commend=all ＆ssid= S5-e＆search_type=item＆sourceId=tb.index＆spm=a21bo.50862. 201856-taobao- Item.1＆ie=utf8＆initiative_id=tbindexz_20170916 ", above-mentioned two URL have identical request address “http://s.taobao.com/search ", above-mentioned URL is inputted in a browser and understands that both request contents are similar, is to close In " fruit ".Meanwhile this is also provided subsequently to carry out the pending URL of screening acquisition to request address based on default screening rule Foundation.

Step S120, screening is carried out to request address based on default screening rule and obtains pending URL, and according to pending Pending entity file corresponding to URL request address selection；

In a particular embodiment of the present invention, default screening rule refers to carrying out according to actual conditions or constantly in advance Testing improvement, it is determined that data processing rule, so as to realize reduce follow-up data treating capacity.Optionally, screening rule is preset The basis of foundation is the content included in the URL based on the HTTP data with same request address with identical structure.Phase Answer, default screening rule can be classified statistics rule and probabilistic model rule etc..

Step S130, treat processing entities file using default segmentation methods and carry out word segmentation processing generation word segmentation result, and The value element in word segmentation result is identified, statistical analysis generation statistical result is carried out to value element；

In a particular embodiment of the present invention, participle refers to reconfiguring continuous word sequence according to certain specification Into the process of word sequence.It is that nature delimiter is used as using space between word in the style of writing of English.Such as：I say a boy.And Chinese be word, sentence and section can simply be demarcated by obvious delimiter, only word neither one in form Delimiter, so when equally facing the partition problem of short word, such as：" this branch song is too insipid " the words is being divided into word order Will be complicated many during row.For in short, computer is how to understand which is word, which is not word, above-mentioned place Reason process is known as segmentation methods.Optionally, presetting segmentation methods includes the segmentation methods based on string matching, based on understanding Segmentation methods and segmentation methods three major types based on statistics.Wherein, the segmentation methods based on character match, which are called, makees machinery point Word algorithm, it is the entry progress in the Chinese character string and " fully big " machine dictionary being analysed to according to certain strategy Matching, if finding some character string in dictionary, the match is successful (setting out a word)；The base of segmentation methods based on understanding This thought is to carry out syntax, semantic analysis while participle, and Ambiguity is handled using syntactic information and semantic information；Base It is due to that the frequency of word co-occurrence adjacent with word or probability can preferably reflect into word in the basic thought of the segmentation methods of statistics Confidence level, counted using the frequency to each combinatorics on words of adjacent co-occurrence in language material, calculate their information that appears alternatively, The information that appears alternatively embodies the tightness degree of marriage relation between Chinese character.When tightness degree is higher than some threshold value, can think This word group may constitute a word, and the word group is added into candidate's word sequence, then make last determine through hand inspection again.This Kind algorithm need to only count to the word group frequency in language material, it is not necessary to cutting dictionary.Conventional point based on string matching Word algorithm include Forward Maximum Method algorithm, reverse maximum matching algorithm (Reverse Maximum Matching Method, ) and minimum cutting matching algorithm etc. RMM.Preferably, processing entities file is treated using reverse maximum matching algorithm to be segmented Processing, is illustrated by taking the algorithm as an example below.

Specifically, due to more polarization phrase in Chinese be present, in order to reduce error rate, can inversely be matched, Match from back to front.The basic ideas of reverse maximum matching algorithm are as follows：

The number of character is MaxLen in the most entry of number of characters in step 1, hypothesis dictionary for word segmentation, during setting is to be slit Chinese character string is S1, S2=" "；

Step 2, judge whether S1 is empty, if sky, carry out step 7, if being not sky, carry out step 3；

Step 3, taken out from right to left no more than MaxLen character as matched character string W in character string S1；

Step 4, dictionary for word segmentation is searched, if character string W in dictionary for word segmentation be present, the match is successful, then S2="/"+W+ S2, S1=S1-W, step 2 is carried out, if there is no character string W, then it fails to match, carries out step 5；

Step 5, one word of W Far Lefts removed, as new W；

Step 6, judge whether W is individual character, if individual character, carry out step 7, if not individual character, carry out step 4；

Step 7, output result S2.

If the characteristics of algorithm maximum is that it fails to match for lookup dictionary for word segmentation, remove first character from the left side.Such as：It is right Character string ABCD in text, wherein CD ∈ W, BCD ∈ W,So just take cutting A/BCD.

Exemplary, it is S1=" computational linguistics course is interesting " to input Chinese character string to be slit, sets number of characters The number of character is MaxLen=5 in most entries, S2=" ", separator="/".S1 is not sky, and the right is taken out from S1 Candidate character strings W=" course is interesting "；Dictionary for word segmentation is searched, one word of W Far Lefts is removed, obtained not in dictionary for word segmentation by W To W=" journey is interesting "；Dictionary for word segmentation is searched, W removes one word of W Far Lefts not in dictionary for word segmentation, obtains W=" intentionally Think "；Dictionary for word segmentation is searched, one word of W Far Lefts is removed not in dictionary for word segmentation, obtain W=" meaning " by W；Search participle word W is added in S2 by allusion quotation, " meaning " in dictionary for word segmentation, the S2="/meaning ", and W is removed from S1, and now S1=" is calculated Linguistics course has "；S1 is not sky, and candidate character strings W=" speech, which learns course, to be had " is taken out from S1 the right；Dictionary for word segmentation is looked into, W does not exist In dictionary for word segmentation, one word of W Far Lefts is removed, obtains W=" learning course has "；Look into dictionary for word segmentation, W not in dictionary for word segmentation, One word of W Far Lefts is removed, obtains W=" course has "；Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation, by W Far Lefts one Word removes, and obtains W=" journey has "；Dictionary for word segmentation is looked into, one word of W Far Lefts is removed not in dictionary for word segmentation, obtain W=by W " having ", W are individual characters, and W is added in S2, S2="/having/looks like ", and W is removed from S1, now S1=" computational linguistics Course "；S1 is not sky, and candidate character strings W=" linguistics course " is taken out from S1 the right；Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation In, one word of W Far Lefts is removed, obtains W=" speech learns course "；Dictionary for word segmentation is looked into, W is not in dictionary for word segmentation, by W Far Lefts One word removes, and obtains W=" course "；Dictionary for word segmentation is looked into, one word of W Far Lefts is removed, obtained not in dictionary for word segmentation by W To W=" course "；Dictionary for word segmentation is looked into, W is added in S2 by W in dictionary for word segmentation, S2="/course/has/looked like ", and by W Remove from S1, now S1=" computational linguistics "；S1 is not sky, and candidate character strings W=" computational languages are taken out from S1 the right Learn "；Dictionary for word segmentation is looked into, W is added in S2 by W in dictionary for word segmentation, S2=" computational linguistics/course/has/looked like ", and will W removes from S1, now S1=" "；S1 is sky, and output S2 terminates as word segmentation result, participle process.

In a particular embodiment of the present invention, optionally, value element includes the information of user's concern, unknown value information And/or value information unidentified at present.Wherein, the information of user's concern includes but is not limited to hardware characteristics information and true body Part information, wherein, hardware characteristics information refers to the unique mark of equipment, such as IMEI and MAC Address, and true identity information refers to Be can unique mark user identity information, such as cell-phone number and identification card number；Unknown value information refers to being difficult with rule The value element then stated, such as user name；Value information unidentified at present refers to virtual identity, can unique mark use Family is in the information of some web-based applications, such as account, UID and email address.

On the basis of the above, optionally, the value element in word segmentation result is identified by recognition rule set in advance, And statistical analysis is carried out to value element, and it is exemplary, such as count the number that value element occurs in each word segmentation result.

Step S140, according to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.

In a particular embodiment of the present invention, filtering URL refers to not including the URL of value element or comprising value element Number is less than the URL of predetermined threshold value.Value element mentioned here is identical with the implication of the value element in step S130.

It should be noted that above-mentioned steps S110-S140 is online lower progress, realizes and mitigate inline system load Purpose, inline system (system that user is used) only need to be corresponding to the URL in filtering URL name list according to result Entity file be that text data carries out corresponding filter operation, the influence of user will also be fallen below minimum.

The technical scheme of the present embodiment, by obtaining the URL of the HTTP data in preset time, it include request address and Required parameter, screening is carried out to request address based on default screening rule and obtains pending URL, and asking according to pending URL Pending entity file corresponding to the selection of address is sought, treating processing entities file using default segmentation methods carries out word segmentation processing life Into word segmentation result, and the value element in word segmentation result is identified, statistical analysis generation statistical result is carried out to value element, and then According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list, solves rubbish in the prior art The problem of text identification precision is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, is carried The high reliability of identification.

Further, on the basis of above-mentioned technical proposal, screening is carried out to request address based on default screening rule and obtained Pending URL is obtained, including：

HTTP data are grouped according to request address, and count the number of the HTTP data included in every group；

In a particular embodiment of the present invention, the content included in the HTTP based on same request address has identical knot Structure, consideration can be first grouped according to request address to HTTP data, i.e. request address identical HTTP data are subdivided into together One group.Exemplary, such as HTTP data " http://www.sogou.com/sieHdq=AQxRG-0000＆query=% E7%BB%9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7 % AC%A6＆ie=utf8 ", " http://www.sogou.com/sieHdq=AQxRG-0000＆query=unified resources are determined Position symbol ＆ie=utf8 ", " http://s.taobao.com/searchQ=Shui Guos ＆imgfile=＆commend=all＆ Ssid=s5-e＆search_type=item＆sourceId=tb.index＆spm=a21bo.5 0862.201856- Taobao-item.1＆ie=utf8＆initiative_id=tbindexz_20170916 " and " http:// s.taobao.com/searchQ=%E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C＆ Imgfile=＆commend=all＆ssid=s5-e＆search_type=item＆sourceId=tb.index＆spm= A21bo.50862.201856-taobao-item.1＆ie=utf8＆initiative_id=t bindexz_20170916 ", According to request address, by " http://www.sogou.com/sieHdq=AQxRG-0000＆query=%E7%BB% 9F%E4%B8%80%E8%B5%84%E6%BA%90%E5%AE%9A%E4%BD%8D%E7%AC%A6 ＆ie =utf8 " and " http://www.sogou.com/sieHdq=AQxRG-0000＆query=Tong Yiziyuandingweifus ＆ie= Utf8 " is included into same group, by " http://s.taobao.com/searchQ=Shui Guos ＆imgfile=＆commend=all＆ Ssid=s5-e＆search_type=item＆sourceId=tb.index＆spm=a21bo.5 0862.201856- Taobao-item.1＆ie=utf8＆initiative_id=tbindexz_20170916 " and " http:// s.taobao.com/searchQ=%E6%B0%B4%E6%9E%9C%E6%96%B0%E9%B2%9C＆ Imgfile=＆commend=all＆ssid=s5-e＆search_type=item＆sourceId=tb.index＆spm= A21bo.50862.201856-taobao-item.1＆ie=utf8＆initiative_id=t bindexz_20170916 " return Enter same group.It should be noted that other HTTP data are divided in the same fashion, will not be repeated here.

On the basis of the above, the number of the HTTP data included in every group is counted.Exemplary, such as comprising request address “http:The number of //www.sogou.com/sie " HTTP data is 250, includes request address " http:// The number of s.taobao.com/search " HTTP data is 400.

According to number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default accumulative to choose satisfaction URL corresponding to the request address of accounting, as pending URL.

In a particular embodiment of the present invention, it is not at whole HTTP data in order to reduce data processing amount Reason, but it is screened, the follow-up HTTP data for meeting preparatory condition for filtering out are handled.

Specifically, first, according to the number included in every group counted, descending row is carried out to every group of request address Sequence, such as it is grouped 1 " http:The number included in //s.taobao.com/search " is f₁=400, it is grouped 2 " http:// The number included in www.sogou.com/sie " is f₂=250, it is grouped 3 " http://baike.sogou.com/ The number included altogether in v431372.htm " is f₃=230, it is grouped 4 " http:Wrapped in //list.jd.com/list.html " The number contained is f₄=120, it is grouped 5 " https:Wrapped in //mst.vip.com/k1WvCu4H5gssBn3KLODlHQ.php " The number contained is f₅=100, by above-mentioned f₁、f₂、f₃、f₄And f₅It is ranked up according to order from big to small, ranking results are f₁、f₂、f₃、f₄And f₅.Then, accumulative accounting is calculated, i.e., according to formulaCalculate accumulative account for Than calculating process and result are as follows： Finally, the request address for meeting default accumulative accounting is chosen Corresponding URL, as pending URL.Found according to NULL before, 80% number is little before accumulative accounting and covers Face is sufficiently large, and such study analysis cost is little, based on this, it is preferred that and default accumulative accounting is arranged to 80%, based on above-mentioned, Understand x₃=80% meets the condition, chooses packet 1, packet 2 immediately and is grouped URL corresponding to 3 request address, as waiting to locate Manage URL.

It should be noted that the packet count of division, the number of the HTTP data included in every group and default accumulative accounting, Need to carry out statistics setting according to actual conditions, be not especially limited herein.

By above-mentioned, while ensureing that data cover face is sufficiently large, reduce data processing amount, reduce study analysis Cost.

Further, on the basis of above-mentioned technical proposal, the value element in word segmentation result is identified, value element is entered Row statistical analysis generates statistical result, including：

Based on default recognition rule, word segmentation result and default value element are subjected to match cognization, obtained in word segmentation result Value element；

The number that Statistical Value element occurs in pending entity file, generate statistical result.

In a particular embodiment of the present invention, default recognition rule refers to the use being provided previously by by configuration file form It is exemplary in the rule of identification value element, such as when default value element is cell-phone number, identification card number, IMEI, email address During with MAC Address etc., default recognition rule refers to regular expression rule；When default value element is user name, preset Recognition rule refers to specifying the rule of the information such as application and attribute mark.Above-mentioned default value element also refers to same class Other value element, same category of value element mentioned here refer to the value element using same identification rule, such as Cell-phone number and identification card number are just same category of value element.Different classes of value element can also be referred to, here institute The different classes of value element said refers to the value element using different recognition rules, if cell-phone number and user name are just not Generic value element.It is above-mentioned to carry out relative set according to actual conditions, it is not especially limited herein.Preferably, valency is preset The object type number that value element includes the value element in different classes of value element and same category is at least one.Under Face is cell-phone number, illustrated exemplified by identification card number, IMEI, email address, MAC Address and user name by default value element.

Because default value element is cell-phone number, identification card number, IMEI, email address, MAC Address and user name, accordingly , it is regular expression rule and the rule for specifying the information such as application and attribute mark to preset recognition rule, based on above-mentioned rule Whether there are cell-phone number, identification card number, IMEI, email address, MAC Address and user name in identification word segmentation result, match cognization goes out Value element in word segmentation result, it is exemplary, such as identify and value element cell-phone number, word segmentation result 2 are included in word segmentation result 1 In include value element user name, include value element cell-phone number and MAC Address in word segmentation result 3.Above-mentioned word segmentation result 1, divide Word result 2 and word segmentation result 3 treat processing entities file 1, pending entity file 2 and pending entity file 3 respectively Generated after presetting segmentation methods and carrying out word segmentation processing.And time that Statistical Value element occurs in pending entity file Number, exemplary, the number as cell-phone number occurs in pending entity file 1 is 20, and user name is in pending entity file 2 The number of middle appearance is 1, and the number that cell-phone number and MAC Address occur in pending entity file 3 is 2.

It should be noted that the statistical result based on above-mentioned acquisition is the sum for the number that each value element occurs.Such as base In default recognition rule, the word segmentation result that certain pending entity file is obtained after segmentation methods are handled, with default value member Element carries out match cognization, and it is cell-phone number and identification card number to obtain the value element in word segmentation result, then Statistical Value element exists The statistical result of the number generation occurred in the pending entity file is that the number that cell-phone number occurs occurs plus identification card number Number.

By above-mentioned, according to the statistical result of acquisition hereafter whether corresponding pending URL to be added into filtering URL name It is single that basis for estimation is provided.

Further, on the basis of above-mentioned technical proposal, according to statistical result, it is determined whether pending by corresponding to URL is added to filtering URL name list, including：

When statistical result is less than default recommendation results, pending URL and pending entity corresponding to statistical result are obtained File；

In a particular embodiment of the present invention, default recommendation results can be determined by prior statistical analysis, This is not especially limited.Preferably, default recommendation results are set as 2, i.e., when statistical result is less than 2, just need to obtain statistics As a result corresponding pending URL and pending entity file, when statistical result is more than or equal to 2, just operated without this, Determine that pending entity file has contained more value information corresponding to statistical result.It is exemplary based on this, due to The number that previously described user name occurs in pending entity file 2 is 1, and the statistical result is less than 2, it is therefore desirable to obtains Pending URL corresponding to statistical result and pending entity file, that is, obtain pending URL and pending entity file 2.

Pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment result；

In a particular embodiment of the present invention, the purpose for carrying out artificial judgment operation is：In practice, in entity file Often some virtual identities (such as email address), virtual identity mentioned here are virtual identity explained before, in fact It is valuable information in matter, due to that can not be identified well, causes statistical result to be likely less than default recommendation results, because This is, it is necessary to which artificial judgment is checked on, it is ensured that and valuable information is not deleted, reduces False Rate, meanwhile, for recommending out not Identified extractible valuable information (such as email address), by artificial judgment, data extraction model is found, generation is accordingly Data extracting rule, then the extracting rule can be included in previously described default recognition rule, be easy to follow-up afterwards Use, from another point of view, reduce the number of artificial judgment.

Based on artificial judgment result, it is determined whether pending URL is added into filtering URL name list.

In a particular embodiment of the present invention, it is exemplary, whether include mailbox in the pending entity file 2 of artificial judgment Address, if comprising email address, due to being unsatisfactory for condition of the statistical result less than default recommendation results, then just will not incite somebody to action Pending entity file 2 is added to filtering URL name list, conversely, pending file 2 just is added into filtering URL name list.

Embodiment two

Fig. 2 is a kind of flow chart for rubbish text processing method that the embodiment of the present invention two provides, and the present embodiment is applicable In accurately identifying and filtering out the situation of the text comprising priceless value information, this method can be held by rubbish text processing unit OK, the device can realize that the device can be configured in terminal, such as is typically by the way of software and/or hardware Mobile phone, computer, tablet personal computer etc..As shown in Fig. 2 this method specifically comprises the following steps：

Step S210, the URL of the HTTP data in preset time is obtained, URL includes request address and required parameter；

Step S220, HTTP data are grouped according to request address, and count the HTTP data included in every group Number, according to number, descending sort is carried out to every group of request address, and calculate accumulative accounting；

Step S230, choose and meet URL corresponding to the request address of default accumulative accounting, as pending URL, and according to Pending entity file corresponding to pending URL request address selection；

Step S240, processing entities file is treated with regard to those word segmentation processings using default segmentation methods, based on default identification Rule, word segmentation result and default value element are subjected to match cognization, obtain the value element in word segmentation result；

In a particular embodiment of the present invention, it is preferred that processing entities file is treated using reverse maximum matching algorithm and entered Row word segmentation processing.

Step S250, the number that Statistical Value element occurs in pending entity file, statistical result is generated；

Step S260, when statistical result is less than default recommendation results, obtain pending URL corresponding to statistical result and treat Processing entities file；

Step S270, pending URL and pending entity file are recommended and manually determined whether, and obtained and manually sentence Disconnected result；

Step S280, based on artificial judgment result, it is determined whether pending URL is added into filtering URL name list.

It should be noted that above-mentioned steps S210-S280 is online lower progress, realizes and mitigate inline system load Purpose, inline system (system that user is used) only corresponding filter operation need to be carried out according to result, It is minimum by being fallen below to the influence of user.

Embodiment three

Fig. 3 is a kind of structural representation for rubbish text processing unit that the embodiment of the present invention three provides, and the present embodiment can Situation suitable for accurately identifying and filtering out the text comprising priceless value information, the device can use software and/or hardware Mode realize that the device can be configured in terminal, such as typically mobile phone, computer, tablet personal computer etc..Such as Fig. 3 institutes Show, the device specifically includes：

URL acquisition modules 310, for obtaining the URL of the HTTP data in preset time, the URL includes request address And required parameter；

Pending entity file acquisition module 320, for being screened based on default screening rule to the request address Obtain pending URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL；

Statistical result generation module 330, for being segmented using default segmentation methods to the pending entity file Processing generation word segmentation result, and the value element in the word segmentation result is identified, statistical analysis life is carried out to the value element Into statistical result；

URL name list generation module 340 is filtered, for according to the statistical result, it is determined whether the pending URL by corresponding to It is added to filtering URL name list.

The technical scheme of the present embodiment, the URL of the HTTP data in preset time is obtained by URL acquisition modules 310, its Including request address and required parameter, pending entity file acquisition module 320 is based on default screening rule and request address is entered Row screening obtains pending URL, and the pending entity file according to corresponding to being chosen pending URL request address, statistics knot Fruit generation module 330 treats processing entities file using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies Value element in word segmentation result, statistical analysis generation statistical result is carried out to value element, filters URL name list generation module 340 according to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list, solves rubbish in the prior art The problem of rubbish text identification precision is not high, and inline system load is larger, the load at utmost reduced to inline system is realized, Improve the reliability of identification.

Further, on the basis of above-mentioned technical proposal, pending entity file acquisition module 320, including：

Classified statistics unit, for being grouped to HTTP data according to request address, and count what is included in every group The number of HTTP data；

Pending URL acquiring units, for according to number, descending sort being carried out to every group of request address, and calculate tired Accounting is counted, URL corresponding to the request address for meeting default accumulative accounting is chosen, as pending URL.

Further, on the basis of above-mentioned technical proposal, statistical result generation module 330, including：

Value element recognition unit, for based on default recognition rule, word segmentation result and default value element are carried out With identification, the value element in word segmentation result is obtained；

Statistical result generation unit, the number occurred for Statistical Value element in pending entity file, generation system Count result.

Further, on the basis of above-mentioned technical proposal, filtering URL name list generation module 340, it is specifically used for：

What the embodiment of the present invention was provided is configured at any implementation of the executable present invention of rubbish text processing unit of terminal What example was provided is applied to terminal rubbish text processing method, possesses the corresponding functional module of execution method and beneficial effect.

Example IV

Fig. 4 is a kind of structural representation for equipment that the embodiment of the present invention four provides.Fig. 4 is shown suitable for being used for realizing this The block diagram of the example devices 412 of invention embodiment.The equipment 412 that Fig. 4 is shown is only an example, should not be to the present invention The function and use range of embodiment bring any restrictions.

As shown in figure 4, equipment 412 is showed in the form of universal computing device.The component of equipment 412 can include but unlimited In：One or more processor 416, system storage 428, it is connected to different system component (including the He of system storage 428 Processor 416) bus 418.

Bus 418 represents the one or more in a few class bus structures, including memory bus or Memory Controller, Peripheral bus, graphics acceleration port, processor or the local bus using any bus structures in a variety of bus structures.Lift For example, these architectures include but is not limited to industry standard architecture (ISA) bus, MCA (MAC) Bus, enhanced isa bus, VESA's (VESA) local bus and periphery component interconnection (PCI) bus.

Equipment 412 typically comprises various computing systems computer-readable recording medium.These media can be it is any can be by equipment 412 usable mediums accessed, including volatibility and non-volatile media, moveable and immovable medium.

System storage 428 can include the computer system readable media of form of volatile memory, such as deposit at random Access to memory (RAM) 430 and/or cache memory 432.Equipment 412 may further include other removable/not removable Dynamic, volatile/non-volatile computer system storage medium.Only as an example, storage system 434 can be used for read-write can not Mobile, non-volatile magnetic media (Fig. 4 do not show, commonly referred to as " hard disk drive ").Although not shown in Fig. 4, Ke Yiti For the disc driver for being read and write to may move non-volatile magnetic disk (such as " floppy disk "), and to may move non-volatile light The CD drive of disk (such as CD-ROM, DVD-ROM or other optical mediums) read-write.In these cases, each driver It can be connected by one or more data media interfaces with bus 418.Memory 428 can include at least one program and produce Product, the program product have one group of (for example, at least one) program module, and these program modules are configured to perform of the invention each The function of embodiment.

Program/utility 440 with one group of (at least one) program module 442, can be stored in such as memory In 428, such program module 442 includes but is not limited to operating system, one or more application program, other program modules And routine data, the realization of network environment may be included in each or certain combination in these examples.Program module 442 Generally perform the function and/or method in embodiment described in the invention.

Equipment 412 can also be logical with one or more external equipments 414 (such as keyboard, sensing equipment, display 424 etc.) Letter, can also enable a user to the equipment communication interacted with the equipment 412 with one or more, and/or with causing the equipment 412 Any equipment (such as network interface card, the modem etc.) communication that can be communicated with one or more of the other computing device.This Kind communication can be carried out by input/output (I/O) interface 422.Also, equipment 412 can also by network adapter 420 with One or more network (such as LAN (LAN), wide area network (WAN) and/or public network, such as internet) communication.Such as Shown in figure, network adapter 420 is communicated by bus 418 with other modules of equipment 412.It should be understood that although do not show in Fig. 4 Go out, other hardware and/or software module can be used with bonding apparatus 412, included but is not limited to：It is microcode, device driver, superfluous Remaining processing unit, external disk drive array, RAID system, tape drive and data backup storage system etc..

Processor 416 is stored in program in system storage 428 by operation, so as to perform various function application and Data processing, such as a kind of rubbish text processing method that the embodiment of the present invention is provided is realized, including：

The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, URL includes request Address and required parameter；

Screening is carried out to request address based on screening rule and obtains pending URL, and according to pending URL request address Pending entity file corresponding to selection；

Processing entities file is treated using default segmentation methods and carries out word segmentation processing generation word segmentation result, and identifies participle knot Value element in fruit, statistical analysis generation statistical result is carried out to value element；

According to statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.

Embodiment five

The embodiment of the present invention five additionally provides a kind of computer-readable recording medium, is stored thereon with computer program, should A kind of rubbish text processing method provided such as the embodiment of the present invention is realized when program is executed by processor, this method includes：

The computer-readable storage medium of the embodiment of the present invention, any of one or more computer-readable media can be used Combination.Computer-readable medium can be computer-readable signal media or computer-readable recording medium.It is computer-readable Storage medium for example may be-but not limited to-the system of electricity, magnetic, optical, electromagnetic, infrared ray or semiconductor, device or Device, or any combination above.The more specifically example (non exhaustive list) of computer-readable recording medium includes：Tool There are the electrical connections of one or more wires, portable computer diskette, hard disk, random access memory (RAM), read-only storage (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, portable compact disc read-only storage (CD- ROM), light storage device, magnetic memory device or above-mentioned any appropriate combination.In this document, computer-readable storage Medium can be any includes or the tangible medium of storage program, the program can be commanded execution system, device or device Using or it is in connection.

Computer-readable signal media can include in a base band or as carrier wave a part propagation data-signal, Wherein carry computer-readable program code.The data-signal of this propagation can take various forms, including but unlimited In electromagnetic signal, optical signal or above-mentioned any appropriate combination.Computer-readable signal media can also be that computer can Any computer-readable medium beyond storage medium is read, the computer-readable medium, which can send, propagates or transmit, to be used for By instruction execution system, device either device use or program in connection.

The program code included on computer-readable medium can be transmitted with any appropriate medium, including --- but it is unlimited In wireless, electric wire, optical cable, RF etc., or above-mentioned any appropriate combination.

It can be write with one or more programming languages or its combination for performing the computer that operates of the present invention Program code, described program design language include object oriented program language-such as Java, Smalltalk, C++, Also include conventional procedural programming language-such as " C " language or similar programming language.Program code can be with Fully perform, partly perform on the user computer on the user computer, the software kit independent as one performs, portion Divide and partly perform or performed completely on remote computer or server on the remote computer on the user computer. Be related in the situation of remote computer, remote computer can pass through the network of any kind --- including LAN (LAN) or Wide area network (WAN)-be connected to subscriber computer, or, it may be connected to outer computer (such as carried using Internet service Pass through Internet connection for business).

Pay attention to, above are only presently preferred embodiments of the present invention and institute's application technology principle.It will be appreciated by those skilled in the art that The invention is not restricted to specific embodiment described here, can carry out for a person skilled in the art various obvious changes, Readjust and substitute without departing from protection scope of the present invention.Therefore, although being carried out by above example to the present invention It is described in further detail, but the present invention is not limited only to above example, without departing from the inventive concept, also Other more equivalent embodiments can be included, and the scope of the present invention is determined by scope of the appended claims.

Claims

A kind of 1. rubbish text processing method, it is characterised in that including：

The uniform resource position mark URL of the HTTP HTTP data in preset time is obtained, the URL includes request Address and required parameter；

Screening is carried out to the request address based on default screening rule and obtains pending URL, and according to the pending URL's Pending entity file corresponding to request address selection；

Word segmentation processing generation word segmentation result is carried out to the pending entity file using default segmentation methods, and identifies described point Value element in word result, statistical analysis generation statistical result is carried out to the value element；

According to the statistical result, it is determined whether corresponding pending URL is added into filtering URL name list.
2. according to the method for claim 1, it is characterised in that described that the request address is entered based on default screening rule Row screening obtains pending URL, including：

The HTTP data are grouped according to the request address, and count the number of the HTTP data included in every group；

According to the number, descending sort is carried out to every group of request address, and calculates accumulative accounting, it is default accumulative to choose satisfaction URL corresponding to the request address of accounting, as the pending URL.
3. method according to claim 1 or 2, it is characterised in that the value element in the identification word segmentation result, Statistical analysis generation statistical result is carried out to the value element, including：

Based on default recognition rule, the word segmentation result and default value element are subjected to match cognization, obtain the participle knot Value element in fruit；

The number that the value element occurs in the pending entity file is counted, generates statistical result.
4. according to the method for claim 3, it is characterised in that described according to the statistical result, it is determined whether will be corresponding Pending URL be added to filtering URL name list, including：

When the statistical result is less than default recommendation results, pending URL corresponding to the statistical result and pending is obtained Entity file；

The pending URL and pending entity file are recommended and manually determined whether, and obtains artificial judgment result；

Based on the artificial judgment result, it is determined whether the pending URL is added into filtering URL name list.
5. according to the method for claim 4, it is characterised in that the default segmentation methods include reverse maximum matching and calculated Method.
A kind of 6. rubbish text processing unit, it is characterised in that including：

URL acquisition modules, for obtaining the URL of the HTTP data in preset time, the URL includes request address and request is joined Number；

Pending entity file acquisition module, obtain for carrying out screening to the request address based on default screening rule and wait to locate Manage URL, and the pending entity file according to corresponding to being chosen the request address of the pending URL；

Statistical result generation module, for carrying out word segmentation processing generation to the pending entity file using default segmentation methods Word segmentation result, and the value element in the word segmentation result is identified, statistical analysis generation statistics knot is carried out to the value element Fruit；

URL name list generation module is filtered, for according to the statistical result, it is determined whether corresponding pending URL is added to Filter URL name list.
7. device according to claim 6, it is characterised in that the pending entity file acquisition module includes：

Classified statistics unit, for being grouped to the HTTP data according to the request address, and count and included in every group HTTP data number；

Pending URL acquiring units, for according to the number, descending sort being carried out to every group of request address, and calculate tired Accounting is counted, URL corresponding to the request address for meeting default accumulative accounting is chosen, as the pending URL.
8. the device according to right 6 or 7, it is characterised in that the statistical result generation module includes：

Value element recognition unit, for based on default recognition rule, the word segmentation result and default value element are carried out With identification, the value element in the word segmentation result is obtained；

Statistical result generation unit, the number occurred for counting the value element in the pending entity file are raw Into statistical result.
A kind of 9. equipment, it is characterised in that including：

One or more processors；

Memory, for storing one or more programs；

When one or more of programs are by one or more of computing devices so that one or more of processors are real The now rubbish text processing method as any one of claim 1-5.
10. a kind of computer-readable recording medium, is stored thereon with computer program, it is characterised in that the program is by processor The rubbish text processing method as described in any in claim 1-5 is realized during execution.