CN103886016A - Equipment and method for determining junk text messages in page - Google Patents

Equipment and method for determining junk text messages in page Download PDF

Info

Publication number
CN103886016A
CN103886016A CN201410058591.8A CN201410058591A CN103886016A CN 103886016 A CN103886016 A CN 103886016A CN 201410058591 A CN201410058591 A CN 201410058591A CN 103886016 A CN103886016 A CN 103886016A
Authority
CN
China
Prior art keywords
information
rubbish text
candidate
text information
rubbish
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410058591.8A
Other languages
Chinese (zh)
Other versions
CN103886016B (en
Inventor
施鹏
牛章鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410058591.8A priority Critical patent/CN103886016B/en
Publication of CN103886016A publication Critical patent/CN103886016A/en
Application granted granted Critical
Publication of CN103886016B publication Critical patent/CN103886016B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention aims to provide equipment and a method for determining junk text messages in a page. The method particularly includes: acquiring a to-be-processed initial page; determining one or more candidate junk text messages corresponding to the initial page; determining cheating degree messages corresponding to the candidate junk text messages; determining one or more junk text messages corresponding to the initial page from the one or more candidate junk text messages according to the cheating degree messages. Compared with the prior art, the equipment and the method for determining the junk text messages in the page have the advantages that by determining the cheating degree messages of the candidate junk text messages corresponding to the initial page, the junk text messages corresponding to the initial page are determined from the candidate junk text messages according to the cheating degree messages, and accordingly screening of the candidate junk text messages according to the cheating degree messages is realized, the junk text messages in the initial page can be effectively recognized, safety and efficiency in message acquisition are improved for users, and searching and browsing experiences are promoted for the users.

Description

A kind of method and apparatus of the rubbish text information for definite page
Technical field
The present invention relates to Internet technical field, relate in particular to a kind of technology of the rubbish text information for definite page.
Background technology
Current, along with development and the infiltration of internet, applications to user learning, work and life of Internet technology, people are more and more by Network Capture information, as by inputting keyword to express its demand in search engine search column, and then obtain corresponding Search Results.May there is security risk for website corresponding to Search Results time, the security of this website is pointed out in the meetings such as search engine/browser to user, as the security risk that website may be existed is prompted to user, but, conventionally be not that all pages in website have security risk, but there is security risk in some information in some page, when some page rubbish text information as there is no security risk when website but wherein, security risk prompting taking website as coarseness cannot detect the rubbish text information in the page, thereby the security of user's obtaining information and the efficiency of obtaining information are affected, reduce user search viewing experience.
Summary of the invention
The object of this invention is to provide a kind of method and apparatus of the rubbish text information for definite page.
According to an aspect of the present invention, provide a kind of method of the rubbish text information for definite page, wherein, the method comprises the following steps:
A obtains pending initial page;
B determines the corresponding one or more candidate's rubbish text information of described initial page;
C determines the corresponding cheating degree of described candidate's rubbish text information information;
D, according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.
According to a further aspect in the invention, also provide a kind of rubbish text of the rubbish text information for definite page to determine equipment, wherein, this rubbish text determines that equipment comprises:
Acquisition device, for obtaining pending initial page;
Candidate's determining device, for determining the corresponding one or more candidate's rubbish text information of described initial page;
Cheating degree determining device, for determining the corresponding cheating degree of described candidate's rubbish text information information;
Rubbish determining device for according to described cheating degree information, is determined the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.
Compared with prior art, the present invention is by determining the cheating degree information of the corresponding one or more candidate's rubbish text information of initial page, with according to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, realize according to cheating degree information described candidate's rubbish text information has been screened, effectively identify the rubbish text information in described initial page, the security of user's obtaining information and the efficiency of obtaining information are not only improved, correspondingly, also promoted user search viewing experience.And, the present invention also can generate the target pages corresponding with described initial page, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information, to offer user, the rubbish text information in initial page is identified, for pointing out user, thereby improve further the security of user's obtaining information and the efficiency of obtaining information, promoted user search viewing experience.In addition, the present invention also can be in the time determining described cheating degree information, except presenting percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, also can be in conjunction with the corresponding probabilistic information that presents of described candidate's rubbish text information, make the described cheating degree information that obtains more accurate, thereby further improve the security of user's obtaining information and the efficiency of obtaining information, promoted user search viewing experience.
Brief description of the drawings
By reading the detailed description that non-limiting example is done of doing with reference to the following drawings, it is more obvious that other features, objects and advantages of the present invention will become:
Fig. 1 illustrates according to the equipment schematic diagram of the rubbish text information for definite page of one aspect of the invention;
Fig. 2 illustrates the pending initial page schematic diagram obtaining;
Fig. 3 illustrates the target pages schematic diagram that the initial page with shown in Fig. 2 of generation is corresponding, wherein, and the explicit identification information that described target pages comprises rubbish text information;
Fig. 4 illustrates the equipment schematic diagram of the rubbish text information for definite page in accordance with a preferred embodiment of the present invention;
Fig. 5 illustrates the method flow diagram of the rubbish text information for definite page according to a further aspect of the present invention;
Fig. 6 illustrates the method flow diagram of the rubbish text information for definite page in accordance with a preferred embodiment of the present invention.
In accompanying drawing, same or analogous Reference numeral represents same or analogous parts.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described in further detail.
Fig. 1 illustrates according to the rubbish text of the rubbish text information for definite page of one aspect of the invention and determines equipment 1, wherein, rubbish text determines that equipment 1 comprises acquisition device 11, candidate's determining device 12, cheating degree determining device 13 and rubbish determining device 14.Particularly, acquisition device 11 obtains pending initial page; Candidate's determining device 12 is determined the corresponding one or more candidate's rubbish text information of described initial page; Cheating degree determining device 13 is determined the corresponding cheating degree of described candidate's rubbish text information information; Rubbish determining device 14, according to described cheating degree information, is determined the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, rubbish text determines that equipment 1 includes but not limited to can or offer other users' internet platform as user by own original content displaying by it, as i) being used to its login user that information storage space is provided, upload to share its content as document, video, picture to realize this user; Also can be used for for user provides online reading, download, exchanges the network platform or the terminal platform of the content that other users share, like to ask etc. as Baidu library, beans fourth, Sina, wherein, described terminal platform includes but not limited to the subscriber equipment such as mobile terminal, PC; Ii) provide message reference, information sharing, information to issue or the synchronous network platform or terminal platform for being embodied as its login user, as third party websites such as social network sites, mhkc, forum, knowledge question sharing platform, space, blog, microbloggings.At this, described rubbish text determines that equipment 1 can be realized by the mutually integrated equipment forming of network by the network equipment, subscriber equipment or the network equipment and subscriber equipment.At this, the described network equipment includes but not limited to as realizations such as network host, single network server, multiple webserver collection or the set of computers based on cloud computing; Or realized by subscriber equipment.At this, cloud is made up of a large amount of main frames based on cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine being made up of the loosely-coupled computing machine collection of a group.At this, described subscriber equipment can be any electronic product that can carry out man-machine interaction by modes such as keyboard, mouse, touch pad, touch-screen or handwriting equipments with user, such as computing machine, mobile phone, PDA, palm PC PPC or panel computer etc.Described network includes but not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN (Local Area Network), VPN network, wireless self-organization network (Ad Hoc network) etc.Those skilled in the art will be understood that above-mentioned rubbish text determines that equipment 1 is only for for example; other network equipments existing or that may occur from now on or subscriber equipment are as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.At this, the network equipment and subscriber equipment include a kind of can be according to the instruction of prior setting or storage, automatically carry out the electronic equipment of numerical evaluation and information processing, its hardware includes but not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc.
For example, in the time that rubbish text determines that equipment 1 is realized by subscriber equipment, it can obtain the accessing page request that user submits to by the browser of subscriber equipment end, to obtain pending initial page; Then, determine the corresponding one or more candidate's rubbish text information of described initial page; Then, then determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, provide to subscriber equipment so that this rubbish text information exchange is crossed to browser, and then offer user.
For example, in the time that rubbish text determines that equipment 1 is realized by the network equipment, it can receive the accessing page request that user sends by subscriber equipment, and this accessing page request is sent to page server, receive the page corresponding with this accessing page request that page server returns, to obtain pending initial page; Then, determine the corresponding one or more candidate's rubbish text information of described initial page; Then, then determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, so that this rubbish text information is sent to subscriber equipment, as passed through this rubbish text information of browser display in subscriber equipment, and then offer user.
For example, in the time that rubbish text determines that equipment 1 is realized by subscriber equipment and network equipment cooperation, first subscriber equipment can obtain pending initial page; Then, by subscriber equipment, this initial page is sent to the corresponding network equipment, determines the corresponding one or more candidate's rubbish text information of described initial page by the network equipment; Determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page; Then, this rubbish text information is sent to subscriber equipment by the network equipment, this rubbish text information is offered to user by subscriber equipment.Also as, determine equipment 1 when rubbish text and coordinated while realizing by subscriber equipment and the network equipment, also can first obtain pending initial page and determine the corresponding one or more candidate's rubbish text information of described initial page by subscriber equipment; Then, by subscriber equipment, candidate's rubbish text information is sent to the network equipment, determines the corresponding cheating degree of described candidate's rubbish text information information by the network equipment; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page; Then, then by the network equipment, this rubbish text information is sent to subscriber equipment, this rubbish text information is offered to user by subscriber equipment.At this; those skilled in the art are to be understood that; above-mentioned subscriber equipment and the network equipment coordinate when realizing rubbish text and determining equipment 1, and those skilled in the art can suitably change arbitrarily to the division of labor of subscriber equipment and the network equipment, within this variation is all included in protection scope of the present invention.
Particularly, the application programming interfaces (API) that acquisition device 11 provides by the third party device such as browser, search engine, obtain pending initial page; Or, obtain by the dynamic web page technique such as JSP, ASP the query manipulation that user submits to by subscriber equipment, link as clicked in the page, with the application programming interfaces that provide by browser, obtain this link page pointed, to obtain pending initial page; Or, by the dynamic web page technique such as JSP, ASP, obtain the search sequence that user inputs by subscriber equipment, then this search sequence is submitted to search engine, and receive the Search Results corresponding with this search sequence that search engine feeds back, using the initial page as pending; Or by agreement communication modes such as http, https, obtain pending initial page.For example, user A knows as Baidu at search engine by its PC equipment and in the search column of search, inputs keyword " it is bad that baby has milk powder digestion, what if? " click search button, acquisition device 11 passes through ASP, the dynamic web page techniques such as JSP, get the search sequence of user A input, and based on submitting this search sequence to searching request to search engine, the application programming interfaces (API) that provide by search engine obtain search engine, and according to this keyword, " it is bad that baby has milk powder digestion, what if? " carry out that matching inquiry obtains with this keyword " it is bad that baby has milk powder digestion, what if? " one or more Search Results of matching are as search result1: " it is bad that baby has milk powder digestion, what if? _ child-bearing question and answer _ baby tree ", search result2: " does is it what if bad that baby eat milk powder digestion ?-child-bearing question and answer-child-bearing net ", search result3: " it is bad that baby has milk powder digestion, what if? _ Baidu is known ", search result4: " what is it about sucking baby indigestion? _ Baidu is known " etc., using the initial page as pending.
Those skilled in the art will be understood that the above-mentioned mode of obtaining pending initial page is only for giving an example; other existing or modes of obtaining pending initial page that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Candidate's determining device 12 is determined the corresponding one or more candidate's rubbish text information of described initial page.At this, described rubbish text information refers to the non-vital data, the risk information etc. that in the page, exist, while answering other users' problem as user, all recommend certain destination object, and this destination object might not be answered this problem, this destination object is rubbish text information.Wherein, described destination object refers to any article or the service that can meet consumer or certain demand of user that people provide to market.At this, candidate's determining device 12 determines that the mode of described candidate's rubbish text information includes but not limited to following at least any one:
1), according to the corresponding user's operating characteristics of character string information in the content of pages information of described initial page, determine the corresponding one or more candidate's rubbish text information of described initial page.Particularly, candidate's determining device 12 is first by such as described initial page is carried out to html tag analysis, or, by the abstracting method based on wrapper wrapper, obtain the content of pages information of described initial page; Then, described content of pages information is carried out to semantic analysis processing, to obtain the text string being comprised in the content of pages information of described initial page; Then, then according to the corresponding user's operating characteristics of this character string information, determine the corresponding one or more candidate's rubbish text information of described initial page, as which text string belongs to candidate's rubbish text.At this, described user's operating characteristics information include but not limited to as: i), the corresponding user of character string repeats behavioural information; Ii) the corresponding user of described character string deletes behavioural information.
For example, suppose that the pending initial page that acquisition device 11 obtains is search result2: " does is it what if bad that baby eat milk powder digestion ?-child-bearing question and answer-child-bearing net ", and first candidate's determining device 12 carries out html tag analysis to it, and the content of pages information of initial page search result2 is carried out to semantic analysis processing, obtaining its corresponding character string comprises as " having some probio to baby ", " baby of my family drinks Heng Shi milk powder always, there is not indigestion phenomenon, parent can have a try ", " precious milk powder is thought by the Switzerland that can have a try, good absorption, do not get angry can strong baby's enteron aisle and strengthen immunity ", " not being milk powder problem ", " baby occurs that such situation proves to be not too applicable to this milk powder, the plate of can trying to change ", suppose to answer the content that comprises text string " Heng Shi milk powder " corresponding to same user, and this user repeatedly answers the content that comprises text string " Heng Shi milk powder ", the possibility that this user's Malicious recommendation " Heng Shi milk powder " is described is larger, candidate's determining device 12 can be using " Heng Shi milk powder " as described candidate's rubbish text information, for another example, suppose to answer the user who comprises text string " precious milk powder is thought by Switzerland " content and have the behavior of repeatedly answering and then deleting again, the cheating suspicion of explanatory text string " precious milk powder is thought by Switzerland " is larger, and candidate's determining device 12 can be using " precious milk powder be thought by Switzerland " as described candidate's rubbish text information.
For another example, suppose that the pending initial page that acquisition device 11 obtains is search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion what if? _ Baidu is known ", and this page comprise following to the answer I of this problem to IV:
I: baby digests bad may be because the problem of milk powder tries, with lower Jia Beiaite milk powder, to see in the past that people used well;
II: it is slow that stomach absorbs
Solve baby's indigestion problem, what paediatrics specialist often recommended is newborn good shellfish probio, and newborn good shellfish probio can make intestines and stomach produce multiple organic acid and digestive ferment, help baby to assimilate food, improve a poor appetite, the lactose of generation, acetic acid etc., can strengthen baby's intestines peristalsis, promote digestion.
III: first change milk powder and have a try, have some at ordinary times probio also can to baby and improve stomach, digestant;
, there is not indigestion situation in IV: the baby of my family uses Jia Beiaite milk powder, parent can have a try.
First candidate's determining device 12 carries out semantic analysis processing to above-mentioned answer I to IV, obtaining its corresponding character string comprises as " trying with lower Jia Beiaite milk powder ", " solve baby's indigestion problem, what paediatrics specialist often recommended is newborn good shellfish probio ", " first changing milk powder has a try, have some at ordinary times probio also can to baby and improve stomach, digestant ", " baby of my family uses Jia Beiaite milk powder, there is not indigestion situation, parent can have a try ", suppose that above-mentioned answer I and IV are from same user, and this user recommends equally " Jia Beiaite " milk powder in the time of other answers about " baby eats milk powder indigestion " problem, the possibility that this user's Malicious recommendation " Jia Beiaite " milk powder is described is larger, candidate's determining device 12 can be using " Jia Beiaite " as described candidate's rubbish text information, for another example, suppose to answer the user who comprises text " the newborn good shellfish probio " content in above-mentioned answer II and have the behavior of repeatedly answering and then deleting again, the cheating suspicion of explanatory text string " precious milk powder is thought by Switzerland " is larger, and candidate's determining device 12 can be using " newborn good shellfish probio " as described candidate's rubbish text information.
Those skilled in the art will be understood that above-mentioned user's operating characteristics information is only for giving an example; other user's operating characteristics information existing or that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
2), according to the character string in the content of pages information of described initial page, in rubbish text information bank, carry out matching inquiry, to obtain the corresponding one or more candidate's rubbish text information of described initial page.For example, connect example, candidate's determining device 12 can be according to the character string in the content of pages information of initial page search result2 as " having some probio to baby ", " baby of my family drinks Heng Shi milk powder always, there is not indigestion phenomenon, parent can have a try ", " precious milk powder is thought by the Switzerland that can have a try, good absorption, do not get angry can strong baby's enteron aisle and strengthen immunity ", " not being milk powder problem ", " baby occurs that such situation proves to be not too applicable to this milk powder, the plate of can trying to change ", in rubbish text information bank, carry out matching inquiry, to obtain the corresponding one or more candidate's rubbish text information of described initial page as " Heng Shi milk powder ", " precious milk powder is thought by Switzerland ".At this, described rubbish text information bank can be arranged in rubbish text and determine equipment 1, also can be arranged in other equipment that are connected by network with rubbish text equipment 1, as server.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding one or more candidate's rubbish text information of described initial page is only for giving an example; the mode of the corresponding one or more candidate's rubbish text information of other definite described initial page existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Cheating degree determining device 13 is determined the corresponding cheating degree of described candidate's rubbish text information information.At this, described cheating degree message reflection described candidate's rubbish text information belong to the degree of non-vital data and/or there is the degree of risk, in the time that the corresponding cheating degree of candidate's rubbish text information information is larger, illustrate that its degree that belongs to non-vital data is larger and/or there is the degree of risk higher.At this, cheating degree determining device 13 determines that the mode of the corresponding cheating degree of described candidate's rubbish text information information includes but not limited to following at least any one:
1) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, determine described cheating degree information.At this, cheating degree determining device 13 according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information and determine that the mode of described cheating degree information includes but not limited to following at least any one:
A) cheating degree determining device 13 can be determined described cheating degree information according to following formula (1):
y ′ = C Σ i = 1 n 1 B i - - - ( 1 )
Wherein, C represents the corresponding storehouse of described candidate's rubbish text information information frequently, as in knowledge base, occur comprise as described in the quantity of model of candidate's rubbish text information, at this, described knowledge base comprises the text database of the corresponding website of described initial page, as for forum/question and answer types of web pages, its corresponding knowledge base is the text database that user that corresponding website comprises issues model and answers model; B irepresent that user i presents percent information about the user of described candidate's rubbish text information, the ratio of the text of candidate's rubbish text information as described in comprising as occurred in all texts of user i issue,
Figure BDA0000467938620000101
expressed be user present to described candidate's rubbish text information present wish degree, as when summation numerical value less, illustrate that to present wish larger, correspondingly, B ilarger, when summation numerical value is larger, illustrate that to present wish less, correspondingly, B iless; N represents to issue the total number of users amount of the text that comprises described candidate's rubbish text information; Y ' represents described cheating degree information.For example, suppose that candidate's determining device 12 determines initial page search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion; what if? _ Baidu is known " candidate's rubbish text information be " Jia Beiaite " and " newborn good shellfish probio ", suppose that the corresponding website text database of initial page search result3 (being knowledge base) is post database3, wherein, the corresponding C numerical value of candidate's rubbish text information " Jia Beiaite " is 1000, have 3 users and occurred " Jia Beiaite " in posting, and corresponding B inumerical value is respectively 1/2,1/3,1/5, and the corresponding C numerical value of candidate's rubbish text information " newborn good shellfish probio " is 500, and have 2 users and occurred " newborn good shellfish probio " in posting, and corresponding B inumerical value is respectively 1/15,1/25, and degree of cheating determining device 13, according to above-mentioned formula (1), can calculate candidate's rubbish text information " Jia Beiaite " and be respectively 100,12.5 with " newborn good shellfish probio " corresponding cheating degree information.
B) according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information, and in conjunction with the page subject information of described initial page, determine described cheating degree information.Particularly, first cheating degree determining device 13 carries out word segmentation processing by the content of pages information to described initial page, obtain the corresponding multiple keywords of described initial page, then the plurality of keyword is carried out to statistical treatment, as using keywords maximum occurrence number as described in the page subject information of initial page, then, cheating degree determining device 13 is determined the corresponding type of theme information of described page subject information, as whether being the page that presents about article and/or service, to determine the adjustment parameter of described page subject information about described cheating degree information, as when as described in page subject information be about article and/or service present the page time, this page subject information is about adjustment parameter d=1 of described cheating degree information, when described page subject information be not about article and/or service present the page time, this page subject information is about the adjustment parameter d ∈ (0 of described cheating degree information, 1), at this, adjusting the numerical value of parameter d can be scheduled to, also can be to obtain by machine learning, then, cheating degree determining device 13 can be determined described cheating degree according to following formula (2):
y ′ ′ = d * y ′ = d * C Σ i = 1 n 1 B i - - - ( 2 )
For example, suppose the definite initial page search result3 of cheating degree determining device 13: " it is bad that baby has milk powder digestion, what if? _ Baidu is known " page main information be " baby has milk powder and digests bad reason ", it is not the page that presents about article and/or service, degree of cheating determining device 13 can determine this page subject information about the adjustment parameter of described cheating degree information as d=0.1, degree of cheating determining device 13 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (2) and be respectively 100*0.1=10 with " newborn good shellfish probio " corresponding cheating degree information, 12.5*0.1=1.25.
C) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, and delete percent information in conjunction with the corresponding user of described candidate's rubbish text information, determine described cheating degree information.At this, described user deletes number of times that percent information refers to that user deletes the content of described comprising of its issue candidate's rubbish text information and accounts for the ratio of the number of times of all contents that comprise described candidate's rubbish text information of its issue, comprise candidate's rubbish text information and have m as the model number of candidate garbage text as user issues altogether, and it is deleted the n in this m model, the corresponding user of this candidate's rubbish text information candidate garbage text to delete percent information be n/m.Particularly, cheating degree determining device 13 can be determined described cheating degree information according to following formula (3):
y ′ ′ = ( 1 + d ′ ) * y ′ = ( 1 + d ′ ) * C Σ i = 1 n 1 B i - - - ( 3 )
Wherein, d' represents that the corresponding user of described candidate's rubbish text information deletes percent information.For example, suppose that candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding user delete percent information and be respectively 0.5,0.3, degree of cheating determining device 13 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (3) and be respectively 100*(1+0.5 with " newborn good shellfish probio " corresponding cheating degree information)=150,12.5*(1+0.3)=16.25.
D) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, in conjunction with the corresponding probabilistic information that presents of described candidate's rubbish text information, determine described cheating degree information.At this, described in present the probability that probabilistic information refers to that described candidate's rubbish text information occurs in the included vocabulary of aforementioned knowledge base.Particularly, cheating degree determining device 13 can be determined described cheating degree information by following formula (4):
y = C α * Σ i = 1 n 1 B i - - - ( 4 )
Wherein, α represents the corresponding probabilistic information that presents of described candidate's rubbish text information, as as described in the probability that frequently occurs in the vocabulary of the corresponding knowledge base of information in the represented storehouse of numerical value C of candidate's rubbish text information, if it is large that what user presented described candidate's junk information present wish, it adopts the possibility of non-generic word larger, correspondingly, the corresponding α of this candidate's rubbish text information is corresponding less; Y represents described cheating degree information.For example, suppose that candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding probabilistic information that presents are respectively 0.1,0.5, degree of cheating determining device 13 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (4) and be respectively 100*(1/0.1 with " newborn good shellfish probio " corresponding cheating degree information)=1000,12.5*(1/0.5)=25.
2) first described candidate's rubbish text information is carried out respectively to word segmentation processing, to obtain the corresponding one or more points of word informations of described candidate's rubbish text information; , then according to the corresponding cheating degree of the corresponding one or more points of word informations of described candidate's rubbish text information information, determine described cheating degree information then.For example, suppose that definite initial page candidate's rubbish text as corresponding in the initial web information of candidate's determining device 12 is " Chongqing red building hospital ", first degree of cheating determining device 13 carries out word segmentation processing to this candidate's rubbish text information, obtain its corresponding point of word information as word1 " Chongqing red building " and word2 " red building hospital ", suppose that cheating degree determining device 13 once determined that point word information word1 " Chongqing red building " and the corresponding cheating degree information of word2 " red building hospital " were respectively y1 and y2, degree of cheating determining device 13 can be according to point word information word1 " Chongqing red building " and the corresponding cheating degree information of word2 " red building hospital ", determine the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital ", as by cheating degree information corresponding with word2 word1 and mean value, as the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital ", determine that the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital " is (y1+y2)/2.
At this, the present invention can be according to the corresponding cheating degree of the corresponding one or more points of word informations of candidate's rubbish text information information, determine the cheating degree information of this candidate's rubbish text information, make to utilize collocations storehouse and the corresponding cheating degree of the rubbish word information of priori, running into after new candidate's rubbish text information the mode that need not adopt when determining the corresponding cheating degree of other candidate's rubbish text information information this new candidate's rubbish text information and determine the cheating degree information of its correspondence, realize the beneficial effect of simplifying the cheating degree information of determining candidate's rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding cheating degree of described candidate's rubbish text information information is only for giving an example; the mode of the corresponding cheating degree of other definite described candidate's rubbish text information existing or that may occur from now on information is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Rubbish determining device 14 is according to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, as according to as described in cheating degree information, from described one or more candidate's rubbish text information, filter out cheating degree information and meet candidate's rubbish text information of predetermined threshold, using as the corresponding rubbish text information of described initial page.For example, suppose the initial page search result3 that cheating degree determining device 13 is determined as shown in Figure 2: " it is bad that baby has milk powder digestion, what if? _ Baidu is known " candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding cheating degree information be respectively 100*(1/0.1)=1000, 12.5*(1/0.5)=25, rubbish determining device 14 can be according to this cheating degree information, from described these two candidate's rubbish text information, determine the corresponding one or more candidate's rubbish text information of initial page search result3, as cheating degree information is met predetermined threshold as candidate's rubbish text information " Jia Beiaite " of 100 as described in rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding one or more rubbish text information of described initial page is only for giving an example; the mode of the corresponding one or more rubbish text information of other definite described initial page existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Rubbish text determines between each device of equipment 1 it is constant work.Particularly, acquisition device 11 continues to obtain pending initial page; Candidate's determining device 12 continues to determine the corresponding one or more candidate's rubbish text information of described initial page; Cheating degree determining device 13 continues to determine the corresponding cheating degree of described candidate's rubbish text information information; Rubbish determining device 14 continues according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, those skilled in the art will be understood that " continuing " refers to that rubbish text determines definite with rubbish text information of determining who constantly carries out respectively the determining of the obtaining of initial page, candidate's rubbish text information, cheating degree information between each device of equipment 1, stops obtaining of initial page in a long time until rubbish text is determined equipment 1.
Preferably, rubbish text determines that equipment 1 also comprises webpage generating device (not shown) and generator (not shown).Particularly, webpage generating device generates the target pages corresponding with described initial page, and wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information; Described target pages is offered respective user by generator.
Particularly, webpage generating device generates the target pages corresponding with described initial page, and wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.At this, described explicit identification information includes but not limited to the corresponding background color of described rubbish text information, font color, font size, display mode etc., as with as described in initial page have the background color of discrimination and/or block diagram to mark, or, also can use float layer word to identify.Particularly, first webpage generating device determines the corresponding explicit identification information of described rubbish text information, as according to as described in the corresponding cheating degree of rubbish text information information, determine the explicit identification information of described rubbish text information in described target pages, if the corresponding cheating degree of described rubbish text information information is larger, as exceed predetermined threshold 90, mark with redness mark and with square frame, if the corresponding cheating degree of described rubbish text information information is interval (50,90], with orange mark; Then, webpage generating device is according to the corresponding explicit identification information of described rubbish text information, described initial page is upgraded to processing, as by as described in rubbish text information identify with its corresponding explicit identification information, generate the target pages corresponding with described initial page, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.
For example, suppose that the definite corresponding rubbish text information of initial page search result3 of rubbish determining device 14 is " Jia Beiaite ", suppose that the corresponding page layout background color of initial page search result3 is for light grey, webpage generating device can determine that the corresponding explicit identification information of rubbish text information " Jia Beiaite " is the color that has a discrimination with described initial page and marks as Dark grey, or, webpage generating device also can be according to the corresponding cheating degree of rubbish text information " Jia Beiaite " information, determine the explicit identification information that it is corresponding, if the corresponding cheating degree of rubbish text information " Jia Beiaite " information is 100, it exceedes predetermined threshold 90, webpage generating device can determine that the corresponding explicit identification information of rubbish text information " Jia Beiaite " is for marking with redness mark and with square frame, then, webpage generating device according to the corresponding explicit identification information of rubbish text information " Jia Beiaite " as identified with Dark grey, initial page search result3 is upgraded to processing, as rubbish text information " Jia Beiaite " is identified with its corresponding explicit identification information, generate the target pages corresponding with described initial page, as shown in Figure 3, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding explicit identification information of described rubbish text information is only for giving an example; the mode of the corresponding explicit identification information of other definite described rubbish text information existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that the mode of the above-mentioned generation target pages corresponding with described initial page is only for giving an example; the mode of other generations existing or that may occur from now on target pages corresponding with described initial page is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Generator is by dynamic web page techniques such as ASP, JSP or PHP, or the communication mode of other agreements, as the communication protocol such as http or https, described target pages is offered to respective user, to point out user.
More preferably, webpage generating device comprises presentation modes determining unit (not shown) and page generation unit (not shown).Particularly, presentation modes determining unit, according at least one corresponding cheating degree information in described one or more rubbish text information, is determined at least one corresponding presentation modes in described one or more rubbish text information; Page generation unit is according to described presentation modes, generate the target pages corresponding with described initial page, wherein, described target pages comprise corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.
Particularly, presentation modes determining unit is according at least one corresponding cheating degree information in described one or more rubbish text information, determine at least one corresponding presentation modes in described one or more rubbish text information, as the rubbish text information of difference cheating degree information, corresponding tupe difference, as degree of cheating information is greater than the rubbish text information of predetermined threshold, can deletes and not show etc.At this, described presentation modes include but not limited to as described in the corresponding position of appearing of rubbish text information, present color, presentation mode etc.For example, suppose that the definite corresponding rubbish text information of initial page search result3 of rubbish determining device 14 is " Jia Beiaite ", and cheating degree determining device 13 determines that its cheating degree information is 100, presentation modes determining unit can determine that its corresponding presentation modes is color to have a discrimination with initial page search result3 and marks as Dark grey, and shows; For another example, suppose that the cheating degree information that cheating degree determining device 13 determines that rubbish text information is " Jia Beiaite " is 1000, be greater than predetermined threshold as 500, but its corresponding presentation modes of presentation modes determining unit for it is deleted and is not shown.
Those skilled in the art will be understood that above-mentioned presentation modes is only for giving an example, and other presentation modes existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
Those skilled in the art will be understood that the above-mentioned mode of determining described presentation modes is only for giving an example; the mode of other definite described presentation modes existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Page generation unit is according to described presentation modes, generate the target pages corresponding with described initial page, wherein, described target pages comprise corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.At this, page generation unit generates the mode of the target pages corresponding with described initial page and aforementioned webpage generating device according to described presentation modes, and to generate the mode of described target pages identical or basic identical, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith.
In another preferred embodiment, the rubbish text of the above-mentioned rubbish text information for definite page can be determined to equipment 1, combine with existing browser, form a kind of new browser, existing browser comprises the IE browser of for example Microsoft company, the netscape browser of Netscape company, the Firefox browser of Mozilla company, the Chrome browser of Google company, the roam Maxthon browser of company, the opera browser of Opera company, 360 browsers of 360 companies, the search dog browser of Sohu.com Inc., the TT of the Tengxun browser of company of Tengxun etc.
In another preferred embodiment, the rubbish text of the above-mentioned rubbish text information for definite page can be determined to equipment 1, combine with existing browser plug-in, form a kind of new browser plug-in, existing browser plug-in comprises as Flash plug-in unit, RealPlayer plug-in unit, MMS plug-in unit, MIDI staff plug-in unit, ActiveX plug-in unit etc.
In another preferred embodiment, the rubbish text of the above-mentioned rubbish text information for definite page can be determined to equipment 1, combine with existing mobile phone browser APP, form a kind of new mobile phone browser APP, existing mobile phone browser app comprises as UC browser, UCmobile, UEWEB, baidu mobile phone browser, QQ browser etc.
In another preferred embodiment, the rubbish text of the above-mentioned rubbish text information for definite page can be determined to equipment 1, combine with existing search engine, form a kind of new search engine, existing search engine includes but not limited to know etc. as the Google search engine of Google company, the baidu search engine of company of Baidu, Baidu.
In another preferred embodiment, the rubbish text of the above-mentioned rubbish text information for definite page can be determined to equipment 1, combine with existing search engine plug-in unit, form a kind of new search engine plug-in unit, existing including but not limited to searched the search engine plug-in units such as the MSN ToolBar of despot, Microsoft as the Baidu of the Google ToolBar of Google company, company of Baidu.
Fig. 4 illustrates the equipment schematic diagram of the rubbish text information for definite page in accordance with a preferred embodiment of the present invention, wherein, rubbish text determines that equipment 1 comprises acquisition device 11 ', candidate's determining device 12 ', cheating degree determining device 13 ' and rubbish determining device 14 '.Particularly, acquisition device 11 ' obtains pending initial page; Candidate's determining device 12 ' detects the character string that meets predetermined characteristics of spam in described initial page, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information; Cheating degree determining device 13 ' is determined the corresponding cheating degree of described candidate's rubbish text information information; Rubbish determining device 14 ', according to described cheating degree information, is determined the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, acquisition device 11 ', cheating degree determining device 13 ' and rubbish determining device 14 ' are identical or basic identical with the content of corresponding intrument in Fig. 1 embodiment, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith.
Particularly, candidate's determining device 12 ' detects the character string that meets predetermined characteristics of spam in described initial page, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information.At this, described predetermined characteristics of spam includes but not limited to as the character string that 1) meets predetermined phrase pattern, recommend sentence pattern as " I used XXX product; pretty good; you also have a try " as met, " XXX fat-reducing effect is fine ", " trying, with lower XXX milk powder, to see in the past that people used well " etc.; 2) meet the character string of predetermined prefix feature and/or suffix feature, as comprise prefix and/or suffix as character string of place name etc.; 3) meet the character string at predetermined rubbish text position place, as be positioned at the character string of the head and the tail position of paragraph; 4) meet the character string of predetermined part of speech combination, as continuous some contaminations etc.For example, suppose that the pending initial page that acquisition device 11 ' obtains is search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion, what if? _ Baidu is known ", and this page comprises the answer I of this problem to IV, wherein, answer in I and comprise and meet predetermined phrase pattern as " tried with lower XXX milk powder, saw in the past that people used well " character string as " baby digests bad, it may be milk powder problem, try with lower Jia Beiaite milk powder, saw in the past that people used well ", answer to comprise in IV and meet the character string of predetermined phrase pattern " baby of my family uses Jia Beiaite milk powder, there is not indigestion situation, parent can have a try ", when candidate's determining device 12 ' is carried out semantic analysis to the content of pages information of this initial page search result3, just can detect in this initial page search result3 and include the character string that meets predetermined characteristics of spam as " Jia Beiaite ", the candidate rubbish text information of candidate's determining device 12 ' using this character string " Jia Beiaite " as initial page search result3.For another example, suppose that the pending initial page that acquisition device 11 ' obtains is initial web, in this initial page, include character string " have individual Chongqing red building hospital ", when candidate's determining device 12 ' is carried out syntactic analysis to the content of pages information of this initial page initial web, just can detect that character string " have individual Chongqing red building hospital " is for meeting the character string of predetermined part of speech combination as continuous some nouns, candidate's rubbish text information of candidate's determining device 12 ' using character string " have individual Chongqing red building hospital " as initial page initial web.
Those skilled in the art will be understood that above-mentioned predetermined characteristics of spam is only for giving an example, and other predetermined characteristics of spam existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
(with reference to figure 4) in a preferred embodiment, rubbish text determines that equipment 1 comprises acquisition device 11 ', candidate's determining device 12 ', cheating degree determining device 13 ', rubbish determining device 14 ' and pretreatment unit (not shown).Below with reference to Fig. 4, the preferred embodiment is described: particularly, acquisition device 11 ' obtains pending initial page; Candidate's determining device 12 ' detects the character string that meets the combination of predetermined part of speech in described initial page, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information; Pretreatment unit, according to the corresponding grammar property information of described candidate's rubbish text information, carries out pre-service to described one or more candidate's rubbish text information, to obtain pretreated one or more candidate's rubbish text information; Cheating degree determining device 13 ' is determined the corresponding cheating degree of pretreated described candidate's rubbish text information; Rubbish determining device 14 ', according to the corresponding cheating degree of pretreated described candidate's rubbish text information, is determined the corresponding one or more rubbish text information of described initial page from pretreated described one or more candidate's rubbish text information.At this, acquisition device 11 ' is identical or basic identical with the content of corresponding intrument in Fig. 1 embodiment, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith.
Particularly, candidate's determining device 12 ' detects the character string that meets the combination of predetermined part of speech in described initial page, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information.For example, suppose that the pending initial page that acquisition device 11 ' obtains is initial web, in this initial page, include character string " have individual Chongqing red building hospital ", when candidate's determining device 12 ' is carried out syntactic analysis to the content of pages information of this initial page initial web, just can detect that character string " have individual Chongqing red building hospital " is for meeting the character string of predetermined part of speech combination as continuous some nouns, candidate's rubbish text information of candidate's determining device 12 ' using character string " have individual Chongqing red building hospital " as initial page initial web.
Pretreatment unit, according to the corresponding grammar property information of described candidate's rubbish text information, carries out pre-service to described one or more candidate's rubbish text information, to obtain pretreated one or more candidate's rubbish text information.At this, described grammar property information refers to whether the position in the whole sentence of described candidate's rubbish text information under it meets corresponding syntactic structure, as for V-O construction, whether described candidate's rubbish text information meets the object in corresponding V-O construction, as for subject-predicate phrase, whether described candidate's rubbish text information meets subject or the object in corresponding subject-predicate phrase.At this, described pre-service include but not limited to as to as described in rubbish text information carry out cutting, pruning etc.For example, connect example, suppose that in the sentence of candidate's rubbish text in the definite initial page initial web of candidate's determining device 12 ' " have individual Chongqing red building hospital " under it should be the object in V-O construction, but its in this sentence by syntax cutting, pretreatment unit need to carry out pruning modes to this candidate's rubbish text " have individual Chongqing red building hospital ", as being split as " having individual/Chongqing red building hospital ", the candidate's rubbish text information obtaining after pruning is " Chongqing red building hospital ".
Those skilled in the art will be understood that and above-mentioned described one or more candidate's rubbish text information carried out to pretreated mode only for for example; other existing or may occur from now on described one or more candidate's rubbish text information is carried out to pretreated mode as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Cheating degree determining device 13 ' is determined the corresponding cheating degree of pretreated described candidate's rubbish text information.At this, cheating degree determining device 13 ' determines that the mode of the corresponding cheating degree of pretreated described candidate's rubbish text information is identical or basic identical with the mode of the 13 definite corresponding cheating degree of the described candidate's rubbish text information of degree of cheating determining device in Fig. 1, for simplicity's sake, therefore do not repeat them here, and comprise by reference therewith.
Rubbish determining device 14 ', according to the corresponding cheating degree of pretreated described candidate's rubbish text information, is determined the corresponding one or more rubbish text information of described initial page from pretreated described one or more candidate's rubbish text information.At this, rubbish determining device 14 ' determines that from pretreated described one or more candidate's rubbish text information the mode of the corresponding one or more rubbish text information of described initial page is identical or basic identical with the mode of the definite corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information of rubbish determining device 14 in Fig. 1, for simplicity's sake, therefore do not repeat them here, and comprise by reference therewith.
Fig. 5 illustrates the method flow diagram of the rubbish text information for definite page according to a further aspect of the present invention.
Particularly, in step S1, rubbish text determines that equipment 1 obtains pending initial page; In step S2, rubbish text is determined the definite corresponding one or more candidate's rubbish text information of described initial page of equipment 1; In step S3, rubbish text is determined the definite corresponding cheating degree of the described candidate's rubbish text information information of equipment 1; In step S4, rubbish text determines that equipment 1 is according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, rubbish text determines that equipment 1 includes but not limited to can or offer other users' internet platform as user by own original content displaying by it, as i) being used to its login user that information storage space is provided, upload to share its content as document, video, picture to realize this user; Also can be used for for user provides online reading, download, exchanges the network platform or the terminal platform of the content that other users share, like to ask etc. as Baidu library, beans fourth, Sina, wherein, described terminal platform includes but not limited to the subscriber equipment such as mobile terminal, PC; Ii) provide message reference, information sharing, information to issue or the synchronous network platform or terminal platform for being embodied as its login user, as third party websites such as social network sites, mhkc, forum, knowledge question sharing platform, space, blog, microbloggings.At this, described rubbish text determines that equipment 1 can be realized by the mutually integrated equipment forming of network by the network equipment, subscriber equipment or the network equipment and subscriber equipment.At this, the described network equipment includes but not limited to as realizations such as network host, single network server, multiple webserver collection or the set of computers based on cloud computing; Or realized by subscriber equipment.At this, cloud is made up of a large amount of main frames based on cloud computing (Cloud Computing) or the webserver, and wherein, cloud computing is the one of Distributed Calculation, the super virtual machine being made up of the loosely-coupled computing machine collection of a group.At this, described subscriber equipment can be any electronic product that can carry out man-machine interaction by modes such as keyboard, mouse, touch pad, touch-screen or handwriting equipments with user, such as computing machine, mobile phone, PDA, palm PC PPC or panel computer etc.Described network includes but not limited to internet, wide area network, Metropolitan Area Network (MAN), LAN (Local Area Network), VPN network, wireless self-organization network (Ad Hoc network) etc.Those skilled in the art will be understood that above-mentioned rubbish text determines that equipment 1 is only for for example; other network equipments existing or that may occur from now on or subscriber equipment are as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.At this, the network equipment and subscriber equipment include a kind of can be according to the instruction of prior setting or storage, automatically carry out the electronic equipment of numerical evaluation and information processing, its hardware includes but not limited to microprocessor, special IC (ASIC), programmable gate array (FPGA), digital processing unit (DSP), embedded device etc.
For example, in the time that rubbish text determines that equipment 1 is realized by subscriber equipment, it can obtain the accessing page request that user submits to by the browser of subscriber equipment end, to obtain pending initial page; Then, determine the corresponding one or more candidate's rubbish text information of described initial page; Then, then determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, provide to subscriber equipment so that this rubbish text information exchange is crossed to browser, and then offer user.
For example, in the time that rubbish text determines that equipment 1 is realized by the network equipment, it can receive the accessing page request that user sends by subscriber equipment, and this accessing page request is sent to page server, receive the page corresponding with this accessing page request that page server returns, to obtain pending initial page; Then, determine the corresponding one or more candidate's rubbish text information of described initial page; Then, then determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, so that this rubbish text information is sent to subscriber equipment, as passed through this rubbish text information of browser display in subscriber equipment, and then offer user.
For example, in the time that rubbish text determines that equipment 1 is realized by subscriber equipment and network equipment cooperation, first subscriber equipment can obtain pending initial page; Then, by subscriber equipment, this initial page is sent to the corresponding network equipment, determines the corresponding one or more candidate's rubbish text information of described initial page by the network equipment; Determine the corresponding cheating degree of described candidate's rubbish text information information; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page; Then, this rubbish text information is sent to subscriber equipment by the network equipment, this rubbish text information is offered to user by subscriber equipment.Also as, determine equipment 1 when rubbish text and coordinated while realizing by subscriber equipment and the network equipment, also can first obtain pending initial page and determine the corresponding one or more candidate's rubbish text information of described initial page by subscriber equipment; Then, by subscriber equipment, candidate's rubbish text information is sent to the network equipment, determines the corresponding cheating degree of described candidate's rubbish text information information by the network equipment; According to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page; Then, then by the network equipment, this rubbish text information is sent to subscriber equipment, this rubbish text information is offered to user by subscriber equipment.At this; those skilled in the art are to be understood that; above-mentioned subscriber equipment and the network equipment coordinate when realizing rubbish text and determining equipment 1, and those skilled in the art can suitably change arbitrarily to the division of labor of subscriber equipment and the network equipment, within this variation is all included in protection scope of the present invention.
Particularly, in step S1, rubbish text is determined the application programming interfaces (API) that equipment 1 provides by the third party device such as browser, search engine, obtains pending initial page; Or, obtain by the dynamic web page technique such as JSP, ASP the query manipulation that user submits to by subscriber equipment, link as clicked in the page, with the application programming interfaces that provide by browser, obtain this link page pointed, to obtain pending initial page; Or, by the dynamic web page technique such as JSP, ASP, obtain the search sequence that user inputs by subscriber equipment, then this search sequence is submitted to search engine, and receive the Search Results corresponding with this search sequence that search engine feeds back, using the initial page as pending; Or by agreement communication modes such as http, https, obtain pending initial page.For example, user A knows as Baidu at search engine by its PC equipment and in the search column of search, inputs keyword " it is bad that baby has milk powder digestion, what if? " click search button, in step S1, rubbish text determines that equipment 1 passes through ASP, the dynamic web page techniques such as JSP, get the search sequence of user A input, and based on submitting this search sequence to searching request to search engine, the application programming interfaces (API) that provide by search engine obtain search engine, and according to this keyword, " it is bad that baby has milk powder digestion, what if? " carry out that matching inquiry obtains with this keyword " it is bad that baby has milk powder digestion, what if? " one or more Search Results of matching are as search result1: " it is bad that baby has milk powder digestion, what if? _ child-bearing question and answer _ baby tree ", search result2: " does is it what if bad that baby eat milk powder digestion ?-child-bearing question and answer-child-bearing net ", search result3: " it is bad that baby has milk powder digestion, what if? _ Baidu is known ", search result4: " what is it about sucking baby indigestion? _ Baidu is known " etc., using the initial page as pending.
Those skilled in the art will be understood that the above-mentioned mode of obtaining pending initial page is only for giving an example; other existing or modes of obtaining pending initial page that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S2, rubbish text is determined the definite corresponding one or more candidate's rubbish text information of described initial page of equipment 1.At this, described rubbish text information refers to the non-vital data, the risk information etc. that in the page, exist, while answering other users' problem as user, all recommend certain destination object, and this destination object might not be answered this problem, this destination object is rubbish text information.Wherein, described destination object refers to any article or the service that can meet consumer or certain demand of user that people provide to market.At this, in step S2, rubbish text determines that the mode of equipment 1 definite described candidate's rubbish text information includes but not limited to following at least any one:
1), according to the corresponding user's operating characteristics of character string information in the content of pages information of described initial page, determine the corresponding one or more candidate's rubbish text information of described initial page.Particularly, in step S2, rubbish text determines that equipment 1 is first by such as described initial page is carried out to html tag analysis, or, by the abstracting method based on wrapper wrapper, obtain the content of pages information of described initial page; Then, described content of pages information is carried out to semantic analysis processing, to obtain the text string being comprised in the content of pages information of described initial page; Then, then according to the corresponding user's operating characteristics of this character string information, determine the corresponding one or more candidate's rubbish text information of described initial page, as which text string belongs to candidate's rubbish text.At this, described user's operating characteristics information include but not limited to as: i), the corresponding user of character string repeats behavioural information; Ii) the corresponding user of described character string deletes behavioural information.
For example, suppose in step S1, rubbish text determines that the pending initial page that equipment 1 obtains is search result2: " does is it what if bad that baby eat milk powder digestion ?-child-bearing question and answer-child-bearing net ", and in step S2, rubbish text determines that first equipment 1 carry out html tag analysis to it, and the content of pages information of initial page search result2 is carried out to semantic analysis processing, obtaining its corresponding character string comprises as " having some probio to baby ", " baby of my family drinks Heng Shi milk powder always, there is not indigestion phenomenon, parent can have a try ", " precious milk powder is thought by the Switzerland that can have a try, good absorption, do not get angry can strong baby's enteron aisle and strengthen immunity ", " not being milk powder problem ", " baby occurs that such situation proves to be not too applicable to this milk powder, the plate of can trying to change ", suppose to answer the content that comprises text string " Heng Shi milk powder " corresponding to same user, and this user repeatedly answers the content that comprises text string " Heng Shi milk powder ", the possibility that this user's Malicious recommendation " Heng Shi milk powder " is described is larger, in step S2, rubbish text determines that equipment 1 can be using " Heng Shi milk powder " as described candidate's rubbish text information, for another example, suppose to answer the user who comprises text string " precious milk powder is thought by Switzerland " content and have the behavior of repeatedly answering and then deleting again, the cheating suspicion of explanatory text string " precious milk powder is thought by Switzerland " is larger,, in step S2, rubbish text determines that equipment 1 can be using " precious milk powder be thought by Switzerland " as described candidate's rubbish text information.
For another example, suppose in step S1, rubbish text determines that the pending initial page that equipment 1 obtains is search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion what if? _ Baidu is known ", and this page comprise following to the answer I of this problem to IV:
I: baby digests bad may be because the problem of milk powder tries, with lower Jia Beiaite milk powder, to see in the past that people used well;
II: it is slow that stomach absorbs
Solve baby's indigestion problem, what paediatrics specialist often recommended is newborn good shellfish probio, and newborn good shellfish probio can make intestines and stomach produce multiple organic acid and digestive ferment, help baby to assimilate food, improve a poor appetite, the lactose of generation, acetic acid etc., can strengthen baby's intestines peristalsis, promote digestion.
III: first change milk powder and have a try, have some at ordinary times probio also can to baby and improve stomach, digestant;
, there is not indigestion situation in IV: the baby of my family uses Jia Beiaite milk powder, parent can have a try.
In step S2, rubbish text determines that first equipment 1 carry out semantic analysis processing to above-mentioned answer I to IV, obtaining its corresponding character string comprises as " trying with lower Jia Beiaite milk powder ", " solve baby's indigestion problem, what paediatrics specialist often recommended is newborn good shellfish probio ", " first changing milk powder has a try, have some at ordinary times probio also can to baby and improve stomach, digestant ", " baby of my family uses Jia Beiaite milk powder, there is not indigestion situation, parent can have a try ", suppose that above-mentioned answer I and IV are from same user, and this user recommends equally " Jia Beiaite " milk powder in the time of other answers about " baby eats milk powder indigestion " problem, the possibility that this user's Malicious recommendation " Jia Beiaite " milk powder is described is larger, in step S2, rubbish text determines that equipment 1 can be using " Jia Beiaite " as described candidate's rubbish text information, for another example, suppose to answer the user who comprises text " the newborn good shellfish probio " content in above-mentioned answer II and have the behavior of repeatedly answering and then deleting again, the cheating suspicion of explanatory text string " precious milk powder is thought by Switzerland " is larger,, in step S2, rubbish text determines that equipment 1 can be using " newborn good shellfish probio " as described candidate's rubbish text information.
Those skilled in the art will be understood that above-mentioned user's operating characteristics information is only for giving an example; other user's operating characteristics information existing or that may occur are from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
2), according to the character string in the content of pages information of described initial page, in rubbish text information bank, carry out matching inquiry, to obtain the corresponding one or more candidate's rubbish text information of described initial page.For example, connect example, in step S2, rubbish text determines that equipment 1 can be according to the character string in the content of pages information of initial page search result2 as " having some probio to baby ", " baby of my family drinks Heng Shi milk powder always, there is not indigestion phenomenon, parent can have a try ", " precious milk powder is thought by the Switzerland that can have a try, good absorption, do not get angry can strong baby's enteron aisle and strengthen immunity ", " not being milk powder problem ", " baby occurs that such situation proves to be not too applicable to this milk powder, the plate of can trying to change ", in rubbish text information bank, carry out matching inquiry, to obtain the corresponding one or more candidate's rubbish text information of described initial page as " Heng Shi milk powder ", " precious milk powder is thought by Switzerland ".At this, described rubbish text information bank can be arranged in rubbish text and determine equipment 1, also can be arranged in other equipment that are connected by network with rubbish text equipment 1, as server.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding one or more candidate's rubbish text information of described initial page is only for giving an example; the mode of the corresponding one or more candidate's rubbish text information of other definite described initial page existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S3, rubbish text is determined the definite corresponding cheating degree of the described candidate's rubbish text information information of equipment 1.At this, described cheating degree message reflection described candidate's rubbish text information belong to the degree of non-vital data and/or there is the degree of risk, in the time that the corresponding cheating degree of candidate's rubbish text information information is larger, illustrate that its degree that belongs to non-vital data is larger and/or there is the degree of risk higher.At this, in step S3, rubbish text determines that the mode of the equipment 1 definite corresponding cheating degree of described candidate's rubbish text information information includes but not limited to following at least any one:
1) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, determine described cheating degree information.At this, in step S3, rubbish text determine equipment 1 according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information and determine that the mode of described cheating degree information includes but not limited to following at least any one:
A), in step S3, rubbish text determines that equipment 1 can determine described cheating degree information according to following formula (5):
y ′ = C Σ i = 1 n 1 B i - - - ( 5 )
Wherein, C represents the corresponding storehouse of described candidate's rubbish text information information frequently, as in knowledge base, occur comprise as described in the quantity of model of candidate's rubbish text information, at this, described knowledge base comprises the text database of the corresponding website of described initial page, as for forum/question and answer types of web pages, its corresponding knowledge base is the text database that user that corresponding website comprises issues model and answers model; B irepresent that user i presents percent information about the user of described candidate's rubbish text information, the ratio of the text of candidate's rubbish text information as described in comprising as occurred in all texts of user i issue,
Figure BDA0000467938620000272
expressed be user present to described candidate's rubbish text information present wish degree, as when summation numerical value less, illustrate that to present wish larger, correspondingly, B ilarger, when summation numerical value is larger, illustrate that to present wish less, correspondingly, B iless; N represents to issue the total number of users amount of the text that comprises described candidate's rubbish text information; Y ' represents described cheating degree information.For example, suppose in step S2, rubbish text determines that equipment 1 determines initial page search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion; what if? _ Baidu is known " candidate's rubbish text information be " Jia Beiaite " and " newborn good shellfish probio ", suppose that the corresponding website text database of initial page search result3 (being knowledge base) is post database3, wherein, the corresponding C numerical value of candidate's rubbish text information " Jia Beiaite " is 1000, have 3 users and occurred " Jia Beiaite " in posting, and corresponding B inumerical value is respectively 1/2,1/3,1/5, and the corresponding C numerical value of candidate's rubbish text information " newborn good shellfish probio " is 500, and have 2 users and occurred " newborn good shellfish probio " in posting, and corresponding B inumerical value is respectively 1/15,1/25, and, in step S3, rubbish text determines that equipment 1 is according to above-mentioned formula (5), can calculate candidate's rubbish text information " Jia Beiaite " and be respectively 100,12.5 with " newborn good shellfish probio " corresponding cheating degree information.
B) according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information, and in conjunction with the page subject information of described initial page, determine described cheating degree information.Particularly, in step S3, rubbish text determines that first equipment 1 carry out word segmentation processing by the content of pages information to described initial page, obtain the corresponding multiple keywords of described initial page, then the plurality of keyword is carried out to statistical treatment, as using keywords maximum occurrence number as described in the page subject information of initial page, then, in step S3, rubbish text is determined the definite corresponding type of theme information of described page subject information of equipment 1, as whether being the page that presents about article and/or service, to determine the adjustment parameter of described page subject information about described cheating degree information, as when as described in page subject information be about article and/or service present the page time, this page subject information is about adjustment parameter d=1 of described cheating degree information, when described page subject information be not about article and/or service present the page time, this page subject information is about the adjustment parameter d ∈ (0 of described cheating degree information, 1), at this, adjusting the numerical value of parameter d can be scheduled to, also can be to obtain by machine learning, then,, in step S3, rubbish text determines that equipment 1 can determine described cheating degree according to following formula (6):
y ′ ′ = d * y ′ = d * C Σ i = 1 n 1 B i - - - ( 6 )
For example, suppose in step S3, rubbish text is determined the definite initial page search result3 of equipment 1: " it is bad that baby has milk powder digestion, what if? _ Baidu is known " page main information be " baby has milk powder and digests bad reason ", it is not the page that presents about article and/or service, in step S3, rubbish text determine equipment 1 can determine this page subject information about the adjustment parameter of described cheating degree information as d=0.1, in step S3, rubbish text determines that equipment 1 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (6) and be respectively 100*0.1=10 with " newborn good shellfish probio " corresponding cheating degree information, 12.5*0.1=1.25.
C) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, and delete percent information in conjunction with the corresponding user of described candidate's rubbish text information, determine described cheating degree information.At this, described user deletes number of times that percent information refers to that user deletes the content of described comprising of its issue candidate's rubbish text information and accounts for the ratio of the number of times of all contents that comprise described candidate's rubbish text information of its issue, comprise candidate's rubbish text information and have m as the model number of candidate garbage text as user issues altogether, and it is deleted the n in this m model, the corresponding user of this candidate's rubbish text information candidate garbage text to delete percent information be n/m.Particularly, in step S3, rubbish text determines that equipment 1 can determine described cheating degree information according to following formula (7):
y ′ ′ = ( 1 + d ′ ) * y ′ = ( 1 + d ′ ) * C Σ i = 1 n 1 B i - - - ( 7 )
Wherein, d' represents that the corresponding user of described candidate's rubbish text information deletes percent information.For example, suppose that candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding user delete percent information and be respectively 0.5,0.3, in step S3, rubbish text determines that equipment 1 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (7) and be respectively 100*(1+0.5 with " newborn good shellfish probio " corresponding cheating degree information)=150,12.5*(1+0.3)=16.25.
D) present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, in conjunction with the corresponding probabilistic information that presents of described candidate's rubbish text information, determine described cheating degree information.At this, described in present the probability that probabilistic information refers to that described candidate's rubbish text information occurs in the included vocabulary of aforementioned knowledge base.Particularly, in step S3, rubbish text determines that equipment 1 can determine described cheating degree information by following formula (8):
y = C α * Σ i = 1 n 1 B i - - - ( 8 )
Wherein, α represents the corresponding probabilistic information that presents of described candidate's rubbish text information, as as described in the probability that frequently occurs in the vocabulary of the corresponding knowledge base of information in the represented storehouse of numerical value C of candidate's rubbish text information, if it is large that what user presented described candidate's junk information present wish, it adopts the possibility of non-generic word larger, correspondingly, the corresponding α of this candidate's rubbish text information is corresponding less; Y represents described cheating degree information.For example, suppose that candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding probabilistic information that presents are respectively 0.1,0.5, in step S3, rubbish text determines that equipment 1 can calculate candidate's rubbish text information " Jia Beiaite " according to above-mentioned formula (8) and be respectively 100*(1/0.1 with " newborn good shellfish probio " corresponding cheating degree information)=1000,12.5*(1/0.5)=25.
2) first described candidate's rubbish text information is carried out respectively to word segmentation processing, to obtain the corresponding one or more points of word informations of described candidate's rubbish text information; , then according to the corresponding cheating degree of the corresponding one or more points of word informations of described candidate's rubbish text information information, determine described cheating degree information then.For example, suppose in step S2, rubbish text determines that definite initial page candidate's rubbish text as corresponding in the initial web information of equipment 1 is " Chongqing red building hospital ", in step S3, rubbish text determines that first equipment 1 carry out word segmentation processing to this candidate's rubbish text information, obtain its corresponding point of word information as word1 " Chongqing red building " and word2 " red building hospital ", suppose in step S3, rubbish text determines that equipment 1 once determined that point word information word1 " Chongqing red building " and the corresponding cheating degree information of word2 " red building hospital " were respectively y1 and y2, in step S3, rubbish text determines that equipment 1 can be according to point word information word1 " Chongqing red building " and the corresponding cheating degree information of word2 " red building hospital ", determine the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital ", as by cheating degree information corresponding with word2 word1 and mean value, as the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital ", determine that the corresponding cheating degree information of candidate's rubbish text information " Chongqing red building hospital " is (y1+y2)/2.
At this, the present invention can be according to the corresponding cheating degree of the corresponding one or more points of word informations of candidate's rubbish text information information, determine the cheating degree information of this candidate's rubbish text information, make to utilize collocations storehouse and the corresponding cheating degree of the rubbish word information of priori, running into after new candidate's rubbish text information the mode that need not adopt when determining the corresponding cheating degree of other candidate's rubbish text information information this new candidate's rubbish text information and determine the cheating degree information of its correspondence, realize the beneficial effect of simplifying the cheating degree information of determining candidate's rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding cheating degree of described candidate's rubbish text information information is only for giving an example; the mode of the corresponding cheating degree of other definite described candidate's rubbish text information existing or that may occur from now on information is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S4, rubbish text determines that equipment 1 is according to described cheating degree information, from described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page, as according to as described in cheating degree information, from described one or more candidate's rubbish text information, filter out cheating degree information and meet candidate's rubbish text information of predetermined threshold, using as the corresponding rubbish text information of described initial page.For example, suppose in step S3, rubbish text is determined the initial page search result3 that equipment 1 is determined as shown in Figure 2: " it is bad that baby has milk powder digestion, what if? _ Baidu is known " candidate's rubbish text information " Jia Beiaite " and " newborn good shellfish probio " corresponding cheating degree information be respectively 100*(1/0.1)=1000, 12.5*(1/0.5)=25, in step S4, rubbish text determines that equipment 1 can be according to this cheating degree information, from described these two candidate's rubbish text information, determine the corresponding one or more candidate's rubbish text information of initial page search result3, as cheating degree information is met predetermined threshold as candidate's rubbish text information " Jia Beiaite " of 100 as described in rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding one or more rubbish text information of described initial page is only for giving an example; the mode of the corresponding one or more rubbish text information of other definite described initial page existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Rubbish text determines between each step of equipment 1 it is constant work.Particularly, in step S1, rubbish text determines that equipment 1 continues to obtain pending initial page; In step S2, rubbish text determines that equipment 1 continues to determine the corresponding one or more candidate's rubbish text information of described initial page; In step S3, rubbish text determines that equipment 1 continues to determine the corresponding cheating degree of described candidate's rubbish text information information; In step S4, rubbish text determines that equipment 1 continues according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, those skilled in the art will be understood that " continuing " refers to that rubbish text determines definite with rubbish text information of determining who constantly carries out respectively the determining of the obtaining of initial page, candidate's rubbish text information, cheating degree information between each step of equipment 1, stops obtaining of initial page in a long time until rubbish text is determined equipment 1.
Preferably, rubbish text determines that equipment 1 also comprises that step S5(is not shown) and step S6(not shown).Particularly, in step S5, rubbish text determines that equipment 1 generates the target pages corresponding with described initial page, and wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information; In step S6, rubbish text determines that described target pages is offered respective user by equipment 1.
Particularly, in step S5, rubbish text determines that equipment 1 generates the target pages corresponding with described initial page, and wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.At this, described explicit identification information includes but not limited to the corresponding background color of described rubbish text information, font color, font size, display mode etc., as with as described in initial page have the background color of discrimination and/or block diagram to mark, or, also can use float layer word to identify.Particularly, in step S5, rubbish text is determined the first definite corresponding explicit identification information of described rubbish text information of equipment 1, as according to as described in the corresponding cheating degree of rubbish text information information, determine the explicit identification information of described rubbish text information in described target pages, if the corresponding cheating degree of described rubbish text information information is larger, as exceed predetermined threshold 90, mark with redness mark and with square frame, if the corresponding cheating degree of described rubbish text information information is interval (50,90], with orange mark; Then, in step S5, rubbish text determines that equipment 1 is according to the corresponding explicit identification information of described rubbish text information, described initial page is upgraded to processing, as by as described in rubbish text information identify with its corresponding explicit identification information, generate the target pages corresponding with described initial page, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.
For example, suppose in step S4, rubbish text determines that the definite corresponding rubbish text information of initial page search result3 of equipment 1 is " Jia Beiaite ", suppose that the corresponding page layout background color of initial page search result3 is for light grey, in step S5, rubbish text determines that equipment 1 can determine that the corresponding explicit identification information of rubbish text information " Jia Beiaite " is the color that has a discrimination with described initial page and marks as Dark grey, or, in step S5, rubbish text determines that equipment 1 also can be according to the corresponding cheating degree of rubbish text information " Jia Beiaite " information, determine the explicit identification information that it is corresponding, if the corresponding cheating degree of rubbish text information " Jia Beiaite " information is 100, it exceedes predetermined threshold 90, in step S5, rubbish text determines that equipment 1 can determine that the corresponding explicit identification information of rubbish text information " Jia Beiaite " is for marking with redness mark and with square frame, then, in step S5, rubbish text determine equipment 1 according to the corresponding explicit identification information of rubbish text information " Jia Beiaite " as identified with Dark grey, initial page search result3 is upgraded to processing, as rubbish text information " Jia Beiaite " is identified with its corresponding explicit identification information, generate the target pages corresponding with described initial page, as shown in Figure 3, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information.
Those skilled in the art will be understood that the above-mentioned mode of determining the corresponding explicit identification information of described rubbish text information is only for giving an example; the mode of the corresponding explicit identification information of other definite described rubbish text information existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
Those skilled in the art will be understood that the mode of the above-mentioned generation target pages corresponding with described initial page is only for giving an example; the mode of other generations existing or that may occur from now on target pages corresponding with described initial page is as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S6, rubbish text determines that equipment 1 is by dynamic web page techniques such as ASP, JSP or PHP, or the communication mode of other agreements, as the communication protocol such as http or https, described target pages is offered to respective user, to point out user.
More preferably, step S5 comprises that step S51(is not shown) and step S52(not shown).Particularly, in step S51, rubbish text determines that equipment 1, according at least one corresponding cheating degree information in described one or more rubbish text information, determines at least one corresponding presentation modes in described one or more rubbish text information; In step S52, rubbish text determines that equipment 1 is according to described presentation modes, generate the target pages corresponding with described initial page, wherein, described target pages comprise corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.
Particularly, in step S51, rubbish text determines that equipment 1 is according at least one corresponding cheating degree information in described one or more rubbish text information, determine at least one corresponding presentation modes in described one or more rubbish text information, as the rubbish text information of difference cheating degree information, corresponding tupe difference, as cheating degree information is greater than the rubbish text information of predetermined threshold, can deletes and not show etc.At this, described presentation modes include but not limited to as described in the corresponding position of appearing of rubbish text information, present color, presentation mode etc.For example, suppose in step S4, rubbish text determines that the definite corresponding rubbish text information of initial page search result3 of equipment 1 is " Jia Beiaite ", and in step S3, rubbish text determines that definite its cheating degree information of equipment 1 is 100, in step S51, rubbish text determines that equipment 1 can determine that its corresponding presentation modes is color to have a discrimination with initial page search result3 and marks as Dark grey, and shows; For another example, suppose in step S3, rubbish text determines that the cheating degree information that the definite rubbish text information of equipment 1 is " Jia Beiaite " is 1000, is greater than predetermined threshold as 500, in step S51, but rubbish text determines that equipment 1 its corresponding presentation modes is for to delete and not show it.
Those skilled in the art will be understood that above-mentioned presentation modes is only for giving an example, and other presentation modes existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
Those skilled in the art will be understood that the above-mentioned mode of determining described presentation modes is only for giving an example; the mode of other definite described presentation modes existing or that may occur is from now on as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S52, rubbish text determines that equipment 1 is according to described presentation modes, generate the target pages corresponding with described initial page, wherein, described target pages comprise corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.At this, in step S52, rubbish text determines that equipment 1 is according to the mode of the described presentation modes generation target pages corresponding with described initial page and aforementioned in step S5, it is identical or basic identical that rubbish text determines that equipment 1 generates the mode of described target pages, for simplicity's sake, therefore do not repeat them here, and comprise by reference therewith.
Fig. 6 illustrates the method flow diagram of the rubbish text information for definite page in accordance with a preferred embodiment of the present invention.
Wherein, the method comprising the steps of S1 ', step S2 ', step S3 ' and step S4 '.Particularly, in step S1 ', rubbish text determines that equipment 1 obtains pending initial page; In step S2 ', rubbish text determines that equipment 1 detects the character string that meets predetermined characteristics of spam in described initial page, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information; In step S3 ', rubbish text is determined the definite corresponding cheating degree of the described candidate's rubbish text information information of equipment 1; In step S4 ', rubbish text determines that equipment 1 is according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.At this, step S1 ', step S3 ' are identical or basic identical with the content of corresponding step in Fig. 5 embodiment with step S4 ', for simplicity's sake, thus do not repeat them here, and comprise by reference therewith.
Particularly, in step S2 ', rubbish text determines that equipment 1 detects the character string that meets predetermined characteristics of spam in described initial page, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information.At this, described predetermined characteristics of spam includes but not limited to as the character string that 1) meets predetermined phrase pattern, recommend sentence pattern as " I used XXX product; pretty good; you also have a try " as met, " XXX fat-reducing effect is fine ", " trying, with lower XXX milk powder, to see in the past that people used well " etc.; 2) meet the character string of predetermined prefix feature and/or suffix feature, as comprise prefix and/or suffix as character string of place name etc.; 3) meet the character string at predetermined rubbish text position place, as be positioned at the character string of the head and the tail position of paragraph; 4) meet the character string of predetermined part of speech combination, as continuous some contaminations etc.For example, suppose in step S1 ', rubbish text determines that the pending initial page that equipment 1 obtains is search result3 as shown in Figure 2: " it is bad that baby has milk powder digestion, what if? _ Baidu is known ", and this page comprises the answer I of this problem to IV, wherein, answer in I and comprise and meet predetermined phrase pattern as " tried with lower XXX milk powder, saw in the past that people used well " character string as " baby digests bad, it may be milk powder problem, try with lower Jia Beiaite milk powder, saw in the past that people used well ", answer to comprise in IV and meet the character string of predetermined phrase pattern " baby of my family uses Jia Beiaite milk powder, there is not indigestion situation, parent can have a try ", in step S2 ', rubbish text is determined when equipment 1 carries out semantic analysis to the content of pages information of this initial page search result3, just can detect in this initial page search result3 and include the character string that meets predetermined characteristics of spam as " Jia Beiaite ", in step S2 ', rubbish text is determined the candidate rubbish text information of equipment 1 ' using this character string " Jia Beiaite " as initial page search result3.For another example, suppose in step S1 ', rubbish text determines that the pending initial page that equipment 1 obtains is initial web, in this initial page, include character string " have individual Chongqing red building hospital ", in step S2 ', rubbish text is determined when equipment 1 carries out syntactic analysis to the content of pages information of this initial page initial web, just can detect that character string " have individual Chongqing red building hospital " is for meeting the character string of predetermined part of speech combination as continuous some nouns, in step S2 ', rubbish text is determined the candidate rubbish text information of equipment 1 using character string " have individual Chongqing red building hospital " as initial page initial web.
Those skilled in the art will be understood that above-mentioned predetermined characteristics of spam is only for giving an example, and other predetermined characteristics of spam existing or that may occur from now on, as applicable to the present invention, also should be included in protection domain of the present invention, and are contained in this at this with way of reference.
(with reference to figure 6) in a preferred embodiment, wherein, the method comprising the steps of S1 ', step S2 ', step S3 ', step S4 ' and step S7 ' (not shown).Below with reference to Fig. 6, the preferred embodiment is described: particularly, in step S1 ', rubbish text determines that equipment 1 obtains pending initial page; In step S2 ', rubbish text determines that equipment 1 detects the character string that meets the combination of predetermined part of speech in described initial page, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information; Pretreatment unit, according to the corresponding grammar property information of described candidate's rubbish text information, carries out pre-service to described one or more candidate's rubbish text information, to obtain pretreated one or more candidate's rubbish text information; In step S3 ', rubbish text is determined the definite corresponding cheating degree of the pretreated described candidate's rubbish text information of equipment 1; In step S4 ', rubbish text determines that equipment 1 is according to the corresponding cheating degree of pretreated described candidate's rubbish text information, determines the corresponding one or more rubbish text information of described initial page from pretreated described one or more candidate's rubbish text information.At this, step S1 ' is identical or basic identical with the content of corresponding step in Fig. 5 embodiment, for simplicity's sake, thus do not repeat them here, and comprise by reference therewith.
Particularly, in step S2 ', rubbish text determines that equipment 1 detects the character string that meets the combination of predetermined part of speech in described initial page, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information.For example, suppose in step S1 ', rubbish text determines that the pending initial page that equipment 1 obtains is initial web, in this initial page, include character string " have individual Chongqing red building hospital ", in step S2 ', rubbish text is determined when equipment 1 carries out syntactic analysis to the content of pages information of this initial page initial web, just can detect that character string " have individual Chongqing red building hospital " is for meeting the character string of predetermined part of speech combination as continuous some nouns, in step S2 ', rubbish text is determined the candidate rubbish text information of equipment 1 using character string " have individual Chongqing red building hospital " as initial page initial web.
In step S7 ', rubbish text determines that equipment 1 is according to the corresponding grammar property information of described candidate's rubbish text information, described one or more candidate's rubbish text information is carried out to pre-service, to obtain pretreated one or more candidate's rubbish text information.At this, described grammar property information refers to whether the position in the whole sentence of described candidate's rubbish text information under it meets corresponding syntactic structure, as for V-O construction, whether described candidate's rubbish text information meets the object in corresponding V-O construction, as for subject-predicate phrase, whether described candidate's rubbish text information meets subject or the object in corresponding subject-predicate phrase.At this, described pre-service include but not limited to as to as described in rubbish text information carry out cutting, pruning etc.For example, connect example, suppose in step S2 ', rubbish text determines that in the sentence of candidate's rubbish text in the definite initial page initial web of equipment 1 " have individual Chongqing red building hospital " under it should be the object in V-O construction, but its in this sentence by syntax cutting, in step S7 ', rubbish text determines that equipment 1 need to carry out pruning modes to this candidate's rubbish text " have individual Chongqing red building hospital ", as being split as " having individual/Chongqing red building hospital ", the candidate's rubbish text information obtaining after pruning is " Chongqing red building hospital ".
Those skilled in the art will be understood that and above-mentioned described one or more candidate's rubbish text information carried out to pretreated mode only for for example; other existing or may occur from now on described one or more candidate's rubbish text information is carried out to pretreated mode as applicable to the present invention; also should be included in protection domain of the present invention, and be contained in this at this with way of reference.
In step S3 ', rubbish text is determined the definite corresponding cheating degree of the pretreated described candidate's rubbish text information of equipment 1.At this, in step S2 ', rubbish text determines that equipment 1 determines in the mode of the corresponding cheating degree of pretreated described candidate's rubbish text information and Fig. 5 in step S3, rubbish text determines that the mode of the equipment 1 definite corresponding cheating degree of described candidate's rubbish text information is identical or basic identical, for simplicity's sake, therefore do not repeat them here, and comprise by reference therewith.
In step S4 ', rubbish text determines that equipment 1 is according to the corresponding cheating degree of pretreated described candidate's rubbish text information, determines the corresponding one or more rubbish text information of described initial page from pretreated described one or more candidate's rubbish text information.At this, in step S4 ', rubbish text determines that equipment 1 determines in the mode of the corresponding one or more rubbish text information of described initial page and Fig. 5 in step S4 from pretreated described one or more candidate's rubbish text information, rubbish text determines that the mode of equipment 1 definite corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information is identical or basic identical, for simplicity's sake, therefore do not repeat them here, and comprise by reference therewith.
It should be noted that the present invention can be implemented in the assembly of software and/or software and hardware, for example, can adopt special IC (ASIC), general object computing machine or any other similar hardware device to realize.In one embodiment, software program of the present invention can carry out to realize step mentioned above or function by processor.Similarly, software program of the present invention (comprising relevant data structure) can be stored in computer readable recording medium storing program for performing, for example, and RAM storer, magnetic or CD-ROM driver or flexible plastic disc and similar devices.In addition, steps more of the present invention or function can adopt hardware to realize, for example, thereby as coordinate the circuit of carrying out each step or function with processor.
In addition, a part of the present invention can be applied to computer program, and for example computer program instructions, in the time that it is carried out by computing machine, by the operation of this computing machine, can call or provide the method according to this invention and/or technical scheme.And call the programmed instruction of method of the present invention, may be stored in fixing or movably in recording medium, and/or be transmitted by the data stream in broadcast or other signal bearing medias, and/or be stored in according in the working storage of the computer equipment of described programmed instruction operation.At this, comprise according to one embodiment of present invention a device, this device comprises storer for storing computer program instructions and the processor for execution of program instructions, wherein, in the time that this computer program instructions is carried out by this processor, trigger this device and move based on aforementioned according to the method for multiple embodiment of the present invention and/or technical scheme.
To those skilled in the art, obviously the invention is not restricted to the details of above-mentioned example embodiment, and in the situation that not deviating from spirit of the present invention or essential characteristic, can realize the present invention with other concrete form.Therefore, no matter from which point, all should regard embodiment as exemplary, and be nonrestrictive, scope of the present invention is limited by claims instead of above-mentioned explanation, is therefore intended to all changes that drop in the implication and the scope that are equal to important document of claim to be included in the present invention.Any Reference numeral in claim should be considered as limiting related claim.In addition, obviously other unit or step do not got rid of in " comprising " word, and odd number is not got rid of plural number.Multiple unit of stating in device claim or device also can be realized by software or hardware by a unit or device.The first, the second word such as grade is used for representing title, and does not represent any specific order.

Claims (20)

1. for determining the method for rubbish text information for the page, wherein, the method comprises the following steps:
A obtains pending initial page;
B determines the corresponding one or more candidate's rubbish text information of described initial page;
C determines the corresponding cheating degree of described candidate's rubbish text information information;
D, according to described cheating degree information, determines the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.
2. method according to claim 1, wherein, described step b comprises:
-in described initial page, detect and meet the character string of predetermined characteristics of spam, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information.
3. method according to claim 2, wherein, described step b comprises:
-in described initial page, detect the character string that meets the combination of predetermined part of speech, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information;
Wherein, the method also comprises:
-according to the corresponding grammar property information of described candidate's rubbish text information, described one or more candidate's rubbish text information is carried out to pre-service, to obtain pretreated one or more candidate's rubbish text information;
Wherein, described step c comprises:
-determine the corresponding cheating degree of pretreated described candidate's rubbish text information;
Wherein, described steps d comprises:
-according to the corresponding cheating degree of pretreated described candidate's rubbish text information, from pretreated described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page.
4. method according to claim 1 and 2, wherein, described step c comprises:
-present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, determine described cheating degree information.
5. method according to claim 4, wherein, described step c comprises:
-according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information, and in conjunction with the page subject information of described initial page, determine described cheating degree information.
6. method according to claim 4, wherein, described step c comprises:
-present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, in conjunction with the corresponding probabilistic information that presents of described candidate's rubbish text information, determine described cheating degree information.
7. method according to claim 6, wherein, described step c comprises:
-in the following manner, determine described cheating degree information:
y = C α * Σ i = 1 n 1 B i
Wherein, C represents the corresponding storehouse of described candidate's rubbish text information information frequently, B irepresent that user i presents percent information about the user of described candidate's rubbish text information, n represents to issue the total number of users amount of the text that comprises described candidate's rubbish text information, α represents the corresponding probabilistic information that presents of described candidate's rubbish text information, and y represents described cheating degree information.
8. method according to claim 1 and 2, wherein, described step c comprises:
-described candidate's rubbish text information is carried out respectively to word segmentation processing, to obtain the corresponding one or more points of word informations of described candidate's rubbish text information;
-according to the corresponding cheating degree of the corresponding one or more points of word informations of described candidate's rubbish text information information, determine described cheating degree information.
9. according to the method described in any one in claim 1 to 8, wherein, the method also comprises:
M generates the target pages corresponding with described initial page, and wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information;
-described target pages is offered to respective user.
10. method according to claim 9, wherein, described step m comprises:
-according at least one corresponding cheating degree information in described one or more rubbish text information, determine at least one corresponding presentation modes in described one or more rubbish text information;
-according to described presentation modes, generate the target pages corresponding with described initial page, wherein, that described target pages comprises is corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.
11. 1 kinds of rubbish texts for the rubbish text information of definite page are determined equipment, and wherein, this rubbish text determines that equipment comprises:
Acquisition device, for obtaining pending initial page;
Candidate's determining device, for determining the corresponding one or more candidate's rubbish text information of described initial page;
Cheating degree determining device, for determining the corresponding cheating degree of described candidate's rubbish text information information;
Rubbish determining device for according to described cheating degree information, is determined the corresponding one or more rubbish text information of described initial page from described one or more candidate's rubbish text information.
12. rubbish texts according to claim 11 are determined equipment, and wherein, described candidate's determining device is used for:
-in described initial page, detect and meet the character string of predetermined characteristics of spam, using by the described character string that meets predetermined characteristics of spam as one or more candidate's rubbish text information.
13. rubbish texts according to claim 12 are determined equipment, and wherein, described candidate's determining device is used for:
-in described initial page, detect the character string that meets the combination of predetermined part of speech, using by the described character string that meets predetermined part of speech as one or more candidate's rubbish text information;
Wherein, this rubbish text determines that equipment also comprises:
Pretreatment unit, for according to the corresponding grammar property information of described candidate's rubbish text information, carries out pre-service to described one or more candidate's rubbish text information, to obtain pretreated one or more candidate's rubbish text information;
Wherein, described cheating degree determining device is used for:
-determine the corresponding cheating degree of pretreated described candidate's rubbish text information;
Wherein, described rubbish text determining device is used for:
-according to the corresponding cheating degree of pretreated described candidate's rubbish text information, from pretreated described one or more candidate's rubbish text information, determine the corresponding one or more rubbish text information of described initial page.
14. determine equipment according to the rubbish text described in claim 11 or 12, and wherein, described cheating degree determining device is used for:
-present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, determine described cheating degree information.
15. rubbish texts according to claim 14 are determined equipment, and wherein, described cheating degree determining device is used for:
-according to the corresponding storehouse of described candidate's rubbish text information frequently information and user present percent information, and in conjunction with the page subject information of described initial page, determine described cheating degree information.
16. rubbish texts according to claim 14 are determined equipment, and wherein, described cheating degree determining device is used for:
-present percent information according to described candidate's rubbish text information corresponding storehouse frequency information and user, in conjunction with the corresponding probabilistic information that presents of described candidate's rubbish text information, determine described cheating degree information.
17. rubbish texts according to claim 16 are determined equipment, and wherein, described cheating degree determining device is used for:
-in the following manner, determine described cheating degree information:
y = C α * Σ i = 1 n 1 B i
Wherein, C represents the corresponding storehouse of described candidate's rubbish text information information frequently, B irepresent that user i presents percent information about the user of described candidate's rubbish text information, n represents to issue the total number of users amount of the text that comprises described candidate's rubbish text information, α represents the corresponding probabilistic information that presents of described candidate's rubbish text information, and y represents described cheating degree information.
18. determine equipment according to the rubbish text described in claim 11 or 12, and wherein, described cheating degree determining device is used for:
-described candidate's rubbish text information is carried out respectively to word segmentation processing, to obtain the corresponding one or more points of word informations of described candidate's rubbish text information;
-according to the corresponding cheating degree of the corresponding one or more points of word informations of described candidate's rubbish text information information, determine described cheating degree information.
19. determine equipment according to claim 11 to the rubbish text described in any one in 18, and wherein, this rubbish text determines that equipment also comprises:
Webpage generating device, for generating the target pages corresponding with described initial page, wherein, described target pages comprises at least one explicit identification information in described one or more rubbish text information;
Generator, for offering respective user by described target pages.
20. rubbish texts according to claim 19 are determined equipment, and wherein, described webpage generating device comprises:
Presentation modes determining unit, for according at least one corresponding cheating degree information of described one or more rubbish text information, determines at least one corresponding presentation modes in described one or more rubbish text information;
Page generation unit, be used for according to described presentation modes, generate the target pages corresponding with described initial page, wherein, that described target pages comprises is corresponding with described presentation modes, at least one explicit identification information in described one or more rubbish text information.
CN201410058591.8A 2014-02-20 2014-02-20 A kind of method and apparatus for being used to determine the rubbish text information in the page Active CN103886016B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410058591.8A CN103886016B (en) 2014-02-20 2014-02-20 A kind of method and apparatus for being used to determine the rubbish text information in the page

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410058591.8A CN103886016B (en) 2014-02-20 2014-02-20 A kind of method and apparatus for being used to determine the rubbish text information in the page

Publications (2)

Publication Number Publication Date
CN103886016A true CN103886016A (en) 2014-06-25
CN103886016B CN103886016B (en) 2017-11-03

Family

ID=50954908

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410058591.8A Active CN103886016B (en) 2014-02-20 2014-02-20 A kind of method and apparatus for being used to determine the rubbish text information in the page

Country Status (1)

Country Link
CN (1) CN103886016B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN105704005A (en) * 2014-11-28 2016-06-22 深圳市腾讯计算机系统有限公司 Malicious user reporting method and device, and reporting information processing method and device
CN106411988A (en) * 2016-03-31 2017-02-15 北京金山安全软件有限公司 Garbage treatment method and device and mobile terminal
CN107544967A (en) * 2016-06-23 2018-01-05 北京搜狗科技发展有限公司 A kind of Network Access Method, device and electronic equipment
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN110688540A (en) * 2019-10-08 2020-01-14 腾讯科技(深圳)有限公司 Cheating account screening method, device, equipment and medium
CN111460110A (en) * 2019-01-22 2020-07-28 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104408087A (en) * 2014-11-13 2015-03-11 百度在线网络技术(北京)有限公司 Method and system for identifying cheating text
CN105704005A (en) * 2014-11-28 2016-06-22 深圳市腾讯计算机系统有限公司 Malicious user reporting method and device, and reporting information processing method and device
CN106411988A (en) * 2016-03-31 2017-02-15 北京金山安全软件有限公司 Garbage treatment method and device and mobile terminal
CN107544967A (en) * 2016-06-23 2018-01-05 北京搜狗科技发展有限公司 A kind of Network Access Method, device and electronic equipment
CN107544967B (en) * 2016-06-23 2022-03-25 北京搜狗科技发展有限公司 Network access method and device and electronic equipment
CN108804413A (en) * 2018-04-28 2018-11-13 百度在线网络技术(北京)有限公司 The recognition methods of text cheating and device
CN111460110A (en) * 2019-01-22 2020-07-28 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN111460110B (en) * 2019-01-22 2023-04-25 阿里巴巴集团控股有限公司 Abnormal text detection method, abnormal text sequence detection method and device
CN110688540A (en) * 2019-10-08 2020-01-14 腾讯科技(深圳)有限公司 Cheating account screening method, device, equipment and medium
CN110688540B (en) * 2019-10-08 2022-06-10 腾讯科技(深圳)有限公司 Cheating account screening method, device, equipment and medium

Also Published As

Publication number Publication date
CN103886016B (en) 2017-11-03

Similar Documents

Publication Publication Date Title
US20240029464A1 (en) Method, apparatus, and computer program product for classification of documents
CN103886016A (en) Equipment and method for determining junk text messages in page
US10776885B2 (en) Mutually reinforcing ranking of social media accounts and contents
CN102117317B (en) Blind person Internet system based on voice technology
WO2019041521A1 (en) Apparatus and method for extracting user keyword, and computer-readable storage medium
CN104102639B (en) Popularization triggering method based on text classification and device
CN107784092A (en) A kind of method, server and computer-readable medium for recommending hot word
Xie et al. Efficient browsing of web search results on mobile devices based on block importance model
CN103514191A (en) Method and device for determining keyword matching mode of target popularization information
US20120078945A1 (en) Interactive addition of semantic concepts to a document
US20230229714A1 (en) Identifying Information Using Referenced Text
CN106484764A (en) User's similarity calculating method based on crowd portrayal technology
CN101957834A (en) Content recommending method and device based on user characteristics
US20140006408A1 (en) Identifying points of interest via social media
CN103294781A (en) Method and equipment used for processing page data
CN103455524A (en) Method and device for displaying and acquiring entry information
CN102169501A (en) Method and device for generating abstract based on type information of document corresponding with searching result
CN102314494A (en) Method and equipment for processing webpage contents
US11651039B1 (en) System, method, and user interface for a search engine based on multi-document summarization
CN107273393A (en) Image search method, device and data handling system for mobile device
US20170235835A1 (en) Information identification and extraction
CN106202312B (en) A kind of interest point search method and system for mobile Internet
US8266140B2 (en) Tagging system using internet search engine
KR102195686B1 (en) Apparatus and method of recommending items based on areas
CN108595466B (en) Internet information filtering and internet user information and network card structure analysis method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant