CN104504027A - Method and device for automatically selecting webpage content - Google Patents

Method and device for automatically selecting webpage content Download PDF

Info

Publication number
CN104504027A
CN104504027A CN201410769099.1A CN201410769099A CN104504027A CN 104504027 A CN104504027 A CN 104504027A CN 201410769099 A CN201410769099 A CN 201410769099A CN 104504027 A CN104504027 A CN 104504027A
Authority
CN
China
Prior art keywords
web page
webpage
selection result
page contents
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410769099.1A
Other languages
Chinese (zh)
Other versions
CN104504027B (en
Inventor
陈俊宏
余德乐
杨韬
赵冬玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410769099.1A priority Critical patent/CN104504027B/en
Publication of CN104504027A publication Critical patent/CN104504027A/en
Application granted granted Critical
Publication of CN104504027B publication Critical patent/CN104504027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a device for automatically selecting webpage content. The method comprises the following steps of reading the webpage content in a source database; selecting the webpage content according to a preset keyword dictionary and a preset selecting parameter to obtain a webpage selecting result; extracting a preset label information dictionary; adding one or more types of labels in the label information dictionary into the webpage selecting result; according to the label information added into the webpage selecting result, executing function processing corresponding to the label information on the webpage selecting result to obtain the automatically selected webpage content. By the method and the device, the problems of tedious procedure and low efficiency caused when a large amount of webpage content which is updated everyday is manually selected in the prior art are solved.

Description

The auto-screening method of web page contents and device
Technical field
The present invention relates to computer realm, in particular to a kind of auto-screening method and device of web page contents.
Background technology
At present, carry out for the public opinion information monitoring system monitored for the content in web page contents, although user can be allowed to screen again required content of text, and can to again screening after content of text operate (such as: sort operation, label operation etc.), the multiple demand of user can well be met, but there is a problem: every day the web page contents on network all upgrade, and the data volume that every day upgrades is huge, if just result in user to need to continue to monitor most news, when each web page contents to upgrading is analyzed, all need to go to analyze the web page contents upgraded from the classification dimension oneself wanted, this craft just needing every day artificial all content of text are screened and are screened after operate again, process is lengthy and jumbled, trouble.
Carry out for a large amount of web page contents upgraded every day in prior art that the process that manual screening causes is lengthy and jumbled, the problem of inefficiency, not yet propose effective solution at present.
Summary of the invention
Fundamental purpose of the present invention is the auto-screening method and the device that provide a kind of web page contents, carries out manual screening, the problem of the lengthy and jumbled inefficiency of the process caused to solve in prior art to a large amount of web page contents that every day upgrades.
To achieve these goals, according to an aspect of the embodiment of the present invention, a kind of auto-screening method of web page contents is provided.The method comprises: read the web page contents in source database; According to the keyword dictionary pre-set and the screening parameter pre-set, web page contents is screened, obtains webpage the selection result; Extract the label information dictionary pre-set; Any one or polytype label in label information dictionary are added in the middle of webpage the selection result; According to the label information added in webpage the selection result, the function treatment corresponding with label information is performed to webpage the selection result, obtain web page contents after automatic screening.
To achieve these goals, according to the another aspect of the embodiment of the present invention, provide a kind of automatic splinter screening device for screening of web page contents, this device comprises the first read module, for reading the web page contents in source database; Screening module, for according to the keyword dictionary pre-set and the screening parameter pre-set, screens web page contents, obtains webpage the selection result; Extraction module, for extracting the label information dictionary pre-set; First processing module, for according to the web page contents in webpage the selection result, is added in the middle of webpage the selection result by any one or polytype label in label information dictionary; Second processing module, for according to the label information added in webpage the selection result, performs the function treatment corresponding with label information to webpage the selection result, obtains web page contents after automatic screening.
According to inventive embodiments, by reading the web page contents in source database; According to the keyword dictionary pre-set and the screening parameter pre-set, web page contents is screened, obtains webpage the selection result; Extract the label information dictionary pre-set; Any one or polytype label in label information dictionary are added in the middle of webpage the selection result; According to the label information added in webpage the selection result, the function treatment corresponding with label information is performed to webpage the selection result, obtain web page contents after automatic screening, solve in prior art and manual screening is carried out, the problem of the lengthy and jumbled inefficiency of the process caused to a large amount of web page contents that every day upgrades.Achieve and automatically webpage is screened, and according to the effect that web page contents processes webpage.
Accompanying drawing explanation
The accompanying drawing forming a application's part is used to provide a further understanding of the present invention, and schematic description and description of the present invention, for explaining the present invention, does not form inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 is the process flow diagram of the auto-screening method of web page contents according to the embodiment of the present invention one;
Fig. 2 is the process flow diagram of the auto-screening method according to the preferred web page contents of the embodiment of the present invention one;
Fig. 3 is the structural representation of the automatic splinter screening device for screening of web page contents according to the embodiment of the present invention two; And
Fig. 4 is the structural representation of the automatic splinter screening device for screening according to the preferred web page contents of the embodiment of the present invention two.
Embodiment
It should be noted that, when not conflicting, the embodiment in the application and the feature in embodiment can combine mutually.Below with reference to the accompanying drawings and describe the present invention in detail in conjunction with the embodiments.
The present invention program is understood better in order to make those skilled in the art person, below in conjunction with the accompanying drawing in the embodiment of the present invention, technical scheme in the embodiment of the present invention is clearly and completely described, obviously, described embodiment is only the embodiment of a part of the present invention, instead of whole embodiments.Based on the embodiment in the present invention, those of ordinary skill in the art, not making the every other embodiment obtained under creative work prerequisite, should belong to the scope of protection of the invention.
It should be noted that, term " first ", " second " etc. in instructions of the present invention and claims and above-mentioned accompanying drawing are for distinguishing similar object, and need not be used for describing specific order or precedence.Should be appreciated that the data used like this can be exchanged, in the appropriate case so that embodiments of the invention described herein.In addition, term " comprises " and " having " and their any distortion, intention is to cover not exclusive comprising, such as, contain those steps or unit that the process of series of steps or unit, method, system, product or equipment is not necessarily limited to clearly list, but can comprise clearly do not list or for intrinsic other step of these processes, method, product or equipment or unit.
Embodiment 1
Embodiments provide a kind of auto-screening method of web page contents.
Fig. 1 is the process flow diagram of the auto-screening method of web page contents according to the embodiment of the present invention.As shown in Figure 1, it is as follows that the method comprising the steps of:
Step S11, reads the web page contents in source database.
Concrete, by above-mentioned steps S11, the web page contents stored in source database is read.Wherein, source database is for storing the web page contents of regular update.
Step S13, according to the keyword dictionary pre-set and the screening parameter pre-set, screens web page contents, obtains webpage the selection result.
Concrete, by above-mentioned steps S13, screen reading the web page contents obtained from source database.Wherein, carrying out screening technique to web page contents can first screen web page contents according to the keyword dictionary pre-set, and is then screened by the screening parameter preset the selection result, obtains webpage the selection result.
Step S15, extracts the label information dictionary pre-set.
Concrete, by above-mentioned steps S15, the label information dictionary pre-set for web page contents is extracted, obtain the label information dictionary being used for identifying webpage.
Step S17, according to the web page contents in webpage the selection result, is added into any one or the polytype label that mate with web page contents in label information dictionary in the middle of webpage the selection result.
Concrete, by above-mentioned steps S17, the content of webpage the selection result is mated with the label in label information dictionary, obtained and the one or more labels in one or more types of the content matching of webpage the selection result by coupling.In label information dictionary, include the label information that several are dissimilar.
Step S19, according to the label information added in webpage the selection result, performs the function treatment corresponding with label information to webpage the selection result, obtains web page contents after automatic screening.
Concrete, by above-mentioned steps S19, for the label information of one or more types corresponding to webpage the selection result, call the power function corresponding to tag types and the web page contents in this webpage the selection result is processed, thus realize the function to the automatic screening of webpage the selection result.
By step S11 to step S19, after the web page contents in source database is read, first according to keyword dictionary and the screening parameter pre-set, web page contents is screened.Obtain the web page contents of one or more keyword comprised in keyword dictionary, and, according to screening parameter, web page contents is screened further, obtain meeting the web page contents of one or more screening conditions in screening parameter, thus obtain webpage the selection result.On the basis of webpage the selection result, according to label information dictionary, webpage the selection result is identified further.When one or more tag types in certain web page contents in webpage the selection result and label information dictionary match, for the web page contents in webpage the selection result adds label.Finally, call the power function mated with tag types according to tag types, the web page contents in webpage the selection result is processed.
In summary, the invention solves in prior art and manual screening is carried out, the problem of the lengthy and jumbled inefficiency of the process caused to a large amount of web page contents that every day upgrades, achieve and automatically webpage is screened, and according to the effect that web page contents processes webpage.
Preferably, in the above embodiments of the present application, label information dictionary can comprise one or more in following tag types: tag along sort, mood label, region label, blacklist label, label to be deleted, wherein, in step S19 according to the label information added in webpage the selection result, perform the function treatment corresponding with label information to webpage the selection result, obtain after automatic screening in web page contents, step comprises:
Step S191, reads the label information of the one or more labels in webpage the selection result.
Step S193, calls corresponding power function according to the type of label information and processes webpage the selection result, web page contents after generation automatic screening.
Concrete, by above-mentioned steps S191 to step S193, by the reading to the label information of label in webpage the selection result, call the power function corresponding with tag types.By different power functions, the web page contents in webpage the selection result is processed accordingly.
Preferably, in the above embodiments of the present application, step S193 calls corresponding power function according to the type of label information and processes webpage the selection result, generates in the step of web page contents after automatic screening, at least comprises following any one or more scheme:
Scheme one: when label is tag along sort, the power function called is classification feature function, makes to carry out classification process to webpage the selection result, generates sorted web page contents.
Scheme two: when label is mood label, the power function called is the power function revising mood label, makes to carry out correcting process to the mood label of webpage the selection result, generates revised web page contents.
Scheme three: when label is region label, the power function called is the power function revising region label, makes to carry out correcting process to the region label of webpage the selection result, generates revised web page contents.
Scheme four: when label is blacklist label, the power function called is the power function screening the webpage the selection result corresponding with blacklist label, make to carry out Screening Treatment to the blacklist label of webpage the selection result, generate the web page contents after screening.
Scheme five: when label is label to be deleted, the power function called is the power function deleting the webpage the selection result corresponding with label to be deleted, make to carry out delete processing to the webpage the selection result with label to be deleted, generate the web page contents after deleting.
Concrete, in the middle of practical application, according to the needs of actual conditions, other processing scheme can also be carried out to webpage the selection result, be not limited to five kinds of above-mentioned schemes.
Preferably, in the above embodiments of the present application, step S13 is according to the keyword dictionary pre-set and the screening parameter pre-set, and screen web page contents, the step obtaining webpage the selection result comprises:
Step S131, according to the keyword dictionary pre-set, screens web page contents, obtains the first pre-service web page contents.
Step S133, the screening parameter according to presetting is screened the first pre-service web page contents, and obtain the second pre-service web page contents as webpage the selection result, wherein, the second pre-service web page contents is webpage the selection result.
Concrete, by above-mentioned steps S131 to step 133, first according to the keyword dictionary pre-set, webpage is screened, the first pre-service web page contents that the keyword obtaining comprising with keyword dictionary matches.Then, according to the screening parameter pre-set, the first pre-service web page contents is screened.By the screening to the first pre-service web page contents, obtain the second pre-service web page contents with the Condition Matching of screening parameter.Second pre-service web page contents is webpage the selection result.
Preferably, in the above embodiments of the present application, screen the first pre-service web page contents according to the screening parameter preset in step S133, the step obtained as the second pre-service web page contents of webpage the selection result comprises:
S1331, reads the screening parameter and screening order that pre-set.
S1333, according to screening parameter and screening order, be that condition screen to the first pre-service web page contents with screening parameter according to screening order successively, obtain the second pre-service web page contents, wherein, what screening parameter at least comprised among following screening item is any one or more: webpage text content, web page text source, web page text author, web page text acquisition time, webpage mood label and web page text issue regional information.
Concrete, by above-mentioned steps S1331 to step S1333, according to screening order, successively the first pre-service web page contents is screened according to screening parameter.By the screening layer by layer of screening parameter, progressively by the range shorter of screening, remove the dirty data in the first pre-service web page contents, obtain webpage the selection result.
Preferably, as shown in Figure 2, in the above embodiments of the present application, before step S11 reads the web page contents in source database, method also comprises:
Step S101, the target web address of reading pre-set.
Step S103, according to target web address, downloads the web page contents corresponding with target web address.
Step S105, is stored in the targeted web content downloaded in source database.
Concrete, by above-mentioned steps S101 to step S105, the web page contents in the target web address preset is captured.By access destination web page address, get the web page contents corresponding with target web address.The web page contents got is stored, is stored in source database as source data.
Preferably, in the above embodiments of the present application, web page contents can comprise as the next item down or several: webpage text content, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
Concrete, according to target web address, the web page contents corresponding with target web address is carried out in the step downloaded, while the webpage text content in web page contents is downloaded, need to preserve the web page text source corresponding with webpage text content, web page text author, web page text acquisition time and web page text and issue regional information etc.And by these information, corresponding stored is in source database.
Preferably, in the above embodiments of the present application, in step S13 according to the keyword dictionary pre-set and the screening parameter pre-set, screen web page contents, after obtaining webpage the selection result, method also comprises:
Step S141, read the text mood dictionary pre-set, wherein, text mood dictionary comprises: front mood word dictionary and negative emotions word dictionary.
Step S143, judges according to the content of text mood dictionary to webpage the selection result, obtains the mood label corresponding with the content of webpage the selection result.
Concrete, by above-mentioned steps S141 and step S143, the text mood dictionary according to pre-setting carries out mood analysis to the web page contents in webpage the selection result.Web page contents is mated with the front mood word dictionary in text mood dictionary and the one or more mood entries in negative emotions word dictionary respectively, obtains the mood label corresponding with the content of webpage the selection result.
Preferably, in the above embodiments of the present application, judge according to the content of text mood dictionary to webpage the selection result in step S143, obtain in the mood label corresponding with the content of webpage the selection result, step comprises:
Step S1431, judges according to the content of text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in the mood word dictionary of front in the content of webpage the selection result exceedes the threshold value pre-set, determine that the mood label of the content of webpage the selection result is front mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result exceedes the threshold value pre-set, determine that the mood label of the content of webpage the selection result is negative emotions.
Concrete, by above-mentioned steps S1431, the mood label corresponding to the content of webpage the selection result is judged.In the content of webpage the selection result, when the quantity of the front mood word comprised exceedes the threshold value preset, then judge that the content of this webpage the selection result is as front mood; In the content of webpage the selection result, when the quantity of the negative emotions word comprised exceedes the threshold value preset, then judge that the content of this webpage the selection result is as negative emotions.
In the middle of practical application, may both comprise front mood word in a webpage, comprise again negative emotions word.At this moment, can judge the difference of the quantity of the quantity of the front mood word in webpage and negative emotions word.When the quantity of front mood word is greater than the quantity of negative emotions word, then judge that the mood of this webpage is as front mood; When the quantity of front mood word is less than the quantity of negative emotions word, then judge that the mood of this webpage is as negative emotions.
In sum, in the middle of practical application, in order to make up in prior art, this shortcoming is operated again after every day all content of text being filtered and filtered, can by arranging the logic of multiple rule judgment for web page contents, and user can according to the pre-set screening conditions of the hobby of oneself (such as: comprise what keyword, be derived from which channel, what time range is, emotion degree is how many, author personage etc.).Then, whenever having new webpage text content from after internet grabs, the judgment rule of setting before whether automatic decision meets and screening rule.If meet, then automatic (such as classify, label, change mood, change Regional Property, add asterisk, draw in blacklist, delete etc. operation) is operated to web page text.
In order to realize above-mentioned functions, can be realized by front end automatic screening condition setting module and these two modules of robot reptile module (i.e. webpage capture module).
Front end automatic screening condition setting module:
User can set automatic screening and operation in product front end: first, by judging whether to comprise keyword; Then, set screening conditions by screening parameter, as: the condition such as relevant information, text crawl time judging text source, text mood, text author.Finally, when meeting this two conditions above-mentioned, web page contents is operated (such as: classification, labels, change mood, change Regional Property, add asterisk, draw in blacklist, deletion etc. operation) automatically.
Webpage capture module:
Webpage capture module can be one section of javascript code, the scope (website) that will crawl can be pre-determined, words all on the net and image content can all capture by reptile, to be stored in database and to judge whether to meet the screening conditions that user sets, if meet, then it is processed accordingly.
The screening rule that the present invention is manually set in advance by user, captures laggard row filter to the content on internet automatically, and automatically carries out subsequent operation by screening the web page contents obtained.In all processes, except the setting at initial stage, other do not need user intervention.This method is adopted to greatly improve the subjective initiative of user, can screen the web page contents newly grabbed at every turn according to the rule preset, and operate through screening the web page contents obtained accordingly, decrease the renewal because of website or webpage, user needs the complicacy of manual repeatable operation.
Embodiment 2
The embodiment of the present invention additionally provides a kind of automatic splinter screening device for screening of web page contents, and as shown in Figure 3, this device can comprise: the first read module 31, screening module 33, extraction module 35, first processing module 37 and the second processing module 39.
Wherein, the first read module 31, for reading the web page contents in source database.
Concrete, by the first read module 31, the web page contents stored in source database is read.Wherein, source database is for storing the web page contents of regular update.
Screening module 33, for according to the keyword dictionary pre-set and the screening parameter pre-set, screens web page contents, obtains webpage the selection result.
Concrete, by above-mentioned screening module 33, screen reading the web page contents obtained from source database.Wherein, carrying out screening technique to web page contents can first screen web page contents according to the keyword dictionary pre-set, and is then screened by the screening parameter preset the selection result, obtains webpage the selection result.
Extraction module 35, for extracting the label information dictionary pre-set.
Concrete, by said extracted module 35, the label information dictionary pre-set for web page contents is extracted, obtain the label information dictionary being used for identifying webpage.
First processing module 37, for according to the web page contents in webpage the selection result, is added into any one or polytype label in label information dictionary in the middle of webpage the selection result.
Concrete, by above-mentioned first processing module 37, the content of webpage the selection result is mated with the label in label information dictionary, obtained and the one or more labels in one or more types of the content matching of webpage the selection result by coupling.In label information dictionary, include the label information that several are dissimilar.
Second processing module 39, for according to the label information added in webpage the selection result, performs the function treatment corresponding with label information to webpage the selection result, obtains web page contents after automatic screening.
Concrete, by the second processing module 39, for the label information of one or more types corresponding to webpage the selection result, call the power function corresponding to tag types to process the web page contents in this webpage the selection result, thus realize the function to the automatic screening of webpage the selection result.
By the first read module 31, screening module 33, extraction module 35, first processing module 37 and the second processing module 39, after web page contents in source database is read, first according to keyword dictionary and the screening parameter pre-set, web page contents is screened.Obtain the web page contents of one or more keyword comprised in keyword dictionary, and, according to screening parameter, web page contents is screened further, obtain meeting the web page contents of one or more screening conditions in screening parameter, thus obtain webpage the selection result.On the basis of webpage the selection result, according to label information dictionary, webpage the selection result is identified further.When one or more tag types in certain web page contents in webpage the selection result and label information dictionary match, for the web page contents in webpage the selection result adds label.Finally, call the power function mated with tag types according to tag types, the web page contents in webpage the selection result is processed.
In summary, the invention solves in prior art and manual screening is carried out, the problem of the lengthy and jumbled inefficiency of the process caused to a large amount of web page contents that every day upgrades, achieve and automatically webpage is screened, and according to the effect that web page contents processes webpage.
Further, label information dictionary can comprise one or more in following tag types: tag along sort, mood label, region label, blacklist label, label to be deleted, in the second processing module 39, perform according to the label information added in webpage the selection result, perform the function treatment corresponding with label information to webpage the selection result, after obtaining automatic screening, the step of web page contents comprises:
Read the label information of the one or more labels in webpage the selection result.
Call corresponding power function according to the type of label information and process webpage the selection result, web page contents after generation automatic screening.
Concrete, by the reading of above-mentioned steps to the label information of label in webpage the selection result, call the power function corresponding with tag types.By different power functions, the web page contents in webpage the selection result is processed accordingly.
Further, call corresponding power function according to the type of label information and process webpage the selection result, generate in the step of web page contents after automatic screening, at least comprise following any one or more scheme:
Scheme one: when label is tag along sort, the power function called is classification feature function, makes to carry out classification process to webpage the selection result, generates sorted web page contents.
Scheme two: when label is mood label, the power function called is the power function revising mood label, makes to carry out correcting process to the mood label of webpage the selection result, generates revised web page contents.
Scheme three: when label is region label, the power function called is the power function revising region label, makes to carry out correcting process to the region label of webpage the selection result, generates revised web page contents.
Scheme four: when label is blacklist label, the power function called is the power function screening the webpage the selection result corresponding with blacklist label, make to carry out Screening Treatment to the blacklist label of webpage the selection result, generate the web page contents after screening.
Scheme five: when label is label to be deleted, the power function called is the power function deleting the webpage the selection result corresponding with label to be deleted, make to carry out delete processing to the webpage the selection result with label to be deleted, generate the web page contents after deleting.
Concrete, in the middle of practical application, according to the needs of actual conditions, other processing scheme can also be carried out to webpage the selection result, be not limited to five kinds of above-mentioned schemes.
Further, in screening module 33, basis the keyword dictionary pre-set and the screening parameter pre-set of execution, screen web page contents, the step obtaining webpage the selection result comprises:
According to the keyword dictionary pre-set, web page contents is screened, obtain the first pre-service web page contents.
Screening parameter according to presetting is screened the first pre-service web page contents, and obtain the second pre-service web page contents as webpage the selection result, wherein, the second pre-service web page contents is webpage the selection result.
Concrete, by above-mentioned steps, first according to the keyword dictionary pre-set, webpage is screened, the first pre-service web page contents that the keyword obtaining comprising with keyword dictionary matches.Then, according to the screening parameter pre-set, the first pre-service web page contents is screened.By the screening to the first pre-service web page contents, obtain the second pre-service web page contents with the Condition Matching of screening parameter.Second pre-service web page contents is webpage the selection result.
Further, screen the first pre-service web page contents according to the screening parameter preset in above-mentioned steps, the step obtained as the second pre-service web page contents of webpage the selection result comprises:
Read the screening parameter and screening order that pre-set.
According to screening parameter and screening order, be that condition screen to the first pre-service web page contents with screening parameter according to screening order successively, obtain the second pre-service web page contents, wherein, what screening parameter at least comprised among following screening item is any one or more: webpage text content, web page text source, web page text author, web page text acquisition time, webpage mood label and web page text issue regional information.
Concrete, by above-mentioned steps, according to screening order, successively the first pre-service web page contents is screened according to screening parameter.By the screening layer by layer of screening parameter, progressively by the range shorter of screening, remove the dirty data in the first pre-service web page contents, obtain webpage the selection result.
Preferably, as shown in Figure 4, in the above embodiments of the present application, device also comprises: the second read module 301, download module 303 and memory module 305.
Wherein, the second read module 301, for the target web address of reading pre-set.
Download module 303, for according to target web address, downloads the web page contents corresponding with target web address.
Memory module 305, for being stored in the targeted web content downloaded in source database.
Concrete, by above-mentioned second read module 301, download module 303 and memory module 305, the web page contents in the target web address preset is captured.By access destination web page address, get the web page contents corresponding with target web address.The web page contents got is stored, is stored in source database as source data.
Further, web page contents can comprise as the next item down or several: webpage text content, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
Concrete, according to target web address, the web page contents corresponding with target web address is carried out in the step downloaded, while the webpage text content in web page contents is downloaded, need to preserve the web page text source corresponding with webpage text content, web page text author, web page text acquisition time and web page text and issue regional information etc.And by these information, corresponding stored is in source database.
Further, in the above embodiments of the present application, device also comprises: third reading delivery block 341 and judge module 343.
Wherein, third reading delivery block 341, for reading the text mood dictionary pre-set, wherein, text mood dictionary comprises: front mood word dictionary and negative emotions word dictionary.
Judge module 343, for judging according to the content of text mood dictionary to webpage the selection result, obtains the mood label corresponding with the content of webpage the selection result.
Concrete, by above-mentioned third reading delivery block 341 and judge module 343, the text mood dictionary according to pre-setting carries out mood analysis to the web page contents in webpage the selection result.Web page contents is mated with the front mood word dictionary in text mood dictionary and the one or more mood entries in negative emotions word dictionary respectively, obtains the mood label corresponding with the content of webpage the selection result.
Further, judge according to the content of text mood dictionary to webpage the selection result above-mentioned, the step obtained in the mood label corresponding with the content of webpage the selection result comprises:
Judge according to the content of text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in the mood word dictionary of front in the content of webpage the selection result exceedes the threshold value pre-set, determine that the mood label of the content of webpage the selection result is front mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result exceedes the threshold value pre-set, determine that the mood label of the content of webpage the selection result is negative emotions.
Concrete, by above-mentioned steps, the mood label corresponding to the content of webpage the selection result is judged.In the content of webpage the selection result, when the quantity of the front mood word comprised exceedes the threshold value preset, then judge that the content of this webpage the selection result is as front mood; In the content of webpage the selection result, when the quantity of the negative emotions word comprised exceedes the threshold value preset, then judge that the content of this webpage the selection result is as negative emotions.
In the middle of practical application, may both comprise front mood word in a webpage, comprise again negative emotions word.At this moment, can judge the difference of the quantity of the quantity of the front mood word in webpage and negative emotions word.When the quantity of front mood word is greater than the quantity of negative emotions word, then judge that the mood of this webpage is as front mood; When the quantity of front mood word is less than the quantity of negative emotions word, then judge that the mood of this webpage is as negative emotions.
It should be noted that, for aforesaid each embodiment of the method, in order to simple description, therefore it is all expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not by the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in instructions all belongs to preferred embodiment, and involved action and module might not be that the present invention is necessary.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, in certain embodiment, there is no the part described in detail, can see the associated description of other embodiments.
In several embodiments that the application provides, should be understood that, disclosed device, the mode by other realizes.Such as, device embodiment described above is only schematic, the such as division of described unit, be only a kind of logic function to divide, actual can have other dividing mode when realizing, such as multiple unit or assembly can in conjunction with or another system can be integrated into, or some features can be ignored, or do not perform.Another point, shown or discussed coupling each other or direct-coupling or communication connection can be by some interfaces, and the indirect coupling of device or unit or communication connection can be electrical or other form.
The described unit illustrated as separating component or can may not be and physically separates, and the parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of unit wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.
In addition, each functional unit in each embodiment of the present invention can be integrated in a processing unit, also can be that the independent physics of unit exists, also can two or more unit in a unit integrated.Above-mentioned integrated unit both can adopt the form of hardware to realize, and the form of SFU software functional unit also can be adopted to realize.
If described integrated unit using the form of SFU software functional unit realize and as independently production marketing or use time, can be stored in a computer read/write memory medium.Based on such understanding, the part that technical scheme of the present invention contributes to prior art in essence in other words or all or part of of this technical scheme can embody with the form of software product, this computer software product is stored in a storage medium, comprises all or part of step of some instructions in order to make a computer equipment (can be personal computer, mobile terminal, server or the network equipment etc.) perform method described in each embodiment of the present invention.And aforesaid storage medium comprises: USB flash disk, ROM (read-only memory) (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), portable hard drive, magnetic disc or CD etc. various can be program code stored medium.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for a person skilled in the art, the present invention can have various modifications and variations.Within the spirit and principles in the present invention all, any amendment done, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. an auto-screening method for web page contents, is characterized in that, comprising:
Read the web page contents in source database;
According to the keyword dictionary pre-set and the screening parameter pre-set, described web page contents is screened, obtains webpage the selection result;
Extract the label information dictionary pre-set;
According to the web page contents in described webpage the selection result, any one or the polytype label that mate with described web page contents in described label information dictionary are added in the middle of described webpage the selection result;
According to the label information added in described webpage the selection result, the function treatment corresponding with described label information is performed to described webpage the selection result, obtain web page contents after automatic screening.
2. method according to claim 1, it is characterized in that, described label information dictionary comprises one or more in following tag types: tag along sort, mood label, region label, blacklist label, label to be deleted, wherein, according to the label information added in described webpage the selection result, perform the function treatment corresponding with described label information to described webpage the selection result, after obtaining automatic screening, the step of web page contents comprises:
Read the label information of the one or more labels in described webpage the selection result;
Call corresponding power function according to the type of described label information and process described webpage the selection result, generate web page contents after described automatic screening.
3. method according to claim 2, it is characterized in that, call corresponding power function according to the type of described label information and process described webpage the selection result, after generating described automatic screening, the step of web page contents at least comprises following any one or more scheme:
Scheme one: when described label is described tag along sort, the described power function called is classification feature function, makes to carry out classification process to described webpage the selection result, generates sorted web page contents;
Scheme two: when described label is described mood label, the described power function called is the power function revising described mood label, makes to carry out correcting process to the mood label of described webpage the selection result, generates revised web page contents;
Scheme three: when described label is described region label, the described power function called is the power function revising described region label, makes to carry out correcting process to the region label of described webpage the selection result, generates revised web page contents;
Scheme four: when described label is described blacklist label, the described power function called is the power function screening the described webpage the selection result corresponding with described blacklist label, make to carry out Screening Treatment to the blacklist label of described webpage the selection result, generate the web page contents after screening;
Scheme five: when described label is described label to be deleted, the described power function called is the power function deleting the described webpage the selection result corresponding with described label to be deleted, make to carry out delete processing to the described webpage the selection result with label to be deleted, generate the web page contents after deleting.
4. method according to claim 1, is characterized in that, according to the keyword dictionary pre-set and the screening parameter pre-set, screen described web page contents, the step obtaining webpage the selection result comprises:
According to the keyword dictionary pre-set, described web page contents is screened, obtain the first pre-service web page contents;
Screening parameter according to presetting is screened described first pre-service web page contents, and obtain the second pre-service web page contents as described webpage the selection result, wherein, described second pre-service web page contents is described webpage the selection result.
5. method according to claim 4, is characterized in that, the screening parameter that described basis is preset is screened described first pre-service web page contents, and the step obtained as the second pre-service web page contents of described webpage the selection result comprises:
Read the screening parameter and screening order that pre-set;
According to described screening parameter and described screening order, successively according to described screening order to described first pre-service web page contents with described screening parameter for condition is screened, obtain described second pre-service web page contents, wherein, what described screening parameter at least comprised among screening item as described below is any one or more: described webpage text content, described web page text source, described web page text author, described web page text acquisition time, described webpage mood label and described web page text issue regional information.
6. method according to claim 1, is characterized in that, before reading the web page contents in source database, described method also comprises:
The target web address of reading pre-set;
According to described target web address, the described web page contents corresponding with described target web address is downloaded;
The described targeted web content downloaded to is stored in described source database.
7. method according to claim 6, is characterized in that, described web page contents comprises as the next item down or several: webpage text content, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
8. method according to claim 1, is characterized in that, in basis the keyword dictionary pre-set and the screening parameter pre-set, screen described web page contents, after obtaining webpage the selection result, described method also comprises:
Read the text mood dictionary pre-set, wherein, described text mood dictionary comprises: front mood word dictionary and negative emotions word dictionary;
Judge according to the content of described text mood dictionary to described webpage the selection result, obtain the mood label corresponding with the content of described webpage the selection result.
9. method according to claim 8, is characterized in that, describedly judges according to the content of described text mood dictionary to described webpage the selection result, and the step obtaining the mood label corresponding with the content of described webpage the selection result comprises:
Judge according to the content of described text mood dictionary to described webpage the selection result;
When the quantity comprising the front mood word in the mood word dictionary of described front in the content of described webpage the selection result exceedes the threshold value pre-set, determine that the described mood label of the content of described webpage the selection result is front mood;
When the quantity comprising the negative emotions word in described negative emotions word dictionary in the content of described webpage the selection result exceedes the threshold value pre-set, determine that the described mood label of the content of described webpage the selection result is negative emotions.
10. an automatic splinter screening device for screening for web page contents, is characterized in that, comprising:
First read module, for reading the web page contents in source database;
Screening module, for according to the keyword dictionary pre-set and the screening parameter pre-set, screens described web page contents, obtains webpage the selection result;
Extraction module, for extracting the label information dictionary pre-set;
First processing module, for according to the web page contents in described webpage the selection result, is added in the middle of described webpage the selection result by any one or the polytype label that mate with described web page contents in described label information dictionary;
Second processing module, for according to the label information added in described webpage the selection result, performs the function treatment corresponding with described label information to described webpage the selection result, obtains web page contents after automatic screening.
11. devices according to claim 10, is characterized in that, described device also comprises:
Second read module, for the target web address of reading pre-set;
Download module, for according to described target web address, downloads the described web page contents corresponding with described target web address;
Memory module, for being stored in the described targeted web content downloaded in described source database.
12. devices according to claim 10, is characterized in that, described device also comprises:
Third reading delivery block, for reading the text mood dictionary pre-set, wherein, described text mood dictionary comprises: front mood word dictionary and negative emotions word dictionary;
Judge module, for judging according to the content of described text mood dictionary to described webpage the selection result, obtains the mood label corresponding with the content of described webpage the selection result.
CN201410769099.1A 2014-12-12 2014-12-12 The auto-screening method and device of web page contents Active CN104504027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410769099.1A CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410769099.1A CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Publications (2)

Publication Number Publication Date
CN104504027A true CN104504027A (en) 2015-04-08
CN104504027B CN104504027B (en) 2019-11-12

Family

ID=52945425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410769099.1A Active CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Country Status (1)

Country Link
CN (1) CN104504027B (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294785A (en) * 2016-08-12 2017-01-04 北京创新乐知信息技术有限公司 Content Selection method and system
WO2017092355A1 (en) * 2015-12-01 2017-06-08 乐视控股(北京)有限公司 Data service system
WO2017161577A1 (en) * 2016-03-25 2017-09-28 马岩 Data removal method and system
CN107545905A (en) * 2017-08-21 2018-01-05 北京合光人工智能机器人技术有限公司 Emotion identification method based on sound property
CN108095740A (en) * 2017-12-20 2018-06-01 姜涵予 A kind of user emotion appraisal procedure and device
CN109918516A (en) * 2019-03-13 2019-06-21 百度在线网络技术(北京)有限公司 A kind of data processing method, device and terminal
CN110717110A (en) * 2019-10-12 2020-01-21 北京达佳互联信息技术有限公司 Multimedia resource filtering method and device, electronic equipment and storage medium
CN114647466A (en) * 2020-12-17 2022-06-21 国信君和(北京)科技有限公司 Page content extraction method, device, equipment and computer readable storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
US20120271810A1 (en) * 2009-07-17 2012-10-25 Erzhong Liu Method for inputting and processing feature word of file content
CN102902790A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Web page classification system and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200A (en) * 2009-06-19 2009-12-02 淮海工学院 Chinese Web page classification method based on the keyword frequency analysis
US20120271810A1 (en) * 2009-07-17 2012-10-25 Erzhong Liu Method for inputting and processing feature word of file content
CN102902790A (en) * 2012-09-29 2013-01-30 北京奇虎科技有限公司 Web page classification system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
秦鹏,李恒训,张华平,刘金刚: "基于关键词提取的搜索结果聚类研究", 《第五届全国信息检索学术会议CCIR2009》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2017092355A1 (en) * 2015-12-01 2017-06-08 乐视控股(北京)有限公司 Data service system
WO2017161577A1 (en) * 2016-03-25 2017-09-28 马岩 Data removal method and system
CN106294785A (en) * 2016-08-12 2017-01-04 北京创新乐知信息技术有限公司 Content Selection method and system
CN107545905A (en) * 2017-08-21 2018-01-05 北京合光人工智能机器人技术有限公司 Emotion identification method based on sound property
CN107545905B (en) * 2017-08-21 2021-01-05 北京合光人工智能机器人技术有限公司 Emotion recognition method based on sound characteristics
CN108095740A (en) * 2017-12-20 2018-06-01 姜涵予 A kind of user emotion appraisal procedure and device
CN108095740B (en) * 2017-12-20 2021-06-22 姜涵予 User emotion assessment method and device
CN109918516A (en) * 2019-03-13 2019-06-21 百度在线网络技术(北京)有限公司 A kind of data processing method, device and terminal
CN109918516B (en) * 2019-03-13 2021-07-30 百度在线网络技术(北京)有限公司 Data processing method and device and terminal
CN110717110A (en) * 2019-10-12 2020-01-21 北京达佳互联信息技术有限公司 Multimedia resource filtering method and device, electronic equipment and storage medium
CN114647466A (en) * 2020-12-17 2022-06-21 国信君和(北京)科技有限公司 Page content extraction method, device, equipment and computer readable storage medium

Also Published As

Publication number Publication date
CN104504027B (en) 2019-11-12

Similar Documents

Publication Publication Date Title
CN104504027A (en) Method and device for automatically selecting webpage content
CN108021929B (en) Big data-based mobile terminal e-commerce user portrait establishing and analyzing method and system
CN106484858B (en) hot content pushing method and device
CN112632385A (en) Course recommendation method and device, computer equipment and medium
CN108038119A (en) Utilize the method, apparatus and storage medium of new word discovery investment target
CN109726327A (en) A kind of information-pushing method and device
CN110008378A (en) Corpus collection method, device, equipment and storage medium based on artificial intelligence
CN105447184A (en) Information capturing method and device
WO2017107571A1 (en) Method and system for determining quality of application on basis of user behaviors of application management
CN112148881B (en) Method and device for outputting information
CN106557410B (en) User behavior analysis method and apparatus based on artificial intelligence
CN103024169A (en) Method and device for starting communication terminal application program through voice
JP2017168057A (en) Device, system, and method for sorting images
CN109241392A (en) Recognition methods, device, system and the storage medium of target word
CN102542061A (en) Intelligent product classification method
CN111931809A (en) Data processing method and device, storage medium and electronic equipment
CN105718543A (en) Sentence display method and device
CN113962199B (en) Text recognition method, text recognition device, text recognition equipment, storage medium and program product
CN111355628A (en) Model training method, business recognition device and electronic device
CN103475532A (en) Hardware detection method and system thereof
CN106293650B (en) Folder attribute setting method and device
CN116383521B (en) Subject word mining method and device, computer equipment and storage medium
Garg et al. Android app behaviour classification using topic modeling techniques and outlier detection using app permissions
CN116166867A (en) Content filtering method, device, equipment and storage medium for network acquisition
CN111338811A (en) User writing behavior analysis method, server, terminal, system and electronic equipment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant