CN104504027B - The auto-screening method and device of web page contents - Google Patents

The auto-screening method and device of web page contents Download PDF

Info

Publication number
CN104504027B
CN104504027B CN201410769099.1A CN201410769099A CN104504027B CN 104504027 B CN104504027 B CN 104504027B CN 201410769099 A CN201410769099 A CN 201410769099A CN 104504027 B CN104504027 B CN 104504027B
Authority
CN
China
Prior art keywords
webpage
web page
label
selection result
page contents
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410769099.1A
Other languages
Chinese (zh)
Other versions
CN104504027A (en
Inventor
陈俊宏
余德乐
杨韬
赵冬玲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201410769099.1A priority Critical patent/CN104504027B/en
Publication of CN104504027A publication Critical patent/CN104504027A/en
Application granted granted Critical
Publication of CN104504027B publication Critical patent/CN104504027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9535Search customisation based on user profiles and personalisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a kind of auto-screening method of web page contents and devices.Wherein, this method comprises: reading the web page contents in source database;According to pre-set keyword dictionary and pre-set screening parameter, web page contents are screened, obtain webpage the selection result;Extract pre-set label information dictionary;The label of any one or more type in label information dictionary is added in webpage the selection result;According to the label information added in webpage the selection result, function treatment corresponding with label information is executed to webpage the selection result, obtains web page contents after automatic screening.The problem of present invention, which solves, carries out manual screening to a large amount of web page contents for updating daily in the prior art, caused process lengthy and jumbled inefficiency.

Description

The auto-screening method and device of web page contents
Technical field
The present invention relates to computer fields, in particular to the auto-screening method and device of a kind of web page contents.
Background technique
Currently, for the public opinion information monitoring system that the content in web page contents is monitored, although use can be allowed Required content of text is screened at family again, and the content of text after screening again can be operated (such as: point Generic operation, label operation etc.), it can be very good the diversified demand for meeting user, but have a problem that: on network Web page contents be all to be updated daily, and the data volume updated daily is huge, has led to user and holds if necessary The newest situation of continuous monitoring requires the classification dimension wanted from oneself when analyzing every time the web page contents of update The web page contents for going analysis to update, this just needs craft artificial daily to be screened to all content of text and screen it Operating again afterwards, process is lengthy and jumbled, trouble.
For, efficiency lengthy and jumbled to process caused by a large amount of web page contents progress manual screening updated daily in the prior art Low problem, currently no effective solution has been proposed.
Summary of the invention
The main purpose of the present invention is to provide a kind of auto-screening method of web page contents and devices, to solve existing skill Manual screening carried out to a large amount of web page contents for updating daily in art, the problem of caused process lengthy and jumbled inefficiency.
To achieve the goals above, according to an aspect of an embodiment of the present invention, a kind of the automatic of web page contents is provided Screening technique.This method comprises: reading the web page contents in source database;It sets according to pre-set keyword dictionary and in advance The screening parameter set, screens web page contents, obtains webpage the selection result;Extract pre-set label information dictionary; The label of any one or more type in label information dictionary is added in webpage the selection result;It is screened and is tied according to webpage The label information added in fruit executes function treatment corresponding with label information to webpage the selection result, after obtaining automatic screening Web page contents.
To achieve the goals above, according to another aspect of an embodiment of the present invention, a kind of the automatic of web page contents is provided Screening plant, which includes the first read module, for reading the web page contents in source database;Screening module is used for root According to pre-set keyword dictionary and pre-set screening parameter, web page contents are screened, obtain webpage screening knot Fruit;Extraction module, for extracting pre-set label information dictionary;First processing module, for according to webpage the selection result In web page contents, the label of any one or more type in label information dictionary is added in webpage the selection result; Second processing module, for executing to webpage the selection result and believing with label according to the label information added in webpage the selection result Corresponding function treatment is ceased, web page contents after automatic screening are obtained.
According to inventive embodiments, by reading the web page contents in source database;According to pre-set keyword dictionary With pre-set screening parameter, web page contents are screened, obtain webpage the selection result;Extract pre-set label letter Cease dictionary;The label of any one or more type in label information dictionary is added in webpage the selection result;According to net The label information added in page the selection result executes function treatment corresponding with label information to webpage the selection result, obtains certainly Web page contents after dynamic screening solve and carry out manual screening to a large amount of web page contents updated daily in the prior art, caused The problem of process lengthy and jumbled inefficiency.It realizes and webpage is screened automatically, and webpage is handled according to web page contents Effect.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the auto-screening method of according to embodiments of the present invention one web page contents;
Fig. 2 is the flow chart of the auto-screening method of according to embodiments of the present invention one preferred web page contents;
Fig. 3 is the structural schematic diagram of the automatic splinter screening device for screening of according to embodiments of the present invention two web page contents;And
Fig. 4 is the structural schematic diagram of the automatic splinter screening device for screening of according to embodiments of the present invention two preferred web page contents.
Specific embodiment
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein.In addition, term " includes " and " tool Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
Embodiment 1
The embodiment of the invention provides a kind of auto-screening methods of web page contents.
Fig. 1 is the flow chart of the auto-screening method of web page contents according to an embodiment of the present invention.As shown in Figure 1, the party Method comprises the following steps that
Step S11 reads the web page contents in source database.
Specifically, S11 through the above steps, is read out the web page contents stored in source database.Wherein, source data Library is for storing the web page contents regularly updated.
Step S13 sieves web page contents according to pre-set keyword dictionary and pre-set screening parameter Choosing, obtains webpage the selection result.
Specifically, S13 through the above steps, screens the web page contents read from source database.Wherein, Can first web page contents be screened according to pre-set keyword dictionary by carrying out screening technique to web page contents, then right The selection result is screened by preset screening parameter, obtains webpage the selection result.
Step S15 extracts pre-set label information dictionary.
Specifically, S15 through the above steps, will extract for the pre-set label information dictionary of web page contents, Obtain the label information dictionary for being identified to webpage.
Step S17 will be matched with web page contents in label information dictionary according to the web page contents in webpage the selection result The label of any one or more type is added in webpage the selection result.
Specifically, S17 through the above steps, the label in the content of webpage the selection result and label information dictionary is carried out Matching is obtained and one or more labels in one or more types of the content matching of webpage the selection result by matching. It include several different types of label informations in label information dictionary.
Step S19, according to the label information added in webpage the selection result, to the execution of webpage the selection result and label information Corresponding function treatment obtains web page contents after automatic screening.
Specifically, S19 through the above steps, believes for the label of one or more types corresponding with webpage the selection result Breath calls power function corresponding with tag types to handle the web page contents in the webpage the selection result, to realize To the function of the automatic screening of webpage the selection result.
By step S11 to step S19, after being read out to the web page contents in source database, first according to keyword Dictionary and pre-set screening parameter, screen web page contents.It obtains comprising one in keyword dictionary or more The web page contents of a keyword, also, further web page contents are screened according to screening parameter, it obtains meeting screening parameter In one or more screening conditions web page contents, to obtain webpage the selection result.On the basis of webpage the selection result, Further webpage the selection result is identified according to label information dictionary.When some web page contents in webpage the selection result and mark It is that the web page contents in webpage the selection result add mark when one or more of label dictionary of information tag types match Label.Finally, according to tag types call with the matched power function of tag types, to the web page contents in webpage the selection result into Row processing.
In summary, the present invention, which solves, carries out manual screening to a large amount of web page contents updated daily in the prior art, The problem of caused process lengthy and jumbled inefficiency, realize and webpage screened automatically, and according to web page contents to webpage into The effect of row processing.
Preferably, in the above embodiments of the present application, label information dictionary may include one of following tag types or It is several: tag along sort, mood label, region label, blacklist label, label to be deleted, wherein in step S19 according to webpage The label information added in the selection result executes function treatment corresponding with label information to webpage the selection result, obtains automatic After screening in web page contents, step includes:
Step S191 reads the label information of one or more labels in webpage the selection result.
Step S193 calls corresponding power function to handle webpage the selection result according to the type of label information, raw At web page contents after automatic screening.
Specifically, S191 to step S193 through the above steps, passes through the label information to label in webpage the selection result Reading, call corresponding with tag types power function.By different power functions to the webpage in webpage the selection result Content performs corresponding processing.
Preferably, in the above embodiments of the present application, step S193 calls corresponding function according to the type of label information In the step of function handles webpage the selection result, generates web page contents after automatic screening, include at least it is following any one or Multiple schemes:
Scheme one: in the case where label is tag along sort, the power function of calling is classification feature function, so as to net Page the selection result carries out classification processing, generates sorted web page contents.
Scheme two: in the case where label is mood label, the power function of calling is the function letter for correcting mood label Number, so that the mood label to webpage the selection result is modified processing, generates revised web page contents.
Scheme three: in the case where label is region label, the power function of calling is the function letter for correcting region label Number, so that the region label to webpage the selection result is modified processing, generates revised web page contents.
Scheme four: in the case where label is blacklist label, the power function of calling is screening and blacklist label pair The power function for the webpage the selection result answered generates sieve so that the blacklist label to webpage the selection result carries out Screening Treatment Web page contents after choosing.
Scheme five: in the case where label is label to be deleted, the power function of calling is to delete and label pair to be deleted The power function for the webpage the selection result answered, so that delete processing is carried out to the webpage the selection result with label to be deleted, it is raw At the web page contents after deletion.
Specifically, in practical application according to the needs of actual conditions, other can also be carried out to webpage the selection result Processing scheme, however it is not limited to five kinds of above-mentioned schemes.
Preferably, in the above embodiments of the present application, step S13 is according to pre-set keyword dictionary and pre-set The step of screening parameter screens web page contents, obtains webpage the selection result include:
Step S131 screens web page contents according to pre-set keyword dictionary, obtains the first pretreatment net Page content.
Step S133 screens the first pretreatment web page contents according to preset screening parameter, obtains as webpage Second pretreatment web page contents of the selection result, wherein the second pretreatment web page contents are webpage the selection result.
Specifically, S131 is to step 133 through the above steps, first according to pre-set keyword dictionary, to webpage It is screened, obtains the match with the keyword that keyword dictionary includes first pretreatment web page contents.Then, according to preparatory The screening parameter of setting screens the first pretreatment web page contents.By the screening to the first pretreatment web page contents, obtain Web page contents are pre-processed to the condition of screening parameter matched second.Second pretreatment web page contents are webpage screening knot Fruit.
Preferably, in the above embodiments of the present application, in step S133 according to preset screening parameter to the first pretreatment net Page content screened, obtain as webpage the selection result second pre-process web page contents the step of include:
S1331 reads pre-set screening parameter and screening sequence.
S1333 successively pre-processes web page contents to first according to screening sequence according to screening parameter and screening sequence to sieve Selecting parameter is that condition is screened, and obtains the second pretreatment web page contents, wherein screening parameter include at least following screening item it In it is any one or more: webpage text content, web page text source, web page text author, web page text acquisition time, net Page mood label and web page text issue regional information.
Specifically, S1331 to step S1333 through the above steps, according to screening sequence, successively to the first pretreatment webpage Content is screened according to screening parameter.By the screening layer by layer of screening parameter, gradually by the range shorter of screening, removal first The dirty data in web page contents is pre-processed, webpage the selection result is obtained.
Preferably, as shown in Fig. 2, in the above embodiments of the present application, the web page contents in source database are read in step S11 Before, method further include:
Step S101 reads preset target webpage address.
Web page contents corresponding with target webpage address are downloaded by step S103 according to target webpage address.
The targeted web content downloaded to is stored in source database by step S105.
Specifically, S101 to step S105 through the above steps, in the webpage in preset target webpage address Appearance is grabbed.By access target web page address, web page contents corresponding with target webpage address are got.It will acquire Web page contents are stored, and are stored in source database as source data.
Preferably, in the above embodiments of the present application, web page contents may include such as the next item down or several: in web page text Appearance, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
Specifically, according to target webpage address, the step that web page contents corresponding with target webpage address are downloaded In rapid, while being downloaded the webpage text content in web page contents, need to save corresponding with webpage text content Web page text source, web page text author, web page text acquisition time and web page text publication regional information etc..And these are believed Breath, correspondence are stored into source database.
Preferably, it in the above embodiments of the present application, according to pre-set keyword dictionary and is preset in step S13 Screening parameter, web page contents are screened, after obtaining webpage the selection result, method further include:
Step S141 reads pre-set text mood dictionary, wherein text mood dictionary includes: front mood word Dictionary and negative emotions word dictionary.
Step S143 judges according to content of the text mood dictionary to webpage the selection result, obtains screening with webpage As a result the corresponding mood label of content.
Specifically, S141 and step S143 through the above steps, sieves webpage according to pre-set text mood dictionary The web page contents in result are selected to carry out mood analysis.By web page contents respectively with the front mood word dictionary in text mood dictionary It is matched, is obtained corresponding with the content of webpage the selection result with one or more mood entries in negative emotions word dictionary Mood label.
Preferably, in the above embodiments of the present application, in step S143 according to text mood dictionary to webpage the selection result Content is judged, is obtained in mood label corresponding with the content of webpage the selection result, step includes:
Step S1431 judges according to content of the text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in front mood word dictionary in the content of webpage the selection result is more than preparatory When the threshold value of setting, determine the mood label of the content of webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result is more than preparatory When the threshold value of setting, determine that the mood label of the content of webpage the selection result is negative emotions.
Specifically, S1431 through the above steps, mood label corresponding to the content to webpage the selection result is sentenced It is disconnected.In the content of webpage the selection result, when the quantity for the front mood word for including is more than preset threshold value, then determining should The content of webpage the selection result is positive mood;In the content of webpage the selection result, the quantity for the negative emotions word for including is super When crossing preset threshold value, then determine the content of the webpage the selection result for negative emotions.
It may not only include front mood word in a webpage, but also include negative emotions word in practical application.This When, can the difference of quantity of quantity and negative emotions word to the front mood word in webpage judge.When positive mood When the quantity of word is greater than the quantity of negative emotions word, then determine the mood of the webpage for positive mood;When the number of front mood word When amount is less than the quantity of negative emotions word, then determine the mood of the webpage for negative emotions.
In conclusion, in order to make up in the prior art, being carried out daily to all content of text in practical application This disadvantage is operated again after filter and filtering, it can be by the way that the logic that multiple rules judge, and use be arranged for web page contents Family can according to oneself the pre-set screening conditions of hobby (such as: include what keyword, to be originated from which channel, when Between range what is, emotion degree is how many, author personage etc.).Then, whenever there is new webpage text content to grab from internet After getting, the judgment rule and screening rule set before whether meeting is judged automatically.If satisfied, then automatically to web page text Operated (such as classifying, label, change mood, change Regional Property, add asterisk, pull in blacklist, deletion etc. operation).
It in order to realize the above functions, can be by front end automatic screening condition setting module and robot crawler module (i.e. Webpage capture module) realization of the two modules.
Front end automatic screening condition setting module:
User can set automatic screening and operation in product front end: firstly, by judging whether comprising keyword;Then, Screening conditions are set by screening parameter, such as: judging text source, text mood, the relevant information of text author, text crawl The conditions such as time.Finally, web page contents are operated automatically when meeting the two above-mentioned conditions (such as: classification, mark Label, change mood, change Regional Property, add asterisk, pull in blacklist, deletion etc. operation).
Webpage capture module:
Webpage capture module can be one section of javascript code, can predefine the range to be crawled (website), climb Worm can all grab online all text and image content, be stored into database and judge whether to meet user's setting Good screening conditions, if satisfied, then performing corresponding processing it.
The present invention passes through the screening rule that user manually sets in advance, grab to the content on internet automatically laggard Row screening, and the web page contents that screening obtains automatically are subjected to subsequent operation.In all processes, in addition to initial stage setting it Outside, other all do not need user intervention.The subjective initiative of user is greatly improved using this method, it can be according to setting in advance Fixed rule carries out the web page contents obtained by screening accordingly to screen to the web page contents newly grabbed every time Operation, reduce the update because of website or webpage, user needs the complexity operated repeatedly manually.
Embodiment 2
The embodiment of the invention also provides a kind of automatic splinter screening device for screening of web page contents, as shown in figure 3, the device can wrap It includes: the first read module 31, screening module 33, extraction module 35, first processing module 37 and Second processing module 39.
Wherein, the first read module 31, for reading the web page contents in source database.
Specifically, being read out by the first read module 31 to the web page contents stored in source database.Wherein, source Database is for storing the web page contents regularly updated.
Screening module 33 is used for according to pre-set keyword dictionary and pre-set screening parameter, in webpage Appearance is screened, and webpage the selection result is obtained.
Specifically, being screened by above-mentioned screening module 33 to the web page contents read from source database.Its In, can first web page contents be screened according to pre-set keyword dictionary by carrying out screening technique to web page contents, so The selection result is screened by preset screening parameter afterwards, obtains webpage the selection result.
Extraction module 35, for extracting pre-set label information dictionary.
Specifically, will be mentioned by said extracted module 35 for the pre-set label information dictionary of web page contents It takes, obtains the label information dictionary for being identified to webpage.
First processing module 37 will be any in label information dictionary for according to the web page contents in webpage the selection result The label of one or more types is added in webpage the selection result.
Specifically, by above-mentioned first processing module 37, it will be in the content of webpage the selection result and label information dictionary Label is matched, and is obtained and one or more in one or more types of the content matching of webpage the selection result by matching A label.It include several different types of label informations in label information dictionary.
Second processing module 39, for being held to webpage the selection result according to the label information added in webpage the selection result Row function treatment corresponding with label information, obtains web page contents after automatic screening.
Specifically, by Second processing module 39, for the mark of one or more types corresponding with webpage the selection result Information is signed, power function corresponding with tag types is called to handle the web page contents in the webpage the selection result, thus Realize the function to the automatic screening of webpage the selection result.
Pass through the first read module 31, screening module 33, extraction module 35, first processing module 37 and Second processing module 39, it is right first according to keyword dictionary and pre-set screening parameter after being read out to the web page contents in source database Web page contents are screened.Obtain the web page contents comprising one or more keyword in keyword dictionary, also, into one Step screens web page contents according to screening parameter, obtains meeting the webpage of one or more screening conditions in screening parameter Content, to obtain webpage the selection result.On the basis of webpage the selection result, further according to label information dictionary to webpage The selection result is identified.When one or more of some web page contents and label information dictionary are marked in webpage the selection result It is that the web page contents in webpage the selection result add label when label type matches.Finally, according to tag types calling and label The power function of type matching handles the web page contents in webpage the selection result.
In summary, the present invention, which solves, carries out manual screening to a large amount of web page contents updated daily in the prior art, The problem of caused process lengthy and jumbled inefficiency, realize and webpage screened automatically, and according to web page contents to webpage into The effect of row processing.
Further, label information dictionary may include one or more of following tag types: tag along sort, mood Label, region label, blacklist label, label to be deleted, in Second processing module 39, execution according to webpage the selection result The label information of middle addition executes function treatment corresponding with label information to webpage the selection result, obtains net after automatic screening Page content the step of include:
Read the label information of one or more labels in webpage the selection result.
Corresponding power function is called to handle webpage the selection result according to the type of label information, generates automatic screening Web page contents afterwards.
Specifically, the reading to the label information of label in webpage the selection result through the above steps, calling and tag class The corresponding power function of type.The web page contents in webpage the selection result are performed corresponding processing by different power functions.
Further, corresponding power function is called to handle webpage the selection result according to the type of label information, it is raw In after automatic screening the step of web page contents, following any one or more schemes are included at least:
Scheme one: in the case where label is tag along sort, the power function of calling is classification feature function, so as to net Page the selection result carries out classification processing, generates sorted web page contents.
Scheme two: in the case where label is mood label, the power function of calling is the function letter for correcting mood label Number, so that the mood label to webpage the selection result is modified processing, generates revised web page contents.
Scheme three: in the case where label is region label, the power function of calling is the function letter for correcting region label Number, so that the region label to webpage the selection result is modified processing, generates revised web page contents.
Scheme four: in the case where label is blacklist label, the power function of calling is screening and blacklist label pair The power function for the webpage the selection result answered generates sieve so that the blacklist label to webpage the selection result carries out Screening Treatment Web page contents after choosing.
Scheme five: in the case where label is label to be deleted, the power function of calling is to delete and label pair to be deleted The power function for the webpage the selection result answered, so that delete processing is carried out to the webpage the selection result with label to be deleted, it is raw At the web page contents after deletion.
Specifically, in practical application according to the needs of actual conditions, other can also be carried out to webpage the selection result Processing scheme, however it is not limited to five kinds of above-mentioned schemes.
Further, in screening module 33, execution according to pre-set keyword dictionary and pre-set sieve The step of selecting parameter, screen to web page contents, obtaining webpage the selection result include:
According to pre-set keyword dictionary, web page contents are screened, obtain the first pretreatment web page contents.
The first pretreatment web page contents are screened according to preset screening parameter, are obtained as webpage the selection result Second pretreatment web page contents, wherein the second pretreatment web page contents are webpage the selection result.
Specifically, through the above steps, first according to pre-set keyword dictionary, screening, obtaining to webpage The first pretreatment web page contents that the keyword for including with keyword dictionary matches.Then, joined according to pre-set screening Number screens the first pretreatment web page contents.By the screening to the first pretreatment web page contents, obtain and screening parameter Condition it is matched second pretreatment web page contents.Second pretreatment web page contents are webpage the selection result.
Further, the first pretreatment web page contents are screened according to preset screening parameter in above-mentioned steps, is obtained To as webpage the selection result second pre-process web page contents the step of include:
Read pre-set screening parameter and screening sequence.
According to screening parameter and screening sequence, web page contents successively are pre-processed with screening parameter to first according to screening sequence It is screened for condition, obtains the second pretreatment web page contents, wherein screening parameter includes at least appointing among following screening item Meaning is one or more: webpage text content, web page text source, web page text author, web page text acquisition time, webpage mood Label and web page text issue regional information.
Specifically, through the above steps, according to screening sequence, successively to the first pretreatment web page contents according to screening parameter It is screened.By the screening layer by layer of screening parameter, gradually by the range shorter of screening, removal first is pre-processed in web page contents Dirty data, obtain webpage the selection result.
Preferably, as shown in figure 4, in the above embodiments of the present application, device further include: the second read module 301, downloading mould Block 303 and memory module 305.
Wherein, the second read module 301, for reading preset target webpage address.
Download module 303, for according to target webpage address, web page contents corresponding with target webpage address to be carried out down It carries.
Memory module 305, for the targeted web content downloaded to be stored in source database.
Specifically, by above-mentioned second read module 301, download module 303 and memory module 305, to preset Web page contents in target webpage address are grabbed.By access target web page address, get and target webpage address pair The web page contents answered.The web page contents that will acquire are stored, and are stored in source database as source data.
Further, web page contents may include such as the next item down or several: webpage text content, web page text source, net Page text author, web page text acquisition time and web page text issue regional information.
Specifically, according to target webpage address, the step that web page contents corresponding with target webpage address are downloaded In rapid, while being downloaded the webpage text content in web page contents, need to save corresponding with webpage text content Web page text source, web page text author, web page text acquisition time and web page text publication regional information etc..And these are believed Breath, correspondence are stored into source database.
Further, in the above embodiments of the present application, device further include: third read module 341 and judgment module 343.
Wherein, third read module 341, for reading pre-set text mood dictionary, wherein text mood dictionary It include: front mood word dictionary and negative emotions word dictionary.
Judgment module 343 obtains and net for being judged according to content of the text mood dictionary to webpage the selection result The corresponding mood label of content of page the selection result.
Specifically, by above-mentioned third read module 341 and judgment module 343, according to pre-set text mood word Allusion quotation carries out mood analysis to the web page contents in webpage the selection result.By web page contents respectively with the front in text mood dictionary One or more mood entries in mood word dictionary and negative emotions word dictionary are matched, and are obtained and webpage the selection result The corresponding mood label of content.
Further, judged above-mentioned according to content of the text mood dictionary to webpage the selection result, obtained and net Page the selection result the corresponding mood label of content in step include:
Judged according to content of the text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in front mood word dictionary in the content of webpage the selection result is more than preparatory When the threshold value of setting, determine the mood label of the content of webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result is more than preparatory When the threshold value of setting, determine that the mood label of the content of webpage the selection result is negative emotions.
Specifically, mood label corresponding to the content to webpage the selection result judges through the above steps.Work as net In the content of page the selection result, when the quantity for the front mood word for including is more than preset threshold value, then determine that the webpage sieves Select the content of result for positive mood;In the content of webpage the selection result, the quantity for the negative emotions word for including is more than preparatory When the threshold value of setting, then determine the content of the webpage the selection result for negative emotions.
It may not only include front mood word in a webpage, but also include negative emotions word in practical application.This When, can the difference of quantity of quantity and negative emotions word to the front mood word in webpage judge.When positive mood When the quantity of word is greater than the quantity of negative emotions word, then determine the mood of the webpage for positive mood;When the number of front mood word When amount is less than the quantity of negative emotions word, then determine the mood of the webpage for negative emotions.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit, It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme 's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can be personal computer, mobile terminal, server or network equipment etc.) executes side described in each embodiment of the present invention The all or part of the steps of method.And storage medium above-mentioned include: USB flash disk, read-only memory (ROM, Read-Only Memory), Random access memory (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. are various to be can store The medium of program code.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims (11)

1. a kind of auto-screening method of web page contents characterized by comprising
Read the web page contents in source database;
According to pre-set keyword dictionary and pre-set screening parameter, the web page contents are screened, are obtained Webpage the selection result;
Extract pre-set label information dictionary;
It, will be matched with the web page contents in the label information dictionary according to the web page contents in the webpage the selection result Any one or any a plurality of types of labels are added in the webpage the selection result;
According to the label information added in the webpage the selection result, the webpage the selection result is executed and the label information Corresponding function treatment obtains web page contents after automatic screening;
Wherein, the label information dictionary includes one or more of following tag types: tag along sort, mood label, Domain label, blacklist label, label to be deleted, wherein according to the label information added in the webpage the selection result, to described The step of webpage the selection result executes corresponding with label information function treatment, obtains web page contents after automatic screening packet It includes:
Read the label information of one or more labels in the webpage the selection result;
Corresponding power function is called according to the type of the label information to handle the webpage the selection result, described in generation Web page contents after automatic screening.
2. the method according to claim 1, wherein calling corresponding function according to the type of the label information It can include at least function the step of handling the webpage the selection result, generating web page contents after the automatic screening following any One or more schemes:
Scheme one: in the case where the label is the tag along sort, the power function of calling is classification feature function, So that carrying out classification processing to the webpage the selection result, sorted web page contents are generated;
Scheme two: in the case where the label is the mood label, the power function of calling is to correct the mood The power function of label generates revised webpage so that being modified processing to the mood label of the webpage the selection result Content;
Scheme three: in the case where the label is the region label, the power function of calling is to correct the region The power function of label generates revised webpage so that being modified processing to the region label of the webpage the selection result Content;
Scheme four: in the case where the label is the blacklist label, the power function of calling be screening with it is described The power function of the corresponding webpage the selection result of blacklist label, so as to the blacklist label of the webpage the selection result Screening Treatment is carried out, the web page contents after generating screening;
Scheme five: in the case where the label is the label to be deleted, the power function of calling be delete with it is described The power function of the corresponding webpage the selection result of label to be deleted, so that being screened to the webpage with label to be deleted As a result delete processing is carried out, the web page contents after deleting are generated.
3. the method according to claim 1, wherein according to pre-set keyword dictionary and pre-set The step of screening parameter screens the web page contents, obtains webpage the selection result include:
According to pre-set keyword dictionary, the web page contents are screened, obtain the first pretreatment web page contents;
The first pretreatment web page contents are screened according to preset screening parameter, obtains screening as the webpage and tie The second pretreatment web page contents of fruit, wherein the second pretreatment web page contents are the webpage the selection result.
4. according to the method described in claim 3, it is characterized in that, described locate according to preset screening parameter to described first in advance Reason web page contents screened, obtain as the webpage the selection result second pre-process web page contents the step of include:
Read pre-set screening parameter and screening sequence;
According to the screening parameter and screening sequence, successively according to the screening sequence in the first pretreatment webpage Appearance is screened using the screening parameter as condition, obtains the second pretreatment web page contents, wherein the screening parameter is extremely Few includes any one or any number of among screening item as described below: the webpage text content, the web page text come Source, the web page text author, the web page text acquisition time, the webpage mood label and web page text publication ground Domain information.
5. the method according to claim 1, wherein read source database in web page contents before, it is described Method further include:
Read preset target webpage address;
According to the target webpage address, the web page contents corresponding with the target webpage address are downloaded;
The targeted web content downloaded to is stored in the source database.
6. according to the method described in claim 5, it is characterized in that, the web page contents include such as the next item down or several: webpage Content of text, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
7. the method according to claim 1, wherein according to pre-set keyword dictionary and presetting Screening parameter, the web page contents are screened, after obtaining webpage the selection result, the method also includes:
Read pre-set text mood dictionary, wherein the text mood dictionary includes: front mood word dictionary and negative Mood word dictionary;
Judged according to content of the text mood dictionary to the webpage the selection result, obtains screening with the webpage and tie The corresponding mood label of the content of fruit.
8. the method according to the description of claim 7 is characterized in that described sieve the webpage according to the text mood dictionary The step of selecting the content of result to be judged, obtaining mood label corresponding with the content of the webpage the selection result include:
Judged according to content of the text mood dictionary to the webpage the selection result;
When the quantity comprising the front mood word in the front mood word dictionary in the content of the webpage the selection result is more than When pre-set threshold value, determine the mood label of the content of the webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in the negative emotions word dictionary in the content of the webpage the selection result is more than When pre-set threshold value, determine that the mood label of the content of the webpage the selection result is negative emotions.
9. a kind of automatic splinter screening device for screening of web page contents characterized by comprising
First read module, for reading the web page contents in source database;
Screening module is used for according to pre-set keyword dictionary and pre-set screening parameter, to the web page contents It is screened, obtains webpage the selection result;
Extraction module, for extracting pre-set label information dictionary;
First processing module, for according to the web page contents in the webpage the selection result, by the label information dictionary with The web page contents matched any one or any a plurality of types of labels are added in the webpage the selection result;
Second processing module, for according to the label information added in the webpage the selection result, to the webpage the selection result Function treatment corresponding with the label information is executed, web page contents after automatic screening are obtained;
Wherein, the label information dictionary includes one or more of following tag types: tag along sort, mood label, Domain label, blacklist label, label to be deleted, wherein Second processing module is used for:
Read the label information of one or more labels in the webpage the selection result;
Corresponding power function is called according to the type of the label information to handle the webpage the selection result, described in generation Web page contents after automatic screening.
10. device according to claim 9, which is characterized in that described device further include:
Second read module, for reading preset target webpage address;
Download module is used for according to the target webpage address, will the web page contents corresponding with the target webpage address It is downloaded;
Memory module, for the targeted web content downloaded to be stored in the source database.
11. device according to claim 9, which is characterized in that described device further include:
Third read module, for reading pre-set text mood dictionary, wherein the text mood dictionary includes: just Face mood word dictionary and negative emotions word dictionary;
Judgment module, for being judged according to content of the text mood dictionary to the webpage the selection result, obtain with The corresponding mood label of the content of the webpage the selection result.
CN201410769099.1A 2014-12-12 2014-12-12 The auto-screening method and device of web page contents Active CN104504027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410769099.1A CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410769099.1A CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Publications (2)

Publication Number Publication Date
CN104504027A CN104504027A (en) 2015-04-08
CN104504027B true CN104504027B (en) 2019-11-12

Family

ID=52945425

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410769099.1A Active CN104504027B (en) 2014-12-12 2014-12-12 The auto-screening method and device of web page contents

Country Status (1)

Country Link
CN (1) CN104504027B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105893416A (en) * 2015-12-01 2016-08-24 乐视网信息技术(北京)股份有限公司 Data service system
CN105849729A (en) * 2016-03-25 2016-08-10 马岩 Date elimination method and system
CN106294785A (en) * 2016-08-12 2017-01-04 北京创新乐知信息技术有限公司 Content Selection method and system
CN107545905B (en) * 2017-08-21 2021-01-05 北京合光人工智能机器人技术有限公司 Emotion recognition method based on sound characteristics
CN108095740B (en) * 2017-12-20 2021-06-22 姜涵予 User emotion assessment method and device
CN109918516B (en) * 2019-03-13 2021-07-30 百度在线网络技术(北京)有限公司 Data processing method and device and terminal
CN110717110B (en) * 2019-10-12 2022-04-22 北京达佳互联信息技术有限公司 Multimedia resource filtering method and device, electronic equipment and storage medium
CN114647466A (en) * 2020-12-17 2022-06-21 国信君和(北京)科技有限公司 Page content extraction method, device, equipment and computer readable storage medium

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101593200B (en) * 2009-06-19 2012-10-03 淮海工学院 Method for classifying Chinese webpages based on keyword frequency analysis
CN101694666B (en) * 2009-07-17 2011-03-30 刘二中 Method for inputting and processing characteristic words of file contents
CN102902790B (en) * 2012-09-29 2017-06-06 北京奇虎科技有限公司 Web page classification system and method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于关键词提取的搜索结果聚类研究;秦鹏,李恒训,张华平,刘金刚;《第五届全国信息检索学术会议CCIR2009》;20100628;全文 *

Also Published As

Publication number Publication date
CN104504027A (en) 2015-04-08

Similar Documents

Publication Publication Date Title
CN104504027B (en) The auto-screening method and device of web page contents
CN106528769A (en) Data acquisition method and apparatus
CN110275935A (en) Processing method, device and storage medium, the electronic device of policy information
CN109726327A (en) A kind of information-pushing method and device
CN105302815B (en) The filter method and device of the uniform resource position mark URL of webpage
CN110135693A (en) A kind of Risk Identification Method, device, equipment and storage medium
CN106649334B (en) Processing method and device of associated word set
CN108021806A (en) A kind of recognition methods of malice installation kit and device
CN106815206A (en) The analysis method and device of law judgement document
CN112491643A (en) Deep packet inspection method, device, equipment and storage medium
CN107748898A (en) File classifying method, device, computing device and computer-readable storage medium
CN109948639A (en) A kind of picture rubbish recognition methods based on deep learning
CN108334895A (en) Sorting technique, device, storage medium and the electronic device of target data
CN106844482A (en) A kind of retrieval information matching method and device based on search engine
CN109117172A (en) A kind of method and device of the terminal versions number identification of target terminal
CN107340954A (en) A kind of information extracting method and device
CN108984514A (en) Acquisition methods and device, storage medium, the processor of word
CN106844412A (en) A kind of human face data collection method and device
CN106933916A (en) The processing method and processing device of JSON character strings
CN110472230A (en) The recognition methods of Chinese text and device
CN109783678B (en) Image searching method and device
CN114186102A (en) Tree structure data construction method and device and computer equipment
EP3576024A1 (en) Accessible machine learning
CN111064996B (en) Method, system and storage medium for identifying user watching video content preference
CN107544994A (en) The treating method and apparatus of associated data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant