CN104504027B - The auto-screening method and device of web page contents - Google Patents
The auto-screening method and device of web page contents Download PDFInfo
- Publication number
- CN104504027B CN104504027B CN201410769099.1A CN201410769099A CN104504027B CN 104504027 B CN104504027 B CN 104504027B CN 201410769099 A CN201410769099 A CN 201410769099A CN 104504027 B CN104504027 B CN 104504027B
- Authority
- CN
- China
- Prior art keywords
- webpage
- web page
- label
- selection result
- page contents
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
- G06F16/9535—Search customisation based on user profiles and personalisation
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a kind of auto-screening method of web page contents and devices.Wherein, this method comprises: reading the web page contents in source database;According to pre-set keyword dictionary and pre-set screening parameter, web page contents are screened, obtain webpage the selection result;Extract pre-set label information dictionary;The label of any one or more type in label information dictionary is added in webpage the selection result;According to the label information added in webpage the selection result, function treatment corresponding with label information is executed to webpage the selection result, obtains web page contents after automatic screening.The problem of present invention, which solves, carries out manual screening to a large amount of web page contents for updating daily in the prior art, caused process lengthy and jumbled inefficiency.
Description
Technical field
The present invention relates to computer fields, in particular to the auto-screening method and device of a kind of web page contents.
Background technique
Currently, for the public opinion information monitoring system that the content in web page contents is monitored, although use can be allowed
Required content of text is screened at family again, and the content of text after screening again can be operated (such as: point
Generic operation, label operation etc.), it can be very good the diversified demand for meeting user, but have a problem that: on network
Web page contents be all to be updated daily, and the data volume updated daily is huge, has led to user and holds if necessary
The newest situation of continuous monitoring requires the classification dimension wanted from oneself when analyzing every time the web page contents of update
The web page contents for going analysis to update, this just needs craft artificial daily to be screened to all content of text and screen it
Operating again afterwards, process is lengthy and jumbled, trouble.
For, efficiency lengthy and jumbled to process caused by a large amount of web page contents progress manual screening updated daily in the prior art
Low problem, currently no effective solution has been proposed.
Summary of the invention
The main purpose of the present invention is to provide a kind of auto-screening method of web page contents and devices, to solve existing skill
Manual screening carried out to a large amount of web page contents for updating daily in art, the problem of caused process lengthy and jumbled inefficiency.
To achieve the goals above, according to an aspect of an embodiment of the present invention, a kind of the automatic of web page contents is provided
Screening technique.This method comprises: reading the web page contents in source database;It sets according to pre-set keyword dictionary and in advance
The screening parameter set, screens web page contents, obtains webpage the selection result;Extract pre-set label information dictionary;
The label of any one or more type in label information dictionary is added in webpage the selection result;It is screened and is tied according to webpage
The label information added in fruit executes function treatment corresponding with label information to webpage the selection result, after obtaining automatic screening
Web page contents.
To achieve the goals above, according to another aspect of an embodiment of the present invention, a kind of the automatic of web page contents is provided
Screening plant, which includes the first read module, for reading the web page contents in source database;Screening module is used for root
According to pre-set keyword dictionary and pre-set screening parameter, web page contents are screened, obtain webpage screening knot
Fruit;Extraction module, for extracting pre-set label information dictionary;First processing module, for according to webpage the selection result
In web page contents, the label of any one or more type in label information dictionary is added in webpage the selection result;
Second processing module, for executing to webpage the selection result and believing with label according to the label information added in webpage the selection result
Corresponding function treatment is ceased, web page contents after automatic screening are obtained.
According to inventive embodiments, by reading the web page contents in source database;According to pre-set keyword dictionary
With pre-set screening parameter, web page contents are screened, obtain webpage the selection result;Extract pre-set label letter
Cease dictionary;The label of any one or more type in label information dictionary is added in webpage the selection result;According to net
The label information added in page the selection result executes function treatment corresponding with label information to webpage the selection result, obtains certainly
Web page contents after dynamic screening solve and carry out manual screening to a large amount of web page contents updated daily in the prior art, caused
The problem of process lengthy and jumbled inefficiency.It realizes and webpage is screened automatically, and webpage is handled according to web page contents
Effect.
Detailed description of the invention
The attached drawing constituted part of this application is used to provide further understanding of the present invention, schematic reality of the invention
It applies example and its explanation is used to explain the present invention, do not constitute improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is the flow chart of the auto-screening method of according to embodiments of the present invention one web page contents;
Fig. 2 is the flow chart of the auto-screening method of according to embodiments of the present invention one preferred web page contents;
Fig. 3 is the structural schematic diagram of the automatic splinter screening device for screening of according to embodiments of the present invention two web page contents;And
Fig. 4 is the structural schematic diagram of the automatic splinter screening device for screening of according to embodiments of the present invention two preferred web page contents.
Specific embodiment
It should be noted that in the absence of conflict, the features in the embodiments and the embodiments of the present application can phase
Mutually combination.The present invention will be described in detail below with reference to the accompanying drawings and embodiments.
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein.In addition, term " includes " and " tool
Have " and their any deformation, it is intended that cover it is non-exclusive include, for example, containing a series of steps or units
Process, method, system, product or equipment those of are not necessarily limited to be clearly listed step or unit, but may include without clear
Other step or units listing to Chu or intrinsic for these process, methods, product or equipment.
Embodiment 1
The embodiment of the invention provides a kind of auto-screening methods of web page contents.
Fig. 1 is the flow chart of the auto-screening method of web page contents according to an embodiment of the present invention.As shown in Figure 1, the party
Method comprises the following steps that
Step S11 reads the web page contents in source database.
Specifically, S11 through the above steps, is read out the web page contents stored in source database.Wherein, source data
Library is for storing the web page contents regularly updated.
Step S13 sieves web page contents according to pre-set keyword dictionary and pre-set screening parameter
Choosing, obtains webpage the selection result.
Specifically, S13 through the above steps, screens the web page contents read from source database.Wherein,
Can first web page contents be screened according to pre-set keyword dictionary by carrying out screening technique to web page contents, then right
The selection result is screened by preset screening parameter, obtains webpage the selection result.
Step S15 extracts pre-set label information dictionary.
Specifically, S15 through the above steps, will extract for the pre-set label information dictionary of web page contents,
Obtain the label information dictionary for being identified to webpage.
Step S17 will be matched with web page contents in label information dictionary according to the web page contents in webpage the selection result
The label of any one or more type is added in webpage the selection result.
Specifically, S17 through the above steps, the label in the content of webpage the selection result and label information dictionary is carried out
Matching is obtained and one or more labels in one or more types of the content matching of webpage the selection result by matching.
It include several different types of label informations in label information dictionary.
Step S19, according to the label information added in webpage the selection result, to the execution of webpage the selection result and label information
Corresponding function treatment obtains web page contents after automatic screening.
Specifically, S19 through the above steps, believes for the label of one or more types corresponding with webpage the selection result
Breath calls power function corresponding with tag types to handle the web page contents in the webpage the selection result, to realize
To the function of the automatic screening of webpage the selection result.
By step S11 to step S19, after being read out to the web page contents in source database, first according to keyword
Dictionary and pre-set screening parameter, screen web page contents.It obtains comprising one in keyword dictionary or more
The web page contents of a keyword, also, further web page contents are screened according to screening parameter, it obtains meeting screening parameter
In one or more screening conditions web page contents, to obtain webpage the selection result.On the basis of webpage the selection result,
Further webpage the selection result is identified according to label information dictionary.When some web page contents in webpage the selection result and mark
It is that the web page contents in webpage the selection result add mark when one or more of label dictionary of information tag types match
Label.Finally, according to tag types call with the matched power function of tag types, to the web page contents in webpage the selection result into
Row processing.
In summary, the present invention, which solves, carries out manual screening to a large amount of web page contents updated daily in the prior art,
The problem of caused process lengthy and jumbled inefficiency, realize and webpage screened automatically, and according to web page contents to webpage into
The effect of row processing.
Preferably, in the above embodiments of the present application, label information dictionary may include one of following tag types or
It is several: tag along sort, mood label, region label, blacklist label, label to be deleted, wherein in step S19 according to webpage
The label information added in the selection result executes function treatment corresponding with label information to webpage the selection result, obtains automatic
After screening in web page contents, step includes:
Step S191 reads the label information of one or more labels in webpage the selection result.
Step S193 calls corresponding power function to handle webpage the selection result according to the type of label information, raw
At web page contents after automatic screening.
Specifically, S191 to step S193 through the above steps, passes through the label information to label in webpage the selection result
Reading, call corresponding with tag types power function.By different power functions to the webpage in webpage the selection result
Content performs corresponding processing.
Preferably, in the above embodiments of the present application, step S193 calls corresponding function according to the type of label information
In the step of function handles webpage the selection result, generates web page contents after automatic screening, include at least it is following any one or
Multiple schemes:
Scheme one: in the case where label is tag along sort, the power function of calling is classification feature function, so as to net
Page the selection result carries out classification processing, generates sorted web page contents.
Scheme two: in the case where label is mood label, the power function of calling is the function letter for correcting mood label
Number, so that the mood label to webpage the selection result is modified processing, generates revised web page contents.
Scheme three: in the case where label is region label, the power function of calling is the function letter for correcting region label
Number, so that the region label to webpage the selection result is modified processing, generates revised web page contents.
Scheme four: in the case where label is blacklist label, the power function of calling is screening and blacklist label pair
The power function for the webpage the selection result answered generates sieve so that the blacklist label to webpage the selection result carries out Screening Treatment
Web page contents after choosing.
Scheme five: in the case where label is label to be deleted, the power function of calling is to delete and label pair to be deleted
The power function for the webpage the selection result answered, so that delete processing is carried out to the webpage the selection result with label to be deleted, it is raw
At the web page contents after deletion.
Specifically, in practical application according to the needs of actual conditions, other can also be carried out to webpage the selection result
Processing scheme, however it is not limited to five kinds of above-mentioned schemes.
Preferably, in the above embodiments of the present application, step S13 is according to pre-set keyword dictionary and pre-set
The step of screening parameter screens web page contents, obtains webpage the selection result include:
Step S131 screens web page contents according to pre-set keyword dictionary, obtains the first pretreatment net
Page content.
Step S133 screens the first pretreatment web page contents according to preset screening parameter, obtains as webpage
Second pretreatment web page contents of the selection result, wherein the second pretreatment web page contents are webpage the selection result.
Specifically, S131 is to step 133 through the above steps, first according to pre-set keyword dictionary, to webpage
It is screened, obtains the match with the keyword that keyword dictionary includes first pretreatment web page contents.Then, according to preparatory
The screening parameter of setting screens the first pretreatment web page contents.By the screening to the first pretreatment web page contents, obtain
Web page contents are pre-processed to the condition of screening parameter matched second.Second pretreatment web page contents are webpage screening knot
Fruit.
Preferably, in the above embodiments of the present application, in step S133 according to preset screening parameter to the first pretreatment net
Page content screened, obtain as webpage the selection result second pre-process web page contents the step of include:
S1331 reads pre-set screening parameter and screening sequence.
S1333 successively pre-processes web page contents to first according to screening sequence according to screening parameter and screening sequence to sieve
Selecting parameter is that condition is screened, and obtains the second pretreatment web page contents, wherein screening parameter include at least following screening item it
In it is any one or more: webpage text content, web page text source, web page text author, web page text acquisition time, net
Page mood label and web page text issue regional information.
Specifically, S1331 to step S1333 through the above steps, according to screening sequence, successively to the first pretreatment webpage
Content is screened according to screening parameter.By the screening layer by layer of screening parameter, gradually by the range shorter of screening, removal first
The dirty data in web page contents is pre-processed, webpage the selection result is obtained.
Preferably, as shown in Fig. 2, in the above embodiments of the present application, the web page contents in source database are read in step S11
Before, method further include:
Step S101 reads preset target webpage address.
Web page contents corresponding with target webpage address are downloaded by step S103 according to target webpage address.
The targeted web content downloaded to is stored in source database by step S105.
Specifically, S101 to step S105 through the above steps, in the webpage in preset target webpage address
Appearance is grabbed.By access target web page address, web page contents corresponding with target webpage address are got.It will acquire
Web page contents are stored, and are stored in source database as source data.
Preferably, in the above embodiments of the present application, web page contents may include such as the next item down or several: in web page text
Appearance, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
Specifically, according to target webpage address, the step that web page contents corresponding with target webpage address are downloaded
In rapid, while being downloaded the webpage text content in web page contents, need to save corresponding with webpage text content
Web page text source, web page text author, web page text acquisition time and web page text publication regional information etc..And these are believed
Breath, correspondence are stored into source database.
Preferably, it in the above embodiments of the present application, according to pre-set keyword dictionary and is preset in step S13
Screening parameter, web page contents are screened, after obtaining webpage the selection result, method further include:
Step S141 reads pre-set text mood dictionary, wherein text mood dictionary includes: front mood word
Dictionary and negative emotions word dictionary.
Step S143 judges according to content of the text mood dictionary to webpage the selection result, obtains screening with webpage
As a result the corresponding mood label of content.
Specifically, S141 and step S143 through the above steps, sieves webpage according to pre-set text mood dictionary
The web page contents in result are selected to carry out mood analysis.By web page contents respectively with the front mood word dictionary in text mood dictionary
It is matched, is obtained corresponding with the content of webpage the selection result with one or more mood entries in negative emotions word dictionary
Mood label.
Preferably, in the above embodiments of the present application, in step S143 according to text mood dictionary to webpage the selection result
Content is judged, is obtained in mood label corresponding with the content of webpage the selection result, step includes:
Step S1431 judges according to content of the text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in front mood word dictionary in the content of webpage the selection result is more than preparatory
When the threshold value of setting, determine the mood label of the content of webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result is more than preparatory
When the threshold value of setting, determine that the mood label of the content of webpage the selection result is negative emotions.
Specifically, S1431 through the above steps, mood label corresponding to the content to webpage the selection result is sentenced
It is disconnected.In the content of webpage the selection result, when the quantity for the front mood word for including is more than preset threshold value, then determining should
The content of webpage the selection result is positive mood;In the content of webpage the selection result, the quantity for the negative emotions word for including is super
When crossing preset threshold value, then determine the content of the webpage the selection result for negative emotions.
It may not only include front mood word in a webpage, but also include negative emotions word in practical application.This
When, can the difference of quantity of quantity and negative emotions word to the front mood word in webpage judge.When positive mood
When the quantity of word is greater than the quantity of negative emotions word, then determine the mood of the webpage for positive mood;When the number of front mood word
When amount is less than the quantity of negative emotions word, then determine the mood of the webpage for negative emotions.
In conclusion, in order to make up in the prior art, being carried out daily to all content of text in practical application
This disadvantage is operated again after filter and filtering, it can be by the way that the logic that multiple rules judge, and use be arranged for web page contents
Family can according to oneself the pre-set screening conditions of hobby (such as: include what keyword, to be originated from which channel, when
Between range what is, emotion degree is how many, author personage etc.).Then, whenever there is new webpage text content to grab from internet
After getting, the judgment rule and screening rule set before whether meeting is judged automatically.If satisfied, then automatically to web page text
Operated (such as classifying, label, change mood, change Regional Property, add asterisk, pull in blacklist, deletion etc. operation).
It in order to realize the above functions, can be by front end automatic screening condition setting module and robot crawler module (i.e.
Webpage capture module) realization of the two modules.
Front end automatic screening condition setting module:
User can set automatic screening and operation in product front end: firstly, by judging whether comprising keyword;Then,
Screening conditions are set by screening parameter, such as: judging text source, text mood, the relevant information of text author, text crawl
The conditions such as time.Finally, web page contents are operated automatically when meeting the two above-mentioned conditions (such as: classification, mark
Label, change mood, change Regional Property, add asterisk, pull in blacklist, deletion etc. operation).
Webpage capture module:
Webpage capture module can be one section of javascript code, can predefine the range to be crawled (website), climb
Worm can all grab online all text and image content, be stored into database and judge whether to meet user's setting
Good screening conditions, if satisfied, then performing corresponding processing it.
The present invention passes through the screening rule that user manually sets in advance, grab to the content on internet automatically laggard
Row screening, and the web page contents that screening obtains automatically are subjected to subsequent operation.In all processes, in addition to initial stage setting it
Outside, other all do not need user intervention.The subjective initiative of user is greatly improved using this method, it can be according to setting in advance
Fixed rule carries out the web page contents obtained by screening accordingly to screen to the web page contents newly grabbed every time
Operation, reduce the update because of website or webpage, user needs the complexity operated repeatedly manually.
Embodiment 2
The embodiment of the invention also provides a kind of automatic splinter screening device for screening of web page contents, as shown in figure 3, the device can wrap
It includes: the first read module 31, screening module 33, extraction module 35, first processing module 37 and Second processing module 39.
Wherein, the first read module 31, for reading the web page contents in source database.
Specifically, being read out by the first read module 31 to the web page contents stored in source database.Wherein, source
Database is for storing the web page contents regularly updated.
Screening module 33 is used for according to pre-set keyword dictionary and pre-set screening parameter, in webpage
Appearance is screened, and webpage the selection result is obtained.
Specifically, being screened by above-mentioned screening module 33 to the web page contents read from source database.Its
In, can first web page contents be screened according to pre-set keyword dictionary by carrying out screening technique to web page contents, so
The selection result is screened by preset screening parameter afterwards, obtains webpage the selection result.
Extraction module 35, for extracting pre-set label information dictionary.
Specifically, will be mentioned by said extracted module 35 for the pre-set label information dictionary of web page contents
It takes, obtains the label information dictionary for being identified to webpage.
First processing module 37 will be any in label information dictionary for according to the web page contents in webpage the selection result
The label of one or more types is added in webpage the selection result.
Specifically, by above-mentioned first processing module 37, it will be in the content of webpage the selection result and label information dictionary
Label is matched, and is obtained and one or more in one or more types of the content matching of webpage the selection result by matching
A label.It include several different types of label informations in label information dictionary.
Second processing module 39, for being held to webpage the selection result according to the label information added in webpage the selection result
Row function treatment corresponding with label information, obtains web page contents after automatic screening.
Specifically, by Second processing module 39, for the mark of one or more types corresponding with webpage the selection result
Information is signed, power function corresponding with tag types is called to handle the web page contents in the webpage the selection result, thus
Realize the function to the automatic screening of webpage the selection result.
Pass through the first read module 31, screening module 33, extraction module 35, first processing module 37 and Second processing module
39, it is right first according to keyword dictionary and pre-set screening parameter after being read out to the web page contents in source database
Web page contents are screened.Obtain the web page contents comprising one or more keyword in keyword dictionary, also, into one
Step screens web page contents according to screening parameter, obtains meeting the webpage of one or more screening conditions in screening parameter
Content, to obtain webpage the selection result.On the basis of webpage the selection result, further according to label information dictionary to webpage
The selection result is identified.When one or more of some web page contents and label information dictionary are marked in webpage the selection result
It is that the web page contents in webpage the selection result add label when label type matches.Finally, according to tag types calling and label
The power function of type matching handles the web page contents in webpage the selection result.
In summary, the present invention, which solves, carries out manual screening to a large amount of web page contents updated daily in the prior art,
The problem of caused process lengthy and jumbled inefficiency, realize and webpage screened automatically, and according to web page contents to webpage into
The effect of row processing.
Further, label information dictionary may include one or more of following tag types: tag along sort, mood
Label, region label, blacklist label, label to be deleted, in Second processing module 39, execution according to webpage the selection result
The label information of middle addition executes function treatment corresponding with label information to webpage the selection result, obtains net after automatic screening
Page content the step of include:
Read the label information of one or more labels in webpage the selection result.
Corresponding power function is called to handle webpage the selection result according to the type of label information, generates automatic screening
Web page contents afterwards.
Specifically, the reading to the label information of label in webpage the selection result through the above steps, calling and tag class
The corresponding power function of type.The web page contents in webpage the selection result are performed corresponding processing by different power functions.
Further, corresponding power function is called to handle webpage the selection result according to the type of label information, it is raw
In after automatic screening the step of web page contents, following any one or more schemes are included at least:
Scheme one: in the case where label is tag along sort, the power function of calling is classification feature function, so as to net
Page the selection result carries out classification processing, generates sorted web page contents.
Scheme two: in the case where label is mood label, the power function of calling is the function letter for correcting mood label
Number, so that the mood label to webpage the selection result is modified processing, generates revised web page contents.
Scheme three: in the case where label is region label, the power function of calling is the function letter for correcting region label
Number, so that the region label to webpage the selection result is modified processing, generates revised web page contents.
Scheme four: in the case where label is blacklist label, the power function of calling is screening and blacklist label pair
The power function for the webpage the selection result answered generates sieve so that the blacklist label to webpage the selection result carries out Screening Treatment
Web page contents after choosing.
Scheme five: in the case where label is label to be deleted, the power function of calling is to delete and label pair to be deleted
The power function for the webpage the selection result answered, so that delete processing is carried out to the webpage the selection result with label to be deleted, it is raw
At the web page contents after deletion.
Specifically, in practical application according to the needs of actual conditions, other can also be carried out to webpage the selection result
Processing scheme, however it is not limited to five kinds of above-mentioned schemes.
Further, in screening module 33, execution according to pre-set keyword dictionary and pre-set sieve
The step of selecting parameter, screen to web page contents, obtaining webpage the selection result include:
According to pre-set keyword dictionary, web page contents are screened, obtain the first pretreatment web page contents.
The first pretreatment web page contents are screened according to preset screening parameter, are obtained as webpage the selection result
Second pretreatment web page contents, wherein the second pretreatment web page contents are webpage the selection result.
Specifically, through the above steps, first according to pre-set keyword dictionary, screening, obtaining to webpage
The first pretreatment web page contents that the keyword for including with keyword dictionary matches.Then, joined according to pre-set screening
Number screens the first pretreatment web page contents.By the screening to the first pretreatment web page contents, obtain and screening parameter
Condition it is matched second pretreatment web page contents.Second pretreatment web page contents are webpage the selection result.
Further, the first pretreatment web page contents are screened according to preset screening parameter in above-mentioned steps, is obtained
To as webpage the selection result second pre-process web page contents the step of include:
Read pre-set screening parameter and screening sequence.
According to screening parameter and screening sequence, web page contents successively are pre-processed with screening parameter to first according to screening sequence
It is screened for condition, obtains the second pretreatment web page contents, wherein screening parameter includes at least appointing among following screening item
Meaning is one or more: webpage text content, web page text source, web page text author, web page text acquisition time, webpage mood
Label and web page text issue regional information.
Specifically, through the above steps, according to screening sequence, successively to the first pretreatment web page contents according to screening parameter
It is screened.By the screening layer by layer of screening parameter, gradually by the range shorter of screening, removal first is pre-processed in web page contents
Dirty data, obtain webpage the selection result.
Preferably, as shown in figure 4, in the above embodiments of the present application, device further include: the second read module 301, downloading mould
Block 303 and memory module 305.
Wherein, the second read module 301, for reading preset target webpage address.
Download module 303, for according to target webpage address, web page contents corresponding with target webpage address to be carried out down
It carries.
Memory module 305, for the targeted web content downloaded to be stored in source database.
Specifically, by above-mentioned second read module 301, download module 303 and memory module 305, to preset
Web page contents in target webpage address are grabbed.By access target web page address, get and target webpage address pair
The web page contents answered.The web page contents that will acquire are stored, and are stored in source database as source data.
Further, web page contents may include such as the next item down or several: webpage text content, web page text source, net
Page text author, web page text acquisition time and web page text issue regional information.
Specifically, according to target webpage address, the step that web page contents corresponding with target webpage address are downloaded
In rapid, while being downloaded the webpage text content in web page contents, need to save corresponding with webpage text content
Web page text source, web page text author, web page text acquisition time and web page text publication regional information etc..And these are believed
Breath, correspondence are stored into source database.
Further, in the above embodiments of the present application, device further include: third read module 341 and judgment module 343.
Wherein, third read module 341, for reading pre-set text mood dictionary, wherein text mood dictionary
It include: front mood word dictionary and negative emotions word dictionary.
Judgment module 343 obtains and net for being judged according to content of the text mood dictionary to webpage the selection result
The corresponding mood label of content of page the selection result.
Specifically, by above-mentioned third read module 341 and judgment module 343, according to pre-set text mood word
Allusion quotation carries out mood analysis to the web page contents in webpage the selection result.By web page contents respectively with the front in text mood dictionary
One or more mood entries in mood word dictionary and negative emotions word dictionary are matched, and are obtained and webpage the selection result
The corresponding mood label of content.
Further, judged above-mentioned according to content of the text mood dictionary to webpage the selection result, obtained and net
Page the selection result the corresponding mood label of content in step include:
Judged according to content of the text mood dictionary to webpage the selection result;
When the quantity comprising the front mood word in front mood word dictionary in the content of webpage the selection result is more than preparatory
When the threshold value of setting, determine the mood label of the content of webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in negative emotions word dictionary in the content of webpage the selection result is more than preparatory
When the threshold value of setting, determine that the mood label of the content of webpage the selection result is negative emotions.
Specifically, mood label corresponding to the content to webpage the selection result judges through the above steps.Work as net
In the content of page the selection result, when the quantity for the front mood word for including is more than preset threshold value, then determine that the webpage sieves
Select the content of result for positive mood;In the content of webpage the selection result, the quantity for the negative emotions word for including is more than preparatory
When the threshold value of setting, then determine the content of the webpage the selection result for negative emotions.
It may not only include front mood word in a webpage, but also include negative emotions word in practical application.This
When, can the difference of quantity of quantity and negative emotions word to the front mood word in webpage judge.When positive mood
When the quantity of word is greater than the quantity of negative emotions word, then determine the mood of the webpage for positive mood;When the number of front mood word
When amount is less than the quantity of negative emotions word, then determine the mood of the webpage for negative emotions.
It should be noted that for the various method embodiments described above, for simple description, therefore, it is stated as a series of
Combination of actions, but those skilled in the art should understand that, the present invention is not limited by the sequence of acts described because
According to the present invention, some steps may be performed in other sequences or simultaneously.Secondly, those skilled in the art should also know
It knows, the embodiments described in the specification are all preferred embodiments, and related actions and modules is not necessarily of the invention
It is necessary.
In the above-described embodiments, it all emphasizes particularly on different fields to the description of each embodiment, there is no the portion being described in detail in some embodiment
Point, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed device, it can be by another way
It realizes.For example, the apparatus embodiments described above are merely exemplary, such as the division of the unit, it is only a kind of
Logical function partition, there may be another division manner in actual implementation, such as multiple units or components can combine or can
To be integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Coupling, direct-coupling or communication connection can be through some interfaces, the indirect coupling or communication connection of device or unit,
It can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
In network unit.It can select some or all of unit therein according to the actual needs to realize the mesh of this embodiment scheme
's.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can be personal computer, mobile terminal, server or network equipment etc.) executes side described in each embodiment of the present invention
The all or part of the steps of method.And storage medium above-mentioned include: USB flash disk, read-only memory (ROM, Read-Only Memory),
Random access memory (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. are various to be can store
The medium of program code.
The foregoing is only a preferred embodiment of the present invention, is not intended to restrict the invention, for the skill of this field
For art personnel, the invention may be variously modified and varied.All within the spirits and principles of the present invention, made any to repair
Change, equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.
Claims (11)
1. a kind of auto-screening method of web page contents characterized by comprising
Read the web page contents in source database;
According to pre-set keyword dictionary and pre-set screening parameter, the web page contents are screened, are obtained
Webpage the selection result;
Extract pre-set label information dictionary;
It, will be matched with the web page contents in the label information dictionary according to the web page contents in the webpage the selection result
Any one or any a plurality of types of labels are added in the webpage the selection result;
According to the label information added in the webpage the selection result, the webpage the selection result is executed and the label information
Corresponding function treatment obtains web page contents after automatic screening;
Wherein, the label information dictionary includes one or more of following tag types: tag along sort, mood label,
Domain label, blacklist label, label to be deleted, wherein according to the label information added in the webpage the selection result, to described
The step of webpage the selection result executes corresponding with label information function treatment, obtains web page contents after automatic screening packet
It includes:
Read the label information of one or more labels in the webpage the selection result;
Corresponding power function is called according to the type of the label information to handle the webpage the selection result, described in generation
Web page contents after automatic screening.
2. the method according to claim 1, wherein calling corresponding function according to the type of the label information
It can include at least function the step of handling the webpage the selection result, generating web page contents after the automatic screening following any
One or more schemes:
Scheme one: in the case where the label is the tag along sort, the power function of calling is classification feature function,
So that carrying out classification processing to the webpage the selection result, sorted web page contents are generated;
Scheme two: in the case where the label is the mood label, the power function of calling is to correct the mood
The power function of label generates revised webpage so that being modified processing to the mood label of the webpage the selection result
Content;
Scheme three: in the case where the label is the region label, the power function of calling is to correct the region
The power function of label generates revised webpage so that being modified processing to the region label of the webpage the selection result
Content;
Scheme four: in the case where the label is the blacklist label, the power function of calling be screening with it is described
The power function of the corresponding webpage the selection result of blacklist label, so as to the blacklist label of the webpage the selection result
Screening Treatment is carried out, the web page contents after generating screening;
Scheme five: in the case where the label is the label to be deleted, the power function of calling be delete with it is described
The power function of the corresponding webpage the selection result of label to be deleted, so that being screened to the webpage with label to be deleted
As a result delete processing is carried out, the web page contents after deleting are generated.
3. the method according to claim 1, wherein according to pre-set keyword dictionary and pre-set
The step of screening parameter screens the web page contents, obtains webpage the selection result include:
According to pre-set keyword dictionary, the web page contents are screened, obtain the first pretreatment web page contents;
The first pretreatment web page contents are screened according to preset screening parameter, obtains screening as the webpage and tie
The second pretreatment web page contents of fruit, wherein the second pretreatment web page contents are the webpage the selection result.
4. according to the method described in claim 3, it is characterized in that, described locate according to preset screening parameter to described first in advance
Reason web page contents screened, obtain as the webpage the selection result second pre-process web page contents the step of include:
Read pre-set screening parameter and screening sequence;
According to the screening parameter and screening sequence, successively according to the screening sequence in the first pretreatment webpage
Appearance is screened using the screening parameter as condition, obtains the second pretreatment web page contents, wherein the screening parameter is extremely
Few includes any one or any number of among screening item as described below: the webpage text content, the web page text come
Source, the web page text author, the web page text acquisition time, the webpage mood label and web page text publication ground
Domain information.
5. the method according to claim 1, wherein read source database in web page contents before, it is described
Method further include:
Read preset target webpage address;
According to the target webpage address, the web page contents corresponding with the target webpage address are downloaded;
The targeted web content downloaded to is stored in the source database.
6. according to the method described in claim 5, it is characterized in that, the web page contents include such as the next item down or several: webpage
Content of text, web page text source, web page text author, web page text acquisition time and web page text issue regional information.
7. the method according to claim 1, wherein according to pre-set keyword dictionary and presetting
Screening parameter, the web page contents are screened, after obtaining webpage the selection result, the method also includes:
Read pre-set text mood dictionary, wherein the text mood dictionary includes: front mood word dictionary and negative
Mood word dictionary;
Judged according to content of the text mood dictionary to the webpage the selection result, obtains screening with the webpage and tie
The corresponding mood label of the content of fruit.
8. the method according to the description of claim 7 is characterized in that described sieve the webpage according to the text mood dictionary
The step of selecting the content of result to be judged, obtaining mood label corresponding with the content of the webpage the selection result include:
Judged according to content of the text mood dictionary to the webpage the selection result;
When the quantity comprising the front mood word in the front mood word dictionary in the content of the webpage the selection result is more than
When pre-set threshold value, determine the mood label of the content of the webpage the selection result for positive mood;
When the quantity comprising the negative emotions word in the negative emotions word dictionary in the content of the webpage the selection result is more than
When pre-set threshold value, determine that the mood label of the content of the webpage the selection result is negative emotions.
9. a kind of automatic splinter screening device for screening of web page contents characterized by comprising
First read module, for reading the web page contents in source database;
Screening module is used for according to pre-set keyword dictionary and pre-set screening parameter, to the web page contents
It is screened, obtains webpage the selection result;
Extraction module, for extracting pre-set label information dictionary;
First processing module, for according to the web page contents in the webpage the selection result, by the label information dictionary with
The web page contents matched any one or any a plurality of types of labels are added in the webpage the selection result;
Second processing module, for according to the label information added in the webpage the selection result, to the webpage the selection result
Function treatment corresponding with the label information is executed, web page contents after automatic screening are obtained;
Wherein, the label information dictionary includes one or more of following tag types: tag along sort, mood label,
Domain label, blacklist label, label to be deleted, wherein Second processing module is used for:
Read the label information of one or more labels in the webpage the selection result;
Corresponding power function is called according to the type of the label information to handle the webpage the selection result, described in generation
Web page contents after automatic screening.
10. device according to claim 9, which is characterized in that described device further include:
Second read module, for reading preset target webpage address;
Download module is used for according to the target webpage address, will the web page contents corresponding with the target webpage address
It is downloaded;
Memory module, for the targeted web content downloaded to be stored in the source database.
11. device according to claim 9, which is characterized in that described device further include:
Third read module, for reading pre-set text mood dictionary, wherein the text mood dictionary includes: just
Face mood word dictionary and negative emotions word dictionary;
Judgment module, for being judged according to content of the text mood dictionary to the webpage the selection result, obtain with
The corresponding mood label of the content of the webpage the selection result.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410769099.1A CN104504027B (en) | 2014-12-12 | 2014-12-12 | The auto-screening method and device of web page contents |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410769099.1A CN104504027B (en) | 2014-12-12 | 2014-12-12 | The auto-screening method and device of web page contents |
Publications (2)
Publication Number | Publication Date |
---|---|
CN104504027A CN104504027A (en) | 2015-04-08 |
CN104504027B true CN104504027B (en) | 2019-11-12 |
Family
ID=52945425
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410769099.1A Active CN104504027B (en) | 2014-12-12 | 2014-12-12 | The auto-screening method and device of web page contents |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN104504027B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105893416A (en) * | 2015-12-01 | 2016-08-24 | 乐视网信息技术(北京)股份有限公司 | Data service system |
CN105849729A (en) * | 2016-03-25 | 2016-08-10 | 马岩 | Date elimination method and system |
CN106294785A (en) * | 2016-08-12 | 2017-01-04 | 北京创新乐知信息技术有限公司 | Content Selection method and system |
CN107545905B (en) * | 2017-08-21 | 2021-01-05 | 北京合光人工智能机器人技术有限公司 | Emotion recognition method based on sound characteristics |
CN108095740B (en) * | 2017-12-20 | 2021-06-22 | 姜涵予 | User emotion assessment method and device |
CN109918516B (en) * | 2019-03-13 | 2021-07-30 | 百度在线网络技术(北京)有限公司 | Data processing method and device and terminal |
CN110717110B (en) * | 2019-10-12 | 2022-04-22 | 北京达佳互联信息技术有限公司 | Multimedia resource filtering method and device, electronic equipment and storage medium |
CN114647466A (en) * | 2020-12-17 | 2022-06-21 | 国信君和(北京)科技有限公司 | Page content extraction method, device, equipment and computer readable storage medium |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101593200B (en) * | 2009-06-19 | 2012-10-03 | 淮海工学院 | Method for classifying Chinese webpages based on keyword frequency analysis |
CN101694666B (en) * | 2009-07-17 | 2011-03-30 | 刘二中 | Method for inputting and processing characteristic words of file contents |
CN102902790B (en) * | 2012-09-29 | 2017-06-06 | 北京奇虎科技有限公司 | Web page classification system and method |
-
2014
- 2014-12-12 CN CN201410769099.1A patent/CN104504027B/en active Active
Non-Patent Citations (1)
Title |
---|
基于关键词提取的搜索结果聚类研究;秦鹏,李恒训,张华平,刘金刚;《第五届全国信息检索学术会议CCIR2009》;20100628;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN104504027A (en) | 2015-04-08 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN104504027B (en) | The auto-screening method and device of web page contents | |
CN106528769A (en) | Data acquisition method and apparatus | |
CN110275935A (en) | Processing method, device and storage medium, the electronic device of policy information | |
CN109726327A (en) | A kind of information-pushing method and device | |
CN105302815B (en) | The filter method and device of the uniform resource position mark URL of webpage | |
CN110135693A (en) | A kind of Risk Identification Method, device, equipment and storage medium | |
CN106649334B (en) | Processing method and device of associated word set | |
CN108021806A (en) | A kind of recognition methods of malice installation kit and device | |
CN106815206A (en) | The analysis method and device of law judgement document | |
CN112491643A (en) | Deep packet inspection method, device, equipment and storage medium | |
CN107748898A (en) | File classifying method, device, computing device and computer-readable storage medium | |
CN109948639A (en) | A kind of picture rubbish recognition methods based on deep learning | |
CN108334895A (en) | Sorting technique, device, storage medium and the electronic device of target data | |
CN106844482A (en) | A kind of retrieval information matching method and device based on search engine | |
CN109117172A (en) | A kind of method and device of the terminal versions number identification of target terminal | |
CN107340954A (en) | A kind of information extracting method and device | |
CN108984514A (en) | Acquisition methods and device, storage medium, the processor of word | |
CN106844412A (en) | A kind of human face data collection method and device | |
CN106933916A (en) | The processing method and processing device of JSON character strings | |
CN110472230A (en) | The recognition methods of Chinese text and device | |
CN109783678B (en) | Image searching method and device | |
CN114186102A (en) | Tree structure data construction method and device and computer equipment | |
EP3576024A1 (en) | Accessible machine learning | |
CN111064996B (en) | Method, system and storage medium for identifying user watching video content preference | |
CN107544994A (en) | The treating method and apparatus of associated data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |