CN103699602B - A kind of method and apparatus for setting up model essay webpage database - Google Patents
A kind of method and apparatus for setting up model essay webpage database Download PDFInfo
- Publication number
- CN103699602B CN103699602B CN201310684068.1A CN201310684068A CN103699602B CN 103699602 B CN103699602 B CN 103699602B CN 201310684068 A CN201310684068 A CN 201310684068A CN 103699602 B CN103699602 B CN 103699602B
- Authority
- CN
- China
- Prior art keywords
- model essay
- webpage
- building block
- model
- essay
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 31
- 241001269238 Data Species 0.000 claims abstract description 7
- 239000000284 extract Substances 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 10
- 238000013075 data extraction Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 239000000203 mixture Substances 0.000 description 4
- 230000008901 benefit Effects 0.000 description 3
- 238000004590 computer program Methods 0.000 description 2
- 230000035800 maturation Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 241000239290 Araneae Species 0.000 description 1
- 230000003466 anti-cipated effect Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000001035 drying Methods 0.000 description 1
- 235000013399 edible fruits Nutrition 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides a kind of method for setting up model essay webpage database, including:Capture the model essay webpage of website;According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;Set up model essay webpage database;Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, the model essay data of model essay webpage and the corresponding URL of model essay webpage.Model essay searching request for containing number of words keyword, model essay webpage database can provide more accurately retrieval result, and to provide a user corresponding model essay number of words in retrieval result page, greatly facilitate selection of the user to model essay webpage.Present invention also offers a kind of device for setting up model essay webpage database.
Description
Technical field
It is more particularly to a kind of to set up the model essay web data searched for for model essay the present invention relates to field of Internet search
The method and apparatus in storehouse.
Background technology
Model essay search is a kind of critically important demand in Webpage search, searched model essay numerous types, including but is not limited
In all kinds of official documents, secretarial's letter, work plan, final report, gains in depth of comprehension, speech speech, composition operation, various papers etc.
Deng.Peak period during such as student non-have a holiday or vacation period, end of the year work summary, can account for the 1% of total Webpage search amount daily
Left and right.In practice, most of model essay demands have fixed requirement for number of words, thus many users can carry out model essay search
When input number of words, such as " word of reaction to an article 400 ", " word of scholarship application 800 ".Without defeated clearly by number of words when search
Enter, can also there are for the alpha-numeric latent demand of the model;For example, the model essay of paper class is typically not less than 8000 words;Join the party
The model essay of application class typically requires there are 3000~5000 words, etc..
For model essay search, the problem of presently, there are mainly has two:One is that current search mechanism can only be by title, net
Page content matching hits the demand of number of words, unfair for the webpage sorting without related number of words;It is corresponding due to can not find
The field of number of words, also causes recall rate not enough;Two be that user can only pass through corresponding word under the title summary of current retrieval result
Section general rise of prices of the stocks and other securities determines whether information that oneself is desired, for much having whether the page of deception property, number of words meet requirement etc.
All be it is not anticipated that.
Fig. 1 shows the search results pages schematic diagram of current model essay search, and the model essay searching request of user's input is " with family
For the word of composition 350 of topic ";It is other in addition to the title summary of first result directly hits 350 words in search results pages
As a result how many number of words do not known, " 350 word " this keyword can only be abandoned and be ranked up, for some it is potential with
Just seem very unfair for the result of 350 words closely;User does not know that what result has been result yet, can only be one by one
Click checks that efficiency comparison is low.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on
Stating the foundation of problem is used for the method and corresponding device for the model essay webpage database that model essay is searched for.
According to one aspect of the present invention there is provided a kind of method for setting up model essay webpage database, including:
Capture the model essay webpage of website;
According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;
Set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, the model essay web data include model essay type,
The model essay data and the corresponding URL of model essay webpage of model essay webpage.
Alternatively, the Type of website is Ask-Answer Community website, and its webpage includes the main building block of proposition problem and answer is asked
The secondary building block of topic;Extracting rule corresponding with Ask-Answer Community website includes:By in the word of the keyword and webpage main building block
Appearance is matched;If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;If it is, determining number of words
Secondary building block more than predetermined threshold is to be extracted building block;And extract the model essay data of the webpage;Wherein described model essay packet
Include:The title of the word content of to be extracted building block, the text of the word content of to be extracted building block, the text of to be extracted building block
The number of words of word content.
Optionally it is determined that the step of to be extracted building block also includes:First keyword is determined according to the keyword;Will be described
The word content for the secondary building block that first keyword is more than predetermined threshold with number of words is matched;If it matches, determining the secondary building block of matching
For to be extracted building block.
Alternatively, to be extracted building block is multiple in the webpage, then the corresponding model essay web data of the webpage is wrapped
Include multiple model essay data corresponding with to be extracted building number of blocks.
Alternatively, the Type of website is word website, and its webpage includes text title and body matter;With word website
Corresponding extracting rule includes:The keyword is matched with text title;If it matches, extracting the model essay number of the webpage
According to;Wherein described model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, the Type of website is library resource website, and its webpage includes the URL resource links of model essay document and retouched
State the word content of correspondence model essay document;Extracting rule corresponding with library website includes:The keyword is corresponding with description
The word content of model essay document is matched;If it matches, downloading the model essay document via the URL resource links;Extracting should
The model essay data of webpage;Wherein described model essay data include:The word content of model essay document, and the model essay document are described.
Alternatively, the model essay type is corresponding with the keyword.
Alternatively, the predetermined threshold is predefined according to different model essay types.
According to another aspect of the present invention there is provided a kind of device for setting up model essay webpage database, including:
Webpage capture unit, the model essay webpage suitable for capturing website;
Model essay data cell, suitable for according to keyword and extracting rule corresponding with the Type of website, extracting model essay webpage
Model essay data;
Database unit, is adapted to set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, the model essay web data include model essay type,
The model essay data and the corresponding URL of model essay webpage of model essay webpage.
Alternatively, the Type of website is Ask-Answer Community website, and its webpage includes the main building block of proposition problem and answer is asked
The secondary building block of topic;Model essay data cell also includes:Matching unit, suitable for by the word content of the keyword and webpage main building block
Matched;Secondary building block determining unit, suitable for if it matches, judging whether time number of words of the word content of building block is more than predetermined threshold
Value;If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And extraction unit, should suitable for extracting
The model essay data of webpage;Wherein described model essay data include:The title of the word content of to be extracted building block, to be extracted building block
Word content text, the number of words of the word content of to be extracted building block.
Alternatively, secondary building block determining unit is further adapted for determining first keyword according to the keyword;By first keyword
The word content for being more than the secondary building block of predetermined threshold with number of words is matched;If it matches, the secondary building block for determining matching is to be extracted
Secondary building block.
Alternatively, to be extracted building block is multiple in the webpage, then the corresponding model essay web data of the webpage is wrapped
Include multiple model essay data corresponding with to be extracted building number of blocks.
Alternatively, the Type of website is word website, and its webpage includes text title and body matter;Model essay data sheet
Member also includes:Matching unit, suitable for the keyword is matched with text title;Extraction unit, suitable for if it matches, extracting
The model essay data of the webpage;Wherein described model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, the Type of website is library resource website, and its webpage includes the URL resource links of model essay document and retouched
State the word content of correspondence model essay document;Model essay data cell also includes:Matching unit, suitable for by the keyword with description pair
The word content of model essay document is answered to be matched;Download unit, suitable for if it matches, being downloaded via the URL resource links described
Model essay document;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:Model essay text is described
The word content of shelves, and the model essay document.
According to the method and apparatus for setting up model essay webpage database of the present invention, the model essay webpage database includes multiple
Model essay web data, the model essay web data includes model essay type, the model essay data of model essay webpage and model essay webpage
Corresponding URL;The different Types of website is wherein directed to, the model essay data of corresponding web page is extracted, typically comprises what is extracted
Title, text and the number of words of model essay.Thus, when user sends model essay searching request, the basic web page library captured in spiders
While carrying out routine search, also scanned in model essay webpage database.It is each due to being contained in model essay webpage database
Model essay title, text and the number of words of kind of model essay webpage so that the real model identical, close with the model essay number of words required by user
Web page text is appeared in search results pages, and before being come in search results ranking, further can also be in search knot
Model essay number of words is shown to user in fruit page, search quality and Consumer's Experience is thus lifted.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention,
And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can
Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area
Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention
Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is the search results pages schematic diagram of the model essay search of prior art;
Fig. 2 is the flow chart of the method according to an embodiment of the invention for setting up model essay webpage database;
Fig. 3 is the data structure schematic diagram of model essay webpage database according to an embodiment of the invention;
Fig. 4 is the flow chart of the method for extraction model essay data according to another embodiment of the present invention;
Fig. 5 is the schematic diagram of model essay web data according to another embodiment of the present invention;
Fig. 6 is the structural representation of the device for setting up model essay webpage database according to further embodiment of this invention.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing
Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here
Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure
Complete conveys to those skilled in the art.
Embodiment one
Present embodiments provide a kind of method for setting up model essay webpage database.The model essay webpage database is set up and searched
The server end of rope service provider, when user initiates model essay searching request to search engine, is called by search engine.
Fig. 2 shows the flow chart of the method for setting up model essay webpage database according to the present embodiment, at least includes step
Rapid S202 to step S206, wherein:
Step S202:Capture the model essay webpage of website;
Step S204:Extract the model essay data of model essay webpage;With
Step S206:Set up model essay webpage database.
The present embodiment methods described passes through model of the web crawlers to model essay resource website on internet by step S202
Web page text is captured.Web crawlers is a technology maturation, can automatically extract the program of webpage on internet, and it was according to both
Fixed rule is search engine contained network page above and below internet, is the important composition of search engine.It is all to be captured by web crawlers
Webpage will be stored in server side;Certain analysis, filtering can be carried out simultaneously, index is set up, and generation supplies user search
The basic search library used(Or index data base).According to the present embodiment, web crawlers can carry out model in the range of whole network
The crawl of web page text, can also carry out webpage capture, the model specified in the range of specified multiple model essay resource websites
Literary resource website constantly can be added and be updated by search service provider and/or user.
After webpage is crawled, step S204 is performed, for model essay net being stored in server side, being crawled
Page, according to the keyword corresponding with the model essay type to be extracted, extracts the model essay data of the type model essay webpage.Specifically
Ground, it is necessary first to matched keyword with the content of model essay webpage;If it matches, then extracting model from the content of model essay webpage
Literary data.Inventor has found that the model essay resource website on internet mainly includes three major types:Ask-Answer Community website, word website
With library resource website.Preferably, for different types of model essay resource website, using corresponding model essay data extracting rule,
Model essay data can be more accurately provided.
Then, step S206 is performed, based on the model essay data of the model essay webpage extracted, model essay webpage database is set up;
Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay
The model essay data and the corresponding URL of model essay webpage of webpage.Typically, the model essay data include, model essay title, model essay text and
Model essay number of words.Fig. 3 schematically shows the data structure schematic diagram according to the model essay webpage database of the present embodiment, Mei Yifan
Web page text correspond to a model essay web data.
Method described in the present embodiment, establishes a kind of model essay webpage database, and its each model essay web data is included
Such as model alpha-numeric model essay data.When user sends model essay searching request, the keyword according to entrained by searching request,
Via common basic search library(That is, index data base)While retrieval, also carried out in model essay webpage database
Retrieval.Because model essay searching request would generally carry the number of words keyword of such as " 500 word ", model essay webpage database can be provided
More accurately retrieval result;Also, the model essay webpage database to provide a user corresponding model essay in retrieval result page
Number of words be possibly realized, greatly facilitate selection of the user to model essay webpage.
Embodiment two
Present embodiments provide a kind of model essay data extraction method for Ask-Answer Community website.Ask-Answer Community website is with one
Individual main building block proposes problem, and the form that multiple building blocks are answered a question directly, rapidly meets the need that user searches for answer
Ask, can almost solve all problems in daily life, thus also form a huge content resource.At present, it is domestic
There are many more influential Ask-Answer Community websites, such as Baidu knows that 360 question and answer are searched and asked, ends of the earth question and answer etc..
Fig. 4 shows the flow chart of the model essay data extraction method for Ask-Answer Community website according to the present embodiment.Under
Face will be by taking keyword " scholarship application " as an example, and the step with reference to shown in Fig. 4 is described in detail how the net from Ask-Answer Community website
Page extracts the model essay data that model essay type is " scholarship application ".
Step S402, judges whether the word content of main building block matches with keyword " scholarship application ";Wherein, question and answer
The word content of the main building block of community's webpage and each building block, is extracted via web crawlers;Typically, main building block
Word content is, for example, " how writing scholarship application ".
When the word content of main building block is matched with keyword " scholarship application ", into step S404, time building is judged
Whether the number of words of the word content of block is more than predetermined threshold.The minimum number of words according to generally required for a scholarship application,
The predetermined threshold, such as 100 words are set, the secondary building block less than the word of predetermined threshold 100 will be rejected.In the webpage of Ask-Answer Community,
The content of building block is not very likely the answer asked a question to main building block many times, such as secondary building block content be " not knowing ",
" I also would like to know " etc.;And number of words is more than the secondary building block of 100 words, a real scholarship application is just particularly likely that
Model essay.
Certainly, for different model essay types, it should set different predetermined thresholds to model essay number of words.For example, for class
Type is the model essay of " written request for leave ", and its predetermined threshold can set relatively low, for example 10 words;And be " Shen of joining the party for type
Please book " model essay, its predetermined threshold should set of a relatively high, for example 2000 words.
Preferably, as follows the word content of building block has been more than the word of predetermined threshold 100, can be based on keyword further to secondary building
Block is screened, and enters step S406, judge number of words be more than predetermined threshold secondary building block content with member keyword whether
Match somebody with somebody.Here first keyword, either extracts in itself or from keyword and gets for keyword.For the key in this example
Word " scholarship application ", it is " application " and " scholarship " to determine its yuan of keyword.In the webpage of Ask-Answer Community, secondary building block is usual
It can be added by any network user, therefore number of words is also possible to and main building block institute more than the content of the secondary building block of predetermined threshold
Ask a question unrelated, such as secondary building block is the advertisement that network user's malice is pasted.By the way that secondary building block content is entered with first keyword
Row matching, may further determine that time content and the correlation of scholarship application of building block.On the other hand, one in secondary building block
Model essay on scholarship application is also possible to the entire fields for not occurring " scholarship application ", and first keyword " application "
" scholarship ", which ensure that, to be omitted.
If secondary building block and first Keywords matching, into step S408, determine that this building block is to be extracted building block, that is, determine
The content of this building block includes the model essay that theme is " scholarship application ".
Finally, step S410 is performed, model essay number is extracted from " scholarship application " model essay of the to be extracted building block
According to, including model essay title, model essay text, and model essay number of words.Above-mentioned model essay data are extracted from the word content of secondary building block
Implement, and where the inventive point of non-invention, it realizes that details will not be repeated here.
Inventors noted that for a model essay webpage of Ask-Answer Community website, its multiple building blocks are likely to true
It is set to be extracted building block, i.e., multiple times building blocks can meet number of words requirement and first Keywords matching, therefore the model essay webpage is corresponding
Model essay web data will include a plurality of model essay data, as shown in figure 5, wherein every model essay data and each to be extracted building
The content correspondence of block.
By the present embodiment, the model essay data included in the website of Ask-Answer Community can be accurately extracted, are gone to greatest extent
Except invalid content or malice ad content.
Embodiment three
Present embodiments provide the model essay data extraction method of a kind of word website or library resource website.Below will be with
It is described exemplified by keyword " scholarship application ".
The webpage of word website introduces the webpage in the main region of webpage based on word in the form of such as manuscript
Main contents, such as news website, Blog Website etc..Typically, the webpage of word website all includes text title and just
Literary content, these information can be obtained by web crawlers.
According to the model essay data extraction method of the present embodiment, for word website, first by keyword " scholarship application
Book " is matched with text title;If it matches, determining that the webpage is the model essay webpage that type is " scholarship application ", then enter
One step extracts the number of words of text title, body matter, and body matter in body matter, as the model essay number of the webpage
According to.
Library resource website, can provide the user the download service of various articles, paper, such as ten-thousand-ton train net
Deng.Typically, the webpage of library resource website is included in the URL resource links of model essay document and the word for describing the model essay document
Hold.
According to the model essay data extraction method of the present embodiment, for library resource website, first by keyword " scholarship Shen
Please book " matched with the word content of model essay document described in the resource webpage of library;If it matches, determining this article base resource webpage
For the model essay webpage that type is " scholarship application ", then the model essay document is downloaded via the URL resource links;Further
Ground, extracts the word content and model essay data of the model essay document downloaded as the webpage of description model essay document.
Example IV
Present embodiments provide a kind of device for setting up model essay webpage database.The model essay webpage database is set up and searched
The server end of rope service provider, when user initiates model essay searching request to search engine, is called by search engine.
Fig. 6 shows the structural representation of the device for setting up model essay webpage database according to the present embodiment, including:
Webpage capture unit 602, the model essay webpage suitable for capturing website;
Model essay data cell 604, the model essay data suitable for extracting model essay webpage;With
Database unit 606, is adapted to set up model essay webpage database.Wherein model essay webpage database includes multiple models
Web page text data item, it is corresponding with model essay webpage that the model essay web data includes model essay type, the model essay data of model essay webpage
URL.
According to the present embodiment described device, webpage capture unit 602 is suitable to provide model essay on internet by web crawlers
The model essay webpage of source website is captured.Web crawlers is a technology maturation, can automatically extract the journey of webpage on internet
Sequence, it is search engine contained network page above and below internet according to set rule, is the important composition of search engine.
Model essay data cell 604 is suitable to for be stored in server side, the model essay webpage that has been crawled, according to institute
The corresponding keyword of the model essay type to be extracted, extracts the model essay data of the type model essay webpage.
Further, model essay data cell 604 includes:Matching unit, suitable for by keyword and Ask-Answer Community webpage main building
The word content of block is matched;Secondary building block determining unit, suitable in keyword and main building Block- matching, judging time text of building block
Whether the number of words of word content is more than predetermined threshold, and if secondary building block word is more than predetermined threshold, it is determined that number of words is more than pre-
It is to be extracted building block to determine the secondary building block of threshold value;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model
Literary data include the title of the word content of to be extracted building block, the text of the word content of to be extracted building block, to be extracted time
The number of words of the word content of building block.
Preferably, it is determined that the content number of words of secondary building block is more than after predetermined threshold, secondary building block determining unit is further by word
Content and first keyword of the number more than the secondary building block of predetermined threshold(Keyword is determined in itself, or according to the keyword)Progress
Match somebody with somebody;If it matches, determining that the secondary building block of matching is to be extracted building block.
Alternatively, model essay data cell 604 includes matching unit, suitable for by the text mark of the keyword and word webpage
Topic is matched;And extraction unit, suitable in keyword and text title match, extracting the model essay data of the webpage;Its
Described in model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, model essay data cell 604 include matching unit, suitable for by the keyword with being retouched in the resource webpage of library
The word content for stating correspondence model essay document is matched;Download unit, suitable for when keyword is matched with descriptive text, via institute
State URL resource links and download model essay document;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay number
According to including:The word content of model essay document, and the model essay document are described.
Database unit 606 is suitable to, and based on the model essay data of the model essay webpage extracted, sets up model essay web data
Storehouse;Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model
The model essay data and the corresponding URL of model essay webpage of web page text.Typically, the model essay data include, model essay title, model essay text
With model essay number of words.
Device according to the present embodiment, establishes a kind of model essay webpage database, its each model essay web data
Contain the alpha-numeric model essay data of such as model., can be in model essay webpage database when user sends model essay searching request
Carry out model essay retrieval.Model essay searching request for containing number of words keyword, it is more accurate that model essay webpage database can be provided
Retrieval result.Also, the model essay webpage database to provide a user in retrieval result page corresponding model essay number of words into
For possibility, selection of the user to model essay webpage is greatly facilitated.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein.
Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system
Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various
Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair
Bright preferred forms.
In the specification that this place is provided, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention
Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail
And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, exist
Above in the description of the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes
In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect
The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following
Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore,
Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself
All as the separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment
Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment
Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or
Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any
Combination is to this specification(Including adjoint claim, summary and accompanying drawing)Disclosed in all features and so disclosed appoint
Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification(Including adjoint power
Profit requires, made a summary and accompanying drawing)Disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation
Replace.
Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments
In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention
Within the scope of and form different embodiments.For example, in the following claims, times of embodiment claimed
One of meaning mode can be used in any combination.
The present invention all parts embodiment can be realized with hardware, or with one or more processor run
Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that can use in practice
Microprocessor or digital signal processor(DSP)To realize the dress for setting up model essay webpage database according to embodiments of the present invention
The some or all functions of some or all parts in putting.The present invention is also implemented as described here for performing
Method some or all equipment or program of device(For example, computer program and computer program product).This
The program of the realization present invention of sample can be stored on a computer-readable medium, or can have one or more signal
Form.Such signal can be downloaded from internet website and obtained, and either be provided or with any other on carrier signal
Form is provided.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability
Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims,
Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not
Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such
Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer
It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch
To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame
Claim.
Claims (14)
1. a kind of method for setting up model essay webpage database, including:
Capture the model essay webpage of website;
According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;
Set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay
The model essay data and the corresponding URL of model essay webpage of webpage, wherein, the model essay type is corresponding with the keyword.
2. according to the method described in claim 1, wherein the Type of website is Ask-Answer Community website, its webpage is asked including proposition
The main building block of topic and the secondary building block answered a question;Extracting rule corresponding with Ask-Answer Community website includes:
The keyword is matched with the word content of webpage main building block;
If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;
If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And
Extract the model essay data of the webpage;Wherein described model essay data include:The title of the word content of to be extracted building block, is treated
Extract time text of the word content of building block, the number of words of the word content of to be extracted building block.
3. method according to claim 1 or 2, the step of determining to be extracted building block also includes:
First keyword is determined according to the keyword;
The word content for the secondary building block that first keyword is more than into predetermined threshold with number of words is matched;
If it matches, determining that the secondary building block of matching is to be extracted building block.
4. to be extracted building block is multiple in method according to claim 1 or 2, the webpage, then the webpage is corresponding
Model essay web data include multiple model essay data corresponding with to be extracted building number of blocks.
5. method according to claim 1 or 2, wherein the Type of website is word website, its webpage includes text mark
Topic and body matter;Extracting rule corresponding with word website includes:
The keyword is matched with text title;
If it matches, extracting the model essay data of the webpage;Wherein described model essay data include:Text title, body matter, and text
The number of words of content.
6. method according to claim 1 or 2, wherein the Type of website is library resource website, its webpage includes model
The word content of the URL resource links of document model essay document corresponding with description;Extracting rule corresponding with library website includes:
The word content of keyword model essay document corresponding with description is matched;
If it matches, downloading the model essay document via the URL resource links;
Extract the model essay data of the webpage;Wherein described model essay data include:The word content of model essay document, and the model are described
Document.
7. method according to claim 2, wherein, the predetermined threshold is predefined according to different model essay types.
8. a kind of device for setting up model essay webpage database, including:
Webpage capture unit, the model essay webpage suitable for capturing website;
Model essay data cell, suitable for according to keyword and extracting rule corresponding with the Type of website, extracting the model essay of model essay webpage
Data;
Database unit, is adapted to set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay
The model essay data and the corresponding URL of model essay webpage of webpage, wherein, the model essay type is corresponding with the keyword.
9. device according to claim 8, wherein the Type of website is Ask-Answer Community website, its webpage is asked including proposition
The main building block of topic and the secondary building block answered a question;Model essay data cell also includes:
Matching unit, suitable for the keyword is matched with the word content of webpage main building block;
Secondary building block determining unit, is suitable to
If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;
If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And
Extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:The word of to be extracted building block
The title of content, the text of the word content of to be extracted building block, the number of words of the word content of to be extracted building block.
10. device according to claim 8 or claim 9, secondary building block determining unit is further adapted for
First keyword is determined according to the keyword;
The word content for the secondary building block that first keyword is more than into predetermined threshold with number of words is matched;
If it matches, determining that the secondary building block of matching is to be extracted building block.
11. to be extracted building block is multiple in device according to claim 8 or claim 9, the webpage, then the webpage is corresponding
Model essay web data include multiple model essay data corresponding with to be extracted building number of blocks.
12. device according to claim 8 or claim 9, wherein the Type of website is word website, its webpage includes text mark
Topic and body matter;Model essay data cell also includes:
Matching unit, suitable for the keyword is matched with text title;
Extraction unit, is suitable to
If it matches, extracting the model essay data of the webpage;Wherein described model essay data include:Text title, body matter, and text
The number of words of content.
13. device according to claim 8 or claim 9, wherein the Type of website is library resource website, its webpage includes model
The word content of the URL resource links of document model essay document corresponding with description;Model essay data cell also includes:
Matching unit, suitable for the word content of keyword model essay document corresponding with description is matched;
Download unit, suitable for if it matches, downloading the model essay document via the URL resource links;With
Extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:The word of model essay document is described
Content, and the model essay document.
14. device according to claim 9, wherein, the predetermined threshold is predefined according to different model essay types.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310684068.1A CN103699602B (en) | 2013-12-13 | 2013-12-13 | A kind of method and apparatus for setting up model essay webpage database |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201310684068.1A CN103699602B (en) | 2013-12-13 | 2013-12-13 | A kind of method and apparatus for setting up model essay webpage database |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103699602A CN103699602A (en) | 2014-04-02 |
CN103699602B true CN103699602B (en) | 2017-08-29 |
Family
ID=50361130
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201310684068.1A Active CN103699602B (en) | 2013-12-13 | 2013-12-13 | A kind of method and apparatus for setting up model essay webpage database |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103699602B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543049B (en) * | 2018-11-23 | 2021-09-07 | 广东小天才科技有限公司 | Method and system for automatically pushing materials according to writing characteristics |
Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101013439A (en) * | 2007-01-19 | 2007-08-08 | 徐源 | Control method for inquiring information with data base in website |
CN203012717U (en) * | 2013-01-15 | 2013-06-19 | 黑龙江工程学院 | Practical writing inquiry device |
CN103399862A (en) * | 2013-07-04 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | Method and equipment for confirming searching guide information corresponding to target query sequences |
-
2013
- 2013-12-13 CN CN201310684068.1A patent/CN103699602B/en active Active
Patent Citations (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101013439A (en) * | 2007-01-19 | 2007-08-08 | 徐源 | Control method for inquiring information with data base in website |
CN203012717U (en) * | 2013-01-15 | 2013-06-19 | 黑龙江工程学院 | Practical writing inquiry device |
CN103399862A (en) * | 2013-07-04 | 2013-11-20 | 百度在线网络技术(北京)有限公司 | Method and equipment for confirming searching guide information corresponding to target query sequences |
Non-Patent Citations (2)
Title |
---|
魏顺平.大型中国小学生作文语料库的生成.《现代教育技术》.2008,第45-49页. * |
魏顺平.语料库支持下的小学语文阅读环境创设研究.《网络教育与远程教育》.2008,第45-51页. * |
Also Published As
Publication number | Publication date |
---|---|
CN103699602A (en) | 2014-04-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Jäschke et al. | Tag recommendations in folksonomies | |
Patil Swati et al. | Search engine optimization: A study | |
CN112749284B (en) | Knowledge graph construction method, device, equipment and storage medium | |
CN106156372B (en) | A kind of classification method and device of internet site | |
CN103838798B (en) | Page classifications system and page classifications method | |
CN102693271A (en) | Network information recommending method and system | |
CN103530364B (en) | The method and system of download link are provided | |
CN103617213B (en) | Method and system for identifying newspage attributive characters | |
CN103530414B (en) | Web Page Key Words open up word method and apparatus | |
CN107341399A (en) | Assess the method and device of code file security | |
CN103177036A (en) | Method and system for label automatic extraction | |
CN106021418A (en) | News event clustering method and device | |
CN103116635A (en) | Field-oriented method and system for collecting invisible web resources | |
CN107977420A (en) | The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document | |
CN105095175A (en) | Method and device for obtaining truncated web title | |
CN105630937A (en) | Method and device for searching answers to exam questions | |
CN112989824A (en) | Information pushing method and device, electronic equipment and storage medium | |
CN110502680A (en) | A kind of abstracting method and device of acceptance of the bid bulletin relevant field | |
CN106611029A (en) | Method and device for improving site search efficiency in website | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
CN103678601A (en) | Model essay retrieval request processing method and device | |
WO2015149550A1 (en) | Method and apparatus for determining grades of links within website | |
CN102929948B (en) | list page identification system and method | |
CN103699602B (en) | A kind of method and apparatus for setting up model essay webpage database | |
CN104036015A (en) | Electronic terminal question classification method and device, and solution provision method, system and device based on electronic terminal question classification device and method |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right | ||
TR01 | Transfer of patent right |
Effective date of registration: 20220718 Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015 Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park) Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd. Patentee before: Qizhi software (Beijing) Co.,Ltd. |