CN103699602B - A kind of method and apparatus for setting up model essay webpage database - Google Patents

A kind of method and apparatus for setting up model essay webpage database Download PDF

Info

Publication number
CN103699602B
CN103699602B CN201310684068.1A CN201310684068A CN103699602B CN 103699602 B CN103699602 B CN 103699602B CN 201310684068 A CN201310684068 A CN 201310684068A CN 103699602 B CN103699602 B CN 103699602B
Authority
CN
China
Prior art keywords
model essay
webpage
building block
model
essay
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310684068.1A
Other languages
Chinese (zh)
Other versions
CN103699602A (en
Inventor
侯小虎
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201310684068.1A priority Critical patent/CN103699602B/en
Publication of CN103699602A publication Critical patent/CN103699602A/en
Application granted granted Critical
Publication of CN103699602B publication Critical patent/CN103699602B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a kind of method for setting up model essay webpage database, including:Capture the model essay webpage of website;According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;Set up model essay webpage database;Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, the model essay data of model essay webpage and the corresponding URL of model essay webpage.Model essay searching request for containing number of words keyword, model essay webpage database can provide more accurately retrieval result, and to provide a user corresponding model essay number of words in retrieval result page, greatly facilitate selection of the user to model essay webpage.Present invention also offers a kind of device for setting up model essay webpage database.

Description

A kind of method and apparatus for setting up model essay webpage database
Technical field
It is more particularly to a kind of to set up the model essay web data searched for for model essay the present invention relates to field of Internet search The method and apparatus in storehouse.
Background technology
Model essay search is a kind of critically important demand in Webpage search, searched model essay numerous types, including but is not limited In all kinds of official documents, secretarial's letter, work plan, final report, gains in depth of comprehension, speech speech, composition operation, various papers etc. Deng.Peak period during such as student non-have a holiday or vacation period, end of the year work summary, can account for the 1% of total Webpage search amount daily Left and right.In practice, most of model essay demands have fixed requirement for number of words, thus many users can carry out model essay search When input number of words, such as " word of reaction to an article 400 ", " word of scholarship application 800 ".Without defeated clearly by number of words when search Enter, can also there are for the alpha-numeric latent demand of the model;For example, the model essay of paper class is typically not less than 8000 words;Join the party The model essay of application class typically requires there are 3000~5000 words, etc..
For model essay search, the problem of presently, there are mainly has two:One is that current search mechanism can only be by title, net Page content matching hits the demand of number of words, unfair for the webpage sorting without related number of words;It is corresponding due to can not find The field of number of words, also causes recall rate not enough;Two be that user can only pass through corresponding word under the title summary of current retrieval result Section general rise of prices of the stocks and other securities determines whether information that oneself is desired, for much having whether the page of deception property, number of words meet requirement etc. All be it is not anticipated that.
Fig. 1 shows the search results pages schematic diagram of current model essay search, and the model essay searching request of user's input is " with family For the word of composition 350 of topic ";It is other in addition to the title summary of first result directly hits 350 words in search results pages As a result how many number of words do not known, " 350 word " this keyword can only be abandoned and be ranked up, for some it is potential with Just seem very unfair for the result of 350 words closely;User does not know that what result has been result yet, can only be one by one Click checks that efficiency comparison is low.
The content of the invention
In view of the above problems, it is proposed that the present invention so as to provide one kind overcome above mentioned problem or at least in part solve on Stating the foundation of problem is used for the method and corresponding device for the model essay webpage database that model essay is searched for.
According to one aspect of the present invention there is provided a kind of method for setting up model essay webpage database, including:
Capture the model essay webpage of website;
According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;
Set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, the model essay web data include model essay type, The model essay data and the corresponding URL of model essay webpage of model essay webpage.
Alternatively, the Type of website is Ask-Answer Community website, and its webpage includes the main building block of proposition problem and answer is asked The secondary building block of topic;Extracting rule corresponding with Ask-Answer Community website includes:By in the word of the keyword and webpage main building block Appearance is matched;If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;If it is, determining number of words Secondary building block more than predetermined threshold is to be extracted building block;And extract the model essay data of the webpage;Wherein described model essay packet Include:The title of the word content of to be extracted building block, the text of the word content of to be extracted building block, the text of to be extracted building block The number of words of word content.
Optionally it is determined that the step of to be extracted building block also includes:First keyword is determined according to the keyword;Will be described The word content for the secondary building block that first keyword is more than predetermined threshold with number of words is matched;If it matches, determining the secondary building block of matching For to be extracted building block.
Alternatively, to be extracted building block is multiple in the webpage, then the corresponding model essay web data of the webpage is wrapped Include multiple model essay data corresponding with to be extracted building number of blocks.
Alternatively, the Type of website is word website, and its webpage includes text title and body matter;With word website Corresponding extracting rule includes:The keyword is matched with text title;If it matches, extracting the model essay number of the webpage According to;Wherein described model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, the Type of website is library resource website, and its webpage includes the URL resource links of model essay document and retouched State the word content of correspondence model essay document;Extracting rule corresponding with library website includes:The keyword is corresponding with description The word content of model essay document is matched;If it matches, downloading the model essay document via the URL resource links;Extracting should The model essay data of webpage;Wherein described model essay data include:The word content of model essay document, and the model essay document are described.
Alternatively, the model essay type is corresponding with the keyword.
Alternatively, the predetermined threshold is predefined according to different model essay types.
According to another aspect of the present invention there is provided a kind of device for setting up model essay webpage database, including:
Webpage capture unit, the model essay webpage suitable for capturing website;
Model essay data cell, suitable for according to keyword and extracting rule corresponding with the Type of website, extracting model essay webpage Model essay data;
Database unit, is adapted to set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, the model essay web data include model essay type, The model essay data and the corresponding URL of model essay webpage of model essay webpage.
Alternatively, the Type of website is Ask-Answer Community website, and its webpage includes the main building block of proposition problem and answer is asked The secondary building block of topic;Model essay data cell also includes:Matching unit, suitable for by the word content of the keyword and webpage main building block Matched;Secondary building block determining unit, suitable for if it matches, judging whether time number of words of the word content of building block is more than predetermined threshold Value;If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And extraction unit, should suitable for extracting The model essay data of webpage;Wherein described model essay data include:The title of the word content of to be extracted building block, to be extracted building block Word content text, the number of words of the word content of to be extracted building block.
Alternatively, secondary building block determining unit is further adapted for determining first keyword according to the keyword;By first keyword The word content for being more than the secondary building block of predetermined threshold with number of words is matched;If it matches, the secondary building block for determining matching is to be extracted Secondary building block.
Alternatively, to be extracted building block is multiple in the webpage, then the corresponding model essay web data of the webpage is wrapped Include multiple model essay data corresponding with to be extracted building number of blocks.
Alternatively, the Type of website is word website, and its webpage includes text title and body matter;Model essay data sheet Member also includes:Matching unit, suitable for the keyword is matched with text title;Extraction unit, suitable for if it matches, extracting The model essay data of the webpage;Wherein described model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, the Type of website is library resource website, and its webpage includes the URL resource links of model essay document and retouched State the word content of correspondence model essay document;Model essay data cell also includes:Matching unit, suitable for by the keyword with description pair The word content of model essay document is answered to be matched;Download unit, suitable for if it matches, being downloaded via the URL resource links described Model essay document;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:Model essay text is described The word content of shelves, and the model essay document.
According to the method and apparatus for setting up model essay webpage database of the present invention, the model essay webpage database includes multiple Model essay web data, the model essay web data includes model essay type, the model essay data of model essay webpage and model essay webpage Corresponding URL;The different Types of website is wherein directed to, the model essay data of corresponding web page is extracted, typically comprises what is extracted Title, text and the number of words of model essay.Thus, when user sends model essay searching request, the basic web page library captured in spiders While carrying out routine search, also scanned in model essay webpage database.It is each due to being contained in model essay webpage database Model essay title, text and the number of words of kind of model essay webpage so that the real model identical, close with the model essay number of words required by user Web page text is appeared in search results pages, and before being come in search results ranking, further can also be in search knot Model essay number of words is shown to user in fruit page, search quality and Consumer's Experience is thus lifted.
Described above is only the general introduction of technical solution of the present invention, in order to better understand the technological means of the present invention, And can be practiced according to the content of specification, and in order to allow above and other objects of the present invention, feature and advantage can Become apparent, below especially exemplified by the embodiment of the present invention.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantages and benefit is common for this area Technical staff will be clear understanding.Accompanying drawing is only used for showing the purpose of preferred embodiment, and is not considered as to the present invention Limitation.And in whole accompanying drawing, identical part is denoted by the same reference numerals.In the accompanying drawings:
Fig. 1 is the search results pages schematic diagram of the model essay search of prior art;
Fig. 2 is the flow chart of the method according to an embodiment of the invention for setting up model essay webpage database;
Fig. 3 is the data structure schematic diagram of model essay webpage database according to an embodiment of the invention;
Fig. 4 is the flow chart of the method for extraction model essay data according to another embodiment of the present invention;
Fig. 5 is the schematic diagram of model essay web data according to another embodiment of the present invention;
Fig. 6 is the structural representation of the device for setting up model essay webpage database according to further embodiment of this invention.
Embodiment
The exemplary embodiment of the disclosure is more fully described below with reference to accompanying drawings.Although showing the disclosure in accompanying drawing Exemplary embodiment, it being understood, however, that may be realized in various forms the disclosure without should be by embodiments set forth here Limited.On the contrary, these embodiments are provided to facilitate a more thoroughly understanding of the present invention, and can be by the scope of the present disclosure Complete conveys to those skilled in the art.
Embodiment one
Present embodiments provide a kind of method for setting up model essay webpage database.The model essay webpage database is set up and searched The server end of rope service provider, when user initiates model essay searching request to search engine, is called by search engine.
Fig. 2 shows the flow chart of the method for setting up model essay webpage database according to the present embodiment, at least includes step Rapid S202 to step S206, wherein:
Step S202:Capture the model essay webpage of website;
Step S204:Extract the model essay data of model essay webpage;With
Step S206:Set up model essay webpage database.
The present embodiment methods described passes through model of the web crawlers to model essay resource website on internet by step S202 Web page text is captured.Web crawlers is a technology maturation, can automatically extract the program of webpage on internet, and it was according to both Fixed rule is search engine contained network page above and below internet, is the important composition of search engine.It is all to be captured by web crawlers Webpage will be stored in server side;Certain analysis, filtering can be carried out simultaneously, index is set up, and generation supplies user search The basic search library used(Or index data base).According to the present embodiment, web crawlers can carry out model in the range of whole network The crawl of web page text, can also carry out webpage capture, the model specified in the range of specified multiple model essay resource websites Literary resource website constantly can be added and be updated by search service provider and/or user.
After webpage is crawled, step S204 is performed, for model essay net being stored in server side, being crawled Page, according to the keyword corresponding with the model essay type to be extracted, extracts the model essay data of the type model essay webpage.Specifically Ground, it is necessary first to matched keyword with the content of model essay webpage;If it matches, then extracting model from the content of model essay webpage Literary data.Inventor has found that the model essay resource website on internet mainly includes three major types:Ask-Answer Community website, word website With library resource website.Preferably, for different types of model essay resource website, using corresponding model essay data extracting rule, Model essay data can be more accurately provided.
Then, step S206 is performed, based on the model essay data of the model essay webpage extracted, model essay webpage database is set up; Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay The model essay data and the corresponding URL of model essay webpage of webpage.Typically, the model essay data include, model essay title, model essay text and Model essay number of words.Fig. 3 schematically shows the data structure schematic diagram according to the model essay webpage database of the present embodiment, Mei Yifan Web page text correspond to a model essay web data.
Method described in the present embodiment, establishes a kind of model essay webpage database, and its each model essay web data is included Such as model alpha-numeric model essay data.When user sends model essay searching request, the keyword according to entrained by searching request, Via common basic search library(That is, index data base)While retrieval, also carried out in model essay webpage database Retrieval.Because model essay searching request would generally carry the number of words keyword of such as " 500 word ", model essay webpage database can be provided More accurately retrieval result;Also, the model essay webpage database to provide a user corresponding model essay in retrieval result page Number of words be possibly realized, greatly facilitate selection of the user to model essay webpage.
Embodiment two
Present embodiments provide a kind of model essay data extraction method for Ask-Answer Community website.Ask-Answer Community website is with one Individual main building block proposes problem, and the form that multiple building blocks are answered a question directly, rapidly meets the need that user searches for answer Ask, can almost solve all problems in daily life, thus also form a huge content resource.At present, it is domestic There are many more influential Ask-Answer Community websites, such as Baidu knows that 360 question and answer are searched and asked, ends of the earth question and answer etc..
Fig. 4 shows the flow chart of the model essay data extraction method for Ask-Answer Community website according to the present embodiment.Under Face will be by taking keyword " scholarship application " as an example, and the step with reference to shown in Fig. 4 is described in detail how the net from Ask-Answer Community website Page extracts the model essay data that model essay type is " scholarship application ".
Step S402, judges whether the word content of main building block matches with keyword " scholarship application ";Wherein, question and answer The word content of the main building block of community's webpage and each building block, is extracted via web crawlers;Typically, main building block Word content is, for example, " how writing scholarship application ".
When the word content of main building block is matched with keyword " scholarship application ", into step S404, time building is judged Whether the number of words of the word content of block is more than predetermined threshold.The minimum number of words according to generally required for a scholarship application, The predetermined threshold, such as 100 words are set, the secondary building block less than the word of predetermined threshold 100 will be rejected.In the webpage of Ask-Answer Community, The content of building block is not very likely the answer asked a question to main building block many times, such as secondary building block content be " not knowing ", " I also would like to know " etc.;And number of words is more than the secondary building block of 100 words, a real scholarship application is just particularly likely that Model essay.
Certainly, for different model essay types, it should set different predetermined thresholds to model essay number of words.For example, for class Type is the model essay of " written request for leave ", and its predetermined threshold can set relatively low, for example 10 words;And be " Shen of joining the party for type Please book " model essay, its predetermined threshold should set of a relatively high, for example 2000 words.
Preferably, as follows the word content of building block has been more than the word of predetermined threshold 100, can be based on keyword further to secondary building Block is screened, and enters step S406, judge number of words be more than predetermined threshold secondary building block content with member keyword whether Match somebody with somebody.Here first keyword, either extracts in itself or from keyword and gets for keyword.For the key in this example Word " scholarship application ", it is " application " and " scholarship " to determine its yuan of keyword.In the webpage of Ask-Answer Community, secondary building block is usual It can be added by any network user, therefore number of words is also possible to and main building block institute more than the content of the secondary building block of predetermined threshold Ask a question unrelated, such as secondary building block is the advertisement that network user's malice is pasted.By the way that secondary building block content is entered with first keyword Row matching, may further determine that time content and the correlation of scholarship application of building block.On the other hand, one in secondary building block Model essay on scholarship application is also possible to the entire fields for not occurring " scholarship application ", and first keyword " application " " scholarship ", which ensure that, to be omitted.
If secondary building block and first Keywords matching, into step S408, determine that this building block is to be extracted building block, that is, determine The content of this building block includes the model essay that theme is " scholarship application ".
Finally, step S410 is performed, model essay number is extracted from " scholarship application " model essay of the to be extracted building block According to, including model essay title, model essay text, and model essay number of words.Above-mentioned model essay data are extracted from the word content of secondary building block Implement, and where the inventive point of non-invention, it realizes that details will not be repeated here.
Inventors noted that for a model essay webpage of Ask-Answer Community website, its multiple building blocks are likely to true It is set to be extracted building block, i.e., multiple times building blocks can meet number of words requirement and first Keywords matching, therefore the model essay webpage is corresponding Model essay web data will include a plurality of model essay data, as shown in figure 5, wherein every model essay data and each to be extracted building The content correspondence of block.
By the present embodiment, the model essay data included in the website of Ask-Answer Community can be accurately extracted, are gone to greatest extent Except invalid content or malice ad content.
Embodiment three
Present embodiments provide the model essay data extraction method of a kind of word website or library resource website.Below will be with It is described exemplified by keyword " scholarship application ".
The webpage of word website introduces the webpage in the main region of webpage based on word in the form of such as manuscript Main contents, such as news website, Blog Website etc..Typically, the webpage of word website all includes text title and just Literary content, these information can be obtained by web crawlers.
According to the model essay data extraction method of the present embodiment, for word website, first by keyword " scholarship application Book " is matched with text title;If it matches, determining that the webpage is the model essay webpage that type is " scholarship application ", then enter One step extracts the number of words of text title, body matter, and body matter in body matter, as the model essay number of the webpage According to.
Library resource website, can provide the user the download service of various articles, paper, such as ten-thousand-ton train net Deng.Typically, the webpage of library resource website is included in the URL resource links of model essay document and the word for describing the model essay document Hold.
According to the model essay data extraction method of the present embodiment, for library resource website, first by keyword " scholarship Shen Please book " matched with the word content of model essay document described in the resource webpage of library;If it matches, determining this article base resource webpage For the model essay webpage that type is " scholarship application ", then the model essay document is downloaded via the URL resource links;Further Ground, extracts the word content and model essay data of the model essay document downloaded as the webpage of description model essay document.
Example IV
Present embodiments provide a kind of device for setting up model essay webpage database.The model essay webpage database is set up and searched The server end of rope service provider, when user initiates model essay searching request to search engine, is called by search engine.
Fig. 6 shows the structural representation of the device for setting up model essay webpage database according to the present embodiment, including:
Webpage capture unit 602, the model essay webpage suitable for capturing website;
Model essay data cell 604, the model essay data suitable for extracting model essay webpage;With
Database unit 606, is adapted to set up model essay webpage database.Wherein model essay webpage database includes multiple models Web page text data item, it is corresponding with model essay webpage that the model essay web data includes model essay type, the model essay data of model essay webpage URL.
According to the present embodiment described device, webpage capture unit 602 is suitable to provide model essay on internet by web crawlers The model essay webpage of source website is captured.Web crawlers is a technology maturation, can automatically extract the journey of webpage on internet Sequence, it is search engine contained network page above and below internet according to set rule, is the important composition of search engine.
Model essay data cell 604 is suitable to for be stored in server side, the model essay webpage that has been crawled, according to institute The corresponding keyword of the model essay type to be extracted, extracts the model essay data of the type model essay webpage.
Further, model essay data cell 604 includes:Matching unit, suitable for by keyword and Ask-Answer Community webpage main building The word content of block is matched;Secondary building block determining unit, suitable in keyword and main building Block- matching, judging time text of building block Whether the number of words of word content is more than predetermined threshold, and if secondary building block word is more than predetermined threshold, it is determined that number of words is more than pre- It is to be extracted building block to determine the secondary building block of threshold value;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model Literary data include the title of the word content of to be extracted building block, the text of the word content of to be extracted building block, to be extracted time The number of words of the word content of building block.
Preferably, it is determined that the content number of words of secondary building block is more than after predetermined threshold, secondary building block determining unit is further by word Content and first keyword of the number more than the secondary building block of predetermined threshold(Keyword is determined in itself, or according to the keyword)Progress Match somebody with somebody;If it matches, determining that the secondary building block of matching is to be extracted building block.
Alternatively, model essay data cell 604 includes matching unit, suitable for by the text mark of the keyword and word webpage Topic is matched;And extraction unit, suitable in keyword and text title match, extracting the model essay data of the webpage;Its Described in model essay data include:The number of words of text title, body matter, and body matter.
Alternatively, model essay data cell 604 include matching unit, suitable for by the keyword with being retouched in the resource webpage of library The word content for stating correspondence model essay document is matched;Download unit, suitable for when keyword is matched with descriptive text, via institute State URL resource links and download model essay document;And extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay number According to including:The word content of model essay document, and the model essay document are described.
Database unit 606 is suitable to, and based on the model essay data of the model essay webpage extracted, sets up model essay web data Storehouse;Wherein model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model The model essay data and the corresponding URL of model essay webpage of web page text.Typically, the model essay data include, model essay title, model essay text With model essay number of words.
Device according to the present embodiment, establishes a kind of model essay webpage database, its each model essay web data Contain the alpha-numeric model essay data of such as model., can be in model essay webpage database when user sends model essay searching request Carry out model essay retrieval.Model essay searching request for containing number of words keyword, it is more accurate that model essay webpage database can be provided Retrieval result.Also, the model essay webpage database to provide a user in retrieval result page corresponding model essay number of words into For possibility, selection of the user to model essay webpage is greatly facilitated.
Algorithm and display be not inherently related to any certain computer, virtual system or miscellaneous equipment provided herein. Various general-purpose systems can also be used together with based on teaching in this.As described above, construct required by this kind of system Structure be obvious.In addition, the present invention is not also directed to any certain programmed language.It is understood that, it is possible to use it is various Programming language realizes the content of invention described herein, and the description done above to language-specific is to disclose this hair Bright preferred forms.
In the specification that this place is provided, numerous specific details are set forth.It is to be appreciated, however, that the implementation of the present invention Example can be put into practice in the case of these no details.In some instances, known method, structure is not been shown in detail And technology, so as not to obscure the understanding of this description.
Similarly, it will be appreciated that in order to simplify the disclosure and help to understand one or more of each inventive aspect, exist Above in the description of the exemplary embodiment of the present invention, each feature of the invention is grouped together into single implementation sometimes In example, figure or descriptions thereof.However, the method for the disclosure should be construed to reflect following intention:It is i.e. required to protect The application claims of shield features more more than the feature being expressly recited in each claim.More precisely, such as following Claims reflect as, inventive aspect is all features less than single embodiment disclosed above.Therefore, Thus the claims for following embodiment are expressly incorporated in the embodiment, wherein each claim is in itself All as the separate embodiments of the present invention.
Those skilled in the art, which are appreciated that, to be carried out adaptively to the module in the equipment in embodiment Change and they are arranged in one or more equipment different from the embodiment.Can be the module or list in embodiment Member or component be combined into a module or unit or component, and can be divided into addition multiple submodule or subelement or Sub-component.In addition at least some in such feature and/or process or unit exclude each other, it can use any Combination is to this specification(Including adjoint claim, summary and accompanying drawing)Disclosed in all features and so disclosed appoint Where all processes or unit of method or equipment are combined.Unless expressly stated otherwise, this specification(Including adjoint power Profit requires, made a summary and accompanying drawing)Disclosed in each feature can be by providing the alternative features of identical, equivalent or similar purpose come generation Replace.
Although in addition, it will be appreciated by those of skill in the art that some embodiments described herein include other embodiments In included some features rather than further feature, but the combination of the feature of be the same as Example does not mean in of the invention Within the scope of and form different embodiments.For example, in the following claims, times of embodiment claimed One of meaning mode can be used in any combination.
The present invention all parts embodiment can be realized with hardware, or with one or more processor run Software module realize, or realized with combinations thereof.It will be understood by those of skill in the art that can use in practice Microprocessor or digital signal processor(DSP)To realize the dress for setting up model essay webpage database according to embodiments of the present invention The some or all functions of some or all parts in putting.The present invention is also implemented as described here for performing Method some or all equipment or program of device(For example, computer program and computer program product).This The program of the realization present invention of sample can be stored on a computer-readable medium, or can have one or more signal Form.Such signal can be downloaded from internet website and obtained, and either be provided or with any other on carrier signal Form is provided.
It should be noted that the present invention will be described rather than limits the invention for above-described embodiment, and ability Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference symbol between bracket should not be configured to limitations on claims.Word "comprising" is not excluded the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of some different elements and coming real by means of properly programmed computer It is existing.In if the unit claim of equipment for drying is listed, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any order.These words can be explained and run after fame Claim.

Claims (14)

1. a kind of method for setting up model essay webpage database, including:
Capture the model essay webpage of website;
According to keyword and extracting rule corresponding with the Type of website, the model essay data of model essay webpage are extracted;
Set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay The model essay data and the corresponding URL of model essay webpage of webpage, wherein, the model essay type is corresponding with the keyword.
2. according to the method described in claim 1, wherein the Type of website is Ask-Answer Community website, its webpage is asked including proposition The main building block of topic and the secondary building block answered a question;Extracting rule corresponding with Ask-Answer Community website includes:
The keyword is matched with the word content of webpage main building block;
If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;
If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And
Extract the model essay data of the webpage;Wherein described model essay data include:The title of the word content of to be extracted building block, is treated Extract time text of the word content of building block, the number of words of the word content of to be extracted building block.
3. method according to claim 1 or 2, the step of determining to be extracted building block also includes:
First keyword is determined according to the keyword;
The word content for the secondary building block that first keyword is more than into predetermined threshold with number of words is matched;
If it matches, determining that the secondary building block of matching is to be extracted building block.
4. to be extracted building block is multiple in method according to claim 1 or 2, the webpage, then the webpage is corresponding Model essay web data include multiple model essay data corresponding with to be extracted building number of blocks.
5. method according to claim 1 or 2, wherein the Type of website is word website, its webpage includes text mark Topic and body matter;Extracting rule corresponding with word website includes:
The keyword is matched with text title;
If it matches, extracting the model essay data of the webpage;Wherein described model essay data include:Text title, body matter, and text The number of words of content.
6. method according to claim 1 or 2, wherein the Type of website is library resource website, its webpage includes model The word content of the URL resource links of document model essay document corresponding with description;Extracting rule corresponding with library website includes:
The word content of keyword model essay document corresponding with description is matched;
If it matches, downloading the model essay document via the URL resource links;
Extract the model essay data of the webpage;Wherein described model essay data include:The word content of model essay document, and the model are described Document.
7. method according to claim 2, wherein, the predetermined threshold is predefined according to different model essay types.
8. a kind of device for setting up model essay webpage database, including:
Webpage capture unit, the model essay webpage suitable for capturing website;
Model essay data cell, suitable for according to keyword and extracting rule corresponding with the Type of website, extracting the model essay of model essay webpage Data;
Database unit, is adapted to set up model essay webpage database;Wherein
Model essay webpage database includes multiple model essay web datas, and the model essay web data includes model essay type, model essay The model essay data and the corresponding URL of model essay webpage of webpage, wherein, the model essay type is corresponding with the keyword.
9. device according to claim 8, wherein the Type of website is Ask-Answer Community website, its webpage is asked including proposition The main building block of topic and the secondary building block answered a question;Model essay data cell also includes:
Matching unit, suitable for the keyword is matched with the word content of webpage main building block;
Secondary building block determining unit, is suitable to
If it matches, judging whether time number of words of the word content of building block is more than predetermined threshold;
If it is, it is to be extracted building block to determine that number of words is more than the secondary building block of predetermined threshold;And
Extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:The word of to be extracted building block The title of content, the text of the word content of to be extracted building block, the number of words of the word content of to be extracted building block.
10. device according to claim 8 or claim 9, secondary building block determining unit is further adapted for
First keyword is determined according to the keyword;
The word content for the secondary building block that first keyword is more than into predetermined threshold with number of words is matched;
If it matches, determining that the secondary building block of matching is to be extracted building block.
11. to be extracted building block is multiple in device according to claim 8 or claim 9, the webpage, then the webpage is corresponding Model essay web data include multiple model essay data corresponding with to be extracted building number of blocks.
12. device according to claim 8 or claim 9, wherein the Type of website is word website, its webpage includes text mark Topic and body matter;Model essay data cell also includes:
Matching unit, suitable for the keyword is matched with text title;
Extraction unit, is suitable to
If it matches, extracting the model essay data of the webpage;Wherein described model essay data include:Text title, body matter, and text The number of words of content.
13. device according to claim 8 or claim 9, wherein the Type of website is library resource website, its webpage includes model The word content of the URL resource links of document model essay document corresponding with description;Model essay data cell also includes:
Matching unit, suitable for the word content of keyword model essay document corresponding with description is matched;
Download unit, suitable for if it matches, downloading the model essay document via the URL resource links;With
Extraction unit, the model essay data suitable for extracting the webpage;Wherein described model essay data include:The word of model essay document is described Content, and the model essay document.
14. device according to claim 9, wherein, the predetermined threshold is predefined according to different model essay types.
CN201310684068.1A 2013-12-13 2013-12-13 A kind of method and apparatus for setting up model essay webpage database Active CN103699602B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310684068.1A CN103699602B (en) 2013-12-13 2013-12-13 A kind of method and apparatus for setting up model essay webpage database

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310684068.1A CN103699602B (en) 2013-12-13 2013-12-13 A kind of method and apparatus for setting up model essay webpage database

Publications (2)

Publication Number Publication Date
CN103699602A CN103699602A (en) 2014-04-02
CN103699602B true CN103699602B (en) 2017-08-29

Family

ID=50361130

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310684068.1A Active CN103699602B (en) 2013-12-13 2013-12-13 A kind of method and apparatus for setting up model essay webpage database

Country Status (1)

Country Link
CN (1) CN103699602B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543049B (en) * 2018-11-23 2021-09-07 广东小天才科技有限公司 Method and system for automatically pushing materials according to writing characteristics

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013439A (en) * 2007-01-19 2007-08-08 徐源 Control method for inquiring information with data base in website
CN203012717U (en) * 2013-01-15 2013-06-19 黑龙江工程学院 Practical writing inquiry device
CN103399862A (en) * 2013-07-04 2013-11-20 百度在线网络技术(北京)有限公司 Method and equipment for confirming searching guide information corresponding to target query sequences

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101013439A (en) * 2007-01-19 2007-08-08 徐源 Control method for inquiring information with data base in website
CN203012717U (en) * 2013-01-15 2013-06-19 黑龙江工程学院 Practical writing inquiry device
CN103399862A (en) * 2013-07-04 2013-11-20 百度在线网络技术(北京)有限公司 Method and equipment for confirming searching guide information corresponding to target query sequences

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
魏顺平.大型中国小学生作文语料库的生成.《现代教育技术》.2008,第45-49页. *
魏顺平.语料库支持下的小学语文阅读环境创设研究.《网络教育与远程教育》.2008,第45-51页. *

Also Published As

Publication number Publication date
CN103699602A (en) 2014-04-02

Similar Documents

Publication Publication Date Title
Jäschke et al. Tag recommendations in folksonomies
Patil Swati et al. Search engine optimization: A study
CN112749284B (en) Knowledge graph construction method, device, equipment and storage medium
CN106156372B (en) A kind of classification method and device of internet site
CN103838798B (en) Page classifications system and page classifications method
CN102693271A (en) Network information recommending method and system
CN103530364B (en) The method and system of download link are provided
CN103617213B (en) Method and system for identifying newspage attributive characters
CN103530414B (en) Web Page Key Words open up word method and apparatus
CN107341399A (en) Assess the method and device of code file security
CN103177036A (en) Method and system for label automatic extraction
CN106021418A (en) News event clustering method and device
CN103116635A (en) Field-oriented method and system for collecting invisible web resources
CN107977420A (en) The abstract extraction method, apparatus and readable storage medium storing program for executing of a kind of evolved document
CN105095175A (en) Method and device for obtaining truncated web title
CN105630937A (en) Method and device for searching answers to exam questions
CN112989824A (en) Information pushing method and device, electronic equipment and storage medium
CN110502680A (en) A kind of abstracting method and device of acceptance of the bid bulletin relevant field
CN106611029A (en) Method and device for improving site search efficiency in website
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN103678601A (en) Model essay retrieval request processing method and device
WO2015149550A1 (en) Method and apparatus for determining grades of links within website
CN102929948B (en) list page identification system and method
CN103699602B (en) A kind of method and apparatus for setting up model essay webpage database
CN104036015A (en) Electronic terminal question classification method and device, and solution provision method, system and device based on electronic terminal question classification device and method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220718

Address after: Room 801, 8th floor, No. 104, floors 1-19, building 2, yard 6, Jiuxianqiao Road, Chaoyang District, Beijing 100015

Patentee after: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Address before: 100088 room 112, block D, 28 new street, new street, Xicheng District, Beijing (Desheng Park)

Patentee before: BEIJING QIHOO TECHNOLOGY Co.,Ltd.

Patentee before: Qizhi software (Beijing) Co.,Ltd.