CN106934036A - A kind of method and system of Network Learning Resource aggregate query - Google Patents
A kind of method and system of Network Learning Resource aggregate query Download PDFInfo
- Publication number
- CN106934036A CN106934036A CN201710152062.8A CN201710152062A CN106934036A CN 106934036 A CN106934036 A CN 106934036A CN 201710152062 A CN201710152062 A CN 201710152062A CN 106934036 A CN106934036 A CN 106934036A
- Authority
- CN
- China
- Prior art keywords
- code
- targeted website
- search
- sites
- website
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
Abstract
The invention provides a kind of method and system of Network Learning Resource aggregate query, purpose is to overcome to need targeted website to provide technical support, and need that the data of collection are carried out the shortcoming of structuring treatment, and do not cause dispute over copyright, the need for meeting personalized search.Technical scheme is to send inquiry request and key word of the inquiry to several targeted websites simultaneously using CURL multithreadings function, the search result list region in the targeted website return code is extracted using regular expression, the URL in the return code is modified again, finally these described return codes are loaded into the search results pages of the system.The advantage of the invention is that:The unstructured data obtained from targeted website need not be converted to structural data;The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced;Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.
Description
Technical field
The present invention relates to technical field of the computer network, and in particular to a kind of method of Network Learning Resource aggregate query and
System.
Background technology
At present, the various education resources on Internet(Such as courseware, instructional video, study document)It is very rich
Richness, many learners like search Network Learning Resource to be learnt by oneself, and many teachers also like the education resources such as search courseware
Prepared lessons.People's most common method is to use universal search engine(Such as Baidu)Search for these education resources.
But, searching for education resource using universal search engine can only typically search some scattered resources, and in recent years
The major publishing houses for coming China concentrate on the necessary resources construction for strengthening teaching material books, have accumulated on many publishing house websites
Abundant teaching material match teaching resource.This kind of teaching resource is typically provided by books author, comparison system.On the other hand, China
Many admire class website, Exquisite Course Website website and also gathered the teaching resource of magnanimity in substantial amounts of course resources, but these websites existing
But almost searched on the search engines such as Baidu less than because the key that arbitrarily cannot be input into user by the search reptile of search engine
Word obtains search result list in being delivered to the list of targeted website.
If user goes to access every publishing house or curriculum website respectively, then one by one on each website of input keyword search
Teaching resource, that will be a very cumbersome thing.A solution is:Each search targeted website is allowed to provide structure
Change data(Such as JSON or XML data)Interface, then can using Ajax (Asynchronous Javascript And XML,
Asynchronous JavaScript and XML) or CURL (Command Line Uniform Resource Locator, order line system
One resource localizer) technology goes to obtain the structural data of targeted website, then be polymerized in result website.
Second scheme is:Using the searching interface of publishing house, unified query is carried out, the data for obtaining will be inquired about and tied
Structureization is stored in local data base after processing, then local data base is inquired about, and the inquiry velocity of this mode is very fast,
But due to replicating the content of publishing house website in being locally stored, dispute over copyright problem can be triggered.
In a word, at least there is more following or some deficiency in current Network Learning Resource aggregate query scheme:1. need
Targeted website is wanted to provide the data and access interface of structuring;2. needing will be from targeted website at the content structure of collection
In storing local data base after reason, due to replicating the content of targeted website to local, it would be possible to cause dispute over copyright problem;
3. reproducting content is inquired about in local data base, it is impossible to which ensure inquiry is the what be new in targeted website;4. mesh is needed
Mark website provides database structure or other technologies are supported.
The content of the invention
Targeted website is needed to provide technical support in order to overcome, and the data of collection are carried out structuring treatment by needs
Shortcoming, and do not cause dispute over copyright, the need for meeting personalized search.The present invention is proposed, overcomes above-mentioned to provide one kind
Problem or at least in part solution to the problems described above.
According to the present invention, there is provided a kind of method of Network Learning Resource aggregate query, comprise the following steps:
The first step:The information such as the network address of all targeted websites to be inquired about, coding, HTTP request mode are stored in database
One table(If table name is sites, the field in sites tables has id, name, url, charset, pregmatch,
valid, postdata, imgsrcp, asrcp, sort, descp)In, if to increase the website of inquiry newly, only need to newly
The information of website is inserted into sites tables as a record.
Second step:The system provides a list for user input search keyword on webpage.
3rd step:The search keyword is obtained, according to the volume of the targeted website of charset field records in sites tables
Code type, URL codings are carried out by keyword, make the character code phase of character code and the targeted website after keyword conversion
Together.
4th step:Keyword after coding is sent to the search of each targeted website simultaneously using CURL multithreadings function
Treatment page(Url fields in sites tables save the network address of the search process page)If, postdata words in sites tables
Segment value is not sky, then keyword is embedded into postdata field values in POST modes and is then forwarded to targeted website, if
Postdata field values are sky, then send keyword data to targeted website in GET modes.
5th step:An array is defined, the HTML code of the result of page searching that each described targeted website returns is received.
6th step:All described HTML code to returning carries out Unified coding(As unification is converted into utf-8 codings).
7th step:Extract search result list region:Manual method is used first(Such as by the " inspection of chrome browsers
Look into " function)The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code
Regular expression code, in saving it in the pregmatch fields of sites tables, finally use matching regular expressions letter
Number(Such as preg_match)Extract the search result content part in the HTML code.
8th step:Correct the relative URL address in image and hyperlink in the HTML code.First by DOM
(Document Object Model, DOM Document Object Model)Operation class(Such as simple html dom)Find return described
All a elements and img elements in HTML code, then add the domain name and path prefix of original web before its src property value
Character string(Asrc fields and imgsrcp fields in sites tables save the prefix character string).
9th step:Revised search result list area code is loaded into the present system.It is revised by every section respectively
Code is loaded into a HTML Container elements.
Tenth step:For search result list adds model code, pattern layout is carried out to all HTML Container elements
In beautifying and exporting the search results pages of the system.
Further, there is provided a background management system, the information of the targeted website is added, deleted, changing for user,
Whether the value of the settable valid fields of user searches for the targeted website specified, and sets the value of sort attributes to realize to the mesh
Mark the sequence of website.
The advantage of the invention is that:The unstructured data obtained from targeted website need not be converted to structuring number
According to;The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced;
Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.
Brief description of the drawings
Fig. 1 is the operation principle flow chart that the present invention is implemented.
Fig. 2 is according to the webpage of one embodiment of the invention(The homepage of the embodiment)The exemplary schematic representation of display interface.
Fig. 3 is according to the webpage of one embodiment of the invention(The search results pages of the embodiment)Display interface schematic diagram.
Specific embodiment
Further illustrate technical scheme below in conjunction with the accompanying drawings and by specific embodiment.
The main thought of technical solution of the present invention is sent to several targeted websites simultaneously using CURL multithreadings function
Inquiry request and key word of the inquiry, the search result list area in the targeted website return code is extracted using regular expression
Domain, then the URL in the return code is modified, these described return codes are finally loaded into the search knot of the system
In fruit page.
The embodiment includes browser end and server end, between the two can be by network connection, also can be by browser end
Be deployed on same computer with server end, server end be installed and configured web server software Apache,
MySQL and CURL expansion plugins.
The workflow of the embodiment is comprised the following steps:
Step 101:The information such as network address, coding by all targeted websites to be inquired about are stored in a table of database, should
The structure of table is as follows:
sites(id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp,
sort, descp)
The implication correspondence of each field is as follows:
Sites (sequence number, website name, site search entrance URL, the type of coding of website, the regular expression of content area
Whether effectively matching code, store post data, and image URL address prefixs, hyperlink URL address prefix, sequence is retouched
State)
If increasing the website to be inquired about newly, the information of new website need to be only added in sites tables.
Step 102:As shown in Fig. 2 the embodiment provides a searchable form on webpage searches for crucial for user input
Word.
Step 103:The keyword in list is obtained, according to the type of coding of the targeted website recorded in sites tables,
Keyword is carried out into URL codings, the coding after changing keyword is identical with the coding of targeted website.
Step 104:The keyword after coding is sent to searching for each targeted website simultaneously using CURL multithreadings function
Rope processes page.The network address of the search process page is the value of url fields in sites tables, if postdata fields in sites tables
Value is not sky, then keyword is embedded into postdata field values in post modes and is then forwarded to targeted website, if
Postdata field values are sky, then send keyword data to targeted website in get modes.
Wherein postdata field values are to use manual method(Such as by " inspection " function of chrome browsers)Analysis
Targeted website obtains to the POST data that server is sent in HTTP request, for example:
method1=1&keyzy=name&keyword=
It is pre-stored in postdata fields.
Step 105:An array is defined, the HTML code of the result of page searching of each targeted website return is received.
Step 106:All described HTML code to returning carries out Unified coding(As unification is converted into utf-8 codings).
Step 107:Extract search result list region:First using manual method by(Such as " the inspection of chrome browsers
Look into " function)The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code
Regular expression code, save it in the pregmatch fields of tables of data, finally use matching regular expressions function
Preg_match extracts content part therein.
Step 108:Relative URL address in correction map picture and hyperlink.First by DOM(DOM Document Object Model)Operation
Class(simple_html_dom)The all a elements and img elements in return code are found, is then added before its src property value
The domain name and path prefix character string of original web(Asrc fields and imgsrcp fields in sites tables save the preceding asyllabia
Symbol string).
Step 109:It is loaded into the revised search result list area code.As shown in figure 3, the present embodiment by each
The HTML code that website returns is placed individually into a div of the entitled cbs of class, and one h2 mark of addition above it is used for
Place publishing house's name or curriculum net name of station, and one " more " link, the initial search list for being linked to targeted website
Page.
Step 110:For the search result list adds CSS style code.The need for for layout and beautification, should
The picture that embodiment sets in the targeted website of all collections using CSS is same size, and word is same color and size, right
Each list items in the search result list set float attributes, its text is located at the arrangement of picture right side, and to each
The list items are set:After puppet element selectors are removed and floated.
It is consummating function, the present embodiment adds waterfall stream effect using Ajax technologies to search results pages.It is i.e. initially described
Search results pages are only loaded into a part for Search Results, loading are further continued for when user scrolls down through the search results pages next
Partial content.
It is consummating function, the present embodiment provides a background management system, added for user, deleted, changing the target
The information of website, also, whether the value of the also settable valid fields of user searches for the targeted website specified, and sets sort attributes
Value realize the sequence to the targeted website.
Claims (7)
1. a kind of method and system of Network Learning Resource aggregate query, it is characterised in that comprise the following steps:The first step:By institute
There are the information such as network address, coding, the HTTP request mode of the targeted website to be inquired about to be stored in a table of database(If table name is
Field in sites, sites table has id, name, url, charset, pregmatch, valid, postdata,
imgsrcp, asrcp, sort, descp)In, if to increase the website of inquiry newly, only need to be using the information of new website as one
Bar record is inserted into sites tables;Second step:The system provides a list for user input search keyword on webpage;
3rd step:The search keyword is obtained, according to the type of coding of the targeted website of charset field records in sites tables, will
Keyword carries out URL codings, and the character code after changing keyword is identical with the character code of the targeted website;4th
Step:Keyword after coding is sent to the search process page of each targeted website simultaneously using CURL multithreadings function(sites
Url fields in table save the network address of the search process page)If postdata field values are not sky in sites tables,
Keyword is embedded into postdata field values in POST modes is then forwarded to targeted website, if postdata field values are
Sky, then send keyword data to targeted website in GET modes;5th step:An array is defined, each target network is received
Stand return result of page searching HTML code;6th step:All described HTML code to returning carries out Unified coding;The
Seven steps:Extract search result list region:The initial code in search result list region is found using manual method first, then
The regular expression code of matching whole region is manually write out further according to head and the tail code, sites tables are saved it in
In pregmatch fields, matching regular expressions function is finally used(Such as preg_match)In extracting the HTML code
Search result content part;8th step:Correct the relative URL address in image and hyperlink in the HTML code:Make first
Use DOM(Document Object Model, DOM Document Object Model)Operation class(Such as simple html dom)Find return
All a elements and img elements in the HTML code, then add domain name and the path of original web before its src property value
Prefix character string(Asrc fields and imgsrcp fields in sites tables save the prefix character string);9th step:At this
Revised search result list area code is loaded into system, every section of revised code is loaded into a HTML respectively holds
In device element;Tenth step:For search result list adds model code, pattern layout is carried out to all HTML Container elements
In beautifying and exporting the search results pages of the system.
2. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that multi-thread using CURL
Eikonal number is while be sent to all targeted websites, rather than singly sending when sending request.
3. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that can basis
Postdata field values, the mode for automatically selecting transmission HTTP request is post or get.
4. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that if with post side
Formula sends request, then the search keyword character string can be embedded into the character string of postdata fields preservation and send.
5. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that use regular expressions
Formula matches the search result list region of the targeted website.
6. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that operate class using DOM
The URL addresses of hyperlink and image file in amendment code.
7. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that the system is not stored
The HTML code of the Search Results of any targeted website, but after being beautified and be laid out using CSS style code, its is direct
The search results pages for exporting the system show.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710152062.8A CN106934036A (en) | 2017-03-15 | 2017-03-15 | A kind of method and system of Network Learning Resource aggregate query |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710152062.8A CN106934036A (en) | 2017-03-15 | 2017-03-15 | A kind of method and system of Network Learning Resource aggregate query |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106934036A true CN106934036A (en) | 2017-07-07 |
Family
ID=59432439
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710152062.8A Pending CN106934036A (en) | 2017-03-15 | 2017-03-15 | A kind of method and system of Network Learning Resource aggregate query |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106934036A (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109271409A (en) * | 2018-11-08 | 2019-01-25 | 成都索贝数码科技股份有限公司 | Database fragmentation execution method based on container resource allocation |
CN110196965A (en) * | 2018-02-26 | 2019-09-03 | 北大方正集团有限公司 | The method and device of XML file conversion Word file |
CN112783410A (en) * | 2019-11-07 | 2021-05-11 | 北京拉酷网络科技有限公司 | Information processing method, medium, device and computing equipment |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882163A (en) * | 2010-06-30 | 2010-11-10 | 中国科学院地理科学与资源研究所 | Fuzzy Chinese address geographic evaluation method based on matching rule |
CN103124966A (en) * | 2010-07-09 | 2013-05-29 | 诺基亚公司 | Method and apparatus for aggregating and linking place data |
-
2017
- 2017-03-15 CN CN201710152062.8A patent/CN106934036A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101882163A (en) * | 2010-06-30 | 2010-11-10 | 中国科学院地理科学与资源研究所 | Fuzzy Chinese address geographic evaluation method based on matching rule |
CN103124966A (en) * | 2010-07-09 | 2013-05-29 | 诺基亚公司 | Method and apparatus for aggregating and linking place data |
Non-Patent Citations (1)
Title |
---|
张卫: "基于CURL异构数字资源统一检索的研究", 《中国农学通报》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110196965A (en) * | 2018-02-26 | 2019-09-03 | 北大方正集团有限公司 | The method and device of XML file conversion Word file |
CN109271409A (en) * | 2018-11-08 | 2019-01-25 | 成都索贝数码科技股份有限公司 | Database fragmentation execution method based on container resource allocation |
CN109271409B (en) * | 2018-11-08 | 2021-11-02 | 成都索贝数码科技股份有限公司 | Database fragmentation execution method based on container resource allocation |
CN112783410A (en) * | 2019-11-07 | 2021-05-11 | 北京拉酷网络科技有限公司 | Information processing method, medium, device and computing equipment |
CN112783410B (en) * | 2019-11-07 | 2023-11-24 | 北京拉酷网络科技有限公司 | Information processing method, medium, device and computing equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN103544176B (en) | Method and apparatus for generating the page structure template corresponding to multiple pages | |
CN102063476B (en) | Video searching method and system | |
Patil Swati et al. | Search engine optimization: A study | |
CN103294781B (en) | A kind of method and apparatus for processing page data | |
CN103631794B (en) | A kind of method, apparatus and equipment for being ranked up to search result | |
TWI695277B (en) | Automatic website data collection method | |
US20070198727A1 (en) | Method, apparatus and system for extracting field-specific structured data from the web using sample | |
CN102982117B (en) | Information search method and device | |
CN101097578A (en) | Network resource searching method and system | |
US20230229714A1 (en) | Identifying Information Using Referenced Text | |
CN106570750B (en) | Browser plug-in-based automatic tax declaring method and system and browser plug-in | |
CN103399862B (en) | Determine the method and apparatus of search index information corresponding to target query sequence | |
CN102982118A (en) | Searching method and device based on favorites | |
CN102760150A (en) | Webpage extraction method based on attribute reproduction and labeled path | |
CN106156143A (en) | Page processor and web page processing method | |
CN104881428B (en) | A kind of hum pattern extraction, search method and the device of hum pattern webpage | |
CN106934036A (en) | A kind of method and system of Network Learning Resource aggregate query | |
JP5181192B2 (en) | Method and apparatus for providing Internet search result information in a language circle | |
CN102314494A (en) | Method and equipment for processing webpage contents | |
Bin et al. | A study on tactics for college website at search engine optimization | |
CN105117434A (en) | Webpage classification method and webpage classification system | |
CN101894109A (en) | Database building method and device | |
CN102236713A (en) | Digital television interaction service page information extraction method and device | |
WO2017000659A1 (en) | Enriched uniform resource locator (url) identification method and apparatus | |
CN109558123A (en) | The method of webpage conversion electrons book, electronic equipment, storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20170707 |
|
WD01 | Invention patent application deemed withdrawn after publication |