CN106934036A - A kind of method and system of Network Learning Resource aggregate query - Google Patents

A kind of method and system of Network Learning Resource aggregate query Download PDF

Info

Publication number
CN106934036A
CN106934036A CN201710152062.8A CN201710152062A CN106934036A CN 106934036 A CN106934036 A CN 106934036A CN 201710152062 A CN201710152062 A CN 201710152062A CN 106934036 A CN106934036 A CN 106934036A
Authority
CN
China
Prior art keywords
code
targeted website
search
sites
website
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710152062.8A
Other languages
Chinese (zh)
Inventor
唐四薪
林睦纲
唐琼
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hengyang Normal University
Original Assignee
Hengyang Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hengyang Normal University filed Critical Hengyang Normal University
Priority to CN201710152062.8A priority Critical patent/CN106934036A/en
Publication of CN106934036A publication Critical patent/CN106934036A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Abstract

The invention provides a kind of method and system of Network Learning Resource aggregate query, purpose is to overcome to need targeted website to provide technical support, and need that the data of collection are carried out the shortcoming of structuring treatment, and do not cause dispute over copyright, the need for meeting personalized search.Technical scheme is to send inquiry request and key word of the inquiry to several targeted websites simultaneously using CURL multithreadings function, the search result list region in the targeted website return code is extracted using regular expression, the URL in the return code is modified again, finally these described return codes are loaded into the search results pages of the system.The advantage of the invention is that:The unstructured data obtained from targeted website need not be converted to structural data;The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced;Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.

Description

A kind of method and system of Network Learning Resource aggregate query
Technical field
The present invention relates to technical field of the computer network, and in particular to a kind of method of Network Learning Resource aggregate query and System.
Background technology
At present, the various education resources on Internet(Such as courseware, instructional video, study document)It is very rich Richness, many learners like search Network Learning Resource to be learnt by oneself, and many teachers also like the education resources such as search courseware Prepared lessons.People's most common method is to use universal search engine(Such as Baidu)Search for these education resources.
But, searching for education resource using universal search engine can only typically search some scattered resources, and in recent years The major publishing houses for coming China concentrate on the necessary resources construction for strengthening teaching material books, have accumulated on many publishing house websites Abundant teaching material match teaching resource.This kind of teaching resource is typically provided by books author, comparison system.On the other hand, China Many admire class website, Exquisite Course Website website and also gathered the teaching resource of magnanimity in substantial amounts of course resources, but these websites existing But almost searched on the search engines such as Baidu less than because the key that arbitrarily cannot be input into user by the search reptile of search engine Word obtains search result list in being delivered to the list of targeted website.
If user goes to access every publishing house or curriculum website respectively, then one by one on each website of input keyword search Teaching resource, that will be a very cumbersome thing.A solution is:Each search targeted website is allowed to provide structure Change data(Such as JSON or XML data)Interface, then can using Ajax (Asynchronous Javascript And XML, Asynchronous JavaScript and XML) or CURL (Command Line Uniform Resource Locator, order line system One resource localizer) technology goes to obtain the structural data of targeted website, then be polymerized in result website.
Second scheme is:Using the searching interface of publishing house, unified query is carried out, the data for obtaining will be inquired about and tied Structureization is stored in local data base after processing, then local data base is inquired about, and the inquiry velocity of this mode is very fast, But due to replicating the content of publishing house website in being locally stored, dispute over copyright problem can be triggered.
In a word, at least there is more following or some deficiency in current Network Learning Resource aggregate query scheme:1. need Targeted website is wanted to provide the data and access interface of structuring;2. needing will be from targeted website at the content structure of collection In storing local data base after reason, due to replicating the content of targeted website to local, it would be possible to cause dispute over copyright problem; 3. reproducting content is inquired about in local data base, it is impossible to which ensure inquiry is the what be new in targeted website;4. mesh is needed Mark website provides database structure or other technologies are supported.
The content of the invention
Targeted website is needed to provide technical support in order to overcome, and the data of collection are carried out structuring treatment by needs Shortcoming, and do not cause dispute over copyright, the need for meeting personalized search.The present invention is proposed, overcomes above-mentioned to provide one kind Problem or at least in part solution to the problems described above.
According to the present invention, there is provided a kind of method of Network Learning Resource aggregate query, comprise the following steps:
The first step:The information such as the network address of all targeted websites to be inquired about, coding, HTTP request mode are stored in database One table(If table name is sites, the field in sites tables has id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp)In, if to increase the website of inquiry newly, only need to newly The information of website is inserted into sites tables as a record.
Second step:The system provides a list for user input search keyword on webpage.
3rd step:The search keyword is obtained, according to the volume of the targeted website of charset field records in sites tables Code type, URL codings are carried out by keyword, make the character code phase of character code and the targeted website after keyword conversion Together.
4th step:Keyword after coding is sent to the search of each targeted website simultaneously using CURL multithreadings function Treatment page(Url fields in sites tables save the network address of the search process page)If, postdata words in sites tables Segment value is not sky, then keyword is embedded into postdata field values in POST modes and is then forwarded to targeted website, if Postdata field values are sky, then send keyword data to targeted website in GET modes.
5th step:An array is defined, the HTML code of the result of page searching that each described targeted website returns is received.
6th step:All described HTML code to returning carries out Unified coding(As unification is converted into utf-8 codings).
7th step:Extract search result list region:Manual method is used first(Such as by the " inspection of chrome browsers Look into " function)The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code Regular expression code, in saving it in the pregmatch fields of sites tables, finally use matching regular expressions letter Number(Such as preg_match)Extract the search result content part in the HTML code.
8th step:Correct the relative URL address in image and hyperlink in the HTML code.First by DOM (Document Object Model, DOM Document Object Model)Operation class(Such as simple html dom)Find return described All a elements and img elements in HTML code, then add the domain name and path prefix of original web before its src property value Character string(Asrc fields and imgsrcp fields in sites tables save the prefix character string).
9th step:Revised search result list area code is loaded into the present system.It is revised by every section respectively Code is loaded into a HTML Container elements.
Tenth step:For search result list adds model code, pattern layout is carried out to all HTML Container elements In beautifying and exporting the search results pages of the system.
Further, there is provided a background management system, the information of the targeted website is added, deleted, changing for user, Whether the value of the settable valid fields of user searches for the targeted website specified, and sets the value of sort attributes to realize to the mesh Mark the sequence of website.
The advantage of the invention is that:The unstructured data obtained from targeted website need not be converted to structuring number According to;The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced; Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.
Brief description of the drawings
Fig. 1 is the operation principle flow chart that the present invention is implemented.
Fig. 2 is according to the webpage of one embodiment of the invention(The homepage of the embodiment)The exemplary schematic representation of display interface.
Fig. 3 is according to the webpage of one embodiment of the invention(The search results pages of the embodiment)Display interface schematic diagram.
Specific embodiment
Further illustrate technical scheme below in conjunction with the accompanying drawings and by specific embodiment.
The main thought of technical solution of the present invention is sent to several targeted websites simultaneously using CURL multithreadings function Inquiry request and key word of the inquiry, the search result list area in the targeted website return code is extracted using regular expression Domain, then the URL in the return code is modified, these described return codes are finally loaded into the search knot of the system In fruit page.
The embodiment includes browser end and server end, between the two can be by network connection, also can be by browser end Be deployed on same computer with server end, server end be installed and configured web server software Apache, MySQL and CURL expansion plugins.
The workflow of the embodiment is comprised the following steps:
Step 101:The information such as network address, coding by all targeted websites to be inquired about are stored in a table of database, should The structure of table is as follows:
sites(id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp)
The implication correspondence of each field is as follows:
Sites (sequence number, website name, site search entrance URL, the type of coding of website, the regular expression of content area Whether effectively matching code, store post data, and image URL address prefixs, hyperlink URL address prefix, sequence is retouched State)
If increasing the website to be inquired about newly, the information of new website need to be only added in sites tables.
Step 102:As shown in Fig. 2 the embodiment provides a searchable form on webpage searches for crucial for user input Word.
Step 103:The keyword in list is obtained, according to the type of coding of the targeted website recorded in sites tables, Keyword is carried out into URL codings, the coding after changing keyword is identical with the coding of targeted website.
Step 104:The keyword after coding is sent to searching for each targeted website simultaneously using CURL multithreadings function Rope processes page.The network address of the search process page is the value of url fields in sites tables, if postdata fields in sites tables Value is not sky, then keyword is embedded into postdata field values in post modes and is then forwarded to targeted website, if Postdata field values are sky, then send keyword data to targeted website in get modes.
Wherein postdata field values are to use manual method(Such as by " inspection " function of chrome browsers)Analysis Targeted website obtains to the POST data that server is sent in HTTP request, for example:
method1=1&keyzy=name&keyword=
It is pre-stored in postdata fields.
Step 105:An array is defined, the HTML code of the result of page searching of each targeted website return is received.
Step 106:All described HTML code to returning carries out Unified coding(As unification is converted into utf-8 codings).
Step 107:Extract search result list region:First using manual method by(Such as " the inspection of chrome browsers Look into " function)The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code Regular expression code, save it in the pregmatch fields of tables of data, finally use matching regular expressions function Preg_match extracts content part therein.
Step 108:Relative URL address in correction map picture and hyperlink.First by DOM(DOM Document Object Model)Operation Class(simple_html_dom)The all a elements and img elements in return code are found, is then added before its src property value The domain name and path prefix character string of original web(Asrc fields and imgsrcp fields in sites tables save the preceding asyllabia Symbol string).
Step 109:It is loaded into the revised search result list area code.As shown in figure 3, the present embodiment by each The HTML code that website returns is placed individually into a div of the entitled cbs of class, and one h2 mark of addition above it is used for Place publishing house's name or curriculum net name of station, and one " more " link, the initial search list for being linked to targeted website Page.
Step 110:For the search result list adds CSS style code.The need for for layout and beautification, should The picture that embodiment sets in the targeted website of all collections using CSS is same size, and word is same color and size, right Each list items in the search result list set float attributes, its text is located at the arrangement of picture right side, and to each The list items are set:After puppet element selectors are removed and floated.
It is consummating function, the present embodiment adds waterfall stream effect using Ajax technologies to search results pages.It is i.e. initially described Search results pages are only loaded into a part for Search Results, loading are further continued for when user scrolls down through the search results pages next Partial content.
It is consummating function, the present embodiment provides a background management system, added for user, deleted, changing the target The information of website, also, whether the value of the also settable valid fields of user searches for the targeted website specified, and sets sort attributes Value realize the sequence to the targeted website.

Claims (7)

1. a kind of method and system of Network Learning Resource aggregate query, it is characterised in that comprise the following steps:The first step:By institute There are the information such as network address, coding, the HTTP request mode of the targeted website to be inquired about to be stored in a table of database(If table name is Field in sites, sites table has id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp)In, if to increase the website of inquiry newly, only need to be using the information of new website as one Bar record is inserted into sites tables;Second step:The system provides a list for user input search keyword on webpage; 3rd step:The search keyword is obtained, according to the type of coding of the targeted website of charset field records in sites tables, will Keyword carries out URL codings, and the character code after changing keyword is identical with the character code of the targeted website;4th Step:Keyword after coding is sent to the search process page of each targeted website simultaneously using CURL multithreadings function(sites Url fields in table save the network address of the search process page)If postdata field values are not sky in sites tables, Keyword is embedded into postdata field values in POST modes is then forwarded to targeted website, if postdata field values are Sky, then send keyword data to targeted website in GET modes;5th step:An array is defined, each target network is received Stand return result of page searching HTML code;6th step:All described HTML code to returning carries out Unified coding;The Seven steps:Extract search result list region:The initial code in search result list region is found using manual method first, then The regular expression code of matching whole region is manually write out further according to head and the tail code, sites tables are saved it in In pregmatch fields, matching regular expressions function is finally used(Such as preg_match)In extracting the HTML code Search result content part;8th step:Correct the relative URL address in image and hyperlink in the HTML code:Make first Use DOM(Document Object Model, DOM Document Object Model)Operation class(Such as simple html dom)Find return All a elements and img elements in the HTML code, then add domain name and the path of original web before its src property value Prefix character string(Asrc fields and imgsrcp fields in sites tables save the prefix character string);9th step:At this Revised search result list area code is loaded into system, every section of revised code is loaded into a HTML respectively holds In device element;Tenth step:For search result list adds model code, pattern layout is carried out to all HTML Container elements In beautifying and exporting the search results pages of the system.
2. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that multi-thread using CURL Eikonal number is while be sent to all targeted websites, rather than singly sending when sending request.
3. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that can basis Postdata field values, the mode for automatically selecting transmission HTTP request is post or get.
4. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that if with post side Formula sends request, then the search keyword character string can be embedded into the character string of postdata fields preservation and send.
5. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that use regular expressions Formula matches the search result list region of the targeted website.
6. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that operate class using DOM The URL addresses of hyperlink and image file in amendment code.
7. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that the system is not stored The HTML code of the Search Results of any targeted website, but after being beautified and be laid out using CSS style code, its is direct The search results pages for exporting the system show.
CN201710152062.8A 2017-03-15 2017-03-15 A kind of method and system of Network Learning Resource aggregate query Pending CN106934036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710152062.8A CN106934036A (en) 2017-03-15 2017-03-15 A kind of method and system of Network Learning Resource aggregate query

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710152062.8A CN106934036A (en) 2017-03-15 2017-03-15 A kind of method and system of Network Learning Resource aggregate query

Publications (1)

Publication Number Publication Date
CN106934036A true CN106934036A (en) 2017-07-07

Family

ID=59432439

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710152062.8A Pending CN106934036A (en) 2017-03-15 2017-03-15 A kind of method and system of Network Learning Resource aggregate query

Country Status (1)

Country Link
CN (1) CN106934036A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109271409A (en) * 2018-11-08 2019-01-25 成都索贝数码科技股份有限公司 Database fragmentation execution method based on container resource allocation
CN110196965A (en) * 2018-02-26 2019-09-03 北大方正集团有限公司 The method and device of XML file conversion Word file
CN112783410A (en) * 2019-11-07 2021-05-11 北京拉酷网络科技有限公司 Information processing method, medium, device and computing equipment

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN103124966A (en) * 2010-07-09 2013-05-29 诺基亚公司 Method and apparatus for aggregating and linking place data

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101882163A (en) * 2010-06-30 2010-11-10 中国科学院地理科学与资源研究所 Fuzzy Chinese address geographic evaluation method based on matching rule
CN103124966A (en) * 2010-07-09 2013-05-29 诺基亚公司 Method and apparatus for aggregating and linking place data

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张卫: "基于CURL异构数字资源统一检索的研究", 《中国农学通报》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110196965A (en) * 2018-02-26 2019-09-03 北大方正集团有限公司 The method and device of XML file conversion Word file
CN109271409A (en) * 2018-11-08 2019-01-25 成都索贝数码科技股份有限公司 Database fragmentation execution method based on container resource allocation
CN109271409B (en) * 2018-11-08 2021-11-02 成都索贝数码科技股份有限公司 Database fragmentation execution method based on container resource allocation
CN112783410A (en) * 2019-11-07 2021-05-11 北京拉酷网络科技有限公司 Information processing method, medium, device and computing equipment
CN112783410B (en) * 2019-11-07 2023-11-24 北京拉酷网络科技有限公司 Information processing method, medium, device and computing equipment

Similar Documents

Publication Publication Date Title
CN103544176B (en) Method and apparatus for generating the page structure template corresponding to multiple pages
CN102063476B (en) Video searching method and system
Patil Swati et al. Search engine optimization: A study
CN103294781B (en) A kind of method and apparatus for processing page data
CN103631794B (en) A kind of method, apparatus and equipment for being ranked up to search result
TWI695277B (en) Automatic website data collection method
US20070198727A1 (en) Method, apparatus and system for extracting field-specific structured data from the web using sample
CN102982117B (en) Information search method and device
CN101097578A (en) Network resource searching method and system
US20230229714A1 (en) Identifying Information Using Referenced Text
CN106570750B (en) Browser plug-in-based automatic tax declaring method and system and browser plug-in
CN103399862B (en) Determine the method and apparatus of search index information corresponding to target query sequence
CN102982118A (en) Searching method and device based on favorites
CN102760150A (en) Webpage extraction method based on attribute reproduction and labeled path
CN106156143A (en) Page processor and web page processing method
CN104881428B (en) A kind of hum pattern extraction, search method and the device of hum pattern webpage
CN106934036A (en) A kind of method and system of Network Learning Resource aggregate query
JP5181192B2 (en) Method and apparatus for providing Internet search result information in a language circle
CN102314494A (en) Method and equipment for processing webpage contents
Bin et al. A study on tactics for college website at search engine optimization
CN105117434A (en) Webpage classification method and webpage classification system
CN101894109A (en) Database building method and device
CN102236713A (en) Digital television interaction service page information extraction method and device
WO2017000659A1 (en) Enriched uniform resource locator (url) identification method and apparatus
CN109558123A (en) The method of webpage conversion electrons book, electronic equipment, storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20170707

WD01 Invention patent application deemed withdrawn after publication