CN106934036A

CN106934036A - A kind of method and system of Network Learning Resource aggregate query

Info

Publication number: CN106934036A
Application number: CN201710152062.8A
Authority: CN
Inventors: 唐四薪; 林睦纲; 唐琼
Original assignee: Hengyang Normal University
Current assignee: Hengyang Normal University
Priority date: 2017-03-15
Filing date: 2017-03-15
Publication date: 2017-07-07

Abstract

The invention provides a kind of method and system of Network Learning Resource aggregate query, purpose is to overcome to need targeted website to provide technical support, and need that the data of collection are carried out the shortcoming of structuring treatment, and do not cause dispute over copyright, the need for meeting personalized search.Technical scheme is to send inquiry request and key word of the inquiry to several targeted websites simultaneously using CURL multithreadings function, the search result list region in the targeted website return code is extracted using regular expression, the URL in the return code is modified again, finally these described return codes are loaded into the search results pages of the system.The advantage of the invention is that：The unstructured data obtained from targeted website need not be converted to structural data；The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced；Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.

Description

A kind of method and system of Network Learning Resource aggregate query

Technical field

The present invention relates to technical field of the computer network, and in particular to a kind of method of Network Learning Resource aggregate query and System.

Background technology

At present, the various education resources on Internet（Such as courseware, instructional video, study document）It is very rich Richness, many learners like search Network Learning Resource to be learnt by oneself, and many teachers also like the education resources such as search courseware Prepared lessons.People's most common method is to use universal search engine（Such as Baidu）Search for these education resources.

But, searching for education resource using universal search engine can only typically search some scattered resources, and in recent years The major publishing houses for coming China concentrate on the necessary resources construction for strengthening teaching material books, have accumulated on many publishing house websites Abundant teaching material match teaching resource.This kind of teaching resource is typically provided by books author, comparison system.On the other hand, China Many admire class website, Exquisite Course Website website and also gathered the teaching resource of magnanimity in substantial amounts of course resources, but these websites existing But almost searched on the search engines such as Baidu less than because the key that arbitrarily cannot be input into user by the search reptile of search engine Word obtains search result list in being delivered to the list of targeted website.

If user goes to access every publishing house or curriculum website respectively, then one by one on each website of input keyword search Teaching resource, that will be a very cumbersome thing.A solution is：Each search targeted website is allowed to provide structure Change data（Such as JSON or XML data）Interface, then can using Ajax (Asynchronous Javascript And XML, Asynchronous JavaScript and XML) or CURL (Command Line Uniform Resource Locator, order line system One resource localizer) technology goes to obtain the structural data of targeted website, then be polymerized in result website.

Second scheme is：Using the searching interface of publishing house, unified query is carried out, the data for obtaining will be inquired about and tied Structureization is stored in local data base after processing, then local data base is inquired about, and the inquiry velocity of this mode is very fast, But due to replicating the content of publishing house website in being locally stored, dispute over copyright problem can be triggered.

In a word, at least there is more following or some deficiency in current Network Learning Resource aggregate query scheme：1. need Targeted website is wanted to provide the data and access interface of structuring；2. needing will be from targeted website at the content structure of collection In storing local data base after reason, due to replicating the content of targeted website to local, it would be possible to cause dispute over copyright problem； 3. reproducting content is inquired about in local data base, it is impossible to which ensure inquiry is the what be new in targeted website；4. mesh is needed Mark website provides database structure or other technologies are supported.

The content of the invention

Targeted website is needed to provide technical support in order to overcome, and the data of collection are carried out structuring treatment by needs Shortcoming, and do not cause dispute over copyright, the need for meeting personalized search.The present invention is proposed, overcomes above-mentioned to provide one kind Problem or at least in part solution to the problems described above.

According to the present invention, there is provided a kind of method of Network Learning Resource aggregate query, comprise the following steps：

The first step：The information such as the network address of all targeted websites to be inquired about, coding, HTTP request mode are stored in database One table（If table name is sites, the field in sites tables has id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp）In, if to increase the website of inquiry newly, only need to newly The information of website is inserted into sites tables as a record.

Second step：The system provides a list for user input search keyword on webpage.

3rd step：The search keyword is obtained, according to the volume of the targeted website of charset field records in sites tables Code type, URL codings are carried out by keyword, make the character code phase of character code and the targeted website after keyword conversion Together.

4th step：Keyword after coding is sent to the search of each targeted website simultaneously using CURL multithreadings function Treatment page（Url fields in sites tables save the network address of the search process page）If, postdata words in sites tables Segment value is not sky, then keyword is embedded into postdata field values in POST modes and is then forwarded to targeted website, if Postdata field values are sky, then send keyword data to targeted website in GET modes.

5th step：An array is defined, the HTML code of the result of page searching that each described targeted website returns is received.

6th step：All described HTML code to returning carries out Unified coding（As unification is converted into utf-8 codings）.

7th step：Extract search result list region：Manual method is used first（Such as by the " inspection of chrome browsers Look into " function）The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code Regular expression code, in saving it in the pregmatch fields of sites tables, finally use matching regular expressions letter Number（Such as preg_match）Extract the search result content part in the HTML code.

8th step：Correct the relative URL address in image and hyperlink in the HTML code.First by DOM （Document Object Model, DOM Document Object Model）Operation class（Such as simple html dom）Find return described All a elements and img elements in HTML code, then add the domain name and path prefix of original web before its src property value Character string（Asrc fields and imgsrcp fields in sites tables save the prefix character string）.

9th step：Revised search result list area code is loaded into the present system.It is revised by every section respectively Code is loaded into a HTML Container elements.

Tenth step：For search result list adds model code, pattern layout is carried out to all HTML Container elements In beautifying and exporting the search results pages of the system.

Further, there is provided a background management system, the information of the targeted website is added, deleted, changing for user, Whether the value of the settable valid fields of user searches for the targeted website specified, and sets the value of sort attributes to realize to the mesh Mark the sequence of website.

The advantage of the invention is that：The unstructured data obtained from targeted website need not be converted to structuring number According to；The data that need not be acquired from targeted website in the storage of the server end of the system, so as to dispute over copyright will not be produced； Do not need targeted website to provide any technical support, meet personalized search needs, it is simple and practical.

Brief description of the drawings

Fig. 1 is the operation principle flow chart that the present invention is implemented.

Fig. 2 is according to the webpage of one embodiment of the invention（The homepage of the embodiment）The exemplary schematic representation of display interface.

Fig. 3 is according to the webpage of one embodiment of the invention（The search results pages of the embodiment）Display interface schematic diagram.

Specific embodiment

Further illustrate technical scheme below in conjunction with the accompanying drawings and by specific embodiment.

The main thought of technical solution of the present invention is sent to several targeted websites simultaneously using CURL multithreadings function Inquiry request and key word of the inquiry, the search result list area in the targeted website return code is extracted using regular expression Domain, then the URL in the return code is modified, these described return codes are finally loaded into the search knot of the system In fruit page.

The embodiment includes browser end and server end, between the two can be by network connection, also can be by browser end Be deployed on same computer with server end, server end be installed and configured web server software Apache, MySQL and CURL expansion plugins.

The workflow of the embodiment is comprised the following steps：

Step 101：The information such as network address, coding by all targeted websites to be inquired about are stored in a table of database, should The structure of table is as follows：

sites(id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp)

The implication correspondence of each field is as follows：

Sites (sequence number, website name, site search entrance URL, the type of coding of website, the regular expression of content area Whether effectively matching code, store post data, and image URL address prefixs, hyperlink URL address prefix, sequence is retouched State)

If increasing the website to be inquired about newly, the information of new website need to be only added in sites tables.

Step 102：As shown in Fig. 2 the embodiment provides a searchable form on webpage searches for crucial for user input Word.

Step 103：The keyword in list is obtained, according to the type of coding of the targeted website recorded in sites tables, Keyword is carried out into URL codings, the coding after changing keyword is identical with the coding of targeted website.

Step 104：The keyword after coding is sent to searching for each targeted website simultaneously using CURL multithreadings function Rope processes page.The network address of the search process page is the value of url fields in sites tables, if postdata fields in sites tables Value is not sky, then keyword is embedded into postdata field values in post modes and is then forwarded to targeted website, if Postdata field values are sky, then send keyword data to targeted website in get modes.

Wherein postdata field values are to use manual method（Such as by " inspection " function of chrome browsers）Analysis Targeted website obtains to the POST data that server is sent in HTTP request, for example：

method1=1&keyzy=name&keyword=

It is pre-stored in postdata fields.

Step 105：An array is defined, the HTML code of the result of page searching of each targeted website return is received.

Step 106：All described HTML code to returning carries out Unified coding（As unification is converted into utf-8 codings）.

Step 107：Extract search result list region：First using manual method by（Such as " the inspection of chrome browsers Look into " function）The initial code in search result list region is found, then matching whole region is manually write out further according to head and the tail code Regular expression code, save it in the pregmatch fields of tables of data, finally use matching regular expressions function Preg_match extracts content part therein.

Step 108：Relative URL address in correction map picture and hyperlink.First by DOM（DOM Document Object Model）Operation Class（simple_html_dom）The all a elements and img elements in return code are found, is then added before its src property value The domain name and path prefix character string of original web（Asrc fields and imgsrcp fields in sites tables save the preceding asyllabia Symbol string）.

Step 109：It is loaded into the revised search result list area code.As shown in figure 3, the present embodiment by each The HTML code that website returns is placed individually into a div of the entitled cbs of class, and one h2 mark of addition above it is used for Place publishing house's name or curriculum net name of station, and one " more " link, the initial search list for being linked to targeted website Page.

Step 110：For the search result list adds CSS style code.The need for for layout and beautification, should The picture that embodiment sets in the targeted website of all collections using CSS is same size, and word is same color and size, right Each list items in the search result list set float attributes, its text is located at the arrangement of picture right side, and to each The list items are set:After puppet element selectors are removed and floated.

It is consummating function, the present embodiment adds waterfall stream effect using Ajax technologies to search results pages.It is i.e. initially described Search results pages are only loaded into a part for Search Results, loading are further continued for when user scrolls down through the search results pages next Partial content.

It is consummating function, the present embodiment provides a background management system, added for user, deleted, changing the target The information of website, also, whether the value of the also settable valid fields of user searches for the targeted website specified, and sets sort attributes Value realize the sequence to the targeted website.

Claims

1. a kind of method and system of Network Learning Resource aggregate query, it is characterised in that comprise the following steps：The first step：By institute There are the information such as network address, coding, the HTTP request mode of the targeted website to be inquired about to be stored in a table of database（If table name is Field in sites, sites table has id, name, url, charset, pregmatch, valid, postdata, imgsrcp, asrcp, sort, descp）In, if to increase the website of inquiry newly, only need to be using the information of new website as one Bar record is inserted into sites tables；Second step：The system provides a list for user input search keyword on webpage； 3rd step：The search keyword is obtained, according to the type of coding of the targeted website of charset field records in sites tables, will Keyword carries out URL codings, and the character code after changing keyword is identical with the character code of the targeted website；4th Step：Keyword after coding is sent to the search process page of each targeted website simultaneously using CURL multithreadings function（sites Url fields in table save the network address of the search process page）If postdata field values are not sky in sites tables, Keyword is embedded into postdata field values in POST modes is then forwarded to targeted website, if postdata field values are Sky, then send keyword data to targeted website in GET modes；5th step：An array is defined, each target network is received Stand return result of page searching HTML code；6th step：All described HTML code to returning carries out Unified coding；The Seven steps：Extract search result list region：The initial code in search result list region is found using manual method first, then The regular expression code of matching whole region is manually write out further according to head and the tail code, sites tables are saved it in In pregmatch fields, matching regular expressions function is finally used（Such as preg_match）In extracting the HTML code Search result content part；8th step：Correct the relative URL address in image and hyperlink in the HTML code：Make first Use DOM（Document Object Model, DOM Document Object Model）Operation class（Such as simple html dom）Find return All a elements and img elements in the HTML code, then add domain name and the path of original web before its src property value Prefix character string（Asrc fields and imgsrcp fields in sites tables save the prefix character string）；9th step：At this Revised search result list area code is loaded into system, every section of revised code is loaded into a HTML respectively holds In device element；Tenth step：For search result list adds model code, pattern layout is carried out to all HTML Container elements In beautifying and exporting the search results pages of the system.

2. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that multi-thread using CURL Eikonal number is while be sent to all targeted websites, rather than singly sending when sending request.

3. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that can basis Postdata field values, the mode for automatically selecting transmission HTTP request is post or get.

4. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that if with post side Formula sends request, then the search keyword character string can be embedded into the character string of postdata fields preservation and send.

5. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that use regular expressions Formula matches the search result list region of the targeted website.

6. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that operate class using DOM The URL addresses of hyperlink and image file in amendment code.

7. a kind of method of Network Learning Resource aggregate query as claimed in claim 1, it is characterised in that the system is not stored The HTML code of the Search Results of any targeted website, but after being beautified and be laid out using CSS style code, its is direct The search results pages for exporting the system show.