CN101477576B - Method, equipment and system for providing network materials to search engine - Google Patents

Method, equipment and system for providing network materials to search engine Download PDF

Info

Publication number
CN101477576B
CN101477576B CN 200910105235 CN200910105235A CN101477576B CN 101477576 B CN101477576 B CN 101477576B CN 200910105235 CN200910105235 CN 200910105235 CN 200910105235 A CN200910105235 A CN 200910105235A CN 101477576 B CN101477576 B CN 101477576B
Authority
CN
China
Prior art keywords
message
html
http message
user
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN 200910105235
Other languages
Chinese (zh)
Other versions
CN101477576A (en
Inventor
张瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huawei Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN 200910105235 priority Critical patent/CN101477576B/en
Publication of CN101477576A publication Critical patent/CN101477576A/en
Application granted granted Critical
Publication of CN101477576B publication Critical patent/CN101477576B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention relates to the field of telecommunication, and discloses a method for receiving network materials for search engine. The method comprises the following steps: messages from the network side are received, and when the received messages adopt HTTP messages, web page information carried in the messages is obtained and stored. The method can immediately obtain materials after a user accesses the network, so that the database of network materials for search engine is updated in time. The invention further discloses network equipment and a system of providing the network materials for search engine.

Description

Obtain the method, apparatus and system of network materials for search engine
Technical field
The present invention relates to the communications field, particularly a kind of method, apparatus and system of obtaining network materials for search engine.
Background technology
Along with popularizing of internet, increasing website appears in the internet world, how to find the content of oneself wanting to become the problem that numerous Internet users are concerned about very much in numerous website.The appearance of search engine has given this problem a good answer.Login these search engines, the keyword searched is wanted in input, and search engine will be searched in the materials database of oneself by this keyword, feeds back to user-dependent webpage or web site url.The quality of search engine quality depends on the method for building up of materials database to a great extent.
The method that early stage search engine is set up materials database is to adopt the mode of artificial input.The benefit of this mode maximum is that Search Results can accomplish controlledly fully, and can adjust searching order by the mode of collecting advertising fee.But the mode efficient of artificial input is very low, has every day thousands of website to occur or disappearance, and the content of existing website is also in continuous renewal, and the mode of artificial input can be missed a lot of emerging websites or webpage.
Along with the development of technology, the materials database of search engine of new generation is set up the technology that has adopted a kind of Web Spider (Spider).This technology is exactly the webpage that goes to visit other website by the server of search engine, and the content of webpage is analyzed, and finds wherein to comprise hyperlink with regard to the webpage from this hyperlink that is dynamically connected.All webpages of visiting in this process all can be recorded in the materials database.The Web Spider technology has realized the robotization that the search engine material is collected, and the promptness of material has had essential raising than the method for artificial input.Yet, adopt the material collection method of Web Spider to be subject to access speed, generally can only accomplish to finish in 1-2 days the renewal of a materials database.And the Web Spider technology is difficult to realization to the identification of class of subscriber, identity, therefore also just can't provide search service targetedly.
Summary of the invention
In view of this, it is a kind of for search engine provides the method, apparatus and system of network materials that embodiments of the invention provide, to solve the slow-footed problem of network materials database update of search engine in the prior art.
A kind of for search engine provides the method for network materials, comprising:
Reception is from the message of network side;
When message is HTML (Hypertext Markup Language) HTTP message, obtain the info web that carries in the HTTP message;
The info web that carries in the storage HTTP message.
A kind of for search engine provides the network equipment of network materials, comprising:
Receiver module is used for receiving the message from network side;
Parsing module when the message that receives when receiver module is the HTTP message, is used for obtaining the info web that the HTTP message carries;
Memory module, the info web that the HTTP message that gets access to for the storing and resolving module carries.
A kind of for search engine provides the system of network materials, comprising:
The network equipment is used for obtaining the info web that the HTTP message from network side carries, and the info web that gets access to is sent to memory device;
Memory device is used for the info web that storage receives.
The method and apparatus that adopts the embodiment of the invention to provide can just get access to material immediately, thereby realize upgrading in time of search engine network materials database behind customer access network.
Description of drawings
A kind of method flow diagram that network materials is provided for search engine that Fig. 1 provides for one embodiment of the invention;
A kind of network equipment that network materials is provided for search engine that Fig. 2 provides for further embodiment of this invention;
Fig. 3 is the structural drawing of parsing module 210 in further embodiment of this invention;
A kind of network equipment that network materials is provided for search engine that Fig. 4 provides for further embodiment of this invention;
A kind of network equipment that network materials is provided for search engine that Fig. 5 provides for further embodiment of this invention;
A kind of system that network materials is provided for search engine that Fig. 6 provides for further embodiment of this invention.
Embodiment
For the purpose, technical scheme and the advantage that make the embodiment of the invention is clearer, below with reference to accompanying drawing, embodiments of the invention are described in further detail.
Http protocol (Hypertext Transfer Protocol, HTML (Hypertext Markup Language)) is to transmit hypertext to user's transportation protocol for the server from the website.It not only guarantees correctly to transmit rapidly hypertext document, can also determine which part in the transferring documents, and which partial content at first shows etc.HTML (Hypertext Mark-up Language, HTML (Hypertext Markup Language)) is the language that is most widely used on the present network, also is the main language that constitutes web document.
Generally, the server of user and website carries out the process of http communication and is:
The user sends the HTTP request to the server of website;
After the server of website is received request, send corresponding http response to the user;
Wherein, not only carry the info web of website in the http response that the server of website returns, also carry user's relevant information.
Therefore, in an embodiment of the present invention, by resolving the HTTP message from network side, obtain the info web that wherein carries, thereby can behind customer access network, just get access to material immediately, thereby realize upgrading in time of search engine network materials database.
As shown in Figure 1, Fig. 1 for one embodiment of the present of invention provide a kind of for search engine provides the method flow diagram of network materials, comprising:
100, receive the message from network side.
110, when message is the HTTP message, obtain the info web that carries in this HTTP message.
On network, http communication usually occurs on TCP/IP (Transmission ControlProtocol//Internet Protocol, the transmission control protocol/Internet protocol) connection, and default port is TCP 80.Therefore, in the present embodiment, by analytic message, obtain source port number and the protocol type of message, when the source port number of message is 80 and the protocol type of message when being TCP, determine that then this message is exactly the HTTP message.Certainly, http protocol also can be finished on other agreement, and other port also is available, and present embodiment repeats no more.
And in the present embodiment, obtain the info web that carries in the message and can realize by HTML head and the HTML tail identified in this HTTP message.
HTML can be understood as a synthetic text of a series of set of tags.Generally, these labels become two and occur, and are called and open label and close label.For example, the heading of webpage is at<head〉label and</head label.The text of webpage is then at<body〉label and</body between the label, that is: anything that shows on the page is included among these two labels.Whole html page is then with<html〉label begins, with</html〉label finishes.Wherein,<head 〉,<body〉and<html〉be and open label,</head 〉,</body〉and</html〉then be and close label.Therefore, in the present embodiment, identify HTML head and HTML tail, that is: identification<html in this HTTP message〉label and</html label, from<html〉label, to</html label, these information in the message are exactly info web.
120, the info web that carries in the storage HTTP message.
In the present embodiment, can classify to the info web of storage.For example: according to source IP address, that is: classify in the IP address of webpage, just can obtain the content approximate with the content that obtains by the Web Spider technology; And, can further obtain other web page contents of this links on web pages by the Web Spider technology.
Certainly, also can classify to info web according to actual conditions, for example, can classify according to the frequency that word in the info web occurs etc.Embodiments of the invention are not done restriction to The classification basis.
Adopt that embodiments of the invention provide for search engine provides the method for network materials, can user's accessed web page after, just get access to material immediately, thereby realize searching for upgrading in time with the Engine-Network materials database.
Preferably, in another embodiment of the present invention, can also obtain user's identification information, and store according to the HTTP message that receives.User's identification information can be user's I P address, also can be user's user name.
For example, carry user's relevant informations such as physical address, IP address (being the purpose IP address in the HTTP message) and port numbers in the HTTP message.After receiving the HTTP message, obtain the purpose IP address of message, just can access user's IP address.Certainly, after receiving the HTTP message, also can mate according to information such as the physical address of user in this HTTP message, IP address, port numbers, thereby obtain user's user name.Herein, user name can be the account number of user's logging in network, also can be user's cell-phone number, can also be other information that can identifying user etc.
After getting access to user's identification information, just can classify according to the user to the info web of storage, obtain a certain specific user's accessed content with this.For example, can know the web page contents of the frequent visit of a certain user, the web page contents of hobby visit etc.Know these information, just can for example, provide advertising message relevant with this user's hobby etc. for this user provides information more targetedly.
Certainly, get access to after user's the identification information, different Search Results can also be provided at different users, thereby realize the search engine functionality based on the user.For example: for high-end business users, if the key word of searching for catering class just provides commercial hotel information preferential ranking results; At low end subscriber, then provide the inexpensive preferential ranking results in restaurant.At teenage user, then can filter unsound information etc.
Adopt that another embodiment of the present invention provides for search engine provides the method for network materials, not only can realize upgrading in time of network materials database, can also realize the search engine service based on the user.
Another embodiment of the present invention provides a kind of for search engine provides the network equipment of network materials, as shown in Figure 2, comprising:
Receiver module 200 is used for receiving the message from network side;
Parsing module 210 when the message that receives when receiver module 200 is the HTTP message, is used for obtaining the info web that this HTTP message carries;
Memory module 220, the info web that the HTTP message that gets access to for storing and resolving module 210 carries.
Preferably, in the another embodiment of the present invention, as shown in Figure 3, parsing module 210 comprises:
Recognition unit 211 is for HTML head and the HTML tail of identification HTTP message;
Acquiring unit 212 is used for obtaining the info web that the HTTP message carries, and in the present embodiment, info web comprises HTML head and HTML tail, and the information between HTML head and HTML tail.
Preferably, in the another embodiment of the present invention, as shown in Figure 4, a kind of network equipment that network materials is provided for search engine also comprises except comprising receiver module 200, parsing module 210 and memory module 220:
Judge module 230 is used for judging whether the message from network side that receiver module 200 receives is the HTTP message, when the message that receives when receiver module 200 is the HTTP message, and triggering parsing module 210.
Preferably, in the another embodiment of the present invention, memory module 220 also is used for storing the HTTP message that receives according to receiver module 200, the user's who gets access to identification information.User's identification information can be user's IP address, also can be user's user name.
For example, carry user's relevant informations such as physical address, IP address (being the purpose IP address in the HTTP message) and port numbers in the HTTP message.After the network equipment receives the HTTP message, obtain the purpose IP address of message, just can access user's IP address.Certainly, after receiving the HTTP message, the network equipment also can mate (in the present embodiment, preserving user's user name in the network equipment) according to information such as the physical address of user in this HTTP message, IP address, port numbers, thereby obtains user's user name.Herein, user name can be the account number of user's logging in network, also can be user's cell-phone number, can also be other information that can identifying user etc.
Preferably, in the another embodiment of the present invention, as shown in Figure 5, a kind ofly for providing the network equipment of network materials, search engine can further include:
Sort module 240 is used for the info web of memory module 220 storages is classified.
In the network of reality, what relate in the embodiments of the invention provides the network equipment of network materials for search engine, it can be the network equipment in the data communication network, router for example, SR (Service Router, business router), BRAS (Broadband RemoteAccess Server, BAS Broadband Access Server) etc.; Also can be the network equipment in the cordless communication network, GGSN (Gateway GPRS Support Node, Gateway GPRS Support Node) equipment etc. for example; Certainly, can also be other network equipments.
Adopt that embodiments of the invention provide for search engine provides the network equipment of network materials, can get access to material at an equipment, realize upgrading in time of search engine network materials database.
Certainly, embodiments of the invention not only can be realized at an equipment, also can realize at a plurality of equipment.As shown in Figure 6, provide a kind of at another embodiment of the present invention and comprised the network equipment 300 and memory device 310 for search engine provides the system of network materials, wherein:
The network equipment 300 is used for obtaining the info web that the HTTP message from network side carries, and the info web that gets access to is sent to memory device 310;
Memory device 310 is used for the info web that storage receives.
Through the above description of the embodiments, those of ordinary skill in the art can be well understood to the embodiment of the invention and can realize by the mode that software adds essential general hardware platform, can certainly realize by hardware.Based on such understanding, the technical scheme of the embodiment of the invention can embody with the form of software product, this computer software product can be stored in the storage medium, as ROM/RAM, magnetic disc, CD etc., comprise that some instructions are with so that computer equipment or server or other network equipments are carried out the described method of some part of each embodiment of the present invention or embodiment.
Being preferred embodiment of the present invention only below, is not for limiting protection scope of the present invention.Within the spirit and principles in the present invention all, any modification of doing, be equal to replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (10)

1. one kind for search engine obtains the method for network materials, it is characterized in that, comprising:
Reception is from the message of network side;
When described message is HTML (Hypertext Markup Language) HTTP message, obtain the info web that carries in the described HTTP message for search engine;
Store the info web that carries in the described HTTP message for described search engine;
According to the frequency of word appearance in the IP address of webpage, the described info web or user's identification information the info web that carries in the stored described HTTP message is classified;
Obtain other web page contents of described links on web pages by the Web Spider technology.
2. the method for claim 1 is characterized in that, in described reception during from the message of network side, judge whether described message is the HTTP message, when the source port number of described message is 80, and the protocol type of described message is when being transmission control protocol TCP, and described message is the HTTP message.
3. method as claimed in claim 2 is characterized in that, the described info web that carries in the described HTTP message that obtains comprises:
Identify HTML (Hypertext Markup Language) HTML head and HTML tail in the described HTTP message;
Obtain the info web that described HTTP message carries, described info web comprises: described HTML head, described HTML tail, and the information between described HTML head and described HTML tail in the described HTTP message.
4. method as claimed in claim 3 is characterized in that, described HTML head is<html〉label, described HTML tail is</html〉label.
5. method as claimed in claim 4 is characterized in that, described method also comprises:
When described message is HTML (Hypertext Markup Language) HTTP message, according to described HTTP message, obtain user's identification information;
Store described user's identification information.
6. method as claimed in claim 5 is characterized in that, described user's identification information comprises: user's IP address or user's user name;
Wherein, described user's user name comprises: the account number of user's logging in network, perhaps user's cell-phone number.
7. one kind for search engine obtains the network equipment of network materials, it is characterized in that, comprising:
Receiver module is used for receiving the message from network side;
Parsing module when the described message that receives when described receiver module is the HTTP message, is used to search engine to obtain the info web that carries in the described HTTP message;
Memory module, the described info web that the described HTTP message that gets access to for the described parsing module of storage carries;
Sort module is used for according to the IP address of webpage, the frequency of described info web word appearance or user's identification information the info web that carries in the stored described HTTP message being classified;
The related pages acquisition module is for other web page contents that obtain described links on web pages by the Web Spider technology.
8. the network equipment as claimed in claim 7 is characterized in that, the described network equipment also comprises:
Judge module is used for judging whether the described message that described receiver module receives is the HTTP message, when the described message that receives when described receiver module is the HTTP message, triggers described parsing module.
9. the network equipment as claimed in claim 8 is characterized in that, described memory module also is used for storing the user's who gets access to according to the described HTTP message that receives identification information.
10. as each described network equipment among the claim 7-9, it is characterized in that described parsing module comprises:
Recognition unit is for HTML head and the HTML tail of the described HTTP message of identification;
Acquiring unit is used for obtaining the info web that described HTTP message carries, and described info web comprises: described HTML head, described HTML tail, and the information between described HTML head and described HTML tail in the described HTTP message.
CN 200910105235 2009-01-20 2009-01-20 Method, equipment and system for providing network materials to search engine Expired - Fee Related CN101477576B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 200910105235 CN101477576B (en) 2009-01-20 2009-01-20 Method, equipment and system for providing network materials to search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 200910105235 CN101477576B (en) 2009-01-20 2009-01-20 Method, equipment and system for providing network materials to search engine

Publications (2)

Publication Number Publication Date
CN101477576A CN101477576A (en) 2009-07-08
CN101477576B true CN101477576B (en) 2013-08-28

Family

ID=40838292

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 200910105235 Expired - Fee Related CN101477576B (en) 2009-01-20 2009-01-20 Method, equipment and system for providing network materials to search engine

Country Status (1)

Country Link
CN (1) CN101477576B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8639773B2 (en) * 2010-06-17 2014-01-28 Microsoft Corporation Discrepancy detection for web crawling
CN103235785B (en) * 2013-03-28 2016-02-24 四三九九网络股份有限公司 A kind of method of batch extracting web page resources material
CN106790105B (en) * 2016-12-26 2020-08-21 携程旅游网络技术(上海)有限公司 Crawler identification interception method and system based on business data
CN117574010B (en) * 2023-11-03 2024-07-19 中信建投证券股份有限公司 Data acquisition method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN101477576A (en) 2009-07-08

Similar Documents

Publication Publication Date Title
CN102812452B (en) Be used for system, server, terminal, the method for display buffer webpage and record the computer readable recording medium storing program for performing of the method
CN101847160B (en) Method and device for pushing personalized pages to mobile terminal
CA2865187C (en) Method and system relating to salient content extraction for electronic content
CN102521251A (en) Method for directly realizing personalized search, device for realizing method, and search server
CN101986306B (en) Method and equipment for acquiring yellow page information based on query sequence
US20180247035A1 (en) Method and Apparatus for Identifying User Behavior Object Based on Traffic Analysis
JP2009532797A (en) SYSTEM AND METHOD FOR PROVIDING ADAPTIVE RECOMMENDED WORDS BY USER AND COMPUTER-READABLE RECORDING MEDIUM CONTAINING PROGRAM FOR EXECUTING THE METHOD
CN101833570A (en) Method and device for optimizing page push of mobile terminal
CN102750352A (en) Method and device for classified collection of historical access records in browser
CN101542482A (en) Bookmarks and ranking
CN102624756B (en) Data download terminal and data download method
CN103092857A (en) Method and device for sorting historical records
CN101567906A (en) Method and server for confirming the webpage content language
CN102622402B (en) Server, method and system for providing information search service by using sheaf of pages
CN101477576B (en) Method, equipment and system for providing network materials to search engine
CN102348171A (en) Message processing method and system thereof
CN104572719A (en) Information collecting method and device
US20120158796A1 (en) Method, apparatus and system for generating bookmarks
WO2005121982A1 (en) Information providing system, method, program, information communication terminal, and information display switching program
KR101307578B1 (en) System for supplying a representative phone number information with a search function
EP3026567B1 (en) Method and system for exchanging messages on the basis of current position
CN101296201A (en) Network information sharing method, system and instant communication device
CN103365860A (en) Method, device and terminal for processing web pages
CN102447788A (en) Method and device for reading multimedia message through mobile phone browser
US20120173341A1 (en) Information publishing method, apparatus and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20130828

Termination date: 20180120

CF01 Termination of patent right due to non-payment of annual fee