WO2012025040A1 - Visualized search engine system and implementation method and application thereof - Google Patents

Visualized search engine system and implementation method and application thereof Download PDF

Info

Publication number
WO2012025040A1
WO2012025040A1 PCT/CN2011/078725 CN2011078725W WO2012025040A1 WO 2012025040 A1 WO2012025040 A1 WO 2012025040A1 CN 2011078725 W CN2011078725 W CN 2011078725W WO 2012025040 A1 WO2012025040 A1 WO 2012025040A1
Authority
WO
WIPO (PCT)
Prior art keywords
page
focus
text
user
display
Prior art date
Application number
PCT/CN2011/078725
Other languages
French (fr)
Chinese (zh)
Inventor
黄斌
Original Assignee
Huang Bin
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201010264871.6A external-priority patent/CN101916294B/en
Priority claimed from CN 201010590806 external-priority patent/CN102054028B/en
Priority claimed from CN 201110052339 external-priority patent/CN102129453B/en
Priority claimed from CN201110234356.8A external-priority patent/CN102270331B/en
Application filed by Huang Bin filed Critical Huang Bin
Publication of WO2012025040A1 publication Critical patent/WO2012025040A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • the invention relates to a visual search engine system for displaying Internet search results in an illustrated manner, and relates to a method for realizing display control of Internet search results by the visual search engine system and its application in network shopping navigation, and belongs to the field of Internet search technology.
  • search engines play an irreplaceable role in helping users get the information they need quickly from the inexhaustible internet data.
  • a search engine (search engine) collects information on the Internet according to a certain strategy and uses a specific computer program. After organizing and processing the information, the information is displayed to the user, thereby providing the user with information for searching for the service. service system. Instead of actually searching the web pages of the Internet, it searches the pre-organized web index database.
  • the vertical search engine (vert i cal search engine) is a valuable information and related service for a specific area, a specific group of people or a specific need, and is a subdivision and extension of the general search engine. It integrates a certain type of special information in the web index database, and directs the fields to extract the required data for processing and then returns it to the user in some form.
  • the vertical search engine is characterized by "specialized, precise, deep" and has an industry color. Compared to the massive information disorder of the general search engine, the vertical search engine appears to be more focused, specific and in-depth.
  • Vertical search engines generally include the following technologies: 1. Search engine crawler: used to crawl related web pages on the Internet; 2. Web page structured information extraction technology or metadata collection technology: used to extract structures from web pages 3. Data segmentation, indexing: used to store and index data; 4. Data presentation: Since the stored data is not simple web page data, it needs to be considered for display according to industry needs.
  • search engines such as Google, Baidu, Bing, etc. have different layouts for their search results display pages. But there are more similarities between them than they are different, for example, they all display search results in a text-only manner.
  • the page title is displayed for each search result, and a page description summary is followed by the page title.
  • This layout design can present more search results in one page, but since only the text summary of the web page is displayed, the user clicks on a search result based on the content of the text summary, but finds that the page appears far from the page he wants. . Therefore, the user can only click back, and then click on another search result, resulting in a poor user experience.
  • the user will see a magnifying glass logo to the right of the search results. Click on the magnifying glass to see a thumbnail preview of the page. Users can also swipe down to see a preview of all search results.
  • the hardware and software costs of achieving the above effects are enormous.
  • the page preview function there are still some technical means to implement the page preview function, such as using the CGI program to capture the image area of the browser, and using the drawing function of the browser to generate the image.
  • the prior art does not utilize the web crawler Set up a solution that implements the page preview feature.
  • the content of the webpage is generally analyzed only by the webpage file, and the content thereof is extracted.
  • Some web crawlers go a step further and perform simple processing on these contents, such as semantic annotation, which is convenient for search engines to sort and sort.
  • these web crawlers generally do not have the function of page rendering, so the search result page preview function cannot be conveniently implemented.
  • the existing search engine searches into its own database system based on the keywords input by the user, and feeds back the results of the retrieval to the user.
  • the biggest problem is that the user does not know what kind of keywords should be entered in order to accurately express the information that they need to search.
  • the search service provider needs to analyze and judge according to the information input by the user, and provide the search information according to the judgment result. Therefore, it is often unanswered between the judgment of the search service provider and the needs of the user.
  • intelligent search uses word segmentation dictionary, synonym dictionary, and homonym dictionary to improve the retrieval effect. Further, it can assist the query at the knowledge level or the concept level, and form a knowledge through the topic dictionary, the upper and lower dictionary, and the related peer dictionary search processing.
  • the system or concept network gives the user intelligent knowledge prompts, which ultimately helps the user to get the best retrieval results. For example, query “computer”, information related to “computer” can also be retrieved; you can further narrow the scope of the query to "microcomputer”, “server” or expand the query to "information technology” or query related "electronics” Technology, "software”, “computer application” and other categories.
  • association functions, which are based on previous user selection results for statistical analysis, and based on these analysis results to provide the most likely results for users to choose. But this does not actually solve the problem of the accuracy of web search, because for a large number of people, there is a certain statistical law, and for a certain search of a specific user, the statistical rules do not have much meaning.
  • Vertical search engines have many applications, such as enterprise search, supply and demand information search engine, shopping search, property search, talent search, map search, mp3 search, image search, and so on.
  • the overall workflow is as follows: After crawling the webpage, the webpage product information is extracted, the product name, price, introduction, etc. are extracted, and then the information is cleaned, deduplicated, classified, analyzed and compared, and the data is extracted. Mining, finally providing user search through word segmentation index, providing market market report through analysis and mining.
  • a first technical problem to be solved by the present invention is to provide a visual search engine system.
  • the visual search engine system can display Internet search results in an illustrated manner.
  • a second technical problem to be solved by the present invention is to provide a method for the visual search engine system to implement display control of Internet search results.
  • a third technical problem to be solved by the present invention is to provide a system based on the above-described visual search engine.
  • the current online shopping navigation method is to provide a system based on the above-described visual search engine.
  • a visual search engine system comprising a web crawler device, a display control device and a semantic analysis device, characterized in that:
  • the web crawler device further includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library; wherein
  • the information collector is located at the bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content.
  • the URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
  • the URL filter filters the URL and limits the site scope and the theme
  • the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
  • the image generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library; the display control device further includes:
  • a text search result display unit for displaying text search results in a list manner
  • a graphic search result display unit configured to display a webpage thumbnail corresponding to the text search result
  • a focus tracking unit for capturing text focus and/or graphic focus of the user's attention
  • a focus webpage thumbnail display unit configured to display a graphic focus corresponding to a text focus selected by the user
  • a synchronous display control unit configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer
  • the linked list realizes synchronous and coordinated changes
  • the text search result display unit is located at a left middle position of the entire display page, the focus webpage thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located at the upper right of the focus webpage thumbnail display unit. Corner and bottom right corner;
  • the semantic analysis device further includes:
  • the input word segment unit is configured to accept a target information description word input by the user, and perform a word segmentation operation on the target information description word;
  • a semantic determining unit configured to determine whether the target information descriptor has complete semantics
  • a reference vocabulary unit configured to provide a vocabulary associated with the target information descriptor to the user if the target information descriptor does not have complete semantics
  • the secondary input unit is configured to perform secondary input by the user, thereby determining semantics of the target information descriptor, and performing subsequent retrieval according to the semantic.
  • a method for implementing display control of Internet search results by the above visual search engine system comprising a page rendering step, a display control step and a semantic analysis step, wherein:
  • the page rendering step includes the following sub-steps:
  • the display control step includes the following sub-steps:
  • the text search result is vertically arranged in parallel with the corresponding webpage thumbnail, and the central part of the display page is a focus display area for displaying the graphic focus corresponding to the text focus selected by the user;
  • the semantic analysis step includes the following substeps:
  • the user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
  • a web crawler with page rendering function characterized in that:
  • the web crawler includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library;
  • the information collector is located at the bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content.
  • the URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
  • the URL filter filters the URL and limits the site scope and the theme
  • the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
  • the picture generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library.
  • the information collector starts from the information source, requests through the hup protocol, downloads a web page, the page analyzer analyzes the page and extracts the link, and then the information collector accesses the network in an iterative manner.
  • the information collector searches the web page by using a graph traversal algorithm.
  • the URL filter uses the semantic information of the extended metadata to perform topic correlation prediction on the URL extracted from the Web page, and performs pruning processing according to the principle of collecting related links and discarding irrelevant links.
  • the URL manager obtains a URL list from the URL library on the one hand, and assigns the task to a plurality of information collectors after the task is arranged; on the other hand, obtains a new URL list from a plurality of information collectors, and saves the lists to In the URL library.
  • a method for implementing a page rendering function by a web crawler device comprising the steps of:
  • a method for implementing a page rendering function by a web crawler device comprising the steps of: when a picture tag is found to refer to a picture, a request is sent to the server; at this time, the following code is continued to be rendered, and the server returns the picture. File, then re-render this part of the code.
  • a display control device for displaying search results in an image and text manner, comprising: a text search result display unit, configured to display a text search result in a list manner;
  • a graphic search result display unit configured to display a webpage thumbnail corresponding to the text search result
  • a focus tracking unit configured to capture a text focus and/or a graphic focus of the user's attention
  • a focus webpage thumbnail display unit configured to display a graphic focus corresponding to a text focus selected by the user
  • a synchronous display control unit configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer
  • the linked list realizes synchronous and coordinated changes
  • the text search result display unit is located at a left middle position of the entire display page, the focus webpage thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located at the upper right of the focus webpage thumbnail display unit. Corner and bottom right corner.
  • the text search result is vertically arranged in parallel with the corresponding web page thumbnail.
  • the position of the head pointer in the bidirectional circular linked list corresponds to the position of the text focus in the text search result.
  • a display control method for displaying search results in an illustrated manner characterized in that:
  • the text search result is vertically arranged in parallel with the corresponding webpage thumbnail, and the central part of the display page is a focus display area for displaying the graphic focus corresponding to the text focus selected by the user; the text focus and The graphic focus is synchronously displayed in the display page, and the synchronous coordinated change is realized by the bidirectional circular linked list with the leading pointer, wherein the head pointer is used for realizing the judgment of the position of the text focus.
  • a method for realizing accurate search by using semantic analysis which is characterized by the following steps:
  • the user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
  • a network shopping navigation method is implemented based on a visual search engine system including a web crawler device and a display control device, wherein the web crawler device is configured to capture and generate a webpage thumbnail, wherein: the visual search engine system When used for web shopping navigation, first, according to the shopping object keyword input by the user, the display control device displays the text search result of the shopping object on the left side of the search result page, and displays the upper right corner and the lower right corner of the search result page. a webpage thumbnail corresponding to the text search result, and a central area of the search result page displays a thumbnail of the focus webpage of the shopping object currently selected by the user;
  • a selection column is set in the search result page, and the user puts the selected search result into the selection column for comparison, and then enters the webpage where the shopping object is located from the selection column to purchase.
  • a webpage ID is set for each target webpage for the shopping object, and the webpage ID is transit managed.
  • the selection column temporarily saves the webpage ID, and the item to be purchased is added or discarded by the operation of the webpage ID.
  • a favorite folder is further set in the search result page, and the favorite is open to the registered user, and the webpage ID selected by the registered user is stored for a long time.
  • the thumbnails of the shopping objects captured and generated by the web crawler are grouped together for selection by the user.
  • the user enters the online shop page where the shopping object is located to make a purchase, and the online shop page is displayed in a virtual floating manner.
  • the visual search engine system and the implementation method thereof provided by the invention use the web crawler device to directly render the webpage to the page, and save the rendering result directly in the image format, thereby laying a technical foundation for realizing the page preview function with low cost and high efficiency. ; display the text summary and web page thumbnail of the Internet search results in the display page, so that users can accurately identify the content they need; use semantic analysis to achieve accurate search, so that the search engine can accurately think of the user from the database The information you want is available to the user.
  • the “search”, "view” and “comparison” in the online shopping process are integrated into the interior of the visual search engine system, thereby forming a complete
  • the online shopping navigation process effectively improves the user's online shopping experience.
  • FIG. 1 is a schematic diagram of an overall architecture of a visual search engine system provided by the present invention
  • FIG. 2 is a schematic diagram of the overall composition of the network crawler device in the visual search engine system
  • FIG. 3 is a schematic flowchart of the basic function of the network crawler device to implement the network crawler
  • 4 is a schematic flowchart of a web crawling device implementing a page rendering function
  • 5 is a schematic diagram of a display page of a display control device in the visual search engine system
  • FIG. 6 is a schematic diagram of a bidirectional circular linked list for implementing a head pointer of a synchronous display control unit
  • FIG. 7 is a schematic diagram showing a correspondence between an initial state of a page and a bidirectional circular linked list
  • FIG. 8 is a schematic diagram showing a correspondence between an intermediate state of a page and a bidirectional circular linked list
  • FIG. 9 is a flow chart of a method for realizing accurate search by using semantic analysis in the present invention.
  • FIG. 10 is a diagram showing an example of a homepage of the visual search engine system as a network shopping search engine
  • FIG. 11 is a diagram showing an example of a search result page when performing a “search” according to user input information
  • Figure 12 is a diagram showing an example of a display page in which a user selects a preliminary selected shopping object together to "view” for "comparison”;
  • FIG. 13 is a diagram showing an example of a user entering the online shop where the shopping object is located after the user compares the shopping object to be purchased.
  • the visual search engine system provided by the present invention mainly solves the technical problems of the three aspects.
  • the first is to realize the preview function of the search result page with low cost and high efficiency.
  • the second is to display the text summary and web page thumbnail of the Internet search result in the display page, so that the user can accurately identify the content that he needs.
  • the semantic analysis is used. Achieve accurate search, so that the search engine can accurately provide the user with the most desired information from the database.
  • the visual search engine system can implement multiple service functions such as web page collection, web page sorting and indexing, page rendering, and query service.
  • service functions are mainly realized through the cooperation of the web crawler device, the display control device and the semantic analysis device, and the specific description is as follows:
  • the web crawler in the visual search engine system is mainly composed of the following parts:
  • Each information collector is a web spider (Web Spider), which is at the bottom of the web crawler. It is an interface for web crawlers to interact directly with massive Internet information (such as forums, blogs, WAPs, documents, audio and video materials, etc.). section.
  • the role of the information collector is to obtain a web page. It usually starts from an information source (such as a user query, a URL list, or a certain page), requests through the h «p protocol, downloads a web page, analyzes the page and extracts the link, and then the information collector accesses iteratively.
  • the internet In a specific embodiment of the invention, the information collector preferably searches for the web page using a graph traversal algorithm, such as a breadth-first or depth-first strategy.
  • the web crawler uses multi-threading technology for each information collector based on the parallel mechanism.
  • each information collector can start hundreds of threads simultaneously for page information collection.
  • the URL manager manages the URL queue to be collected by means of interleaving access, and allocates collection tasks to each information collector. Therefore, it is ensured that at most one thread of the same information collector is connected to the same web server, thereby effectively avoiding the web server from being accessed. The amount suddenly increased and there was a blockage or even a downtime.
  • URLs Stored in the URL library are all URLs extracted from the collected pages, in order to avoid the collection of pages
  • the "topic drift" problem these URLs must be subject to topic relevance prediction before entering the URL library.
  • extended metadata ie, HTML tags such as Anchor
  • Branch processing reduces the number of unrelated pages collected by the system, thereby saving a lot of system operation costs and effectively improving the speed and efficiency of topic information search.
  • the link filter will be predicted as a link (URL) into the library of the topic-related page, and then distributed as a to-be-collected URL by the URL manager to each information collector to collect the web page pointed to by the URL link.
  • the webpage filter of this web crawler is based on this, absorbs the idea of the traditional vector space model, uses the concept-based vector space method to filter the page content, and maps the vocabulary to the conceptual level, the conceptual meaning expressed from the word.
  • the hierarchy is also the semantic level to analyze the relevance of the text.
  • the main function of the page analyzer is to parse the content of the captured page. It can be divided into two parts: one part is to parse the URL with the link mark, and the URL filter is parsed to extract the link; the other part is the page content. Parse to text format and hand it to the page filter for processing.
  • the main function of the URL Manager is to manage URL tasks.
  • the URL manager obtains a list of URLs from the URL library, and arranges them for assignment to multiple information collectors.
  • the URL manager obtains a new list of URLs from multiple information collectors, and these lists are Save to the URL library with a certain strategy.
  • the URL manager starts the information collector to start the collection of the web page, and stores the collected web page. It is then analyzed by the page analyzer to get both the mark and the page. The tag is parsed by the URL filter, and the page part is sent to the page filter. After the content filter is detected by the page filter, it is stored in the page library. The web page is sent to the URL library after filtering by the URL filter to limit the scope and theme of the site. Thereafter, the image generator connected to the URL library starts working, and the image corresponding to the page is generated for the URL stored in the URL library. A detailed description will be given below.
  • the user enters a URL to make a request to the server, and the server returns a web page in html format;
  • the parser starts to load the source code of the html language. If it finds that there is a ⁇ l ink> tag in the ⁇ 11 ⁇ 01> tag that references the external CSS file, the CSS file is issued, and the server returns the CSS file; the page parser continues to load Enter the code in the 00 ⁇ > section of the html and start rendering the page.
  • the start tag used to generate an Html file
  • This step is mainly used to render the content in the template.
  • the life cycle stages of the tag are called in turn. That is, the local is a recursive entry from the upper tag to the lower tag, and only the lower tag is rendered. , the calling component will continue the operation of the subsequent phase.
  • This step is generally used to generate an end tag or to control the execution flow of the inline tag.
  • the present invention has been described above by taking a Web page in the html format as an example.
  • the web crawler having the page rendering function provided by the present invention is not limited to processing pages in the html format, and web pages in other formats can be directly processed.
  • the Internet search result to be displayed includes two types of data-text search result data and corresponding web page thumbnail data, instead of a single type of text data or graphic data.
  • the display control device in the visual search engine system is as shown in FIG.
  • the displayed display position setting scheme that is, the text display area and the graphic display area are vertically arranged in parallel, and the focus display area is set in the central part of the display page.
  • the selected text focus is combined with the corresponding graphic focus and arranged on the same horizontal line.
  • This display position setting scheme considers that the reading order of the text is from left to right. To comply with people's reading habits, the same related content (ie, corresponding text focus and graphic focus) must be listed on the same horizontal line from left to right. Show.
  • the display control device in the visual search engine system simultaneously displays a text summary (ie, a text search result) of the search result and a corresponding web page thumbnail in the display page.
  • the display control device includes at least three display function units, which are a text search result display unit, a focus web page thumbnail display unit, and a graphic search result display unit.
  • the text search result display unit is located at the middle of the left side of the entire display page
  • the focus webpage thumbnail display unit is located in the central area of the entire display page
  • the graphic search result display unit may have multiple, respectively located in the focus webpage thumbnail display unit
  • the text search result of the web search can be displayed in a list.
  • the text search result 1 to the text search result 5 are displayed in a list.
  • the user can further click on the text search results by using the mouse, for example, clicking the text search result 3 as the text search result of interest, and further clicking the corresponding link. .
  • the display control device between the text search result display unit and the focus page thumbnail display unit, the graphic search result display unit Correlating the display content, wherein the webpage corresponding to the text search result (ie, the text focus) selected by the user in the text search result display unit is associated with the focus webpage thumbnail display unit, and other text search results are combined.
  • the corresponding web page is associated with the graphic search result display unit.
  • the thumbnail of the web page corresponding to the text search result (ie, the text focus) selected by the user is always displayed in the thumbnail page display unit of the focus page, and the user is not displayed in the display unit of the graphic search result.
  • the thumbnail of the web page corresponding to the selected other text search results can implement web page thumbnail display using techniques such as web crawler (Web Crawler).
  • the display area occupied by the focus web page thumbnail display unit is large and always located in the center area of the display page.
  • the thumbnail of the webpage corresponding to the text search result selected by the user can be clearly and comprehensively displayed, which is convenient for the user to decide whether to perform further click operations.
  • the thumbnail of the webpage is only displayed on the right side of the search result and the display area is small, so it is difficult for the user to see the specific content of the thumbnail of the webpage, and it is not convenient to make further Click to judge.
  • the display content of the focus web page thumbnail display unit can be changed.
  • the user's focus of attention indicated by the position of the mouse operated by the user
  • the text focus and the corresponding graphic focus will change accordingly, thereby achieving effective focus conversion.
  • a focus tracking unit is specifically provided in the display control device for capturing the text focus and/or the graphic focus of the user's attention; and a synchronous display control unit for assisting the user to display the display content of the focus webpage thumbnail display unit.
  • the above-described synchronous display control unit can synchronously control three kinds of data (character stream, picture stream, and display focus) using a bidirectional circular linked list with a leading pointer L as shown in FIG.
  • This bidirectional circular linked list is recyclable.
  • the control list is adjusted accordingly.
  • the adjustment of the control linked list drives the adjustment of other data streams, and the whole can be effectively synchronized.
  • This bidirectional circular list must be bidirectional, that is, the display focus can be adjusted from top to bottom in the text stream, or it can be adjusted from bottom to top, and the picture stream is the same.
  • the bidirectional circular list also has a head pointer L, which is used to determine the focus position. When the position of the head pointer L changes, that is, the entire focus display content changes, thereby causing the display of the text focus and the graphic focus to change.
  • Figure 7 is a schematic diagram showing the correspondence between the initial state of the page and the bidirectional circular linked list.
  • both the text focus and the graphic focus are located at the top of the display page, and the head pointer L in the corresponding bidirectional circular list is located at the leftmost position.
  • Fig. 8 is a schematic diagram showing the correspondence between the intermediate state of the page and the bidirectional circular linked list.
  • both the text focus and the graphic focus are in the middle of the display page, and the head pointer L in the corresponding bidirectional circular list is also in the middle position.
  • the bidirectional circular linked list here is a controller that calls both the text stream and the graphics stream. It provides parameters for the text stream and the graphics stream that should be in the focus display state or need to be changed.
  • the number of nodes in the bidirectional circular linked list corresponds to the logarithm of the search results to be displayed, and each node corresponds to a pair of search results (text search) The result and the corresponding web page thumbnail).
  • the L pointer points to which position, the focus is on which pair of search results.
  • the thumbnail of the webpage in the search result is displayed by the focus webpage thumbnail display unit, and the corresponding text search result is in the focus position in the text search result display unit.
  • the display position of other search results is adjusted accordingly.
  • the position of the head pointer L also changes accordingly, but the relationship between the text stream and the picture stream has been fixed by the bidirectional circular linked list, realizing the synchronous display.
  • the present invention converts the processing method of "input one-word-retrieve” used by the existing search engine into the processing method of "input one-word-semantic-sense-retrieve", that is, semantic judgment after the word-dividing operation, and judgment input Whether the word is information with certain semantics, if so, subsequent retrieval is performed directly; if not, the user is provided with a vocabulary associated with the input word. The user then performs a second input (selected in the associated vocabulary) to accurately determine the true semantics of the user input information, thereby obtaining accurate web search results based on the semantics.
  • the network search method is implemented in the form that it needs to be performed twice in most cases.
  • Input operation After the user inputs the most important target information descriptor (first input operation), the related vocabulary is retrieved and provided to the user, and the user selects from it (the second input operation), thereby clarifying the specific search.
  • the goal is to enable the search engine to accurately provide the user with the most desired information from the database.
  • the web search method adds a "meta vocabulary association database" to the search engine, and encourages the user to input a word that best represents the search target in the "meta vocabulary association database", and the word is the most information he needs. Important target description.
  • the search engine When the search engine accepts the words input by the user and performs the maximum segmentation, it judges whether the input information has complete semantics, and if so, directly performs subsequent search operations. If not, the words input by the search engine are in the "meta vocabulary association". Correlation analysis is performed in the database, and the user is provided with a multiple choice based on the results, so that the user can more accurately describe the target information he needs through further selection. This multiple choice has all the relevant information related to the first input word, so that the real purpose of the user is very accurate for the search engine, and the search engine can quickly and cost-effectively provide the search results required by the user. user.
  • the network search method does not simply decompose the existing one-step search into a two-step optional search, but instead discards the usual one-step search method and discards the multi-step or indefinite step query method. This is based on the results of two studies, namely:
  • the "meta-lexical association database” is not represented by one-dimensional data of a simple relational database, but by a multi-dimensional associated vocabulary matrix. Specifically, we uniformly encode each meta-word and use the relevant code as its representation.
  • the specific coding scheme can be implemented by using various existing technologies, such as an XML method, an N3 mode, and a triplet mode, and will not be described in detail herein.
  • the effective association analysis between the vocabularies generates the associated information of a certain vocabulary, that is, for a meta vocabulary S, S ⁇ ci, dj ⁇ is used to store the associated vocabulary, and Ci is classified as the first layer, and dj is classified as the second layer.
  • ci, dj can be selected according to the needs of the designer, for example, classifying ci according to the subject of knowledge, and classifying dj according to people's needs.
  • a vocabulary produces a linked vocabulary.
  • the word car, the first categorization of its associated vocabulary can be expressed as (model, brand, manufacturer, seller, repair, performance, picture), etc., while the second tier classification such as the model includes (small car, truck) , trucks, etc. Its complete representation is:
  • a user enters the search engine through the Internet in a computer at his home (such as College Road, Haidian District, Beijing), the purpose of which is to search for the location of the ICBC nearest to his home. So he entered the word "ICBC" in the search box.
  • the usual search engine will immediately display the text information about the four words "ICBC" on the Internet to the user according to their own sorting method.
  • the user selects a webpage that may include the information he or she needs from at least tens of pages of the selection, and then clicks on the link to enter it, and finds the information he needs from the webpage.
  • the first use of the maximization word segmentation algorithm (such as forward maximization word segmentation algorithm, reverse maximization word segmentation algorithm or probability maximization word segmentation) Algorithm, etc.) Determine this is a vocabulary, and according to whether it has the principle of complete semantics, judge that this is not a complete semantic input, but a vocabulary, and understand that the information required by the user is related to ICBC. Information (the understanding here is based on the fact that the user is a valid input, not an unintended input, so it can be easily determined based on the input of a word).
  • the maximization word segmentation algorithm such as forward maximization word segmentation algorithm, reverse maximization word segmentation algorithm or probability maximization word segmentation Algorithm, etc.
  • the search engine can accurately determine that the information to be searched by the user is actually "the address of the ICBC closest to the location of the IP address.” In this way, the search engine can accurately retrieve it in its own database and will have "College Road ICBC Address" The web page is displayed directly to the user, and the user can quickly get the information they really need.
  • the homepage of the online shopping navigation website provided is shown in FIG.
  • the search box is located at the top of the page, the left side is the eye-catching "Picture” logo, and the bottom is a series of commonly used search shortcuts, including "Latest”, “Recommended”, “Cosmetics”, “Group purchase”, “comprehensive shopping”, “shopping discount”, “digital home appliances”, “female fashion”, “mother and baby children”, “clothing apparel” and so on.
  • Below the search box and search shortcuts above is a selection of themes consisting of a series of thumbnails of web pages. These web page thumbnails are all generated by crawling by web crawlers in the visual search engine system.
  • the use process of the graphic search provided by the present invention fully respects the usage habits of ordinary users, including the steps of searching, screening, comparing and screening (this step can be omitted), and entering the online shop page where the shopping object is located. These steps are very similar to using other shopping search engines. But when the existing shopping search engine is in use,
  • this Tubu search integrates "search”, "view” and “comparison” in the online shopping process into the interior of the visual search engine system through the cooperation of the web crawler device and the display control device. This forms a complete online shopping navigation process that greatly improves the user's shopping experience.
  • FIG. 11 is a diagram showing an example of a search result page when "searching" is performed based on user input information.
  • the user inputs the shopping object keyword of "car” in the search box, and then displays the text search result related to the shopping object "car” on the left side of the display page, and displays the upper right corner of the page and The thumbnail of the web page corresponding to the text search result is displayed in the lower right corner.
  • the focus web page thumbnail of the shopping object currently selected by the user is located in the center area of the entire display page.
  • the basic frame of the display page shown in Fig. 11 is determined by the display control device in the visual search engine system, and thus is very similar to the display page shown in Fig. 5.
  • the user can complete the "viewing" operation without clicking the webpage where the shopping object is located. Since the "view” operation is completely done inside the map search, the user's operation is greatly simplified.
  • the user can use the map purchase search to see the related information of the object to be purchased and its price, and realize the selection of the shopping object by selecting the webpage, thereby realizing the search object at the search engine level, and making the purchase of the map. Searching for the role of online shopping navigation is more prominent.
  • the selection bar and the favorites are set in the search result page shown in FIG.
  • a webpage ID is set for each target webpage for the shopping object, and the webpage ID is transited during the processing of the target shopping object.
  • the selection bar can use the cooki e and the background to select the temporary storage library, store the temporarily saved web page ID, and add or discard the items to be purchased by the operation of the web page ID.
  • the search results are numerous, and must be placed in the selection bar for "comparison", and then from the selection bar to the web page where the target shopping object is located.
  • Selection in the search The bar is open to any user, and the favorites are only open to registered users.
  • the web page ID stored in the pick bar is only temporarily saved. When the user does not use the Tesco search for a certain period of time, the corresponding selection bar will be automatically cleared.
  • the favorites used by registered users can save the web page ID selected by the user for a long time, so that they can be called at any time in the future.
  • Motobu search A notable feature of this Motobu search is that the user's search results must first be placed in the selection bar for "comparison", and then from the selection bar to the web page where the target shopping object is located.
  • Figure 12 is an example of a display page where the user "patch” the initially selected purchase objects together for "comparison”.
  • the thumbnail of the shopping object displayed during the "comparison” process is still captured and generated by the web crawler. Since the web crawler in the map search has a strong webpage thumbnail crawling capability, it is possible to realize arbitrary display of the shopping object thumbnails inside the cartographic search, so that the users can collectively put them together for "comparison". In the "comparison” process, the user still does not leave the platform provided by the Tesco search, thus avoiding the trouble that the existing shopping search engine needs to perform repeated page jumps when performing "comparison", which greatly simplifies the user's operation.
  • FIG. 13 is a diagram showing an example of a user entering a shop page where a shopping object is located. In this operation, the online shop page where the shopping object is located is displayed in a virtual floating manner, and the direction of the target is controlled, and the direct conversion within the search engine result is realized, so that the user does not leave the Tesco search local feeling, and further improves.
  • the user's shopping experience is used as a web shopping portal, it only provides the online shopping navigation function, and does not sell any merchandise itself. Therefore, after the user compares and determines the object to be purchased, the user needs to enter the link through the map purchase search.
  • the online store page where the shopping object is located is purchased.
  • Figure 13 is a diagram showing an example of a user entering a shop page where a shopping object is located. In this operation, the online shop page where the shopping object is located is displayed in a virtual floating manner, and the direction of the target is controlled, and the direct conversion within the search engine result is realized, so that the user does not leave the Tesco search local feeling,

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

Disclosed are a visualized search engine system and implementation method and application thereof. In the visualized search engine system, a web crawler apparatus implementing preview function of the search result page cost-effectively and efficiently, then a display control apparatus concurrently displays the text abstract and thumbnail of the Internet search result on the displayed page, facilitating users accurately identifying their desired content; lastly using semantic analysis to enable precision search, allowing the search engine to accurately provide the most desirable information from the database to the user. Also disclosed is a method for network shopping navigation based on the visualized search engine system.

Description

可视化搜索引擎系统及其实现方法和应用 技术领域  Visual search engine system and its implementation method and application
本发明涉及一种以图文并茂方式显示互联网搜索结果的可视化搜索引擎系统,同 时涉及该可视化搜索引擎系统实现互联网搜索结果显示控制的方法及其在网络购物导 航方面的应用, 属于互联网搜索技术领域。  The invention relates to a visual search engine system for displaying Internet search results in an illustrated manner, and relates to a method for realizing display control of Internet search results by the visual search engine system and its application in network shopping navigation, and belongs to the field of Internet search technology.
背景技术 Background technique
互联网已经成为人们获取信息的主要来源之一。为了帮助用户从漫无边际的互联 网数据中快速获取所需的信息, 搜索引擎发挥着不可替代的作用。 搜索引擎(search engine)是根据一定的策略、 运用特定的计算机程序搜集互联网上的信息, 在对信息进 行组织和处理后, 并将处理后的信息显示给用户, 从而为用户提供检索服务的信息服 务系统。 它并不真正搜索互联网的网页, 而是搜索预先整理好的网页索引数据库。  The Internet has become one of the main sources of information for people. Search engines play an irreplaceable role in helping users get the information they need quickly from the inexhaustible internet data. A search engine (search engine) collects information on the Internet according to a certain strategy and uses a specific computer program. After organizing and processing the information, the information is displayed to the user, thereby providing the user with information for searching for the service. service system. Instead of actually searching the web pages of the Internet, it searches the pre-organized web index database.
垂直搜索引擎 (vert i cal search engine ) 是针对某一特定领域、 某一特定人群 或某一特定需求提供的具有一定价值的信息和相关服务, 是通用搜索引擎的细分和延 伸。 它对网页索引数据库中的某类专门信息进行一次整合, 定向分字段抽取出需要的 数据进行处理后再以某种形式返回给用户。 垂直搜索引擎的特点是 "专、 精、 深", 且 具有行业色彩。相较于通用搜索引擎的海量信息无序化,垂直搜索引擎显得更加专注、 具体和深入。  The vertical search engine (vert i cal search engine) is a valuable information and related service for a specific area, a specific group of people or a specific need, and is a subdivision and extension of the general search engine. It integrates a certain type of special information in the web index database, and directs the fields to extract the required data for processing and then returns it to the user in some form. The vertical search engine is characterized by "specialized, precise, deep" and has an industry color. Compared to the massive information disorder of the general search engine, the vertical search engine appears to be more focused, specific and in-depth.
垂直搜索引擎大体上包括以下几方面的技术: 1 . 搜索引擎爬虫: 用于抓取互联 网上的相关网页; 2. 网页结构化信息抽取技术或元数据采集技术: 用于从网页中抽取 出结构化的数据; 3. 分词、 索引: 用于存储并索引数据; 4. 数据展现: 由于存储的 数据并非简单的网页数据, 需要考虑根据行业需求进行展示。  Vertical search engines generally include the following technologies: 1. Search engine crawler: used to crawl related web pages on the Internet; 2. Web page structured information extraction technology or metadata collection technology: used to extract structures from web pages 3. Data segmentation, indexing: used to store and index data; 4. Data presentation: Since the stored data is not simple web page data, it needs to be considered for display according to industry needs.
在数据展现方面, 目前主流的搜索引擎如谷歌 (Google )、 百度 (Baidu )、 必应 ( Bing ) 等, 对其搜索结果显示页面都有不同的版面设计。 但是它们之间的相似之处 要比不同点多, 例如它们都以纯文字方式逐条显示搜索结果。 对于每一项搜索结果显 示网页标题, 并且在网页标题后面跟上一个网页描述摘要。 这种版面设计方式可以在 一个页面中呈现更多的搜索结果, 但是由于仅显示网页的文字摘要, 用户根据文字摘 要的内容点击一个搜索结果, 却发现出现的页面与其想要的页面相差甚远。 于是用户 只能点击返回, 再去点击另一个搜索结果, 造成用户的体验很差。 为此, 谷歌公司在 2010年推出了搜索结果可视预览功能, 允许用户在搜索结果列表中直接以缩略图的形 式预览每个页面。 用户将在搜索结果右侧看到一个放大镜标志, 点击放大镜就可以看 到这个页面的缩略图预览。用户还可以向下滑动, 査看所有搜索结果的预览图。但是, 实现上述效果所付出的硬件成本和软件成本都是巨大的。  In terms of data display, current mainstream search engines such as Google, Baidu, Bing, etc. have different layouts for their search results display pages. But there are more similarities between them than they are different, for example, they all display search results in a text-only manner. The page title is displayed for each search result, and a page description summary is followed by the page title. This layout design can present more search results in one page, but since only the text summary of the web page is displayed, the user clicks on a search result based on the content of the text summary, but finds that the page appears far from the page he wants. . Therefore, the user can only click back, and then click on another search result, resulting in a poor user experience. To this end, Google launched a visual preview of search results in 2010, allowing users to preview each page directly as a thumbnail in the search results list. The user will see a magnifying glass logo to the right of the search results. Click on the magnifying glass to see a thumbnail preview of the page. Users can also swipe down to see a preview of all search results. However, the hardware and software costs of achieving the above effects are enormous.
目前还有一些技术手段可以实现页面预览功能,例如使用 CGI程序抓取浏览器的 图像区, 利用浏览器的绘图功能生成图片。 但是, 现有技术中并没有利用网络爬虫装 置实现页面预览功能的解决方案。 在现有网络爬虫装置的运行过程中, 普遍只将网页 的内容按网页文件进行分析, 抽取其中的内容。 一些网络爬虫装置则更进一步, 对这 些内容进行简单的处理, 如加以语义标注等, 方便搜索引擎进行整理排序。 但是, 这 些网络爬虫装置普遍不具备页面渲染的功能, 因此并不能方便地实现搜索结果页面预 览功能。 There are still some technical means to implement the page preview function, such as using the CGI program to capture the image area of the browser, and using the drawing function of the browser to generate the image. However, the prior art does not utilize the web crawler Set up a solution that implements the page preview feature. In the operation process of the existing web crawler device, the content of the webpage is generally analyzed only by the webpage file, and the content thereof is extracted. Some web crawlers go a step further and perform simple processing on these contents, such as semantic annotation, which is convenient for search engines to sort and sort. However, these web crawlers generally do not have the function of page rendering, so the search result page preview function cannot be conveniently implemented.
在分词、索引方面, 现有的搜索引擎是根据用户输入的关键词进入自身的数据库 系统进行检索, 并将检索的结果反馈给用户。 在这个过程中, 最大的问题是用户不知 道应该输入什么样的关键词, 才能准确表达自己需要搜索的信息。 而搜索服务提供者 需要根据用户输入的信息进行分析判断, 并根据判断结果来提供搜索信息。 因此, 搜 索服务提供者的判断与用户的需求之间经常是答非所问。  In terms of word segmentation and indexing, the existing search engine searches into its own database system based on the keywords input by the user, and feeds back the results of the retrieval to the user. In this process, the biggest problem is that the user does not know what kind of keywords should be entered in order to accurately express the information that they need to search. The search service provider needs to analyze and judge according to the information input by the user, and provide the search information according to the judgment result. Therefore, it is often unanswered between the judgment of the search service provider and the needs of the user.
随着网络搜索技术的不断发展, 出现了智能搜索的概念。所谓的智能检索是利用 分词词典、 同义词典、 同音词典改善检索效果, 进一步还可在知识层面或者概念层面 上辅助査询, 通过主题词典、 上下位词典、 相关同级词典检索处理形成一个知识体系 或概念网络, 给予用户智能知识提示, 最终帮助用户获得最佳的检索效果。 例如査询 "计算机", 与 "电脑"相关的信息也能检索出来; 还可以进一步缩小査询范围至 "微 机"、 "服务器" 或扩大査询至 "信息技术" 或査询相关的 "电子技术"、 "软件"、 "计 算机应用" 等范畴。 另外, 现有的某些搜索引擎也提供所谓的 "联想" 功能, 即根据 以前的用户选择结果进行统计分析, 并根据这些分析结果进行预测, 提供最可能的结 果来供用户进行选择。 但这实际上并不能解决网络搜索的准确性问题, 因为对于大量 人群来讲, 存在一定的统计规律, 而对于某一个具体用户的某一次搜索而言, 统计规 律并没有太多的意义。  With the continuous development of web search technology, the concept of intelligent search has emerged. The so-called intelligent search uses word segmentation dictionary, synonym dictionary, and homonym dictionary to improve the retrieval effect. Further, it can assist the query at the knowledge level or the concept level, and form a knowledge through the topic dictionary, the upper and lower dictionary, and the related peer dictionary search processing. The system or concept network gives the user intelligent knowledge prompts, which ultimately helps the user to get the best retrieval results. For example, query "computer", information related to "computer" can also be retrieved; you can further narrow the scope of the query to "microcomputer", "server" or expand the query to "information technology" or query related "electronics" Technology, "software", "computer application" and other categories. In addition, some existing search engines also provide so-called "association" functions, which are based on previous user selection results for statistical analysis, and based on these analysis results to provide the most likely results for users to choose. But this does not actually solve the problem of the accuracy of web search, because for a large number of people, there is a certain statistical law, and for a certain search of a specific user, the statistical rules do not have much meaning.
垂直搜索引擎的应用方向很多,例如企业库搜索、供求信息搜索引擎、购物搜索、 房产搜索、 人才搜索、 地图搜索、 mp3搜索、 图片搜索等。 以购物搜索引擎为例, 整体 工作流程大致如下: 抓取网页后, 对网页商品信息进行抽取, 抽取出商品名称、 价格、 简介等, 然后对信息进行清洗、 去重、 分类、 分析比较、 数据挖掘, 最后通过分词索 引提供用户搜索、 通过分析挖掘提供市场行情报告。  Vertical search engines have many applications, such as enterprise search, supply and demand information search engine, shopping search, property search, talent search, map search, mp3 search, image search, and so on. Taking the shopping search engine as an example, the overall workflow is as follows: After crawling the webpage, the webpage product information is extracted, the product name, price, introduction, etc. are extracted, and then the information is cleaned, deduplicated, classified, analyzed and compared, and the data is extracted. Mining, finally providing user search through word segmentation index, providing market market report through analysis and mining.
但是, 现有基于搜索引擎的网络购物导航技术普遍存在一个缺点, 就是从搜索到 査看再到购买的整个体验过程存在网页跳转的脱节, 用户到最后往往找不到最初的购 买路径, 只能重新使用搜索引擎再次进行搜索, 白白浪费大量的时间精力。  However, there is a common shortcoming of the existing search engine-based online shopping navigation technology, that is, there is a disconnection of the webpage jump from the search to the viewing to the purchase of the entire experience process, and the user often cannot find the original purchase path at the end, only Being able to re-use the search engine to search again is a waste of time and effort.
发明内容 Summary of the invention
本发明所要解决的第一个技术问题是提供一种可视化搜索引擎系统。 该可视化搜 索引擎系统可以以图文并茂方式显示互联网的搜索结果。  A first technical problem to be solved by the present invention is to provide a visual search engine system. The visual search engine system can display Internet search results in an illustrated manner.
本发明所要解决的第二个技术问题是提供该可视化搜索引擎系统实现互联网搜索 结果显示控制的方法。  A second technical problem to be solved by the present invention is to provide a method for the visual search engine system to implement display control of Internet search results.
本发明所要解决的第三个技术问题在于提供一种基于上述可视化搜索引擎系统实 现的网络购物导航方法。 A third technical problem to be solved by the present invention is to provide a system based on the above-described visual search engine. The current online shopping navigation method.
为实现上述的发明目的, 本发明采用下述的技术方案:  In order to achieve the above object of the invention, the present invention adopts the following technical solutions:
一种可视化搜索引擎系统, 包括网络爬虫装置、 显示控制装置和语义分析装置, 其 特征在于:  A visual search engine system comprising a web crawler device, a display control device and a semantic analysis device, characterized in that:
所述网络爬虫装置进一步包括多个信息采集器、页面分析器、 URL过滤器、页面过滤器、 URL管理器、 图片生成器、 URL库和页面库; 其中,  The web crawler device further includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library; wherein
所述信息采集器位于所述网络爬虫装置的底层,与互联网直接进行交互以获取 Web页面, 所述页面分析器与所述信息采集器进行连接, 一方面从页面内容中解析出带有链接标记的 URL, 交给所述 URL过滤器解析; 另一方面将页面内容解析为文本格式, 交给所述页面过滤 器处理;  The information collector is located at the bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content. The URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
所述 URL过滤器对 URL进行限定站点范围和主题的过滤之后,存入 URL库中;所述页面 过滤器进行页面内容的冗余检测后, 将检测后的页面存入页面库中;  After the URL filter filters the URL and limits the site scope and the theme, the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
所述图片生成器连接所述 URL库, 针对所述 URL库中存储的 URL生成页面对应的图片; 所述显示控制装置进一步包括:  The image generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library; the display control device further includes:
文字搜索结果显示单元, 用于以列表方式显示文字搜索结果;  a text search result display unit for displaying text search results in a list manner;
图形搜索结果显示单元, 用于显示与文字搜索结果对应的网页缩略图;  a graphic search result display unit, configured to display a webpage thumbnail corresponding to the text search result;
焦点跟踪单元, 用于捕获用户关注的文字焦点和 /或图形焦点;  a focus tracking unit for capturing text focus and/or graphic focus of the user's attention;
焦点网页缩略图显示单元, 用于显示用户选择的文字焦点所对应的图形焦点; 同步显示控制单元, 用于使显示的文字焦点和图形焦点在显示页面中同步显示, 并通过带头指针的双向循环链表实现同步协调变化; 其中,  a focus webpage thumbnail display unit, configured to display a graphic focus corresponding to a text focus selected by the user; a synchronous display control unit, configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer The linked list realizes synchronous and coordinated changes;
所述文字搜索结果显示单元位于整个显示页面的左侧中间位置, 所述焦点网页缩 略图显示单元位于整个显示页面的中心区域, 所述图形搜索结果显示单元分别位于焦 点网页缩略图显示单元的右上角和右下角;  The text search result display unit is located at a left middle position of the entire display page, the focus webpage thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located at the upper right of the focus webpage thumbnail display unit. Corner and bottom right corner;
所述语义分析装置进一步包括:  The semantic analysis device further includes:
输入分词单元, 用于接受用户输入的目标信息描述词, 对所述目标信息描述词进 行分词操作;  The input word segment unit is configured to accept a target information description word input by the user, and perform a word segmentation operation on the target information description word;
语义判断单元, 用于判断所述目标信息描述词是否具有完整语义;  a semantic determining unit, configured to determine whether the target information descriptor has complete semantics;
参考词汇单元, 用于在所述目标信息描述词不具有完整语义的情况下, 向用户提 供与所述目标信息描述词相关联的词汇;  a reference vocabulary unit, configured to provide a vocabulary associated with the target information descriptor to the user if the target information descriptor does not have complete semantics;
二次输入单元,用于供用户进行二次输入,从而确定所述目标信息描述词的语义, 根据该语义进行后续的检索。  The secondary input unit is configured to perform secondary input by the user, thereby determining semantics of the target information descriptor, and performing subsequent retrieval according to the semantic.
一种上述可视化搜索引擎系统实现互联网搜索结果显示控制的方法,包括页面渲染 步骤、 显示控制步骤和语义分析步骤, 其特征在于:  A method for implementing display control of Internet search results by the above visual search engine system, comprising a page rendering step, a display control step and a semantic analysis step, wherein:
所述页面渲染步骤包括如下的子步骤:  The page rendering step includes the following sub-steps:
(1) 生成 Web页面的开始标签; (2)渲染页面模板中的内容, 其中每进入一个标签, 都依次调用所述标签的各个生 命周期阶段; (1) generate a start tag of the web page; (2) rendering the content in the page template, wherein each time a label is entered, the life cycle stages of the label are sequentially invoked;
(3)渲染 Web页面中的体;  (3) rendering the body in the web page;
(4) 生成 Web页面的结束标签;  (4) Generate an end tag of the web page;
(5)清除数据;  (5) Clearing the data;
所述显示控制步骤包括如下的子步骤:  The display control step includes the following sub-steps:
(6) 在显示页面中, 将文字搜索结果与对应的网页缩略图纵向并行排列, 显示页 面的中心部分为焦点显示区域, 用于显示用户所点选的文字焦点所对应的图形焦点; (6) In the display page, the text search result is vertically arranged in parallel with the corresponding webpage thumbnail, and the central part of the display page is a focus display area for displaying the graphic focus corresponding to the text focus selected by the user;
(7) 所述文字焦点和所述图形焦点在显示页面中同步显示, 并通过带头指针的双向 循环链表实现同步协调变化, 其中所述头指针用于实现对文字焦点所在位置的判断; 所述语义分析步骤包括如下的子步骤: (7) the text focus and the graphic focus are synchronously displayed in the display page, and the synchronous coordination change is implemented by the bidirectional circular linked list with the leading pointer, wherein the head pointer is used to implement the judgment of the position of the text focus; The semantic analysis step includes the following substeps:
(8) 接受用户输入的目标信息描述词, 对所述目标信息描述词进行分词操作; 0) 判断所述目标信息描述词是否具有完整的语义;  (8) accepting a target information descriptor input by the user, performing a word segmentation operation on the target information descriptor word; 0) determining whether the target information descriptor word has complete semantics;
(10) 如果是则直接进行后续的检索; 如果不是, 则向用户提供与所述目标信息描述 词相关联的词汇;  (10) If yes, perform subsequent retrieval directly; if not, provide the user with a vocabulary associated with the target information descriptor;
(11) 用户进行二次输入, 从而确定所述目标信息描述词的语义, 根据该语义进行后 续的检索。  (11) The user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
一种具备页面渲染功能的网络爬虫装置, 其特征在于:  A web crawler with page rendering function, characterized in that:
所述网络爬虫装置包括多个信息采集器、 页面分析器、 URL过滤器、 页面过滤器、 URL 管理器、 图片生成器、 URL库和页面库; 其中,  The web crawler includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library; wherein
所述信息采集器位于所述网络爬虫装置的底层,与互联网直接进行交互以获取 Web页面, 所述页面分析器与所述信息采集器进行连接, 一方面从页面内容中解析出带有链接标记的 URL, 交给所述 URL过滤器解析; 另一方面将页面内容解析为文本格式, 交给所述页面过滤 器处理;  The information collector is located at the bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content. The URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
所述 URL过滤器对 URL进行限定站点范围和主题的过滤之后,存入 URL库中;所述页面 过滤器进行页面内容的冗余检测后, 将检测后的页面存入页面库中;  After the URL filter filters the URL and limits the site scope and the theme, the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
所述图片生成器连接所述 URL库, 针对所述 URL库中存储的 URL生成页面对应的图片。 其中, 所述信息采集器从信息源出发, 通过 hup协议请求, 下载 Web页面, 所述页面分 析器分析页面并提取链接, 然后所述信息采集器再以迭代的方式访问网络。  The picture generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library. The information collector starts from the information source, requests through the hup protocol, downloads a web page, the page analyzer analyzes the page and extracts the link, and then the information collector accesses the network in an iterative manner.
所述信息采集器采用图的遍历算法搜索 Web页面。  The information collector searches the web page by using a graph traversal algorithm.
所述 URL过滤器利用扩展元数据的语义信息, 对从 Web页面中提取出的 URL进行主题相关 性预测, 按照相关链接进行采集、 不相关链接直接丢弃的原则进行剪枝处理。  The URL filter uses the semantic information of the extended metadata to perform topic correlation prediction on the URL extracted from the Web page, and performs pruning processing according to the principle of collecting related links and discarding irrelevant links.
所述 URL管理器一方面从所述 URL库中获得 URL列表, 进行任务排列后分配给多个信 息采集器; 另一方面从多个信息采集器中获得新的 URL 列表, 将这些列表保存到所述 URL 库中。 一种网络爬虫装置实现页面渲染功能的方法, 其特征在于包括如下步骤: The URL manager obtains a URL list from the URL library on the one hand, and assigns the task to a plurality of information collectors after the task is arranged; on the other hand, obtains a new URL list from a plurality of information collectors, and saves the lists to In the URL library. A method for implementing a page rendering function by a web crawler device, comprising the steps of:
(1) 生成 Web页面的开始标签;  (1) Generate a start tag of the web page;
(2)渲染页面模板中的内容, 其中每进入一个标签, 都依次调用所述标签的各个生 命周期阶段;  (2) rendering the content in the page template, wherein each time a label is entered, the life cycle stages of the label are sequentially invoked;
(3)渲染 Web页面中的体;  (3) rendering the body in the web page;
(4) 生成 Web页面的结束标签;  (4) Generate an end tag of the web page;
(5)清除数据。  (5) Clear the data.
一种网络爬虫装置实现页面渲染功能的方法, 其特征在于包括如下步骤: 当发现一个图片标签引用了一张图片时, 向服务器发出请求; 此时继续渲染后面 的代码, 服务器返回所述图片的文件, 然后重新渲染这部分代码。  A method for implementing a page rendering function by a web crawler device, comprising the steps of: when a picture tag is found to refer to a picture, a request is sent to the server; at this time, the following code is continued to be rendered, and the server returns the picture. File, then re-render this part of the code.
当发现存在一个 JavaScr ipt代码的 <SCript^ 签时, 执行语句, 重新渲染部分代 码, 然后将渲染的结果生成图片。 When it is found that there is a < SC ript^ tag of JavaScr ipt code, execute the statement, re-render part of the code, and then generate the image as a result of the rendering.
一种以图文并茂方式显示搜索结果的显示控制装置, 其特征在于包括: 文字搜索结果显示单元, 用于以列表方式显示文字搜索结果;  A display control device for displaying search results in an image and text manner, comprising: a text search result display unit, configured to display a text search result in a list manner;
图形搜索结果显示单元, 用于显示与文字搜索结果对应的网页缩略图; 焦点跟踪单元, 用于捕获用户关注的文字焦点和 /或图形焦点;  a graphic search result display unit, configured to display a webpage thumbnail corresponding to the text search result; a focus tracking unit, configured to capture a text focus and/or a graphic focus of the user's attention;
焦点网页缩略图显示单元, 用于显示用户选择的文字焦点所对应的图形焦点; 同步显示控制单元, 用于使显示的文字焦点和图形焦点在显示页面中同步显示, 并通过带头指针的双向循环链表实现同步协调变化; 其中,  a focus webpage thumbnail display unit, configured to display a graphic focus corresponding to a text focus selected by the user; a synchronous display control unit, configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer The linked list realizes synchronous and coordinated changes;
所述文字搜索结果显示单元位于整个显示页面的左侧中间位置, 所述焦点网页缩 略图显示单元位于整个显示页面的中心区域, 所述图形搜索结果显示单元分别位于焦 点网页缩略图显示单元的右上角和右下角。  The text search result display unit is located at a left middle position of the entire display page, the focus webpage thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located at the upper right of the focus webpage thumbnail display unit. Corner and bottom right corner.
在所述显示控制装置的显示页面中, 文字搜索结果与对应的网页缩略图纵向并行 排列。  In the display page of the display control device, the text search result is vertically arranged in parallel with the corresponding web page thumbnail.
所述头指针在双向循环链表中的位置与所述文字焦点在文字搜索结果中的位置 相对应。  The position of the head pointer in the bidirectional circular linked list corresponds to the position of the text focus in the text search result.
一种以图文并茂方式显示搜索结果的显示控制方法, 其特征在于:  A display control method for displaying search results in an illustrated manner, characterized in that:
在显示页面中, 将文字搜索结果与对应的网页缩略图纵向并行排列, 显示页面的 中心部分为焦点显示区域, 用于显示用户所点选的文字焦点所对应的图形焦点; 所述文字焦点和所述图形焦点在显示页面中同步显示,并通过带头指针的双向循 环链表实现同步协调变化, 其中所述头指针用于实现对文字焦点所在位置的判断。  In the display page, the text search result is vertically arranged in parallel with the corresponding webpage thumbnail, and the central part of the display page is a focus display area for displaying the graphic focus corresponding to the text focus selected by the user; the text focus and The graphic focus is synchronously displayed in the display page, and the synchronous coordinated change is realized by the bidirectional circular linked list with the leading pointer, wherein the head pointer is used for realizing the judgment of the position of the text focus.
一种利用语义分析实现精确搜索的方法, 其特征在于包括如下的步骤:  A method for realizing accurate search by using semantic analysis, which is characterized by the following steps:
(1) 接受用户输入的目标信息描述词, 对所述目标信息描述词进行分词操作; (1) accepting a target information description word input by the user, and performing a word segmentation operation on the target information description word;
(2) 判断所述目标信息描述词是否具有完整的语义; (2) determining whether the target information descriptor has complete semantics;
(3) 如果是则直接进行后续的检索; 如果不是, 则向用户提供与所述目标信息描述 词相关联的词汇; (3) If yes, perform subsequent retrieval directly; if not, provide the user with the description of the target information Word associated with the word;
(4) 用户进行二次输入, 从而确定所述目标信息描述词的语义, 根据该语义进行 后续的检索。  (4) The user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
一种网络购物导航方法, 基于包括网络爬虫装置和显示控制装置的可视化搜索引 擎系统实现, 其中所述网络爬虫装置用于抓取并生成网页缩略图, 其特征在于: 在所述可视化搜索引擎系统用于网络购物导航时, 首先根据用户输入的购物对象 关键词, 由所述显示控制装置在搜索结果页面的左侧显示购物对象的文字搜索结果, 搜索结果页面的右上角和右下角显示与所述文字搜索结果所对应的网页缩略图, 搜索 结果页面的中心区域显示用户当前所选择的购物对象的焦点网页缩略图;  A network shopping navigation method is implemented based on a visual search engine system including a web crawler device and a display control device, wherein the web crawler device is configured to capture and generate a webpage thumbnail, wherein: the visual search engine system When used for web shopping navigation, first, according to the shopping object keyword input by the user, the display control device displays the text search result of the shopping object on the left side of the search result page, and displays the upper right corner and the lower right corner of the search result page. a webpage thumbnail corresponding to the text search result, and a central area of the search result page displays a thumbnail of the focus webpage of the shopping object currently selected by the user;
在所述搜索结果页面中设置挑选栏, 用户将所选择的搜索结果放入所述挑选栏中 进行比较, 再从挑选栏中进入购物对象所在的网页进行购买。  A selection column is set in the search result page, and the user puts the selected search result into the selection column for comparison, and then enters the webpage where the shopping object is located from the selection column to purchase.
其中较优地, 在所述挑选栏中, 为每一个针对购物对象的目标网页设定一个网页 ID, 对所述网页 ID开展中转管理。  Preferably, in the selection column, a webpage ID is set for each target webpage for the shopping object, and the webpage ID is transit managed.
其中较优地, 所述挑选栏暂时保存所述网页 ID, 通过对所述网页 ID的操作实现加 入或抛弃想购买的物品。  Preferably, the selection column temporarily saves the webpage ID, and the item to be purchased is added or discarded by the operation of the webpage ID.
其中较优地,在所述搜索结果页面中还设置收藏夹,所述收藏夹对注册用户开放, 长期保存所述注册用户所挑选的网页 ID。  Preferably, a favorite folder is further set in the search result page, and the favorite is open to the registered user, and the webpage ID selected by the registered user is stored for a long time.
其中较优地, 在进行比较时, 将由所述网络爬虫装置抓取并生成的购物对象缩略 图集中在一起供用户挑选。  Preferably, when comparing, the thumbnails of the shopping objects captured and generated by the web crawler are grouped together for selection by the user.
其中较优地, 用户确定要购买的购物对象后, 通过链接进入该购物对象所在的网 店页面进行购买, 所述网店页面采用虚浮方式进行显示。  Preferably, after the user determines the shopping object to be purchased, the user enters the online shop page where the shopping object is located to make a purchase, and the online shop page is displayed in a virtual floating manner.
本发明所提供的可视化搜索引擎系统及其实现方法利用网络爬虫装置将网页直 接进行页面渲染, 并将渲染结果直接用图片格式加以保存, 从而为低成本、 高效率地 实现页面预览功能奠定技术基础; 在显示页面中同时显示互联网搜索结果的文字摘要 和网页缩略图, 便于用户准确辨别自己所需要的内容; 利用语义分析实现精确搜索, 使搜索引擎能够准确地从数据库中将用户心目中最想要的信息提供给用户。 上述可视 化搜索引擎系统作为面向网络购物场合服务的垂直搜索引擎时, 将网络购物过程中的 "搜索"、 "査看" 和 "比较"集成在可视化搜索引擎系统的内部完成, 由此形成一个 完整的网络购物导航过程, 有效改善了用户的网络购物体验。  The visual search engine system and the implementation method thereof provided by the invention use the web crawler device to directly render the webpage to the page, and save the rendering result directly in the image format, thereby laying a technical foundation for realizing the page preview function with low cost and high efficiency. ; display the text summary and web page thumbnail of the Internet search results in the display page, so that users can accurately identify the content they need; use semantic analysis to achieve accurate search, so that the search engine can accurately think of the user from the database The information you want is available to the user. When the above-mentioned visual search engine system is used as a vertical search engine for online shopping occasion services, the "search", "view" and "comparison" in the online shopping process are integrated into the interior of the visual search engine system, thereby forming a complete The online shopping navigation process effectively improves the user's online shopping experience.
附图说明 DRAWINGS
下面结合附图和具体实施方式对本发明作进一步的详细说明。  The present invention will be further described in detail below in conjunction with the drawings and specific embodiments.
图 1为本发明所提供的可视化搜索引擎系统的整体架构示意图;  1 is a schematic diagram of an overall architecture of a visual search engine system provided by the present invention;
图 2为本可视化搜索引擎系统中, 网络爬虫装置的整体组成示意图; 图 3为本网络爬虫装置实现网络爬虫基本功能的流程示意图;  2 is a schematic diagram of the overall composition of the network crawler device in the visual search engine system; FIG. 3 is a schematic flowchart of the basic function of the network crawler device to implement the network crawler;
图 4为本网络爬虫装置实现页面渲染功能的流程示意图; 图 5为本可视化搜索引擎系统中, 显示控制装置的显示页面示意图; 4 is a schematic flowchart of a web crawling device implementing a page rendering function; 5 is a schematic diagram of a display page of a display control device in the visual search engine system;
图 6为用于实现同步显示控制单元的带头指针的双向循环链表的示意图; 图 7为显示页面的初始状态与双向循环链表的对应关系示意图;  6 is a schematic diagram of a bidirectional circular linked list for implementing a head pointer of a synchronous display control unit; FIG. 7 is a schematic diagram showing a correspondence between an initial state of a page and a bidirectional circular linked list;
图 8为显示页面的中间状态与双向循环链表的对应关系示意图;  8 is a schematic diagram showing a correspondence between an intermediate state of a page and a bidirectional circular linked list;
图 9为本发明中, 利用语义分析实现精确搜索的方法流程图;  9 is a flow chart of a method for realizing accurate search by using semantic analysis in the present invention;
图 10为本可视化搜索引擎系统作为网络购物搜索弓 I擎的首页示例图; 图 11为根据用户输入信息进行 "搜索" 时的搜索结果页面示例图;  FIG. 10 is a diagram showing an example of a homepage of the visual search engine system as a network shopping search engine; FIG. 11 is a diagram showing an example of a search result page when performing a “search” according to user input information;
图 12为用户将初步挑选的购物对象集中在一起 "査看", 以便进行 "比较"的显 示页面示例图;  Figure 12 is a diagram showing an example of a display page in which a user selects a preliminary selected shopping object together to "view" for "comparison";
图 13为用户经 "比较"确定要购买的购物对象后, 进入该购物对象所在的网店 页面的示例图。  FIG. 13 is a diagram showing an example of a user entering the online shop where the shopping object is located after the user compares the shopping object to be purchased.
具体实施方式 detailed description
本发明所提供的可视化搜索引擎系统主要解决三方面的技术问题。首先是低成本、 高效率地实现搜索结果页面的预览功能; 其次是在显示页面中同时显示互联网搜索结 果的文字摘要和网页缩略图, 便于用户准确辨别自己所需要的内容; 最后是利用语义 分析实现精确搜索, 使搜索引擎能够准确地从数据库中将用户心目中最想要的信息提 供给用户。 下面分别展开详细的说明。  The visual search engine system provided by the present invention mainly solves the technical problems of the three aspects. The first is to realize the preview function of the search result page with low cost and high efficiency. The second is to display the text summary and web page thumbnail of the Internet search result in the display page, so that the user can accurately identify the content that he needs. Finally, the semantic analysis is used. Achieve accurate search, so that the search engine can accurately provide the user with the most desired information from the database. Detailed explanations are provided below.
如图 1所示, 该可视化搜索引擎系统可以实现网页搜集、 网页整理与索引、 页面 渲染、 査询服务等多项服务功能。 这些服务功能主要是通过网络爬虫装置、 显示控制 装置和语义分析装置共同配合实现的, 具体说明如下:  As shown in FIG. 1 , the visual search engine system can implement multiple service functions such as web page collection, web page sorting and indexing, page rendering, and query service. These service functions are mainly realized through the cooperation of the web crawler device, the display control device and the semantic analysis device, and the specific description is as follows:
如图 2所示, 本可视化搜索引擎系统中的网络爬虫装置主要由以下各部分组成: As shown in Figure 2, the web crawler in the visual search engine system is mainly composed of the following parts:
1. 信息采集器 Information collector
每个信息采集器是一个网页蜘蛛 (Web Spider ) , 处于网络爬虫装置的底层, 是网 络爬虫装置与海量的互联网信息 (如论坛、 博客、 WAP、 文档、 音视频资料等) 直接进 行交互的接口部分。信息采集器的作用是获取 Web页面。它通常从信息源(如用户査询、 URL列表或某一页面)出发, 通过 h«p协议请求, 下载 Web页面, 页面分析器分析页面并 提取链接,然后信息采集器再以迭代的方式访问网络。在本发明的一个具体实施例中, 信息采集器优选采用图的遍历算法(如广度优先或深度优先策略)搜索 Web页面。  Each information collector is a web spider (Web Spider), which is at the bottom of the web crawler. It is an interface for web crawlers to interact directly with massive Internet information (such as forums, blogs, WAPs, documents, audio and video materials, etc.). section. The role of the information collector is to obtain a web page. It usually starts from an information source (such as a user query, a URL list, or a certain page), requests through the h«p protocol, downloads a web page, analyzes the page and extracts the link, and then the information collector accesses iteratively. The internet. In a specific embodiment of the invention, the information collector preferably searches for the web page using a graph traversal algorithm, such as a breadth-first or depth-first strategy.
为保证高速获取 Web 页面中的信息, 本网络爬虫装置在并行机制的基础上, 对各 个信息采集器采用多线程技术。 在一般情况下, 每个信息采集器能同时启动数百个线 程进行页面信息采集。 URL管理器采取交织存取的方式管理待采集的 URL队列, 向各个 信息采集器分配采集任务, 因此可以保证同一个信息采集器最多只有一个线程连接同 一个 Web服务器, 有效避免该 Web服务器因访问量骤增而出现阻塞甚至宕机。  In order to ensure high-speed access to information in Web pages, the web crawler uses multi-threading technology for each information collector based on the parallel mechanism. In general, each information collector can start hundreds of threads simultaneously for page information collection. The URL manager manages the URL queue to be collected by means of interleaving access, and allocates collection tasks to each information collector. Therefore, it is ensured that at most one thread of the same information collector is connected to the same web server, thereby effectively avoiding the web server from being accessed. The amount suddenly increased and there was a blockage or even a downtime.
2. 链接 (URL ) 过滤器  2. Link (URL) filter
在 URL库里存放的是从采集到的页面中提取出来的所有 URL,为避免采集页面出现 "主题漂移" 问题, 这些 URL在进入 URL库前都必须经过主题相关性预测。 我们利用 扩展元数据(即 HTML Tag如 Anchor等信息)的语义信息, 对从采集到的页面内提取出 来的 URL 进行主题相关性预测, 按照相关链接进行采集、 不相关链接直接丢弃的原则 进行剪枝处理, 减少系统采集无关页面的数量, 从而大量节省系统运行成本, 有效提 高主题信息搜索的速度和效率。链接过滤器将被预测为指向主题相关页面的链接(URL ) 入库存储, 进而作为待采集 URL由 URL管理器分配给各个信息采集器采集该 URL链接 所指向的 Web页面。 Stored in the URL library are all URLs extracted from the collected pages, in order to avoid the collection of pages The "topic drift" problem, these URLs must be subject to topic relevance prediction before entering the URL library. We use the semantic information of extended metadata (ie, HTML tags such as Anchor) to perform topic correlation prediction on URLs extracted from the collected pages, and then cut according to the principle of collecting related links and discarding irrelevant links. Branch processing reduces the number of unrelated pages collected by the system, thereby saving a lot of system operation costs and effectively improving the speed and efficiency of topic information search. The link filter will be predicted as a link (URL) into the library of the topic-related page, and then distributed as a to-be-collected URL by the URL manager to each information collector to collect the web page pointed to by the URL link.
3. 页面过滤器  3. Page filter
为进一步提高系统的査准率, 需要对采集下来的页面进行主题相关性判断, 也就 是页面过滤。 这实质上是一个文本主题分类的过程。 通过去除相关性较小的页面(小于 设定的阈值), 提高系统的査准率。 根据全信息理论, 自然语言作为认识主体所表述的 "事物运动状态及其变化方式", 包括形式、 含义和其对认识主体的效用等三方面, 分 别称为事物的语法信息、 语义信息和语用信息, 而这三者的整体则称为 "全信息"。 自 然语言文本具有词语同义性、 词语多义性等特点, 而 Web 文本是自然语言的一种特殊 载体, 因此在判断一篇文本是否与系统的采集主题相关时, 我们不但要关心文本的语 法信息, 还需要关心文本的语义准确性。 本网络爬虫装置的页面过滤器以此为依据, 吸收传统向量空间模型的思想, 采用基于概念的向量空间法进行页面内容的过滤, 通 过将词汇映射到概念一级, 从词所表达的概念意义层次也就是语义层次对文本进行相 关性分析。  In order to further improve the precision of the system, it is necessary to make a topic correlation judgment on the collected pages, that is, page filtering. This is essentially a process of text topic classification. Improve the precision of the system by removing less relevant pages (less than the set threshold). According to the theory of total information, natural language as the subject of cognition, "the state of motion of things and its changing ways", including form, meaning and its utility to the subject of cognition, are called grammatical information, semantic information and language of things. Use information, and the whole of these three is called "full information." Natural language texts have the characteristics of synonymousness of words, polysemy of words, etc. Web text is a special carrier of natural language. Therefore, when judging whether a text is related to the collection theme of the system, we should not only care about the grammar of the text. Information, also needs to care about the semantic accuracy of the text. The webpage filter of this web crawler is based on this, absorbs the idea of the traditional vector space model, uses the concept-based vector space method to filter the page content, and maps the vocabulary to the conceptual level, the conceptual meaning expressed from the word. The hierarchy is also the semantic level to analyze the relevance of the text.
4. 页面分析器  4. Page Analyzer
页面分析器的主要功能是解析抓取下来的页面内容, 可以分为两部分工作: 一部 分是解析出带有链接标记的 URL, 交给 URL过滤器解析, 提取出链接; 另一部分是将页 面内容解析为文本格式, 交给页面过滤器处理。  The main function of the page analyzer is to parse the content of the captured page. It can be divided into two parts: one part is to parse the URL with the link mark, and the URL filter is parsed to extract the link; the other part is the page content. Parse to text format and hand it to the page filter for processing.
5. URL管理器  5. URL Manager
URL管理器的主要功能是管理 URL任务。 一方面 URL管理器从 URL 库中获得 URL 列表, 并将它们进行任务排列后分配给多个信息采集器, 另一方面 URL 管理器从多个 信息采集器中获得新的 URL列表, 将这些列表以一定的策略保存到 URL库中。  The main function of the URL Manager is to manage URL tasks. On the one hand, the URL manager obtains a list of URLs from the URL library, and arranges them for assignment to multiple information collectors. On the other hand, the URL manager obtains a new list of URLs from multiple information collectors, and these lists are Save to the URL library with a certain strategy.
如图 3所示, 上述的网络爬虫装置在实现网络爬虫的基本功能时, 首先由 URL管 理器启动信息采集器开始 Web页面的采集工作, 并对采集的 Web页面进行存储。 然后 由页面分析器进行分析, 获得标记和页面两部分。 其中的标记由送入 URL 过滤器进行 解析, 而页面部分送入页面过滤器, 由页面过滤器进行内容冗余检测后, 存入页面库 中。 Web页面在由 URL过滤器进行限定站点范围和主题的过滤之后, 送入 URL库中。 此 后, 与 URL库连接的图片生成器开始工作, 针对 URL库中存储的 URL生成页面对应的 图片。 下面对此展开具体的说明。  As shown in FIG. 3, when the above-mentioned web crawler implements the basic functions of the web crawler, the URL manager starts the information collector to start the collection of the web page, and stores the collected web page. It is then analyzed by the page analyzer to get both the mark and the page. The tag is parsed by the URL filter, and the page part is sent to the page filter. After the content filter is detected by the page filter, it is stored in the page library. The web page is sent to the URL library after filtering by the URL filter to limit the scope and theme of the site. Thereafter, the image generator connected to the URL library starts working, and the image corresponding to the page is generated for the URL stored in the URL library. A detailed description will be given below.
首先, 用户输入网址向服务器发出请求, 服务器返回 html格式的 Web页面; 页面 解析器开始载入 html语言的源代码, 如果发现〈11^01>标签内有一个 <l ink>标签引用外 部 CSS文件, 则发出 CSS文件的请求, 服务器返回这个 CSS文件; 页面解析器继续载 入 html中 00^>部分的代码, 开始渲染页面。 First, the user enters a URL to make a request to the server, and the server returns a web page in html format; The parser starts to load the source code of the html language. If it finds that there is a <l ink> tag in the <11^01> tag that references the external CSS file, the CSS file is issued, and the server returns the CSS file; the page parser continues to load Enter the code in the 00^> section of the html and start rendering the page.
如图 4所示, 本网络爬虫装置实现页面渲染功能的具体步骤是这样的:  As shown in Figure 4, the specific steps of the web crawler to implement the page rendering function are as follows:
1. 渲染准备阶段  1. Rendering preparation phase
用于渲染前的准备操作, 比如初始化一些数据;  Used for pre-rendering operations, such as initializing some data;
2. 生成开始标签  2. Generate a start tag
用于生成一个 Html文件的开始标签;  The start tag used to generate an Html file;
3. 渲染模板  3. Render the template
该步骤主要用于渲染模板中的内容。 这个阶段一般会有多个标签需要渲染, 每进 入一个标签, 都会依次调用这个标签的各个生命周期阶段, 也就是说, 本处是一个从 上层标签到下层标签的递归入口, 只有下层标签渲染结束, 进行调用的组件才会继续 后续阶段的操作。  This step is mainly used to render the content in the template. At this stage, there are usually multiple tags to be rendered. Each time a tag is entered, the life cycle stages of the tag are called in turn. That is, the local is a recursive entry from the upper tag to the lower tag, and only the lower tag is rendered. , the calling component will continue the operation of the subsequent phase.
4. 渲染体  4. Rendering body
与渲染模板相似, 也是渲染一段模板中的内容。 比如对于 a 标签 (〈a href="page l ink") thi s i s body</ a> ), 它的 body是 " thi s i s body "这几个文字。  Similar to rendering a template, it also renders the content of a template. For example, for a tag (<a href="page l ink") thi s i s body</ a> ), its body is the text " thi s i s body ".
5. 生成结束标签  5. Generate end tag
该步骤一般用于生成一个结束标签, 或者控制内嵌标签的执行流程。  This step is generally used to generate an end tag or to control the execution flow of the inline tag.
6. 清除数据  6. Clear data
其它几个阶段并非经常用到, 更多是保证生命周期的完整性。  The other phases are not often used, and more are to ensure the integrity of the life cycle.
需要说明的是, 当发现一个〈img>标签引用了一张图片时, 向服务器发出请求。 此 时不必等到图片下载完, 而是继续渲染后面的代码; 服务器返回图片文件。 由于图片 占用了一定面积, 影响了后面段落的排布, 因此需要回过头来重新渲染这部分代码; 当发现存在一个 JavaScript代码的 <script^ 签时, 执行语句, 重新渲染 JavaScript 执行中处理的那部分页面代码; 然后由图片生成器将渲染的结果生成图片。  It should be noted that when an <img> tag is found to reference an image, a request is made to the server. At this time, you don't have to wait until the image is downloaded, but continue to render the code behind; the server returns the image file. Since the image takes up a certain area, it affects the arrangement of the following paragraphs, so you need to go back and re-render this part of the code; when you find a <script^ tag with a JavaScript code, execute the statement and re-render the JavaScript processing. Part of the page code; then the image generator will generate the image as a result of the rendering.
上面以 html格式的 Web页面为例对本发明作了说明, 但本发明所提供的具备页面 渲染功能的网络爬虫装置并不限于处理 html格式的页面, 其它格式的 Web页面也是可 以直接处理的。  The present invention has been described above by taking a Web page in the html format as an example. However, the web crawler having the page rendering function provided by the present invention is not limited to processing pages in the html format, and web pages in other formats can be directly processed.
利用本可视化搜索引擎系统中的网络爬虫装置,当我们根据网页的地址进行检索 后, 不仅可以了解该页面的基本内容, 更重要的是能够看到其基本的显示效果, 从而 更多地了解整个页面的内容。  By using the web crawler device in the visual search engine system, when we search according to the address of the webpage, we can not only understand the basic content of the page, but more importantly, can see the basic display effect, thereby learning more about the whole. The content of the page.
在本发明中, 所要显示的互联网搜索结果包括两类类型的数据一文字搜索结果数 据和相对应的网页缩略图数据, 而不是单一类型的文字数据或图形数据。 为了在同一 显示页面中同时显示尽可能多的搜索结果, 同时又要实现对多种数据的有效控制, 体 现两类类型数据的关联关系。 本可视化搜索引擎系统中的显示控制装置采用如图 5 所 示的显示位置设置方案, 即文字显示区域与图形显示区域纵向并行排列, 在显示页面 的中心部分设置焦点显示区域。 被选中的文字焦点和相对应的图形焦点结合起来, 排 列在同一水平线上。 这种显示位置设置方案是考虑到文字的阅读顺序是从左至右, 为 遵守人们的阅读习惯, 同一关联内容 (即相互对应的文字焦点和图形焦点) 必须从左 至右在同一水平线上列示。 In the present invention, the Internet search result to be displayed includes two types of data-text search result data and corresponding web page thumbnail data, instead of a single type of text data or graphic data. In order to display as many search results as possible in the same display page, and at the same time to achieve effective control of a variety of data, the relationship between the two types of data is reflected. The display control device in the visual search engine system is as shown in FIG. The displayed display position setting scheme, that is, the text display area and the graphic display area are vertically arranged in parallel, and the focus display area is set in the central part of the display page. The selected text focus is combined with the corresponding graphic focus and arranged on the same horizontal line. This display position setting scheme considers that the reading order of the text is from left to right. To comply with people's reading habits, the same related content (ie, corresponding text focus and graphic focus) must be listed on the same horizontal line from left to right. Show.
如图 5所示, 本可视化搜索引擎系统中的显示控制装置在显示页面中同时显示搜 索结果的文字摘要 (即文字搜索结果) 和相对应的网页缩略图。 为了实现较佳的显示 效果,该显示控制装置中至少包括三个显示功能单元,分别是文字搜索结果显示单元、 焦点网页缩略图显示单元和图形搜索结果显示单元。 其中, 文字搜索结果显示单元位 于整个显示页面的左侧中间位置, 焦点网页缩略图显示单元位于整个显示页面的中心 区域, 而图形搜索结果显示单元可以有多个, 分别位于焦点网页缩略图显示单元的右 上角和右下角 (也可以是其它的位置)。  As shown in FIG. 5, the display control device in the visual search engine system simultaneously displays a text summary (ie, a text search result) of the search result and a corresponding web page thumbnail in the display page. In order to achieve a better display effect, the display control device includes at least three display function units, which are a text search result display unit, a focus web page thumbnail display unit, and a graphic search result display unit. Wherein, the text search result display unit is located at the middle of the left side of the entire display page, the focus webpage thumbnail display unit is located in the central area of the entire display page, and the graphic search result display unit may have multiple, respectively located in the focus webpage thumbnail display unit The upper right and lower right corners (other locations are also available).
在文字搜索结果显示单元中, 可以以列表方式显示网络搜索的文字搜索结果。 例 如在图 5所示的实施例中, 以列表方式显示了文字搜索结果 1〜文字搜索结果 5。 在该 显示控制装置作为计算机显示器的情况下, 用户可以使用鼠标进一步在这些文字搜索 结果中进行点选, 例如点选文字搜索结果 3 作为感兴趣的文字搜索结果, 并可以进一 步点击其对应的链接。 为了避免客户点击一个文字搜索结果后, 却发现出现的页面与 其想要的页面相差甚远, 本显示控制装置中对文字搜索结果显示单元与焦点网页缩略 图显示单元、 图形搜索结果显示单元之间设置了显示内容上的相互关联, 其中将文字 搜索结果显示单元中被用户点选的文字搜索结果 (即文字焦点) 所对应的网页与焦点 网页缩略图显示单元关联起来, 而其它的文字搜索结果所对应的网页与图形搜索结果 显示单元关联起来。 换句话说, 焦点网页缩略图显示单元中始终显示用户所点选的文 字搜索结果 (即文字焦点) 所对应的网页缩略图 (即图形焦点), 而图形搜索结果显示 单元中显示未被用户点选的其它文字搜索结果所对应的网页缩略图。 上述焦点网页缩 略图显示单元和图形搜索结果显示单元可以利用网络爬虫 (Web Crawler ) 等技术来实 现网页缩略图的显示。  In the text search result display unit, the text search result of the web search can be displayed in a list. For example, in the embodiment shown in Fig. 5, the text search result 1 to the text search result 5 are displayed in a list. In the case where the display control device is used as a computer display, the user can further click on the text search results by using the mouse, for example, clicking the text search result 3 as the text search result of interest, and further clicking the corresponding link. . In order to prevent the customer from clicking a text search result, but found that the page appears far from the desired page, the display control device between the text search result display unit and the focus page thumbnail display unit, the graphic search result display unit Correlating the display content, wherein the webpage corresponding to the text search result (ie, the text focus) selected by the user in the text search result display unit is associated with the focus webpage thumbnail display unit, and other text search results are combined. The corresponding web page is associated with the graphic search result display unit. In other words, the thumbnail of the web page corresponding to the text search result (ie, the text focus) selected by the user is always displayed in the thumbnail page display unit of the focus page, and the user is not displayed in the display unit of the graphic search result. The thumbnail of the web page corresponding to the selected other text search results. The above-described focus web page thumbnail display unit and graphic search result display unit can implement web page thumbnail display using techniques such as web crawler (Web Crawler).
在本显示控制装置中, 焦点网页缩略图显示单元所占据的显示面积较大且始终位 于显示页面的中心区域。 这样, 用户所点选的文字搜索结果所对应的网页缩略图可以 得到清晰、 全面的展示, 便于用户决定是否进行进一步的点击操作。 而在谷歌公司提 供的搜索结果可视预览功能中, 网页缩略图仅仅显示在搜索结果的右侧且显示面积较 小, 因此用户难以看清网页缩略图的具体内容, 不便于做出是否进行进一步点击的判 断。  In the present display control device, the display area occupied by the focus web page thumbnail display unit is large and always located in the center area of the display page. In this way, the thumbnail of the webpage corresponding to the text search result selected by the user can be clearly and comprehensively displayed, which is convenient for the user to decide whether to perform further click operations. In the visual preview function of the search results provided by Google, the thumbnail of the webpage is only displayed on the right side of the search result and the display area is small, so it is difficult for the user to see the specific content of the thumbnail of the webpage, and it is not convenient to make further Click to judge.
下面, 进一步介绍本显示控制装置实现互联网搜索结果图文并茂显示的具体方法。 在实现有效的文字焦点和 /或图形焦点位置安排之后, 更重要的工作是实现文字流、 图 片流以及显示焦点的同步控制, 要求在对文字流、 图片流以及焦点实施单一控制调整 时, 相关数据流都能够进行有效的同步调整。 In the following, a specific method for realizing the display of the Internet search results by the display control device is further introduced. After achieving effective text focus and/or graphical focus position placement, the more important task is to achieve synchronous control of the text stream, image stream, and display focus, requiring a single control adjustment to the text stream, image stream, and focus. When relevant data streams are enabled, they can be effectively synchronized.
在本显示控制装置之中, 焦点网页缩略图显示单元的显示内容是可以变化的。 当 用户的关注焦点 (体现为用户所操作鼠标的停留位置) 变化时, 文字焦点和相应的图 形焦点都将进行相应的变化, 从而实现有效的焦点转换。  Among the display control devices, the display content of the focus web page thumbnail display unit can be changed. When the user's focus of attention (indicated by the position of the mouse operated by the user) changes, the text focus and the corresponding graphic focus will change accordingly, thereby achieving effective focus conversion.
为此, 在本显示控制装置中专门设置了焦点跟踪单元, 用于捕获用户关注的文字 焦点和 /或图形焦点; 同步显示控制单元, 用于帮助用户对焦点网页缩略图显示单元的 显示内容进行随意调整, 并且使文字焦点和图形焦点在显示页面中同步显示, 实现同 步协调变化。  To this end, a focus tracking unit is specifically provided in the display control device for capturing the text focus and/or the graphic focus of the user's attention; and a synchronous display control unit for assisting the user to display the display content of the focus webpage thumbnail display unit. Feel free to adjust, and make the text focus and graphic focus appear synchronously in the display page, achieving synchronous and coordinated changes.
上述的同步显示控制单元可以使用如图 6所示的一个带头指针 L的双向循环链表 对三方面的数据 (文字流、 图片流以及显示焦点) 进行同步控制。 这个双向循环链表 是可循环的, 当用户对文字流或图片流或显示焦点中任一个数据流进行调整点击, 控 制链表就进行相应调整。 控制链表的调整带动其他数据流的调整, 整体即可实现有效 地同步调整。 这个双向循环链表必须是双向的, 即可以对显示焦点在文字流中从上向 下进行调整, 也可以是从下向上进行调整, 图片流也是同样的。 最后, 这个双向循环 链表还带有头指针 L, 这个头指针 L用于实现对焦点位置的判断。 当头指针 L的位置发 生变动, 即整个焦点显示内容变动, 从而带动文字焦点和图形焦点的显示变化。  The above-described synchronous display control unit can synchronously control three kinds of data (character stream, picture stream, and display focus) using a bidirectional circular linked list with a leading pointer L as shown in FIG. This bidirectional circular linked list is recyclable. When the user clicks on any of the text stream or the picture stream or the display focus, the control list is adjusted accordingly. The adjustment of the control linked list drives the adjustment of other data streams, and the whole can be effectively synchronized. This bidirectional circular list must be bidirectional, that is, the display focus can be adjusted from top to bottom in the text stream, or it can be adjusted from bottom to top, and the picture stream is the same. Finally, the bidirectional circular list also has a head pointer L, which is used to determine the focus position. When the position of the head pointer L changes, that is, the entire focus display content changes, thereby causing the display of the text focus and the graphic focus to change.
图 7 为显示页面的初始状态与双向循环链表的对应关系示意图。 在初始状态中, 文字焦点和图形焦点都位于显示页面的最上方的位置, 相应的双向循环链表中的头指 针 L位于最左侧的位置。 图 8为显示页面的中间状态与双向循环链表的对应关系示意 图。 此时, 文字焦点和图形焦点都位于显示页面的中间位置, 相应的双向循环链表中 的头指针 L 也位于中间位置。 这里的双向循环链表是一个控制器, 同时调用文字流和 图形流。 它向文字流和图形流提供当前应处于焦点显示状态或需要变动部分的参数, 例如双向循环链表中的结点数对应要展示的搜索结果对数, 每个结点对应一对搜索结 果 (文字搜索结果及对应的网页缩略图)。 当 L指针指向哪个位置时, 焦点就在哪一对 搜索结果上。 该搜索结果中的网页缩略图由焦点网页缩略图显示单元进行显示, 而对 应的文字搜索结果则在文字搜索结果显示单元中的焦点位置。 其它搜索结果的显示位 置则相应进行调整。随着文字焦点和图形焦点的变动,头指针 L的位置也在相应变化, 但文字流和图片流之间的关联关系已经通过双向循环链表固定下来, 实现了同步显示。  Figure 7 is a schematic diagram showing the correspondence between the initial state of the page and the bidirectional circular linked list. In the initial state, both the text focus and the graphic focus are located at the top of the display page, and the head pointer L in the corresponding bidirectional circular list is located at the leftmost position. Fig. 8 is a schematic diagram showing the correspondence between the intermediate state of the page and the bidirectional circular linked list. At this point, both the text focus and the graphic focus are in the middle of the display page, and the head pointer L in the corresponding bidirectional circular list is also in the middle position. The bidirectional circular linked list here is a controller that calls both the text stream and the graphics stream. It provides parameters for the text stream and the graphics stream that should be in the focus display state or need to be changed. For example, the number of nodes in the bidirectional circular linked list corresponds to the logarithm of the search results to be displayed, and each node corresponds to a pair of search results (text search) The result and the corresponding web page thumbnail). When the L pointer points to which position, the focus is on which pair of search results. The thumbnail of the webpage in the search result is displayed by the focus webpage thumbnail display unit, and the corresponding text search result is in the focus position in the text search result display unit. The display position of other search results is adjusted accordingly. As the focus of the text and the focus of the graphic change, the position of the head pointer L also changes accordingly, but the relationship between the text stream and the picture stream has been fixed by the bidirectional circular linked list, realizing the synchronous display.
另外, 本发明将现有搜索引擎所使用的 "输入一分词一检索" 的处理方式转换为 "输入一分词一语义判断一检索" 的处理方式, 即在分词操作之后进行语义判断, 判 断输入的词是否为具有确定语义的信息, 如果是则直接进行后续的检索; 如果不是, 则向用户提供与输入的词相关联的词汇。 用户再进行二次输入 (在相关联的词汇中选 择), 以便准确判断用户输入信息的真实语义, 从而根据该语义获得精确的网络搜索结 果。  In addition, the present invention converts the processing method of "input one-word-retrieve" used by the existing search engine into the processing method of "input one-word-semantic-sense-retrieve", that is, semantic judgment after the word-dividing operation, and judgment input Whether the word is information with certain semantics, if so, subsequent retrieval is performed directly; if not, the user is provided with a vocabulary associated with the input word. The user then performs a second input (selected in the associated vocabulary) to accurately determine the true semantics of the user input information, thereby obtaining accurate web search results based on the semantics.
如图 9所示,本网络搜索方法在实现形式上表现为绝大多数情况下需要进行两次 输入操作: 在用户输入最重要的目标信息描述词 (第一次输入操作) 后, 进行关联词 汇的检索并提供给用户, 由用户从中进行选择 (第二次输入操作), 从而明确具体的搜 索目标, 使搜索弓 I擎能够准确地从数据库中将用户心目中最想要的信息提供给用户。 具体而言, 本网络搜索方法在搜索引擎中增加 "元词汇关联数据库", 鼓励用户在 "元 词汇关联数据库" 中输入最能表示其搜索目标的词, 而且这个词是他所需信息的最重 要的目标描述。 当搜索引擎接受用户输入的词并进行最大分词切分后, 判断其输入信 息是否具有完整语义, 如果有则直接进行后续的搜索操作, 如果没有则由本搜索引擎 对输入的词在 "元词汇关联数据库" 中进行关联分析, 并根据结果为用户提供一个多 项选择, 使得用户通过进一步的选择更准确地描述其所需要的目标信息。 这个多项选 择具有涵盖所有涉及第一个输入词的有关信息, 这样用户的真实目的就非常准确地为 搜索引擎所掌握, 搜索引擎从而可以快速且低成本地将用户所需的检索结果提供给用 户。 As shown in Figure 9, the network search method is implemented in the form that it needs to be performed twice in most cases. Input operation: After the user inputs the most important target information descriptor (first input operation), the related vocabulary is retrieved and provided to the user, and the user selects from it (the second input operation), thereby clarifying the specific search. The goal is to enable the search engine to accurately provide the user with the most desired information from the database. Specifically, the web search method adds a "meta vocabulary association database" to the search engine, and encourages the user to input a word that best represents the search target in the "meta vocabulary association database", and the word is the most information he needs. Important target description. When the search engine accepts the words input by the user and performs the maximum segmentation, it judges whether the input information has complete semantics, and if so, directly performs subsequent search operations. If not, the words input by the search engine are in the "meta vocabulary association". Correlation analysis is performed in the database, and the user is provided with a multiple choice based on the results, so that the user can more accurately describe the target information he needs through further selection. This multiple choice has all the relevant information related to the first input word, so that the real purpose of the user is very accurate for the search engine, and the search engine can quickly and cost-effectively provide the search results required by the user. user.
需要说明的是,本网络搜索方法并非简单地将现有的一步搜索分解为两步可选搜 索, 而是既抛弃通常的一步搜索法, 又舍去了多步或不定步询问法。 这是基于以下两 项研究结果而确定的, 即:  It should be noted that the network search method does not simply decompose the existing one-step search into a two-step optional search, but instead discards the usual one-step search method and discards the multi-step or indefinite step query method. This is based on the results of two studies, namely:
一. 对于 "元词汇" 而言, 一个相对公认的基本集为百万数量级, 而当进行一步 搜索的时候, 百万数量级的 "元词汇" 又不足以表达最新的词汇发展。 但如果进行的 是两步搜索,则从理论上可以表达百万之平方数量级的词汇空间, 即达到万亿数量级。 这个数量级应该足可以表达现有信息空间中的所有可能的元数据。 而且, 关联库的数 量级完全可以根据实用性的要求进行一定的限制或精确化, 从而降低搜索引擎的计算 开销, 达到降低成本的目的。  1. For the "meta vocabulary", a relatively recognized basic set is on the order of a million, and when doing a one-step search, a million-level "meta vocabulary" is not enough to express the latest vocabulary development. However, if a two-step search is performed, it is theoretically possible to express a lexical space of the order of a million square meters, that is, to reach the trillions of orders of magnitude. This order of magnitude should be sufficient to express all possible metadata in the existing information space. Moreover, the number of association libraries can be limited or refined according to the requirements of practicality, thereby reducing the computational overhead of the search engine and achieving the purpose of reducing costs.
二. 实现本网络搜索方法需要借助语义分析。 因为语义的形成需要包括两部分, 即所谓的 "本体"与 "行为"。 只有这两部分都有且形成关联时 (还包括两个都是确定 的 "本体"等多种情况), 才能形成一个有意义的语义。 因此, 两步搜索法首先判断用 户的输入是否形成完整语义, 如不能形成则将 "本体" 的确定作为第一步的目标, 而 将第二步搜索的目标确定为 "行为"。 这样通过两步搜索就可以完整地构成一次有效的 "语义搜索", 从而为用户提供适合其需要的信息。  2. Implementing this web search method requires semantic analysis. Because the formation of semantics needs to include two parts, the so-called "ontology" and "behavior". Only when these two parts are and form an association (including two cases where the two are certain "ontologies") can a meaningful semantics be formed. Therefore, the two-step search method first determines whether the user's input forms a complete semantic. If it cannot be formed, the determination of "ontology" is taken as the first step, and the target of the second search is determined as "behavior". In this way, a two-step search can completely constitute a valid "semantic search", thus providing users with information suitable for their needs.
在本网络搜索方法中, "元词汇关联数据库" 不是以一种简单的关系数据库的一 维数据来表示, 而是采用一个多维的关联词汇矩阵来表示。 具体而言, 我们对于每个 元词汇都进行统一编码, 并将相关的编码作为其表示内容。 具体的编码方案可以采用 多种现有的技术实现,例如 XML方式、 N3方式和三元组方式等,在此就不详细赘述了。 当某一词汇的编码确定之后, 各词汇之间进行有效关联分析时, 产生某个词汇的关联 信息, 即对于某个元词汇 S, 用 S {c i, dj}来存储其关联词汇, 并将 c i作为第一层分 类, dj作为第二层分类。 此处的 c i, dj可以根据设计者的需求选择设定, 例如将 c i 按知识学科归属进行分类, 而将 dj按人们使用需求进行分类等。 通过这种两层分类表 示, 则某个词汇就产生关联词汇库。 如: 汽车一词, 其关联词汇的第一层分类可以表 示为 (车型、 品牌、 厂商、 销售商、 维修、 性能、 图片) 等, 而第二层分类如车型就 包括有 (小轿车、 卡车、 货车) 等。 其完整表示为: In the network search method, the "meta-lexical association database" is not represented by one-dimensional data of a simple relational database, but by a multi-dimensional associated vocabulary matrix. Specifically, we uniformly encode each meta-word and use the relevant code as its representation. The specific coding scheme can be implemented by using various existing technologies, such as an XML method, an N3 mode, and a triplet mode, and will not be described in detail herein. When the coding of a certain vocabulary is determined, the effective association analysis between the vocabularies generates the associated information of a certain vocabulary, that is, for a meta vocabulary S, S {ci, dj} is used to store the associated vocabulary, and Ci is classified as the first layer, and dj is classified as the second layer. Here, ci, dj can be selected according to the needs of the designer, for example, classifying ci according to the subject of knowledge, and classifying dj according to people's needs. Through this two-layer classification table Show, then a vocabulary produces a linked vocabulary. For example: The word car, the first categorization of its associated vocabulary can be expressed as (model, brand, manufacturer, seller, repair, performance, picture), etc., while the second tier classification such as the model includes (small car, truck) , trucks, etc. Its complete representation is:
S {车型 小轿车 卡车 货车  S {model car, truck, truck
Γ¾ Γ3⁄4
髓 ffi  Pulp ffi
 Return
H t } H t }
当用户使用采用本网络搜索方法的搜索引擎时, 第一次如果输入的词为 "汽车", 搜索引擎在检索到 s= "汽车"之后, 将 "汽车" 的关联词汇矩阵检出, 并提供(c i ) 所表示的文本信息, 即提供关联词汇为 " =车型、 品牌、 厂商、 销售商、 维修、 性 能、 图片", 而用户在进行二次确认时, 如果选择的是车型一词, 则搜索引擎将对所有 包括 { "小轿车、 卡车、 货车" }和 "车型" 等词汇的文本进行检索, 从而真正将用户 所需要的所有内容检索出来。  When the user uses the search engine adopting the web search method, if the input word is "car" for the first time, after searching for s= "car", the search engine detects the associated vocabulary matrix of "car" and provides (ci) the textual information expressed, that is, the associated vocabulary is "=model, brand, manufacturer, vendor, repair, performance, picture", and when the user makes a second confirmation, if the word vehicle type is selected, The search engine will retrieve all texts including {"cars, trucks, trucks" and "models" to truly retrieve all the content the user needs.
在本发明的另外一个实施例中, 某用户在其家里 (如北京市海淀区学院路) 的电 脑中通过互联网进入本搜索弓 I擎, 其目的是为了搜索离他家最近的工商银行的位置, 于是他在搜索框里输入了 "工商银行" 一词。 通常的搜索引擎会立即将互联网上有关 "工商银行" 这四个字有关的文本信息按照自己的排序方法陈列给用户。 用户从至少 数十页面的选择项中去选择可能包括自己所需信息的网页, 然后点击链接进入其中, 从该网页中寻找到自己所需的信息。 而在使用采用本网络搜索方法的搜索引擎时, 当 用户输入 "工商银行" 四个字, 首先采用最大化分词算法 (如正向最大化分词算法、 反向最大化分词算法或概率最大化分词算法等) 确定这是一个词汇, 并根据是否具有 完整语义的原则, 判断出这不是一个具有完整语义的输入内容, 而只是一个词汇, 并 可理解到用户所需的信息是与工商银行有关的信息 (此处的理解是基于用户是一次有 效输入, 而不是一个无意识输入产生的结果, 因此可以根据其输入的是一个词而简单 判定)。 当理解到这一使用目的后, 我们就可在搜索引擎的元词汇关联数据库中相对完 全地列出与 "工商银行" 有关的所有词汇信息, 并根据这些词汇与 "工商银行" 这个 词之间的关联度来进行排列, 例如 "工商银行网点"、 "工商银行营业时间"、 "工商银 行网上银行" 等, 也可以只将关联性最强的某一些词汇列举出来。 如 "股票"、 "机构 简介"、 "最新消息"、 "地址"、 "服务流程"、 "企业文化" 等等。 用户面对这些关联词 汇时, 可以选择 "地址"。 然后, 搜索引擎根据用户的 IP 地址, 就可以准确地判断用 户要搜索的信息实际上就是 "离该 IP 地址所在位置最近的工商银行的地址"。 这样, 搜索引擎可以准确地在自己的数据库中进行检索,并将具有"学院路 工商银行 地址" 的网页直接显示给用户, 用户即可快速获得自己真正所需的信息。 In another embodiment of the present invention, a user enters the search engine through the Internet in a computer at his home (such as College Road, Haidian District, Beijing), the purpose of which is to search for the location of the ICBC nearest to his home. So he entered the word "ICBC" in the search box. The usual search engine will immediately display the text information about the four words "ICBC" on the Internet to the user according to their own sorting method. The user selects a webpage that may include the information he or she needs from at least tens of pages of the selection, and then clicks on the link to enter it, and finds the information he needs from the webpage. When using the search engine using this web search method, when the user inputs the words "ICBC", the first use of the maximization word segmentation algorithm (such as forward maximization word segmentation algorithm, reverse maximization word segmentation algorithm or probability maximization word segmentation) Algorithm, etc.) Determine this is a vocabulary, and according to whether it has the principle of complete semantics, judge that this is not a complete semantic input, but a vocabulary, and understand that the information required by the user is related to ICBC. Information (the understanding here is based on the fact that the user is a valid input, not an unintended input, so it can be easily determined based on the input of a word). After understanding the purpose of this use, we can relatively fully list all vocabulary information related to "ICBC" in the search engine's meta-vocabulary association database, and based on these words and the word "ICBC" The degree of relevance is arranged, for example, "ICBC outlets", "ICBC business hours", "ICBC online banking", etc., and only some of the most relevant words can be listed. Such as "stock", "institutional profile", "new news", "address", "service process", "corporate culture" and so on. When users face these related words, they can choose "address". Then, based on the user's IP address, the search engine can accurately determine that the information to be searched by the user is actually "the address of the ICBC closest to the location of the IP address." In this way, the search engine can accurately retrieve it in its own database and will have "College Road ICBC Address" The web page is displayed directly to the user, and the user can quickly get the information they really need.
当上述可视化搜索引擎系统作为面向网络购物场合服务的垂直搜索引擎(以下简 称图购搜索) 时, 所提供的网络购物导航网站入口首页如图 10所示。 在图购搜索的首 页中, 搜索框位于页面的上方, 其左侧是醒目的 "图购" 标识, 下方是一系列常用的 搜索快捷方式, 包括 "最新"、 "推荐"、 "化妆品"、 "团购"、 "综合购物"、 "购物打折"、 "数码家电"、 "女性时尚"、 "母婴儿童"、 "服装服饰"等。 在上述搜索框及搜索快捷 方式的下方是由一系列网页缩略图组成的主题精选。 这些网页缩略图都是由可视化搜 索引擎系统中的网络爬虫装置抓取生成的。  When the above-mentioned visual search engine system is used as a vertical search engine (hereinafter referred to as a map search) for a web shopping occasion service, the homepage of the online shopping navigation website provided is shown in FIG. In the homepage of the Tubu search, the search box is located at the top of the page, the left side is the eye-catching "Picture" logo, and the bottom is a series of commonly used search shortcuts, including "Latest", "Recommended", "Cosmetics", "Group purchase", "comprehensive shopping", "shopping discount", "digital home appliances", "female fashion", "mother and baby children", "clothing apparel" and so on. Below the search box and search shortcuts above is a selection of themes consisting of a series of thumbnails of web pages. These web page thumbnails are all generated by crawling by web crawlers in the visual search engine system.
本发明所提供的图购搜索的使用过程充分尊重普通用户的使用习惯,包括搜索一 筛选一比较一再筛选(该步骤可以省略)一进入购物对象所在的网店页面购买等步骤。 这些步骤与使用其它购物搜索引擎是十分类似的。 但现有的购物搜索引擎在使用时, The use process of the graphic search provided by the present invention fully respects the usage habits of ordinary users, including the steps of searching, screening, comparing and screening (this step can be omitted), and entering the online shop page where the shopping object is located. These steps are very similar to using other shopping search engines. But when the existing shopping search engine is in use,
"査看" 和 "比较" 的操作往往需要离开购物搜索引擎所在的网站, 操作很不方便, 而且经过复杂的网页跳转之后用户往往找不到最初的搜索入口。 为了解决这一问题, 本图购搜索通过网络爬虫装置和显示控制装置的共同配合, 将网络购物过程中的 "搜 索"、 "査看" 和 "比较" 集成在可视化搜索引擎系统的内部完成, 由此形成一个完整 的网络购物导航过程, 极大地改善用户的购物体验。 The operations of "viewing" and "comparing" often need to leave the website where the shopping search engine is located, which is inconvenient to operate, and users often cannot find the original search portal after a complicated web page jump. In order to solve this problem, this Tubu search integrates "search", "view" and "comparison" in the online shopping process into the interior of the visual search engine system through the cooperation of the web crawler device and the display control device. This forms a complete online shopping navigation process that greatly improves the user's shopping experience.
图 11为根据用户输入信息进行 "搜索" 时的搜索结果页面示例图。 在该示例图 中, 用户在搜索框中输入了 "汽车" 的购物对象关键词, 于是在显示页面的左侧显示 了与购物对象 "汽车" 相关的文字搜索结果, 而显示页面的右上角和右下角显示了与 文字搜索结果所对应的网页缩略图。 用户当前所选择的购物对象的焦点网页缩略图位 于整个显示页面的中心区域。 图 11所示的显示页面的基本框架是由可视化搜索引擎系 统中的显示控制装置决定的, 因此与图 5所示的显示页面十分类似。  FIG. 11 is a diagram showing an example of a search result page when "searching" is performed based on user input information. In the example diagram, the user inputs the shopping object keyword of "car" in the search box, and then displays the text search result related to the shopping object "car" on the left side of the display page, and displays the upper right corner of the page and The thumbnail of the web page corresponding to the text search result is displayed in the lower right corner. The focus web page thumbnail of the shopping object currently selected by the user is located in the center area of the entire display page. The basic frame of the display page shown in Fig. 11 is determined by the display control device in the visual search engine system, and thus is very similar to the display page shown in Fig. 5.
由于图 11所示的显示页面可以清楚地显示某一购物对象所在的网页缩略图, 因 此用户不必点击该购物对象所在的网页即可完成 "査看" 的操作。 由于该 "査看"操 作完全在图购搜索的内部完成, 使用户的操作大为简化。 另一方面, 用户仅仅使用图 购搜索就可以看到所要购买的对象及其价格等关联信息, 通过对网页的挑选实现对购 物对象的挑选, 实现了在搜索引擎层面寻找购物对象, 使图购搜索所发挥的网络购物 导航作用更加突出。  Since the display page shown in FIG. 11 can clearly display the thumbnail of the webpage where a certain shopping object is located, the user can complete the "viewing" operation without clicking the webpage where the shopping object is located. Since the "view" operation is completely done inside the map search, the user's operation is greatly simplified. On the other hand, the user can use the map purchase search to see the related information of the object to be purchased and its price, and realize the selection of the shopping object by selecting the webpage, thereby realizing the search object at the search engine level, and making the purchase of the map. Searching for the role of online shopping navigation is more prominent.
为了方便用户在使用图购搜索时进行挑选与比较, 在图 11所示的搜索结果页面 中设置了挑选栏和收藏夹。 在挑选栏中, 为每一个针对购物对象的目标网页设定一个 网页 ID, 在目标购物对象处理过程中对这个网页 ID 开展中转管理。 挑选栏可以利用 cooki e和后台挑选暂存库, 存储暂时保存的网页 ID, 并通过对网页 ID的操作实现加 入或抛弃想购买的物品。  In order to facilitate the selection and comparison of the user when using the map search, the selection bar and the favorites are set in the search result page shown in FIG. In the selection column, a webpage ID is set for each target webpage for the shopping object, and the webpage ID is transited during the processing of the target shopping object. The selection bar can use the cooki e and the background to select the temporary storage library, store the temporarily saved web page ID, and add or discard the items to be purchased by the operation of the web page ID.
在用户使用图购搜索时进行网络购物时, 搜索出的结果很多, 必须先放入到挑选 栏以便进行 "比较", 再从挑选栏中进入目标购物对象所在的网页。 图购搜索中的挑选 栏对于任何用户都是开放的, 而收藏夹只针对注册用户开放。 存放在挑选栏中的网页 ID只是暂时保存。 在用户一段时间没有使用图购搜索时, 相应的挑选栏会自动清空。 注册用户所使用的收藏夹则可以长期保存用户所挑选的网页 ID, 以便以后随时可以调 用。 When a user conducts online shopping using a graphic search, the search results are numerous, and must be placed in the selection bar for "comparison", and then from the selection bar to the web page where the target shopping object is located. Selection in the search The bar is open to any user, and the favorites are only open to registered users. The web page ID stored in the pick bar is only temporarily saved. When the user does not use the Tesco search for a certain period of time, the corresponding selection bar will be automatically cleared. The favorites used by registered users can save the web page ID selected by the user for a long time, so that they can be called at any time in the future.
本图购搜索的一个显著特点在于用户的搜索结果必须先放入到挑选栏以便进行 "比较", 再从挑选栏中进入目标购物对象所在的网页。 图 12 为用户将初步挑选的购 物对象集中在一起 "査看", 以便进行 "比较" 的显示页面示例图。 该 "比较"过程中 所显示的购物对象缩略图仍然是由网络爬虫装置抓取并生成的。 由于本图购搜索中的 网络爬虫装置具有很强的网页缩略图抓取能力, 因此能够在图购搜索的内部实现购物 对象缩略图的任意显示, 以便用户集中放在一起进行 "比较"。 在 "比较"过程中, 用 户仍然没有离开图购搜索所提供的平台, 因此避免了现有购物搜索引擎在进行"比较" 时需要进行反复网页跳转的麻烦, 极大地简化了用户的操作。  A notable feature of this Motobu search is that the user's search results must first be placed in the selection bar for "comparison", and then from the selection bar to the web page where the target shopping object is located. Figure 12 is an example of a display page where the user "patch" the initially selected purchase objects together for "comparison". The thumbnail of the shopping object displayed during the "comparison" process is still captured and generated by the web crawler. Since the web crawler in the map search has a strong webpage thumbnail crawling capability, it is possible to realize arbitrary display of the shopping object thumbnails inside the cartographic search, so that the users can collectively put them together for "comparison". In the "comparison" process, the user still does not leave the platform provided by the Tesco search, thus avoiding the trouble that the existing shopping search engine needs to perform repeated page jumps when performing "comparison", which greatly simplifies the user's operation.
由于本图购搜索作为网络购物门户, 仅仅提供网络购物导航功能, 本身并不销售 任何商品, 因此用户在经过 "比较", 确定要购买的对象后, 需要通过图购搜索所提供 的链接进入该购物对象所在的网店页面进行购买。 图 13为用户进入购物对象所在的网 店页面的示例图。 在这一操作过程中, 购物对象所在的网店页面采用虚浮方式进行显 示, 并控制目标的走向, 实现在搜索引擎结果内的直接转换, 从而使用户没有离开图 购搜索本地的感觉, 进一步改善用户的购物体验。  Since this map search is used as a web shopping portal, it only provides the online shopping navigation function, and does not sell any merchandise itself. Therefore, after the user compares and determines the object to be purchased, the user needs to enter the link through the map purchase search. The online store page where the shopping object is located is purchased. Figure 13 is a diagram showing an example of a user entering a shop page where a shopping object is located. In this operation, the online shop page where the shopping object is located is displayed in a virtual floating manner, and the direction of the target is controlled, and the direct conversion within the search engine result is realized, so that the user does not leave the Tesco search local feeling, and further improves. The user's shopping experience.
以上对本发明所提供的可视化搜索引擎系统及其实现方法和应用进行了详细的说 明。 对本领域的技术人员而言, 在不背离本发明实质精神的前提下对它所做的任何显 而易见的改动, 都将构成对本发明专利权的侵犯, 将承担相应的法律责任。  The visual search engine system and its implementation method and application provided by the present invention are described in detail above. Any obvious changes made to the present invention without departing from the spirit of the invention will constitute an infringement of the patent right of the present invention and will bear corresponding legal liabilities.

Claims

权 利 要 求 Rights request
1. 一种可视化搜索引擎系统,包括网络爬虫装置、显示控制装置和语义分析装置, 其特征在于: A visual search engine system comprising a web crawler device, a display control device and a semantic analysis device, characterized in that:
所述网络爬虫装置进一步包括多个信息采集器、 页面分析器、 URL过滤器、 页面过 滤器、 URL管理器、 图片生成器、 URL库和页面库; 其中,  The web crawler device further includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library; wherein
所述信息采集器位于所述网络爬虫装置的底层,与互联网直接进行交互以获取 Web 页面, 所述页面分析器与所述信息采集器进行连接, 一方面从页面内容中解析出带有 链接标记的 URL, 交给所述 URL过滤器解析; 另一方面将页面内容解析为文本格式, 交 给所述页面过滤器处理;  The information collector is located at a bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content. The URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
所述 URL过滤器对 URL进行限定站点范围和主题的过滤之后, 存入 URL库中; 所 述页面过滤器进行页面内容的冗余检测后, 将检测后的页面存入页面库中;  After the URL filter filters the URL to the site scope and the theme, the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
所述图片生成器连接所述 URL库, 针对所述 URL库中存储的 URL生成页面对应的 图片;  The picture generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library;
所述显示控制装置进一步包括:  The display control device further includes:
文字搜索结果显示单元, 用于以列表方式显示文字搜索结果;  a text search result display unit for displaying text search results in a list manner;
图形搜索结果显示单元, 用于显示与文字搜索结果对应的网页缩略图; 焦点跟踪单元, 用于捕获用户关注的文字焦点和 /或图形焦点;  a graphic search result display unit, configured to display a webpage thumbnail corresponding to the text search result; a focus tracking unit, configured to capture a text focus and/or a graphic focus of the user's attention;
焦点网页缩略图显示单元, 用于显示用户选择的文字焦点所对应的图形焦点; 同步显示控制单元, 用于使显示的文字焦点和图形焦点在显示页面中同步显示, 并通过带头指针的双向循环链表实现同步协调变化; 其中,  a focus webpage thumbnail display unit, configured to display a graphic focus corresponding to a text focus selected by the user; a synchronous display control unit, configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer The linked list realizes synchronous and coordinated changes;
所述文字搜索结果显示单元位于整个显示页面的左侧中间位置, 所述焦点网页缩 略图显示单元位于整个显示页面的中心区域, 所述图形搜索结果显示单元分别位于焦 点网页缩略图显示单元的右上角和右下角;  The text search result display unit is located at a left middle position of the entire display page, the focus webpage thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located at the upper right of the focus webpage thumbnail display unit. Corner and bottom right corner;
所述语义分析装置进一步包括:  The semantic analysis device further includes:
输入分词单元, 用于接受用户输入的目标信息描述词, 对所述目标信息描述词进 行分词操作;  The input word segment unit is configured to accept a target information description word input by the user, and perform a word segmentation operation on the target information description word;
语义判断单元, 用于判断所述目标信息描述词是否具有完整语义;  a semantic determining unit, configured to determine whether the target information descriptor has complete semantics;
参考词汇单元, 用于在所述目标信息描述词不具有完整语义的情况下, 向用户提 供与所述目标信息描述词相关联的词汇;  a reference vocabulary unit, configured to provide a vocabulary associated with the target information descriptor to the user if the target information descriptor does not have complete semantics;
二次输入单元,用于供用户进行二次输入,从而确定所述目标信息描述词的语义, 根据该语义进行后续的检索。  The secondary input unit is configured to perform secondary input by the user, thereby determining semantics of the target information descriptor, and performing subsequent retrieval according to the semantic.
2. —种如权利要求 1所述的可视化搜索引擎系统实现互联网搜索结果显示控制的 方法, 包括页面渲染步骤、 显示控制步骤和语义分析步骤, 其特征在于:  2. A method for implementing display control of Internet search results by a visual search engine system according to claim 1, comprising a page rendering step, a display control step, and a semantic analysis step, wherein:
所述页面渲染步骤包括如下的子步骤: (1) 生成 Web页面的开始标签; The page rendering step includes the following sub-steps: (1) generate a start tag of the web page;
(2) 渲染页面模板中的内容, 其中每进入一个标签, 都依次调用所述标签的各个生 命周期阶段;  (2) rendering the content in the page template, wherein each time a label is entered, the life cycle stages of the label are sequentially invoked;
(3) 渲染 Web页面中的体;  (3) rendering the body in the web page;
(4) 生成 Web页面的结束标签;  (4) Generate an end tag of the web page;
(5) 清除数据;  (5) Clear the data;
所述显示控制步骤包括如下的子步骤:  The display control step includes the following sub-steps:
(6) 在显示页面中, 将文字搜索结果与对应的网页缩略图纵向并行排列, 显示页 面的中心部分为焦点显示区域, 用于显示用户所点选的文字焦点所对应的图形焦点; (6) In the display page, the text search result is vertically arranged in parallel with the corresponding webpage thumbnail, and the central part of the display page is a focus display area for displaying the graphic focus corresponding to the text focus selected by the user;
(7) 所述文字焦点和所述图形焦点在显示页面中同步显示, 并通过带头指针的双向 循环链表实现同步协调变化, 其中所述头指针用于实现对文字焦点所在位置的判断; 所述语义分析步骤包括如下的子步骤: (7) the text focus and the graphic focus are synchronously displayed in the display page, and the synchronous coordination change is implemented by the bidirectional circular linked list with the leading pointer, wherein the head pointer is used to implement the judgment of the position of the text focus; The semantic analysis step includes the following substeps:
(8) 接受用户输入的目标信息描述词, 对所述目标信息描述词进行分词操作; 0) 判断所述目标信息描述词是否具有完整的语义;  (8) accepting a target information descriptor input by the user, performing a word segmentation operation on the target information descriptor word; 0) determining whether the target information descriptor word has complete semantics;
(10) 如果是则直接进行后续的检索; 如果不是, 则向用户提供与所述目标信息描述 词相关联的词汇;  (10) If yes, perform subsequent retrieval directly; if not, provide the user with a vocabulary associated with the target information descriptor;
(11) 用户进行二次输入, 从而确定所述目标信息描述词的语义, 根据该语义进行后 续的检索。  (11) The user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
3. 一种具备页面渲染功能的网络爬虫装置, 其特征在于:  3. A web crawler with page rendering function, characterized by:
所述网络爬虫装置包括多个信息采集器、 页面分析器、 URL过滤器、 页面过滤器、 URL管理器、 图片生成器、 URL库和页面库; 其中,  The web crawler includes a plurality of information collectors, a page analyzer, a URL filter, a page filter, a URL manager, a picture generator, a URL library, and a page library; wherein
所述信息采集器位于所述网络爬虫装置的底层,与互联网直接进行交互以获取 Web 页面, 所述页面分析器与所述信息采集器进行连接, 一方面从页面内容中解析出带有 链接标记的 URL, 交给所述 URL过滤器解析; 另一方面将页面内容解析为文本格式, 交 给所述页面过滤器处理;  The information collector is located at a bottom layer of the web crawler device, and directly interacts with the Internet to obtain a web page, and the page analyzer is connected with the information collector, and parses the link mark from the page content. The URL is forwarded to the URL filter for parsing; on the other hand, the page content is parsed into a text format and submitted to the page filter for processing;
所述 URL过滤器对 URL进行限定站点范围和主题的过滤之后, 存入 URL库中; 所 述页面过滤器进行页面内容的冗余检测后, 将检测后的页面存入页面库中;  After the URL filter filters the URL to the site scope and the theme, the URL filter is stored in the URL library; after the page filter performs redundancy detection of the page content, the detected page is stored in the page library;
所述图片生成器连接所述 URL库, 针对所述 URL库中存储的 URL生成页面对应的 图片。  The picture generator is connected to the URL library, and generates a picture corresponding to the page for the URL stored in the URL library.
4. 如权利要求 3所述的网络爬虫装置, 其特征在于:  4. The web crawler apparatus according to claim 3, wherein:
所述信息采集器从信息源出发, 通过 h«p协议请求, 下载 Web页面, 所述页面分析 器分析页面并提取链接, 然后所述信息采集器再以迭代的方式访问网络。  The information collector starts from the information source and requests the h«p protocol to download the web page. The page analyzer analyzes the page and extracts the link, and then the information collector accesses the network in an iterative manner.
5. 如权利要求 3或 4所述的网络爬虫装置, 其特征在于:  5. The web crawler apparatus according to claim 3 or 4, wherein:
所述信息采集器采用图的遍历算法搜索 Web页面。  The information collector searches the web page by using a graph traversal algorithm.
6. 如权利要求 3所述的网络爬虫装置, 其特征在于: 所述 URL过滤器利用扩展元数据的语义信息, 对从 Web页面中提取出的 URL进行主题 相关性预测, 按照相关链接进行采集、 不相关链接直接丢弃的原则进行剪枝处理。 6. The web crawler apparatus according to claim 3, wherein: The URL filter uses the semantic information of the extended metadata to perform topic correlation prediction on the URL extracted from the web page, and performs pruning processing according to the principle of collecting related links and discarding irrelevant links.
7. 如权利要求 3所述的网络爬虫装置, 其特征在于:  7. The web crawler apparatus according to claim 3, wherein:
所述 URL管理器一方面从所述 URL 库中获得 URL列表, 进行任务排列后分配给多 个信息采集器; 另一方面从多个信息采集器中获得新的 URL 列表, 将这些列表保存到 所述 URL库中。  The URL manager obtains a list of URLs from the URL library on the one hand, and assigns the tasks to a plurality of information collectors after being arranged in a task; on the other hand, obtains a new URL list from a plurality of information collectors, and saves the lists to In the URL library.
8. 一种如权利要求 3所述的网络爬虫装置实现页面渲染功能的方法, 其特征在于 包括如下步骤:  8. A method for implementing a page rendering function by a web crawler device according to claim 3, comprising the steps of:
(1) 生成 Web页面的开始标签;  (1) Generate a start tag of the web page;
(2) 渲染页面模板中的内容, 其中每进入一个标签, 都依次调用所述标签的各个生 命周期阶段;  (2) rendering the content in the page template, wherein each time a label is entered, the life cycle stages of the label are sequentially invoked;
(3) 渲染 Web页面中的体;  (3) rendering the body in the web page;
(4) 生成 Web页面的结束标签;  (4) Generate an end tag of the web page;
(5) 清除数据。  (5) Clear the data.
9. 如权利要求 8所述的网络爬虫装置实现页面渲染功能的方法, 其特征在于: 所述步骤 (2)中, 调用所述标签的各个生命周期阶段是指从上层标签到下层标签的 递归入口, 只有下层标签渲染结束, 进行调用的组件才继续后续阶段的操作。  9. The method for implementing a page rendering function by a web crawler device according to claim 8, wherein: in the step (2), calling each lifecycle stage of the label refers to recursion from an upper layer label to a lower layer label. At the entrance, only the underlying label is rendered, and the calling component continues the subsequent phases.
10. 如权利要求 8所述的网络爬虫装置实现页面渲染功能的方法, 其特征在于: 所述步骤 (4)中, 生成结束标签的操作由控制内嵌标签执行流程的操作代替。  10. The method for implementing a page rendering function by a web crawler device according to claim 8, wherein: in the step (4), the operation of generating an end tag is replaced by an operation of controlling an inline tag execution flow.
11. 一种如权利要求 8 所述的网络爬虫装置实现页面渲染功能的方法, 其特征在 于包括如下步骤:  11. A method of implementing a page rendering function by a web crawler device according to claim 8, further comprising the steps of:
当发现一个图片标签引用了一张图片时, 向服务器发出请求; 此时继续渲染后面 的代码, 服务器返回所述图片的文件, 然后重新渲染这部分代码。  When a picture tag is found to reference a picture, a request is made to the server; at this point, the subsequent code is rendered, the server returns the file of the picture, and the code is re-rendered.
12. 如权利要求 11所述的网络爬虫装置实现页面渲染功能的方法, 其特征在于: 当发现存在一个 JavaScript代码的 < SCript^ 签时, 执行语句, 重新渲染部分 代码, 然后将渲染的结果生成图片。 12. The method for implementing a page rendering function by a web crawler device according to claim 11, wherein: when it is found that there is a < SC ript^ tag of a JavaScript code, executing a statement, re-rendering part of the code, and then rendering the result Generate an image.
13. 一种以图文并茂方式显示搜索结果的显示控制装置, 其特征在于包括: 文字搜索结果显示单元, 用于以列表方式显示文字搜索结果;  13. A display control device for displaying search results in an image and text manner, comprising: a text search result display unit, configured to display a text search result in a list manner;
图形搜索结果显示单元, 用于显示与文字搜索结果对应的网页缩略图; 焦点跟踪单元, 用于捕获用户关注的文字焦点和 /或图形焦点;  a graphic search result display unit, configured to display a webpage thumbnail corresponding to the text search result; a focus tracking unit, configured to capture a text focus and/or a graphic focus of the user's attention;
焦点网页缩略图显示单元, 用于显示用户选择的文字焦点所对应的图形焦点; 同步显示控制单元, 用于使显示的文字焦点和图形焦点在显示页面中同步显示, 并通过带头指针的双向循环链表实现同步协调变化; 其中,  a focus webpage thumbnail display unit, configured to display a graphic focus corresponding to a text focus selected by the user; a synchronous display control unit, configured to synchronously display the displayed text focus and the graphic focus in the display page, and perform a bidirectional loop through the leading pointer The linked list realizes synchronous and coordinated changes;
所述文字搜索结果显示单元位于整个显示页面的左侧中间位置, 所述焦点网页缩 略图显示单元位于整个显示页面的中心区域, 所述图形搜索结果显示单元分别位于焦 点网页缩略图显示单元的右上角和右下角。 The text search result display unit is located at a left middle position of the entire display page, the focus web page thumbnail display unit is located at a central area of the entire display page, and the graphic search result display unit is respectively located in the focus Click the top right corner and bottom right corner of the page thumbnail display unit.
14. 如权利要求 13所述的以图文并茂方式显示搜索结果的显示控制装置, 其特征 在于:  14. The display control device for displaying search results in a graphic form according to claim 13, wherein:
在所述显示控制装置的显示页面中, 文字搜索结果与对应的网页缩略图纵向并行 排列。  In the display page of the display control device, the text search result is vertically arranged in parallel with the corresponding web page thumbnail.
15. 如权利要求 13所述的以图文并茂方式显示搜索结果的显示控制装置, 其特 征在于:  15. The display control device for displaying search results in a graphic form according to claim 13, wherein:
所述头指针在双向循环链表中的位置与所述文字焦点在文字搜索结果中的位置 相对应。  The position of the head pointer in the bidirectional circular linked list corresponds to the position of the text focus in the text search result.
16. 一种以图文并茂方式显示搜索结果的显示控制方法, 其特征在于: 在显示页面中, 将文字搜索结果与对应的网页缩略图纵向并行排列, 显示页面的 中心部分为焦点显示区域, 用于显示用户所点选的文字焦点所对应的图形焦点; 所述文字焦点和所述图形焦点在显示页面中同步显示,并通过带头指针的双向循 环链表实现同步协调变化, 其中所述头指针用于实现对文字焦点所在位置的判断。  16. A display control method for displaying search results in an image and text manner, wherein: in a display page, a text search result is vertically arranged in parallel with a corresponding webpage thumbnail, and a central portion of the display page is a focus display area, and is used for Displaying a graphic focus corresponding to the text focus selected by the user; the text focus and the graphic focus are synchronously displayed in the display page, and synchronously changing the change is realized by a bidirectional circular linked list with a leading pointer, wherein the head pointer is used for Achieve a judgment on the location of the text focus.
17. 如权利要求 16所述的以图文并茂方式显示搜索结果的显示控制方法, 其特 征在于:  17. The display control method for displaying search results in an illustrated manner according to claim 16, wherein:
所述头指针在双向循环链表中的位置与所述文字焦点在文字搜索结果中的位置 相对应。  The position of the head pointer in the bidirectional circular linked list corresponds to the position of the text focus in the text search result.
18. 一种利用语义分析实现精确搜索的方法, 其特征在于包括如下的步骤: 18. A method for implementing an accurate search using semantic analysis, comprising the steps of:
(1) 接受用户输入的目标信息描述词, 对所述目标信息描述词进行分词操作;(1) accepting a target information description word input by the user, and performing a word segmentation operation on the target information description word;
(2) 判断所述目标信息描述词是否具有完整的语义; (2) determining whether the target information descriptor has complete semantics;
(3) 如果是则直接进行后续的检索; 如果不是, 则向用户提供与所述目标信息描述 词相关联的词汇;  (3) If yes, perform subsequent retrieval directly; if not, provide the user with a vocabulary associated with the target information descriptor;
(4) 用户进行二次输入, 从而确定所述目标信息描述词的语义, 根据该语义进行后 续的检索。  (4) The user performs a secondary input to determine the semantics of the target information descriptor, and performs subsequent retrieval based on the semantics.
19. 如权利要求 18所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (1)中, 所述分词操作采用最大化分词算法。  19. The method for implementing an accurate search by using semantic analysis according to claim 18, wherein: in the step (1), the word segmentation operation adopts a maximization word segmentation algorithm.
20. 如权利要求 18所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (2)中, 在所述目标信息描述词中具有 "本体" 与 "行为", 且 "本体" 与 "行为"形成关联时, 认为所述目标信息描述词具有完整的语义。  20. The method of claim 18, wherein the step (2) has "ontology" and "behavior" in the target information descriptor, and "ontology" When "associated with" behavior, it is considered that the target information descriptor has complete semantics.
21. 如权利要求 18所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (2)中, 如果所述目标信息描述词不具有完整的语义, 则首先确定所述目 标信息描述词中的 "本体"。  The method for realizing accurate search by using semantic analysis according to claim 18, wherein: in the step (2), if the target information descriptor does not have complete semantics, first determining the target information. Describe the "ontology" in the word.
22. 如权利要求 21所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (4)中, 通过用户的二次输入进一步确定所述目标信息描述词对应的 "行 为"。 22. The method for implementing an accurate search by using semantic analysis according to claim 21, wherein: in the step (4), determining, by the user's secondary input, the "row" corresponding to the target information descriptor For ".
23. 如权利要求 18所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (3)中, 由元词汇关联数据库存放与所述目标信息描述词相关联的词汇。  23. The method for implementing an accurate search using semantic analysis according to claim 18, wherein: in the step (3), the vocabulary associated with the target information descriptor is stored by the meta-vocabulary association database.
24. 如权利要求 23所述的利用语义分析实现精确搜索的方法, 其特征在于: 在所述元词汇关联数据库中,对于某个元词汇 S,用 S {c i, dj}来存储其关联词汇, 并将 c i作为第一层分类, dj作为第二层分类。  24. The method for realizing accurate search by using semantic analysis according to claim 23, wherein: in the meta-vocabulary association database, for a meta-vocabulary S, S {ci, dj} is used to store the associated vocabulary , and ci is classified as the first layer, and dj is classified as the second layer.
25. 如权利要求 18所述的利用语义分析实现精确搜索的方法, 其特征在于: 所述步骤 (4)中,用户以在与所述目标信息描述词相关联的词汇中选择的方式进行 二次输入。  25. The method for implementing an accurate search by using semantic analysis according to claim 18, wherein: in the step (4), the user performs the second selection in a vocabulary associated with the target information descriptor. Inputs.
26. 一种网络购物导航方法, 基于包括网络爬虫装置和显示控制装置的可视化搜 索引擎系统实现, 其中所述网络爬虫装置用于抓取并生成网页缩略图, 其特征在于: 在所述可视化搜索引擎系统用于网络购物导航时, 首先根据用户输入的购物对象 关键词, 由所述显示控制装置在搜索结果页面的左侧显示购物对象的文字搜索结果, 搜索结果页面的右上角和右下角显示与所述文字搜索结果所对应的网页缩略图, 搜索 结果页面的中心区域显示用户当前所选择的购物对象的焦点网页缩略图;  26. A method of network shopping navigation, implemented based on a visual search engine system comprising a web crawler device and a display control device, wherein the web crawler device is configured to crawl and generate a webpage thumbnail, wherein: the visual search When the engine system is used for web shopping navigation, first, according to the shopping object keyword input by the user, the display control device displays the text search result of the shopping object on the left side of the search result page, and the upper right corner and the lower right corner of the search result page are displayed. a thumbnail of the webpage corresponding to the text search result, and a central area of the search result page displays a thumbnail of the focus webpage of the shopping object currently selected by the user;
在所述搜索结果页面中设置挑选栏, 用户将所选择的搜索结果放入所述挑选栏中 进行比较, 再从挑选栏中进入购物对象所在的网页进行购买。  A selection column is set in the search result page, and the user puts the selected search result into the selection column for comparison, and then enters the webpage where the shopping object is located from the selection column to purchase.
27. 如权利要求 26所述的网络购物导航方法, 其特征在于:  27. The method of network shopping navigation according to claim 26, wherein:
在所述挑选栏中, 为每一个针对购物对象的目标网页设定一个网页 ID, 对所述网 页 ID开展中转管理。  In the selection column, a webpage ID is set for each target webpage for the shopping object, and the webpage ID is transit managed.
28. 如权利要求 27所述的网络购物导航方法, 其特征在于:  28. The method of network shopping navigation according to claim 27, wherein:
所述挑选栏暂时保存所述网页 ID,通过对所述网页 ID的操作实现加入或抛弃想购 买的物品。  The selection bar temporarily saves the webpage ID, and joins or discards the item to be purchased by the operation of the webpage ID.
29. 如权利要求 26所述的网络购物导航方法, 其特征在于:  29. The method of network shopping navigation according to claim 26, wherein:
在所述搜索结果页面中还设置收藏夹, 所述收藏夹对注册用户开放, 长期保存所 述注册用户所挑选的网页 ID。  A favorite is also set in the search result page, and the favorite is open to the registered user, and the webpage ID selected by the registered user is stored for a long time.
30. 如权利要求 26所述的网络购物导航方法, 其特征在于:  30. The method of network shopping navigation according to claim 26, wherein:
在进行比较时, 将由所述网络爬虫装置抓取并生成的购物对象缩略图集中在一起 供用户挑选。  When comparing, the thumbnails of the shopping objects captured and generated by the web crawler are grouped together for selection by the user.
31. 如权利要求 26所述的网络购物导航方法, 其特征在于:  31. The method of network shopping navigation according to claim 26, wherein:
用户在确定要购买的购物对象后, 通过链接进入该购物对象所在的网店页面进行 购买, 所述网店页面采用虚浮方式进行显示。  After determining the shopping object to be purchased, the user enters the online shop page where the shopping object is located through the link, and the online shop page is displayed in a virtual floating manner.
PCT/CN2011/078725 2010-08-27 2011-08-22 Visualized search engine system and implementation method and application thereof WO2012025040A1 (en)

Applications Claiming Priority (8)

Application Number Priority Date Filing Date Title
CN201010264871.6A CN101916294B (en) 2010-08-27 2010-08-27 Method for realizing exact search by utilizing semantic analysis
CN201010264871.6 2010-08-27
CN201010590806.2 2010-12-10
CN 201010590806 CN102054028B (en) 2010-12-10 2010-12-10 Method for implementing web-rendering function by using web crawler system
CN 201110052339 CN102129453B (en) 2011-03-04 2011-03-04 Display control device and method capable of displaying search result in mode of text completed with graphs
CN201110052339.2 2011-03-04
CN201110234356.8 2011-08-14
CN201110234356.8A CN102270331B (en) 2011-08-14 2011-08-14 Network shopping navigating method based on visual search

Publications (1)

Publication Number Publication Date
WO2012025040A1 true WO2012025040A1 (en) 2012-03-01

Family

ID=45722900

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/078725 WO2012025040A1 (en) 2010-08-27 2011-08-22 Visualized search engine system and implementation method and application thereof

Country Status (1)

Country Link
WO (1) WO2012025040A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216892A (en) * 2013-05-31 2014-12-17 亿览在线网络技术(北京)有限公司 Non-semantic non-word-group switching method in song search
CN117874319A (en) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042695A (en) * 2006-03-20 2007-09-26 腾讯科技(深圳)有限公司 Method for breviary displaying the result of page searching
CN101114294A (en) * 2007-08-22 2008-01-30 杭州经合易智控股有限公司 Self-help intelligent uprightness searching method
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN101206662A (en) * 2006-12-13 2008-06-25 佳能株式会社 Document retrieving apparatus, document retrieving method
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval
CN101551819A (en) * 2009-04-30 2009-10-07 用友软件股份有限公司 A method for rendering large-scale Web page
CN101916294A (en) * 2010-08-27 2010-12-15 黄斌 Method for realizing exact search by utilizing semantic analysis
CN102054028A (en) * 2010-12-10 2011-05-11 黄斌 Web crawler system with page-rendering function and implementation method thereof
CN102129453A (en) * 2011-03-04 2011-07-20 黄斌 Display control device and method capable of displaying search result in mode of text completed with graphs

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101042695A (en) * 2006-03-20 2007-09-26 腾讯科技(深圳)有限公司 Method for breviary displaying the result of page searching
CN101114285A (en) * 2006-07-25 2008-01-30 腾讯科技(深圳)有限公司 Internet topics file searching method, reptile system and search engine
CN101206662A (en) * 2006-12-13 2008-06-25 佳能株式会社 Document retrieving apparatus, document retrieving method
CN101114294A (en) * 2007-08-22 2008-01-30 杭州经合易智控股有限公司 Self-help intelligent uprightness searching method
CN101246492A (en) * 2008-02-26 2008-08-20 华中科技大学 Full text retrieval system based on natural language
CN101510221A (en) * 2009-02-17 2009-08-19 北京大学 Enquiry statement analytical method and system for information retrieval
CN101551819A (en) * 2009-04-30 2009-10-07 用友软件股份有限公司 A method for rendering large-scale Web page
CN101916294A (en) * 2010-08-27 2010-12-15 黄斌 Method for realizing exact search by utilizing semantic analysis
CN102054028A (en) * 2010-12-10 2011-05-11 黄斌 Web crawler system with page-rendering function and implementation method thereof
CN102129453A (en) * 2011-03-04 2011-07-20 黄斌 Display control device and method capable of displaying search result in mode of text completed with graphs

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104216892A (en) * 2013-05-31 2014-12-17 亿览在线网络技术(北京)有限公司 Non-semantic non-word-group switching method in song search
CN104216892B (en) * 2013-05-31 2018-01-02 亿览在线网络技术(北京)有限公司 The switching method of non-semantic in song search, non-phrase
CN117874319A (en) * 2024-03-11 2024-04-12 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment
CN117874319B (en) * 2024-03-11 2024-05-17 江西顶易科技发展有限公司 Search engine-based information mining method and device and computer equipment

Similar Documents

Publication Publication Date Title
JP5320509B2 (en) Visual search and 3D results
US8200617B2 (en) Automatic mapping of a location identifier pattern of an object to a semantic type using object metadata
CN102968495B (en) The vertical search engine of search contrast association shopping information and method
CN102270331B (en) Network shopping navigating method based on visual search
CN102063475B (en) Webpage user terminal presenting method of three-dimensional model
KR101017016B1 (en) Method, system and computer-readable recording medium for providing information on goods based on image matching
JP2012511208A (en) Preview search results for proposed refined terms and vertical search
Ahmadi et al. User-centric adaptation of Web information for small screens
CN106503211B (en) Method for automatically generating mobile version facing information publishing website
CN100397401C (en) Method for multiple resources pools integral parallel search in open websites
Sabri et al. Improving performance of DOM in semi-structured data extraction using WEIDJ model
WO2012025040A1 (en) Visualized search engine system and implementation method and application thereof
Khan et al. A relational aggregated disjoint multimedia search results approach using semantics
JP2008046879A (en) Page display device, page display method and computer program
JP4836069B2 (en) Content processing apparatus, content processing program, and content processing method
Sabri et al. WEIDJ: An improvised algorithm for image extraction from web pages
Wei et al. Assisted human-in-the-loop adaptation of Web pages for mobile devices
Khurana et al. Survey of techniques for deep web source selection and surfacing the hidden web content
Yang et al. Search for flash movies on the web
Dörk et al. Towards visual web search: Interactive query formulation and search result visualization
Vuong et al. ViewsInsight: Enhancing Video Retrieval for VBS 2024 with a User-Friendly Interaction Mechanism
CN102890715A (en) Device and method for automatically organizing specific domain information
CN107818126A (en) A kind of full text information retrieval method towards Mongo databases
Yang A Webpage Classification Algorithm Concerning Webpage Design Characteristics.
Veeraiah et al. A novel approach for extraction and representation of main data from web pages to android application

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11819415

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11819415

Country of ref document: EP

Kind code of ref document: A1