WO2012088706A1 - 一种检索的方法和系统 - Google Patents

一种检索的方法和系统 Download PDF

Info

Publication number
WO2012088706A1
WO2012088706A1 PCT/CN2010/080578 CN2010080578W WO2012088706A1 WO 2012088706 A1 WO2012088706 A1 WO 2012088706A1 CN 2010080578 W CN2010080578 W CN 2010080578W WO 2012088706 A1 WO2012088706 A1 WO 2012088706A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
search term
search
string
original
Prior art date
Application number
PCT/CN2010/080578
Other languages
English (en)
French (fr)
Inventor
肖岩
Original Assignee
Xiao Yan
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xiao Yan filed Critical Xiao Yan
Priority to US13/977,528 priority Critical patent/US9870392B2/en
Priority to PCT/CN2010/080578 priority patent/WO2012088706A1/zh
Priority to CN201080071023.1A priority patent/CN103314371B/zh
Publication of WO2012088706A1 publication Critical patent/WO2012088706A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing

Definitions

  • the present invention relates to the field of computer information processing, and more particularly to a retrieval method and system. Background technique
  • the purpose of the search engine is to provide guidance for the information that the user cares about.
  • the user uses the search engine to obtain information of interest and unknown information based on his or her known information.
  • the user may not be able to find an appropriate and accurate text description of the unknown information, and even if the user knows the main keyword, he or she hopes that the information related to the keyword has a good and sufficient prompt.
  • the prior art includes: 1.
  • the Chinese patent application number is 200610112822.4, and the name is "method of searching for prompts based on the inverted list”. 2, Baidu related search and google keyword tools, and so on. These techniques all generate search hints based on statistics of query terms entered by the user.
  • the content of these prompts is the top ranked data after screening. These data are just a list and are incomplete in content.
  • the data presented to the user in the form of a list because they are independent of each other, each piece of information exists independently, and contains prompt information related to the search term, which causes the data amount to be very large, the user The amount of work to find useful information can increase significantly.
  • there is no logical structure and semantic features between these pieces of information, and the search hints given give the user no idea what to do.
  • Baidu related search only gave 10 relevant search tips.
  • the Google Keyword Tool can give you up to 150 related search tips, these tips are unorganized and have no logical relevance.
  • these search hints are mostly based on modeling the behavior of a large number of users, and that most people's behavior is to retrieve the user's needs. For example, in 2008, between Beijing and the Olympics, after a large number of statistical users, it proved that there is a correlation between the two; in the spring of 2009, Beijing and the turbulent, 2010 New Year's Day between Beijing and Blizzard are related. . Once the search clicks of a large number of users are sudden and prescriptive, the search prompts are directly affected by such massive search clicks. In view of this, it is also necessary to find a better search hint and a method of displaying the search information. Summary of the invention
  • the present invention refines semantics, guides, and returns retrieval information by a process of retrieving a character string. Additionally, the search prompts are more concise, clear, and logically complete.
  • a method of searching provided for the purpose of the present invention includes the steps of:
  • a first data item set including the input search term is obtained according to a search term query term list entered by the user on the terminal; wherein, each of the data items of the first data item set has a kinship relationship;
  • Ci Ci and merging the first set of data items to the terminal; wherein the first set of data items are combined in a recursive manner;
  • step A the following steps are further included:
  • the acquiring the second data item set includes the following steps:
  • the second data item set is obtained by recursive combination matching by each data item of the first data item set, the query information index table.
  • step A1 generating a search term directory table includes the following steps:
  • A12. Determine, according to the inclusion relationship, a parent-child relationship between the original strings that match the two pairs;
  • n is greater than or equal to 1; the data item set D1, D2 Dn constitutes a search term directory table; wherein, the original data strings of the data items of the data item set Dn have a kinship relationship .
  • the inclusion relationship comprises:
  • the step A12 comprises the following steps: If the at least two original strings form a left inclusion or a right inclusion relationship, the two original strings are set as a parent-child relationship, and the included original string is a parent;
  • the two original string sets are set as a parent-child relationship, and the included original string set is a parent.
  • the data item set when the inclusion relationship is a right inclusion relationship, constitutes a search term directory table, and further comprises forming a search term on the basis of reverse ordering of the data item combination on the character string.
  • the table of contents; the string is reversed and sorted, including the following steps:
  • A131 Generate a reverse search term field backward according to the search term field
  • step A136 If the inheritance of the direct tree stack is empty, then jump to step A138;
  • the elder node is pushed onto the stack to the temporary direct tree stack, and the last node in the elder node is the parent node of the elder node, and the current stack cursor value is modified as the parent node.
  • step A139 If the elder node is not found in the inherited immediate tree stack, set the stack cursor value to 0; and assign the value of the temporary immediate tree stack to the inherited immediate tree stack, and press the current [number, reverse search term] data Into the direct tree stack, and then jump to step A139;
  • Step A138 Push the current [number, reverse search term] data into the inherited immediate tree stack, and update the current stack cursor value to 0;
  • Step A139 The current stack cursor value is incremented by 1, and the process proceeds to step A134, and the next [number, reverse search term] data is read, and the loop is executed;
  • the step of generating the original search term data table is further included; the step of generating the original search term data table includes the following steps:
  • the original string set is generated, and the original check is obtained.
  • Cable data table According to the information index data of the information index data table, after deduplicating, the original string set is generated, and the original check is obtained.
  • the method for searching further includes the step of generating an information index data table; and the step of generating the information index data table includes the following steps:
  • An inverted data table is created using an index word or an index word set, and an information index data table is generated according to the inverted data table.
  • the method for searching further includes the following steps:
  • a processing system for information retrieval including:
  • An information index data table module configured to generate and store all the information content according to the information data table module as an information index data table
  • a raw search word data table module configured to de-duplicate the information index data table to obtain an original character string, the original character string forming an original string set, all of the original string set being composed and stored as an original search term data sheet;
  • a search term list module configured to generate a data item set according to a kinship relationship between each of the original character strings and other original character strings in the original search term data table, wherein the data item set is composed and stored For searching the word list;
  • a search engine module configured to query the search term list according to a search term input by the user on the terminal, to obtain a first data item set including the input search term; wherein, each of the first data item set There is a kinship relationship between the data items; querying the information index data table according to each data item of the first data item set associated with the input search term, acquiring a second data item set; combining and transmitting the first data item set, And transmitting a second set of data items to the terminal.
  • the inclusion includes no inclusion, left inclusion, right inclusion, and center inclusion;
  • the mutual matching in the search term table module means that each original string is matched with other original strings, and the other original strings are left- and/or right-containing.
  • the original string together generates the data item.
  • the set of data items of the search term list constitutes a sorted multi-fork search tree.
  • the search engine module is further configured to present a first set of data items related to the input search term according to a recursive relationship of the search tree.
  • the second set of data items is associated with the first set of data items.
  • the processing system further includes an alias data table module, configured to store an alias string of the original string; and the alias string is used to perform an alias query of the original string;
  • the second set of data items is also associated with the first set of data items by an alias data table.
  • the present invention has at least the following advantages:
  • the search result provided by the present invention has strong certainty, and depends only on whether the search string exists in the search term list, and the corresponding information index data table, and the result does not depend on the search behavior of other users. Nor is it affected by the behavior of the content provider;
  • the search or search result provided by the present invention is accurate. If the search string of the user exists in the search term list and the information index data table, accurate feedback can be obtained; if it does not exist, the semantics will not be returned. Unrelated results.
  • the technical solution of the present invention has a logical deductive possibility to obtain other unknown information that is logically associated with the search string.
  • the search result provided by the technical solution of the present invention is intuitive in form, and is displayed in the form of a recursive directory tree.
  • the content of the search term list and the information index data table displayed in the recursive directory tree form in the present invention correspond to each other. , making the content of the target information in the information index data table faster and more accurate.
  • the present invention does not require massive amounts of data and does not require statistics.
  • the invention can greatly compress the redundant retrieval hint content, which has high exhaustiveness and improves the recalling rate; the consistency is strong, and the precision is improved.
  • the present invention has good versatility and portability.
  • FIG. 1 is a structural example of a search term list in the embodiment of the present invention.
  • the result of the search term directory table is displayed in a mixed manner
  • FIG. 3 is a screenshot of a system prototype of an embodiment of the present invention.
  • Figure 4 shows the data returned using the Google keyword tool;
  • Figure 5 shows the data returned using Baidu Instant Alert;
  • Figure 6 shows the results of the search from the General Administration of Customs
  • FIG. 7 is a diagram showing an example of a search term directory constructed using an embodiment of the present invention.
  • FIG. 8 is a list of search terms displayed after processing the CDMA product according to the embodiment of the present invention
  • FIG. 9 is a view showing an example of the insurance structure concept tree in the embodiment of the present invention.
  • FIG. 10 is a system structural diagram of an embodiment of the present invention.
  • FIG. 10_1 is another display manner of the recursive directory tree structure provided by the embodiment of the present invention.
  • FIG. 10 is a schematic diagram of a search result provided by an embodiment of the present invention.
  • FIG. 11 is a flowchart of a method for searching according to an embodiment of the present invention.
  • FIG. 12 is a flowchart of obtaining a search result according to an embodiment of the present invention. detailed description
  • the basic principle of the present invention is: prompting the user to form a relationship between the search string and the target information by an objective relationship between the existing search string and the target information that the user needs to search, thereby improving the search result. objectivity. Rather than relying on the modeling of massive user-based search behavior in the prior art - because this modeling behavior is susceptible to human interference, thereby destroying the objectivity of the association between the search string and the target information, and damage The accuracy of the search reduces the user experience. Moreover, the objective relationship between the search string and the target information in the embodiment of the present invention is fed back to the user in the form of a search term list table in which the search string is a node, and the display mode is visual, intuitive, and accurate.
  • the target information itself is formal.
  • coat-wool coat-women's wool coat-men's wool coat-imported women's wool coat For example, coat-wool coat-women's wool coat-men's wool coat-imported women's wool coat.
  • the correlation between these target information is also embedded in similar formalization.
  • the connection between computers, desktops, laptops, and LCD laptops Linkage is also closely related to the form of information description. The relevance of form and content between similar target information is numerous and ubiquitous.
  • the form and content relevance of such target information plays an important role in improving the customer experience. If the user can be provided with the association between the target information related to the search string after the user inputs the search string, the user can obtain not only the normal search result but also the search string.
  • the other target information, and the result of the search is determined by the objectivity of the search string itself in terms of semantics and classification, and is not interfered by the subjective behavior of other users.
  • an embodiment of the present invention provides a processing system for information retrieval, including: • an information data table module 1004, configured to store all information content;
  • the information index data table module 1003 is configured to generate and store all the information content according to the information data table module 1004 as an information index data table;
  • an original search term data table module 1005 configured to de-duplicate the information index data table to obtain an original character string, the original string constitutes a set of original strings, all of the original string sets are composed and stored as an original search Word data table;
  • a search term list module 1006 configured to generate a data item set according to a kinship relationship between each of the original character strings and other original character strings in the original search term data table, where the data item set is composed Stored as a search term table;
  • An alias data table module 1007 configured to store an alias string of the original string; the alias string is used to perform an alias query of the original string;
  • a search engine module 1002 configured to query the search term directory table according to a search term input by the user on the terminal 1001, and obtain a first data item set including the input search term; wherein the first data item set Having a kinship relationship between the respective data items; querying the information index data table according to each data item of the first data item set associated with the input search term, acquiring the second data item set; combining and transmitting the first data item Collecting, and transmitting a second set of data items to the terminal 1001.
  • the information index data table module 1003 extracts the original character string by using all the stored information content as the entire data of the information to be queried, and performing full text or abstract indexing (Chinese word segmentation). Create an inverted data table that serves as an information index data table.
  • the original string extracted after the full text or abstract index is the search string keyword processed by the system.
  • These search keywords are determined by the objectivity of the information to be queried; the search keywords corresponding to each information to be queried , maybe 1 or more.
  • the original search term data table module 1005 uniquely selects the original character string in the information index data table to generate an original search term data table, and the original search term data table is composed of a unique original string, and the original search term data table The unique original string contained is the source of the retrieval hint data.
  • the search term table table module 1006 matches any of the original strings contained in the original search term data table with other original strings in the original search term data table to determine the inclusion relationship between the original string or the original string set. .
  • the inclusion relationship between the original strings reflects the semantic relationship. Based on the inclusion relationship, a parent-child relationship is generated for the two directly related original strings, thereby constructing a search term table.
  • a query is made to obtain data matching the input search term.
  • the search engine module 1002 queries the search term directory table to obtain a first data item set including the input search term; wherein, the first data item set has a kinship relationship between the data items; Retrieving each data item of the first data item set associated with the word, querying the information index data table, acquiring the second data item set; combining and transmitting the first data item set, and transmitting the second data item set to the terminal 1001.
  • the search engine module 1002 performs a query based on the search term input by the user from the terminal 1001, and obtains a recursive directory tree search prompt and data matching the input search term. After receiving the search request, the search engine module 1002 queries the search term directory table according to the input search term, and returns a directory form search term structure including the input search term, thereby prompting the input search term included in the search request.
  • the search engine module 1002 queries the search term directory table according to the input search term, and returns a directory form search term structure including the input search term, thereby prompting the input search term included in the search request.
  • result query information index data table information data associated with the input search term included in the search request is obtained.
  • the information presented appears in the form of a list of search terms, clearly presenting the recursive hierarchy between the various related search information.
  • the original string of the search term directory table of the search term table table module 1006 is It can also be derived from artificially compiled data, either from the original search term data table obtained by indexing or word segmentation.
  • the original search term data table module 1005 filters the original search term data table to remove unnecessary or meaningless original strings.
  • Manual finishing is a supplement and re-editing of an index or participle.
  • the newly created search string can be supplemented or edited into the search term list.
  • the generation of the search term list uses a stack-based non-recursive method that can process massive amounts of data in a short linear time.
  • the alias data table module 1007 when generating the recursive directory tree search prompt, simultaneously queries the alias data table and returns a search prompt of the alias of the search string input by the user, thereby improving the user experience.
  • the information retrieval processing system provided by the embodiment of the present invention is a system for performing a retrieval prompt based on a search term table. Compared with the prior art, it has the following advantages:
  • the search string is displayed in the structure of the search term table. Because of the hierarchical structure, by default, only the first layer of the search term table is displayed instead of displaying all the data. If the user desires, all the data can be displayed, which makes the embodiment of the present invention have a high degree of exhaustion and improves the recall rate.
  • the method for generating the search term table is a linear non-recursive algorithm, which has a fast processing speed and is convenient to implement, and enhances the implementability of the present invention.
  • a retrieval method provided by an embodiment of the present invention includes the following steps:
  • a second set of data items in the information index data table that match the input search term is transmitted to the terminal 1001 of the user.
  • the method may further include:
  • the other data table is queried, the associated word corresponding to the input search term (ie, the alias search term) is obtained, and then the alias search term is used as the new input search term, and the repeated Perform steps 1102 - 1104.
  • the method may further include:
  • the user is prompted to re-enter.
  • the method for searching further includes the step of generating the search term directory table.
  • the acquiring the second set of data items includes the following steps:
  • the second data item set is obtained by recursive combination matching by each data item of the first data item set, the query information index table.
  • the generating the search term directory table comprises the following steps:
  • Dl, D2, ... Dn, n is greater than or equal to 1, and each data item constituting the data item set Dn has a kinship relationship.
  • the inclusion relationship includes: left inclusion, right inclusion, center inclusion or not.
  • generating the search term directory table according to the parent-child relationship may further include:
  • the two strings are set as a parent-child relationship, and the included string is a parent.
  • the original search string set is generated as the original search word data table.
  • the information index data table is generated by the index information to obtain an index word; using the index word set name or creating an inverted data table as the information index data table.
  • Search term list It refers to a structure in which all search strings are arranged into a recursive directory tree according to the relationship between them. Among them, the directly related search string constitutes a parent-child relationship, and the multi-layer parent-child relationship is superimposed. A tree structure of a recursive directory.
  • the source of the search string is mainly the original search term data table, or it can be manually supplemented and edited based on this.
  • a person or terminal 1001 capable of issuing an information inquiry request may include a person or a query terminal 1001 that uses the system to perform information inquiry distributed throughout the country.
  • Retrieve string Refers to the string entered by the user when the query is made.
  • the search term table is composed of the search terms.
  • the search engine returns content data related to the retrieved string based on the search string entered by the user.
  • Information data table module 1004 A data table for storing specific contents, and the specific content can be stored in multiple fields.
  • the product information data table storing the product information may include the product type, the brand, the price, the manufacturer, and the like.
  • the information index data table module 1003 an inverted data table generated by automatically indexing or segmenting the entire content or part of the information data table. For example, the above database for storing product information only generates an inverted data table for the product type.
  • the original search term data table module 1005 removes the data obtained by repeating the index of the information index data table.
  • Alias Data Table Module 1007 Alias data information used to store keywords. Alias data tables can be constructed in a variety of ways, such as synonyms or synonyms. For example, potatoes are also known as potatoes and artichokes. This table is mainly used for query expansion of the search string. When the user queries the alias or the real name, it also displays the real name corresponding to the alias or the alias corresponding to the original name, which can improve the scope of the search and enhance the user experience.
  • the inverted list is a commonly used data structure in search engines.
  • the inverted list is indexed by words, and the collection of documents containing these words is used to quickly find a collection of documents containing a certain word or certain words.
  • the attribute value needs to be indexed, that is, each item in the index table is caused by a specific possibility.
  • the attribute value, and the address of the record that appears for the value consists of two parts. In this way, we can reverse the value of an attribute by logging Find the storage address of this record, or record the corresponding key. We call this index an inverted index.
  • the information index data table is an inverted table, and the words used as indexes can be obtained by automatically indexing or segmenting the main data of the information table by full text or abstract. These index words are retrieval strings processed by the system.
  • the information table is a carrier of the target information, and may generally include a business information table, a legal information table, a public resource information table, and the like.
  • the sample data contained in the information data table is represented by a table below, and the database field is a simplified field.
  • Table 1 shows the table structure of the information data table
  • Table 3 shows the table structure of the information index data table.
  • Tables 2 and 4 are specific sample data.
  • the generation process of the information index data table is illustrated from Table 2 to Table 4. Those skilled in the art can generate other various forms of information index data tables according to the process of generating the information index data table according to the embodiment of the present invention.
  • the information data table is used to store specific information data, and the table name is xinxi.
  • the information index data table is used to store the index data of the information, and the table name is xinxiindex.
  • the information index data table includes three fields: Indexed (the number of the information index number), Xinxiid (the number of the information indexed by the information index), and Tmsp (index).
  • the index field is derived from the full-text automatic indexing or word segmentation of the content field in the information data table. For example, the first data in Table 2, "cotton coat, wool coat", becomes a two-line index in Table 4.
  • the index entry also records the xinxiid of the information item marked by the index to facilitate data search.
  • Table 4-2 The relationship between the index word set name and the index word set is:
  • the generation of the original search term data table can be achieved by uniquely selecting the index field of the information index data table.
  • the original search term data table is the raw data for generating the search term table.
  • the original search term data table can be updated as needed, and the update method is to update the search string.
  • the original search term data table will be updated continuously if needed. For example, when the data of the information index data table changes, the index field may change accordingly. If a new search string is generated when the uniqueness of the index field is selected, the original search data table is updated. .
  • the original search term data table is used to store the initial set of search strings, one of the possible implementations is as follows:
  • the original search term data table includes two fields of a search term number and a search term, and the table name can be exemplified as
  • the new search string can be directly added after the table 6 and the number is added, so that the original search string number order can be not destroyed.
  • the corresponding search string and its number can be deleted; when the original search string needs to be replaced, the original search string can be added, or the original search word list can be replaced.
  • the corresponding search term and its number can be used.
  • the focus of attention should be different for different types of target information.
  • the product name, type, price, manufacturer, and origin may be the consumer's usual concern.
  • users of the legal database it may be more specific to the subject, object, condition, time limit, etc. specified in the legal system. This does not exceed the scope of protection of the embodiments of the present invention, and the same can be applied to the technical solutions of the embodiments of the present invention.
  • the generation of the search term table is mainly achieved by including the matching of the relationship. Any search string in the original search term data table is matched with other search strings in the original search term data table to determine the inclusion relationship with other search strings. After processing according to different inclusion relationships, a list of search terms is obtained.
  • the inclusion relationship of the word characters is considered by default to indicate the semantic inclusion relationship of the words.
  • “wool coat” is the upper concept of “women's wool coat”
  • “wool coat” semantically includes “women's wool coat”.
  • the inclusion words are often sub-categories of the type of the included words, or subordinate concepts.
  • the remaining part and the included word are generally modified or supplemental, such as "pen” and “pen”, “pen” and "refill”.
  • the inclusion relationship between the words is divided into four types: no inclusion, left inclusion, right inclusion, and center inclusion.
  • the search string can be constructed into a tree-shaped search term list according to the above four kinds of inclusion relationships between the strings.
  • the search term table is a sorted multi-fork tree between two adjacent nodes on the same tree branch, and the parent node is the child node whose maximum length contains a string.
  • the root node of the tree is a virtual node. By removing the virtual node, you can get a string forest.
  • FIG. 1 it is a structural example of a search term list in the embodiment of the present invention.
  • the strings included are: coats, cotton coats, cashmere coats, wool coats, ladies wool coats, quilts, tea, water tea, fruit tea, flowers Tea, jasmine tea.
  • the above example shows the right inclusion relationship, such as: "cotton coat” right contains "coat”.
  • the non-contained relationship is discarded, and only three other inclusion relationships need to be identified.
  • the inclusion of the relationship can be achieved by the combination of "Left Contains” and "Right Contains".
  • “Women's Wool Coat” contains “Wool” in the middle, which can be split into “Women's Wool Coat” and contains “Wool Coat” on the right. Wool coat "left contains "wool”.
  • the matching of the inclusion relationship includes two processings, one left inclusion relationship processing, and one right inclusion relationship processing. The process of the two processes is similar, and after the two processes, the inclusion relationship between the two search strings can be identified.
  • all of the search strings can be constructed into a search term list according to the inclusion relationship between the search strings in a linear time.
  • the string 1 is considered to be the elder node of the string 2, for example, "the coat” is the “wool coat” and The elders' knot of "Women's Wool Coat", “Wool Coat” is also the elder node of "Men's Wool Coat”;
  • a node can have several elder nodes.
  • the lengths of these elder nodes may be different.
  • "wool coat” and “coat” are the elder nodes of "women's wool coat”.
  • the string When a string has a parent node, the string consists of two parts, a parent node string and a prefix string, where the parent node string, after the entire string, is called the main suffix string.
  • a parent node string after the entire string, is called the main suffix string.
  • "coat” is for "wool coat”; the other part is for the preceding string called prefix string, such as "wool coat” for "wool”.
  • Brother relationship refers to two or more nodes juxtaposed under the same parent node. For example, "women's wool coats” and “men's wool coats” under “wool coats”.
  • a kinship is a general term for relationships between all nodes under the same root node, including: father-son relationships, brother relationships, and relationships between nodes formed by interconnected father-child relationships. Like a family based on an ancestor Like all descendants of the first generation, these descendants with common ancestors have kinship; although some are father-son relationships, some are brother relationships, some are uncle relationships, and so on.
  • Nodes that do not have direct parent-child relationships or sibling relationships belong to the same directory tree based on this kinship relationship.
  • Such kinship such as father and son, brother, direct, sideline, etc., is the basis for prompting the user for a recursive directory tree.
  • the search term tree is like a family lineage displayed in a recursive directory tree.
  • the search term list can be stored in the form of a data table.
  • the simplest model is taken as an example, and the optimization can be improved on the basis of the core.
  • the search term table table data table model in the embodiment of the present invention mainly includes four fields:
  • the number field id is used to store the number of the string
  • the search term field tmsp is used to store the specific content of the search string
  • the reverse search term field untmsp is used to store the reverse search term of the reverse word field tms;
  • the parent node field bplus which is used to store the parent node of the search term, is 0 if there is no parent node.
  • the search term table is a sorted multi-fork tree. According to the inventor's research, the reverse character ordering of the index name string is just the depth traversal of the search term table. Because the index name string is reversed first, and then sorted, the strings that make up the parent-child relationship must be adjacent because they have the same right-containing string. Hereinafter, the description will be made with reference to Table 11.
  • the specific data is the subtree where "coat” is located. For example, "cotton coat”, “wool coat”, “women's wool coat” and “cashmere coat” are the child nodes of the "coat” right-containing relationship or the child nodes of the child nodes.
  • “Women's Wool Coat” is a child node that is included in the “Wool Coat” right, and after the string is reversed, it becomes “clothing big sheep” and “clothing sheep", the two strings are sorted. When, it must be adjacent.
  • the child nodes of the “coat” and the child nodes of the child nodes are reversely sorted and are "clothing big, big wool sheep, big wool sheep, women big cotton, clothing big cashmere sheep” It can be seen that the child nodes of the child nodes and the child nodes are arranged together, and the child nodes appear first next to the parent node, and then the other child nodes have the same parent node. Child nodes, this arrangement is exactly the depth traversal of the tree.
  • the data item combination can be grouped on the basis of the reverse ordering of the strings.
  • the search terms list Into the search terms list.
  • the stack is used for processing, and the recursive method is not used for processing, so that the calculated time is a linear time.
  • the stack is implemented using existing data structures in the computer. In the stored procedure, the stack can be implemented using strings, for example: 1, coats; 2, wool coats; 3, ladies wool coats.
  • This stack is called the inherited immediate tree stack, and the cell content in front of the semicolon is the parent of the cell contents after the semicolon.
  • the contents of the stack unit consist of two parts, the field id and the field tmsp, separated by a "comma".
  • the stack cells are separated by a "semicolon".
  • the push and pop operations are implemented by the string's related functions. The push operation is converted to append the contents of the unit to the stack string, and the pop operation is converted to the string before the second to last semicolon.
  • a temporary direct tree stack string tmpstackstring is set to perform stack processing.
  • the example data is: 1. Coat, 2. Cashmere coat, 3. Wool coat, 4.ssen wool coat, 5. Cotton coat, 6, quilt cover. Specific steps are as follows:
  • Step 1) generating an untmsp field backward according to the tmsp field, for example, "cotton coat” is reversed to become “clothing cotton”;
  • Step 2 initialize the inheritance of the direct tree stack string stackstring is empty (here only the parent-child relationship, there is no sibling relationship.
  • the so-called brother relationship refers to juxtaposition of more than two nodes under the same parent node, such as wool coat and cashmere coat relative In the case of a coat, it belongs to a brother relationship).
  • the stack content is the id and untmsp data of the currently inherited elder node; step 3), all id, untmsp data is sorted by untmsp. For example, the data returned in sequence: "1, clothing big; 3, clothing big wool sheep; 4, clothing big wool sheep and women; 5, clothing big cotton; 2, clothing plush sheep; 6, cover";
  • Step 4 read the current data [id, untmsp] to inherit the direct tree stack stackstring; if there is no data, ie the current data [id, untmsp] is empty, jump to step 10).
  • Step 5 initialize the temporary direct tree stack string tmpstackstring is empty (only parent-child relationship, no brother relationship);
  • Step 6 if inheriting the direct tree stack stackstring is empty, jump to step 8);
  • Step 7 if the inheritance of the direct tree stack stackstring is not empty, then from the inheritance of the direct tree stack stackstring, find the elder node of the current data [id, untmsp];
  • the node is the parent node of the elder node, and the current stack cursor bplus is modified to be the id of the parent node;
  • step A139 If the elder node is not found in the inherited direct tree stack stackstring, set the stack cursor bplus to 0; and assign the value of the temporary immediate tree stack tmpstackstring to the inherited immediate tree stack stackstring, and the current data [ id, unmtsp ] push in inherits the direct tree stack, then jumps to step A139;
  • Step 9 the current stack cursor Bplus value plus 1 , jump to step 4), read the next data [ id, untmsp ], perform a loop operation;
  • the data style after the sample data is reverse-processed by the string is shown in Table 10.
  • the Bplus parent node is initialized to 0.
  • Table 10 sorts the untmsp fields and outputs the table 11.
  • the Bplus assignment has been performed in accordance with this algorithm in Table 11.
  • an item with an id of 28, 29 is added, and the assignment of Bplus is supplemented.
  • the Bplus value indicates the parent-child relationship after processing the data in Table 9.
  • Bplus of the coat is 0, indicating that it is the root node, there is no parent node
  • the Bplus value of the wool coat is 1, indicating that it is a child node of the coat (ID value of 1)
  • the Bplus value of the women's wool coat is 3 .
  • the sub-node of the hood is a quilt cover (ID value is 28, Bplus is 6), and so on.
  • Figure 1 According to the generated parent-child relationship (id, bplus), the constructed tree structure is shown in Figure 1.
  • Figure 1 also includes tea, water tea, fruit tea and flower tea. The formation process of this part of the tree structure is The tree structure associated with "coat" is the same and will not be repeated.
  • the search result list table is displayed in a mixed manner.
  • "cotton coat”, "wool coat” and “coat” are right-side relationship;
  • "coat button” and “coat” are left-inclusive relationship,
  • metal coat button” and “coat button” are right inclusion relationship.
  • "Metal coat buttons” and "coats” constitute a central inclusion relationship. The centered inclusion relationship here can be processed by one right inclusion relationship and one left inclusion relationship.
  • Optimization methods include:
  • Two string substrings under the same parent string contain the processing of the relationship. For example, "large object transport” and “large object lift transport” do not constitute a left inclusion, a right inclusion or a center inclusion relationship, but they have the same parent string "transport”. If the parent string is removed, the "large object lifting" left contains “large objects”. In this case, the second processing can be performed, and the "large object lifting transport” is classified as a substring of "large object transport". Through this processing, the number of child nodes included under the same node can be reduced.
  • a table is a search term directory table. By querying the search term table, a first set of data items including the search term is obtained, and the set of data items is displayed to the user in the form of a recursive directory tree.
  • the other table is an information index data table. By querying the table, a second set of data items related to the search term, that is, data associated with the first set of data items in the information index data table, is obtained, and then the data is returned to the user.
  • search term table may contain all of the search strings and may also contain partial search strings.
  • search strings such as clothes, electronic products, and legal libraries can be included in a search term list, or they can be placed in three search term lists.
  • the clothing category can also be divided into a plurality of search term list, whichever means are not beyond the scope of the present invention. Either way, it returns to the user a list of search terms that contain only the search string entered by the user, and does not return the search term list that does not contain the search string entered by the user to the user.
  • the search term directory table can be understood as a plurality of search term directory tables combined by a plurality of root nodes, wherein each root node separately constitutes a search term directory table.
  • Each search string belongs to a list of search terms under a root node.
  • FIG. 3 it is a screenshot of a system prototype of an embodiment of the present invention.
  • the search string entered by the user is "coat”.
  • the input search string is "coat” as an example, and the content returned to the user mainly includes two parts: The first part is the part obtained from the search term table, the second part Is the content returned from the information index data table.
  • the first part is to obtain a related search string set of the search string after querying the search term table by using the search string.
  • the list of search terms in which the relevant search string is located is displayed, where the parent string "coat” and its substrings "half coat”, “cotton coat”, “wool coat”, “cashmere coat” ""Various coats", “columns, presented to the user in the form of recursive tree containing search terms containing search strings. Since the search strings are parent-child relationships with each other, the user can see that the search string belongs to The parent string and its substrings, as shown in Figure 3, are related to "coats", “wool coats, cashmere coats" and so on.
  • the second part is to retrieve the search string in the system, and obtain data related to the search string from the information index data table, that is, the summary content of the information related to the search, that is, as shown in the right side of FIG.
  • the product information database may generally include, for example, a manufacturer name, a brand, a price, a product name, and the like.
  • the legal information database it can usually include A legal subject, a category of legal liability, etc., if it is insurance-related, it will display various types of insurance.
  • the result of the query is combined with the information data table for the content returned by the search.
  • Xinxiindex.tmsp 'keyword*" 0
  • Query can be sorted according to the additional fields of the information data table, such as time weight, etc. Exact matching can get more accurate search results, fuzzy matching results are more extensive, both have their own The use of the function button can be achieved through function buttons.
  • the parent-child relationship between the parent string and the child string Due to the parent-child relationship between the parent string and the child string, the parent-child relationship between each other can be displayed in the form of a search term table. As far as the display result of FIG. 3 is concerned, the user not only obtains the relevant search result from the content on the right side, but also obtains the parent-child relationship between the strings containing the search string from the left side of FIG. 3, so that these searches are made. The logical correlation of the results is also shown.
  • the search engine does not display the parent-child relationship of the search strings of these search results simply by the inverted relationship formed by the number of user clicks.
  • the ranking relationship between the prior art search results depends on the number of clicks or links by the user, rather than the logical relationship between the retrieved strings, and not the parent-child relationship between the strings containing the retrieved strings. Therefore, there is a fundamental difference between the two.
  • the query process for retrieving the word list can be as follows. Query whether the search string exists in the search term table. If the search string exists, the search string is taken as the root of the tree, and the child node of the search string is queried from the search word list, and the query condition is that the parent node number bplus of the child node is equal to the current search string. Number id. For example, in the algorithm example, the "coat” number is 1 and his child nodes are "cotton coat”, “cashmere coat” and "wool coat”, and their bplus values are all 1.
  • Child nodes under the same parent node can be sorted in order of node characters for easy searching, or in other order, depending on the specific needs of the user.
  • the search term table structure containing the search string is generated by two kinds of relationships: one is generated by the left inclusion relationship, and the other is generated by the right inclusion relationship, and two sub-search terms can be generated respectively according to the left inclusion relationship and the right inclusion relationship.
  • the table of contents, the two sub-search terms table can be formally merged into a list of search terms.
  • the alias can also be used as a search string, and the information index data table and the search term table are queried and returned. the result of.
  • querying the alias data table finding an alias corresponding to the current search string, and using the alias as the search string, querying in the information index data table and the search term table, Go back to the corresponding result.
  • the system will search the potato as a new search string, that is, in the information index. After the data table and the search term table are queried for matching information, they are returned to the user.
  • the data of the alias data table can usually be created manually.
  • Figure 4 shows the data returned using the Google keyword tool.
  • Google Keyword Tool you can search for up to 150 pieces of data using the "coat”.
  • the data returned includes “coat”, “coat scarf”, “coat man”, “coat woman”, “coat Taobao”, “VERO MOD A coat”, “JACK JONES coat”, “fashion coat”, “coat “coats”, “popular coats”, “men's coats”, “women's coats”, “popular coats”, “coats wholesale”, etc., these returned data are independent of each other, without a logical inclusion relationship with each other. There is no list of search terms that show these logical inclusion relationships.
  • Figure 5 shows the data returned using Baidu Instant Alert.
  • the data suggested is “coat rejection”, “coat rejection picture”, “coat rejection internal structure diagram”, “coat match”, “coat style”, “coat bow tie”, “Coat picture”, etc., these data, in addition to the search string coat, do not show a logical inclusion relationship with each other, and do not display these search term table tables reflecting the logical relationship between these data.
  • the technical solution provided by the embodiment of the present invention can be applied to the automatic generation of a product catalog, and the generation of a concept tree in a professional technical field, in addition to generating a search prompt.
  • Application example 1 Customs code (HS code) for sea tariffs.
  • HS code for sea tariffs.
  • Figure 6 it is the result of a search from the General Administration of Customs.
  • "Home > Home > Product Query” using the search string "coat” the website fuzzy query engine will return 8 pages with a total of 158 records, including the product code and product name.
  • the form of the list is presented to the searcher. However, it does not show the logical connection between the names of these goods, such as “coat” is the upper concept of "men's coat”, “pure wool woven men's coat”, “cotton woven women's coat” and other commodities.
  • the commodity name is used as a parent string or a substring in the search term table. After the search term list is displayed to the searcher, as shown in FIG. 3 of the present invention, the user experience is greatly improved, and the scope of the user query is narrowed.
  • Application example 2 Product service classification table for trademark registration.
  • the keywords are "cord cloth” and "cloth", respectively.
  • the diagram consists of two parts, the left side is a partial screenshot of the "cord cloth” directory tree, and the right side is a partial screenshot of the "cloth” directory tree.
  • Application example 3 Automatic classification of products in specific industries in the National Economic Industry Classification. For example, the textile industry, the clothing industry, the leather industry, the electronics industry, etc., these are only simple industry descriptions in the National Economic Industry Classification. Products in these industries generally do not have more specific and detailed classifications, and product type updates are generally faster.
  • the classification can be automated, and whenever a new product type is added, the related product information is automatically indexed or segmented according to the technical solution provided by the embodiment of the present invention, and the retrieval is extracted.
  • the string, the search term table and the information index data table are generated, so that the new product can be accepted into the existing search term list and the information index data table, so that the embodiment of the present invention has a strong ability to widely accept new product types.
  • CDMA products Take CDMA products as an example.
  • Known product types include: CDMA mobile phones, BlackBerry CDMA mobile phones, and luxury models.
  • CDMA mobile phone dual SIM card CDMA mobile phone, CDMA mobile phone charger, CDMA antenna, CDMA wireless antenna, CDMA network card, CDMA wireless network card, CDMA telephone, CDMA fixed wireless telephone, CDMA coin-operated telephone, outdoor CDMA coin Telephone.
  • FIG. 8 it is a list of search terms displayed after processing the CDMA product in the embodiment of the present invention.
  • all CDMA products can find a suitable location on the search term list shown in the figure by automatically indexing or segmenting the CDMA product and generating a search term list displayed after the search term list. And it is very suitable for the increase of new products.
  • Application example four The concept domain of the legal profession. As shown in FIG. 9, it is an exemplary diagram of a concept tree for "insurance" in the embodiment of the present invention.
  • the search term directory table constructed by the embodiment of the present invention constitutes a tree network semantic, the root node is a superordinate concept of the specific domain concept set, and the child node is a subordinate concept.
  • the upper concept is a generalization of all the subordinate concept attributes.
  • the subordinate concept is the refinement of the upper concept from different angles.
  • Embodiments of the embodiments of the present invention are: 1.
  • inverted index After the legal provisions are automatically indexed or segmented, the conceptual phrases are used to establish an inverted index, which will include insurance, property insurance, vehicle insurance, local insurance, export information insurance, industry insurance, third party liability statutory insurance, and The insurance index, industrial injury insurance, national statutory insurance, cargo transportation insurance, life insurance, labor insurance, pension insurance, travel accident insurance, etc., are established in the form of inverted index, which can be established in various ways in the prior art;
  • search term directory table Generate a search term directory table based on the original search term data table.
  • the search term list is displayed in the form of a recursive directory tree, which is based on the legal concept.
  • the relevant legal terms can be classified, and the search can be performed on this basis, and the relevant search strings and their logical connections can be clearly seen, so that the precision is higher.
  • the concept tree is represented in the same way as the recursive directory tree.
  • the representation form of the recursive directory tree is various, and is not limited to the various modes in the drawing, and the core lies in the kinship relationship of each search string.
  • FIG. 10 it is a schematic diagram of a data flow direction when the structure of the embodiment of the present invention is used.
  • the search engine When the search engine receives the search string input by the user, it searches in the search word list search string and the information index data table, and the information index data table is generated by the information data table, and the information index data table regenerates the original search term.
  • the data table, the original search term data table generates a search term list.
  • the search engine will also retrieve the alias data table to query the alias corresponding to the search string.
  • the parent-child relationship (which may be determined by using the bplus field or other manner) between the data item and the data item in the same directory tree data item set in the search term directory table may be configured as a directory tree.
  • the display of the directory tree can be implemented according to the specific technical method, including stand-alone software, client server mode and browsing. Browser server mode. For example: programming a custom client to implement the tree structure; using html language to encapsulate the results in the browser as a tree structure; using wml language to encapsulate the results displayed in a mobile browser as a tree structure.
  • the specific display manner does not exceed the protection scope of the embodiment of the present invention.
  • a recursive directory tree helps display the hierarchy between the various search strings.
  • the directory tree is the most common and intuitive way. However, the specific display may be different, and there are many ways to display the deformation. For example, a multi-level menu with a directory function is a variant of a directory tree, and the next menu is a sub-item of the upper menu. Despite its diverse forms of expression, it shows that the parent-child relationship and hierarchy of the upper and lower levels are constant.
  • FIG. 10-1 it is an example of a multi-level menu graphic with a directory function provided by an embodiment of the present invention.
  • the child nodes under the same node in the tree structure are transformed into the next level menu.
  • the lower menu is dynamically displayed when the cursor moves over. For example, when the user pays attention to the wool coat, the cursor is moved to it, and accordingly the next level node men's wool coat and women's wool coat are displayed. At this point, the nature of the recursive directory tree structure has not changed.
  • These child nodes have a common parent node "coat", only the way it is displayed has changed. This gives the user another way to display the directory tree structure.
  • FIG. 10-2 it is an example of a multi-level menu graphic and product information result with a directory function provided by an embodiment of the present invention.
  • a cashmere coat is selected by a cursor
  • characters are retrieved in a directory table accordingly.
  • the search term table returns relevant search strings, such as cotton coats, cashmere coats, wool coats and coat buttons.
  • the information index data table returns product information related to the currently selected search string, such as various brands of cashmere coats.
  • the second set of data items sent to the user may include the following steps:
  • a second data item set obtained by simply matching the search string with the first data item set in the information index data table and specifically includes the following two methods:
  • the first data item set can prompt, guide, link, and indirectly display the information that exactly matches the search string, as shown in Figure 10-2, Search for the product information of the "Cotton Coat” that matches the string "coat” but is indirectly related.
  • a second set of data items obtained by recursively matching the search string with the first data item set by the search string specifically comprising:
  • the user retrieves the string and returns the recursive matching result when there is no retrieval result in the information index data table;
  • the search string input by the user is "rabbit coat". If there is no search result in the information index data table, the upper node of the search string can be returned, that is, the search result with the inclusion relationship such as "coat”;
  • the search string is obtained, a search term directory table is generated, and the search term list table with the search string as a node is returned to the user.
  • the embodiments of the present invention well solve the technical problems in terms of data sources, construction methods, display modes, and the like of the search term list, and have good market application prospects.

Description

一种检索的方法和系统
技术领域
本发明涉及计算机信息处理领域, 特别是一种检索的方法和系统。 背景技术
现有信息检索技术中, 处理检索字符串的方法有很多, 最常见的是基于统计方法提出 的技术方案, 也有专门根据特定语义规则构造专用信息检索方法。 在不处理语义即不针对 目前只有用前述的统计方法去归纳、 猜测语义。 此外, 现在知道的就只有利用互联网络海 量信息链接特性、 基于引用 (链接指向) 的 google pagerank ( Google网页级别) 方法。
综上可知, 现有信息检索处理技术在实际使用上, 显然存在不便与缺陷, 所以有必要 加以改进。
同时关于检索提示。 搜索引擎的用途是提供用户所关心的信息的引导, 用户利用搜索 引擎, 根据其已知信息来获得所关心的、 未知的信息。 但用户对未知信息未必能够找到恰 当和准确的文字描述, 另外即使用户知道主要的关键字, 也希望与此关键词有关的信息都 有良好、 充分的提示。 现有技术包括: 1、 中国专利申请号为 200610112822.4, 名称为 "基 于倒排表进行检索提示的方法"。 2、 百度相关搜索和 google的关键字工具、 等。 这些技术 都是基于对用户输入的查询词统计而生成检索提示的。 其缺点在于: 首先, 这些提示的内 容都是经过筛选后的排名靠前的数据。 这些数据仅仅是一个列表, 在内容上是不完备的。 其次, 这些以列表的形式展现给用户的数据, 由于相互之间互相独立, 各条信息都独立存 在, 而且包含了与检索词有关的提示信息, 这就造成其数据量常常会非常大, 用户从中查 找有用信息的工作量会显著加大。 再次, 这些信息之间没有逻辑结构与语义特征, 给出的 检索提示让用户无所适从。 百度相关搜索只给出了 10个相关检索提示。 Google关键字工具 虽然可以给你最多 150条的相关检索提示, 但这些提示是无组织的, 没有逻辑相关性。 另 如, 这些检索提示多基于对海量用户的行为进行建模, 认为多数人的行为就是检索用户所 需要的。 比如在 2008年,北京与奥运之间, 经过统计海量用户, 证明两者之间具备关联性; 而在 2009年春天的北京与曱流, 2010年元旦的北京与暴雪之间, 都具有关联性。 一旦海量 用户的搜索点击具有突然性、 预设性, 那么检索提示就会受到这种海量搜索点击的直接影 响。 有鉴于此, 也需要找到更好的检索提示和展示检索信息的方法。 发明内容
本发明的目的在于提供一种信息检索的处理方法及其系统, 本发明由检索字符串的形 式化过程去提炼语义、 引导和返回检索信息。 附加地使得检索提示更加简洁清晰、 逻辑完 备。
为实现本发明目的而提供的一种检索的方法, 包括步骤:
A.根据用户在终端上所输入检索词查询检索词目录表, 获取包含所述输入检索词的第 一数据项集合; 其中, 所述第一数据项集合的各个数据项之间存在亲属关系;
B.根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息索引数据 表, 获取第二数据项集合;
Ci且合并发送所述第一数据项集合给所述终端; 其中, 所述第一数据项集合以递归方 式组合; 以及
向所述终端发送所述第二数据项集合。
较优地, 所述步骤 A中, 还包括下列步骤:
A1.生成所述检索词目录表。
较优地, 所述步骤 B中, 所述获取第二数据项集合, 包括下列步骤:
由第一数据项集合的各个数据项, 查询信息索引表, 进行简单匹配获取第二数据项集 合; 或
由第一数据项集合的各个数据项, 查询信息索引表, 进行递归组合匹配获取第二数据 项集合。
较优地, 所述步骤 A1中, 生成检索词目录表, 包括如下步骤:
Al 1.将原始检索词数据表中的各个原始字符串两两互相匹配, 确定相互之间的包含关 系;
A12.根据所述包含关系, 确定所述两两互相匹配的所述原始字符串之间的父子关系;
A13.根据所述两两互相匹配的具有父子关系的所述原始字符串, 分别生成数据项集合
Dl、 D2 Dn, 其中, n大于等于 1 ; 所述数据项集合 Dl、 D2 Dn组成检索词目录表; 其中, 所述数据项集合 Dn的各个数据项的所述原始字符串之间具有亲属关系。
较优地, 所述包含关系包括:
左包含、 右包含、 居中包含或不包含。
较优地, 所述步骤 A12包括下列步骤: 如果所述至少两个原始字符串之间构成左包含或右包含关系, 则将所述两个原始字符 串设置为父子关系, 所述被包含的原始字符串为父; 及
如果所述至少两个原始字符串集合之间构成包含关系, 则将所述两个原始字符串集合 设置为父子关系, 所述被包含的原始字符串集合为父。
较优地, 所述步骤 A13中, 所述包含关系为右包含关系时, 则所述数据项集合组成 检索词目录表, 还包括对数据项组合在字符串逆向后排序的基础上组成检索词目录表; 所述字符串逆向后排序, 包括如下步骤:
A131.根据检索词字段逆向后生成逆向检索词字段;
A132.初始化继承直系树堆栈为空;
A133.以逆向检索词字段排序得到全部 [编号,逆向检索词 ]数据;
A134.读取当前 [编号, 逆向检索词 ]数据到继承直系树堆栈; 如果没有数据, 即当 前 [编号, 逆向检索词 ]数据为空, 跳转到步骤 A1310; 否则, 进入步骤 A135;
A135.初始化临时直系树堆栈为空;
A136.如果继承直系树堆栈为空, 则跳转到步骤 A138;
A137.如果继承直系树堆栈不为空, 则从继承直系树堆栈中, 查找当前 [编号, 逆向 检索词 ]数据的长辈结点;
如果有长辈结点, 则将所述长辈结点入栈到临时直系树堆栈, 所述长辈结点中的最后 一个结点是该长辈结点的父结点, 修改当前堆栈游标值为父结点的编号;
如果从继承直系树堆栈中找不到长辈结点, 则设置堆栈游标值为 0; 并将临时直系树 堆栈的值赋予给继承直系树堆栈, 并将当前的 [编号, 逆向检索词 ]数据压入继承直系树 堆栈, 然后跳转到步骤 A139;
步骤 A138.将当前的 [编号, 逆向检索词 ]数据压入继承直系树堆栈,并更新当前堆栈 游标值为 0;
步骤 A139.当前堆栈游标值加 1 , 跳转到步骤 A134, 读取下一条 [编号, 逆向检索词 ]数据, 执行循环;
步骤 A1310.结束。
较优地, 所述步骤 Al l中, 还包括原始检索词数据表生成的步骤; 所述原始检索词 数据表生成的步骤包括下列步骤:
根据所述信息索引数据表的信息索引数据, 去重复后生成原始字符串集, 得到原始检 索词数据表。
较优地, 所述的检索的方法, 还包括信息索引数据表生成的步骤; 所述信息索引数据 表生成的步骤, 包括下列步骤:
从信息数据表获取索引词;
使用索引词或索引词集合建立倒排数据表, 根据倒排数据表生成信息索引数据表。 较优地, 所述的检索的方法, 还包括如下步骤:
根据所述输入检索词, 查询别名数据表, 获取与所述输入检索词对应的别名检索词; 以所述别名检索词作为新的输入检索词, 重复执行步骤 A-C。
为实现本发明目的, 还提供一种信息检索的处理系统, 包括:
-信息数据表模块, 用于存储所有的信息内容;
-信息索引数据表模块, 用于根据所述信息数据表模块的所述所有的信息内容生成并 存储为信息索引数据表;
-原始检索词数据表模块, 用于将所述信息索引数据表去重复后得到原始字符串,所述 原始字符串构成原始字符串集, 全部所述原始字符串集组成并存储为原始检索词数据表;
-检索词目录表模块, 用于根据所述每个原始字符串与所述原始检索词数据表中其他 原始字符串的亲属关系, 进行相互匹配生成数据项集合, 所述数据项集合组成并存储为检 索词目录表;
-搜索引擎模块, 用于根据用户在终端上所输入检索词, 查询所述检索词目录表, 获 取包含所述输入检索词的第一数据项集合; 其中, 所述第一数据项集合的各个数据项之间 存在亲属关系; 根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息 索引数据表, 获取第二数据项集合; 组合并发送第一数据项集合, 以及发送第二数据项集 合到所述终端。
较优地, 所述包含包括不包含、 左包含、 右包含、 居中包含;
所述检索词目录表模块中的相互匹配, 是指对每一原始字符串与其他原始字符串进行 包含关系匹配, 将与所述每个原始字符串具有左包含和 /或右包含关系的其他原始字符串 一起生成所述数据项。
较优地, 所述检索词目录表的数据项集合构成一排序多叉的检索树。
较优地, 所述搜索引擎模块, 还用于按照所述检索树的递归关系呈现与所述输入检索 词相关的第一数据项集合。 较优地, 所述第二数据项集合与所述第一数据项集合关联。
较优地, 所述处理系统还包括别名数据表模块, 用于存储所述原始字符串的别名字符 串; 所述别名字符串用于进行所述原始字符串的别名查询;
所述第二数据项集合还通过别名数据表与所述第一数据项集合关联。
与现有技术相比, 本发明至少具有以下优点:
首先, 本发明所提供的检索结果具有很强的确定性, 只依赖于检索字符串是否存在于 检索词目录表中, 以及对应的信息索引数据表, 其结果不依赖于其他用户的搜索行为, 也 不受信息内容提供者的行为影响;
其次, 本发明所提供的检索或搜索结果是准确的, 用户的检索字符串如果存在于检索 词目录表和信息索引数据表中, 就能得到准确的反馈; 如果不存在, 也不会返回语义上无 关的结果。
再次, 本发明的技术方案具有逻辑演绎可能性, 能够获得与检索字符串有逻辑关联的 其他未知信息。
再其次, 本发明的技术方案所提供的检索结果形式直观, 以递归目录树的形式显示出 最后, 本发明中以递归方式目录树形式显示的检索词目录表与信息索引数据表的内容 相互对应, 使得查找信息索引数据表中目标信息内容更加快捷、 准确。
综上, 本发明不需要海量数据、 也不需要统计。 本发明可大大压缩冗余的检索提示内 容, 其穷举度高, 提升了查全率; 一致性强, 又有利于提高查准率。 而且, 本发明具有良 好的通用性和可移植性。 附图说明
下列是实施例或现有技术描述中使用的附图。 附图仅仅是本发明的一些实施例; 显然 对于本领域普通技术人员来讲, 在不付出创造性劳动性的前提下, 还可以根据这些附图获 得其他的附图。
图 1所示, 是本发明实施例中检索词目录表的一种结构示例;
图 2所示, 是本发明实施例中左包含与右包含检索字符串同时存在的情况下, 将检索 词目录表混合显示的结果;
图 3所示, 是本发明实施例的一个系统原型截图; 图 4所示, 是使用谷歌关键字工具( google keyword tool )返回的数据; 图 5所示, 是使用百度即时提醒返回的数据;
图 6所示, 是从海关总署进行检索的结果;
图 7所示, 是使用本发明实施例所构造的检索词目录表示例;
图 8所示, 是本发明实施例对 CDMA产品处理后所显示的检索词目录表; 图 9所示, 是本发明实施例中对保险构造概念树的示例图;
图 10所示, 是本发明实施例的一种系统结构图;
图 10 _ 1所示, 是本发明实施例所提供的递归目录树结构的另一种显示方式;
图 10 _ 2所示, 是本发明实施例所提供的一种搜索结果示意图;
图 11所示, 是本发明实施例所提供的一种检索的方法的流程图;
图 12所示, 是本发明实施例所提供的一种获得检索结果的流程图。 具体实施方式
下面将结合本发明实施例中的附图, 对本发明实施例中的技术方案进行清楚、 完整地 描述。 显然, 所描述的实施例是本发明一部分实施例, 而不是全部的实施例。 基于本发明 中的实施例, 本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施 例, 都属于本发明保护的范围。
本发明的基本原理在于: 通过业已存在的检索字符串与用户所需要搜索的目标信息之 间的客观联系, 向用户提示所述检索字符串与目标信息之间的形式联系, 从而提高检索结 果的客观性。 而不是依赖于现有技术中的基于海量用户的搜索行为的建模——因为这种建 模行为容易受到人为的干扰, 从而破坏检索字符串与目标信息之间的关联关系的客观性, 损害检索的准确率, 降低了用户体验。 而且, 本发明实施例中的检索字符串与目标信息之 间的客观联系是以检索字符串为结点的检索词目录表的形式向用户反馈的, 其显示方式形 象、 直观、 准确。
发明人通过长期的探索, 发现: 目标信息本身是可形式化的。 例如, 大衣-羊毛大衣- 女士羊毛大衣-男士羊毛大衣-进口女士羊毛大衣。 而且, 这些目标信息之间的关联性也是 蕴藏在类似形式化中的。 例如"大衣、 羊毛大衣、 女士羊毛大衣、 男士羊毛大衣、 进口女 士羊毛大衣"之间的逻辑包含关系: "大衣 "是各种大衣的上位概念, "羊毛大衣"又是其它 羊毛大衣的上位概念。 类似地, 电脑、 台式电脑、 笔记本电脑、 液晶笔记本电脑之间的关 联性也是与其信息描述形式密切相关的。 类似的目标信息之间形式与内容的关联性是大量 而普遍存在的。
这种目标信息的形式与内容关联性, 对于提高客户体验具有重要的作用。 如果能够在 用户输入检索字符串之后, 向用户提供与检索字符串有关的目标信息之间的关联性, 就使 得用户不仅仅得到了通常的检索结果, 同时也可以得到与检索字符串具有关联性的其他目 标信息, 而且这种检索结果是检索字符串自身在语义和分类上的客观性所决定的, 不会受 到其他用户的带有主观性的行为的干扰。
有效利用这一特性, 将能够帮助用户迅速定位目标信息, 最大化地贴近用户的搜索需 要, 显著地提高检索效率以及检索提示的信息容量和速率。
基于上述原理, 如图 10所示, 本发明实施例提供一种信息检索的处理系统, 包括: •信息数据表模块 1004 , 用于存储所有的信息内容;
•信息索引数据表模块 1003 , 用于根据所述信息数据表模块 1004的所述所有的信息内 容生成并存储为信息索引数据表;
-原始检索词数据表模块 1005 , 用于将所述信息索引数据表去重复后得到原始字符 串, 所述原始字符串构成原始字符串集, 全部所述原始字符串集组成并存储为原始检索词 数据表;
-检索词目录表模块 1006 , 用于根据所述每个原始字符串与所述原始检索词数据表中 其他原始字符串的亲属关系, 进行相互匹配生成数据项集合, 所述数据项集合组成并存储 为检索词目录表;
-别名数据表模块 1007 , 用于存储所述原始字符串的别名字符串; 所述别名字符串用 于进行所述原始字符串的别名查询;
-搜索引擎模块 1002 , 用于根据用户在终端 1001上所输入检索词, 查询所述检索词目 录表, 获取包含所述输入检索词的第一数据项集合; 其中, 所述第一数据项集合的各个数 据项之间存在亲属关系; 根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息索引数据表, 获取第二数据项集合; 组合并发送第一数据项集合, 以及发送第二 数据项集合到所述终端 1001。
本发明实施例的信息检索的处理系统, 工作过程如下:
( 1 )生成信息索引数据表。 信息索引数据表模块 1003将所有的存储的信息内容作为 待查询信息的全部数据, 进行全文或者摘要标引 (中文需要分词)后, 提取原始字符串, 建立一个倒排数据表, 这个数据表作为信息索引数据表。
对信息内容进行全文或者摘要标引, 提取原始字符串, 是一种现有技术, 因此, 在本 发明实施例中, 不再——进行详细描述。
全文或者摘要标引后提取的原始字符串, 是系统处理的检索字符串关键字, 这些检索 关键词, 是由待查询信息自身的客观性所决定的; 每条待查询信息对应的检索关键词, 可 能是 1到多个。
( 2 )生成原始检索词数据表。 原始检索词数据表模块 1005对信息索引数据表中的原 始字符串进行唯一性选择, 生成原始检索词数据表, 原始检索词数据表由具有唯一性的原 始字符串组成, 原始检索词数据表中所包含的具有唯一性的原始字符串是检索提示数据的 来源。
( 3 )生成检索词目录表。 检索词目录表模块 1006对原始检索词数据表中所包含的任 一原始字符串, 与原始检索词数据表中其它原始字符串匹配, 确定这些原始字符串或原始 字符串集合之间的包含关系。
匹配的结果只有 4种: 不包含、 左包含、 右包含或居中包含。 原始字符串之间的包含 关系反映了语义关系。 根据包含关系, 为两个直接相关原始字符串生成父子关系, 从而构 造出检索词目录表。
( 4 )根据用户从终端 1001上所输入检索词, 进行查询, 得到与输入检索词相匹配的 数据。 搜索引擎模块 1002查询所述检索词目录表, 获取包含所述输入检索词的第一数据 项集合; 其中, 所述第一数据项集合的各个数据项之间存在亲属关系; 根据与所述输入检 索词相关联的第一数据项集合的各个数据项, 查询信息索引数据表, 获取第二数据项集 合; 组合并发送第一数据项集合, 以及发送第二数据项集合到所述终端 1001。
搜索引擎模块 1002根据用户从终端 1001上所输入检索词, 进行查询, 得到递归目录 树状检索提示和与输入检索词相匹配的数据。 在接收到检索请求后,搜索引擎模块 1002 一方面根据输入检索词查询检索词目录表, 返回包含该输入检索词的目录表状检索词结 构, 从而提示了与搜索请求中所包含的输入检索词相关的待查询信息相关的信息; 另一方 面根据前述结果查询信息索引数据表,得到与检索请求所包含的输入检索词关联的信息数 据。 而且, 所提示出来的信息是以检索词目录表的形式出现的, 清晰地呈现出各个相关的 检索信息之间的递归层次。
作为另一种可实施方式, 检索词目录表模块 1006的检索词目录表的原始字符串除了 来源于如标引或分词所得的原始检索词数据表外, 也可以来源于人工整理的数据。
作为一种可实施方式, 原始检索词数据表模块 1005对原始检索词数据表进行筛选, 以去除不必要的或者无意义的原始字符串。
人工整理是对标引或分词的一种补充和再编辑。 在目标信息落后或滞后于新出现的检 索字符串的时候, 能够将新出现的检索字符串补充或编辑到检索词目录表中。
较佳地, 所述检索词目录表的生成釆用一种基于栈的非递归的方法, 其能在较短的线 性的时间内处理海量数据。
较佳地, 别名数据表模块 1007在生成递归目录树状检索提示时, 将同时查询别名数 据表, 返回用户输入的检索字符串的别名的检索提示, 从而能够提高用户体验。
所以, 本发明实施例所提供的信息检索的处理系统是一种基于检索词目录表进行检索 提示的系统。 相对于现有技术, 它有如下优点:
( 1 )将检索字符串以检索词目录表结构展现出来, 由于使用了层次结构, 缺省情况 下只需要显示检索词目录表第一层, 而不是显示所有数据。 如果用户需要, 还可以将所有 的数据都显示出来, 使得本发明实施例的穷举度高, 提升了查全率。
( 2 )在本发明实施例中, 由于在检索词目录表的生成过程应用了检索词之间的包含 的问题要么不存在, 要么就一定存在于这个检索词目录表中, 这将极大的改善用户体验。 由于检索词目录表上各个检索字符串之间的语义单向性, 使得本发明实施例单向一致性 强, 有利于提高查准率。
( 3 )检索词目录表的生成方法釆用的是线性的非递归的算法, 处理速度快, 实现起 来比较方便, 增强了本发明的可实施性。
如图 11所示, 是本发明实施例所提供的一种检索的方法, 包括步骤:
1101)接收用户在终端 1001上所输入检索词;
1102)使用所述输入检索词查询检索词目录表, 获取包含所述输入检索词的第一数据 项集合, 所述第一数据项集合的各个数据项之间存在亲属关系;
1103 )根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息索引 数据表, 获取第二数据项集合;
1104)向所述用户的终端 1001组合并发送包含所述检索字符串的第一数据项集合, 所 述第一数据项集合的各个数据项之间以亲属关系进行递归目录树组合所述第一数据项集 合, 以及
向所述用户的终端 1001发送所述信息索引数据表中与所述输入检索词匹配的第二数 据项集合。
较佳地, 所述接收用户输入的检索字符串之后, 还可以包括:
根据用户在终端 1001上所输入检索词, 查询别数据表, 获取与所述输入检索词相对 应的联想词 (即别名检索词) , 然后以所述别名检索词作为新的输入检索词, 重复执行步 骤 1102 - 1104。
较佳地, 所述接收用户输入的检索字符串之后, 还可以包括:
如果所述输入检索词没有命中所述信息索引数据表或检索词目录表, 则提示用户重新 输入。
较佳地, 所述一种检索的方法, 还包括生成所述检索词目录表的步骤。
较佳地, 所述获取第二数据项集合, 包括下列步骤:
由第一数据项集合的各个数据项, 查询信息索引表, 进行简单匹配获取第二数据项集 合; 或
由第一数据项集合的各个数据项, 查询信息索引表, 进行递归组合匹配获取第二数据 项集合。
较佳地, 所述生成所述检索词目录表包括如下步骤:
将原始检索词数据表中的各个原始字符串两两互相匹配, 确定相互之间的包含关系; 根据所述包含关系, 确定所述两两互相匹配的原始字符串之间的父子关系; 根据所述父子关系, 生成所述检索词目录表, 所述检索词目录表包括数据项集合
Dl、 D2...... Dn, n大于等于 1 , 组成数据项集合 Dn的各个数据项之间具有亲属关系。
较佳地, 所述包含关系包括: 左包含、 右包含、 居中包含或不包含。
较佳地, 根据所述父子关系, 生成所述检索词目录表还可以包括:
如果所述至少两个字符串之间构成左包含或右包含关系, 则将所述两个字符串设置为 父子关系, 所述被包含的字符串为父。
由信息索引数据表信息索引数据去重复后生成原始检索字符串集, 作为原始检索词数 据表。
信息索引数据表的生成是由被查询信息获取索引词; 使用索引词集合名称或建立倒排 数据表, 作为信息索引数据表。 下面, 分别结合附图说明本发明的检索的方法和系统的实施例。
本发明实施例中所使用的概念及其含义如下:
检索词目录表: 是指将所有的检索字符串按照相互之间的包含关系排列成一个递归目 录树状的结构, 其中, 直接相关的检索字符串构成父子关系, 多层的父子关系的叠加构成 一个递归目录的树结构。 检索字符串的来源主要是原始检索词数据表, 也可以是在此基础 上人工补充、 编辑后的数据。
用户: 是能够发出信息查询请求的人或终端 1001 , 可以包括分布在各地的使用本系 统进行信息查询的人员或者查询终端 1001。
检索字符串: 指用户查询时输入的字符串, 检索词目录表是由检索词组成的。
搜索引擎: 搜索引擎根据用户输入的检索字符串, 返回与检索字符串相关的内容数 据。
信息数据表模块 1004: 用于存放具体内容的数据表, 具体内容可以用多个字段进行 存放。 以存储商品信息的商品信息数据表为例, 其所包含的内容可以包括商品类型、 品 牌、 价格、 厂家等数据。
信息索引数据表模块 1003: 将信息数据表的全部内容或者部分内容进行自动标引或 分词后生成的倒排数据表。 比如上述存放商品信息的数据库只对商品类型生成倒排数据 表。
原始检索词数据表模块 1005: 将信息索引数据表的索引去除重复后得到的数据。 别名数据表模块 1007: 用于存储关键字的别名数据信息。 可以使用各种方式如同义 词或者近义词来构建别名数据表。 比如马铃薯又名土豆、 洋芋。 这个表主要用于进行检索 字符串的查询扩展, 使得用户查询别名或者本名时, 同时显示出与别名对应的本名, 或者 与本名对应的别名, 从而能够完善搜索的范围, 提升用户体验。
下面以具体的步骤详细介绍本发明实施例的各个步骤:
1、 信息索引数据表的生成
倒排表是搜索引擎中一种常用的数据结构, 倒排表以词为索引, 以包含这些词的文档 集合为项, 可以快速找到包含某个词或者某些词的文档集合。
当需要根据记录中的一些非关键码的数据项来进行查找, 也就是根据属性的值来查找 记录时, 需要对属性值建立索引, 即索引表中的每一项均由一个具体可能出现的属性值, 和出现给该值的记录的地址两部分组成。 这样, 我们可以通过记录的某一项属性值反过来 查找到这个记录的存放地址, 或者记录对应的关键码。 我们称这种索引为倒排索引 ( inverted index ) 。
信息索引数据表就是倒排表, 用作索引的词可以通过将信息表的主要数据进行全文或 者摘要自动标引或分词后获取。 这些索引词是系统处理的检索字符串。 信息表是目标信息 的载体, 通常可以包括商业信息表、 法律信息表、 公共资源信息表等等。
下面用表格的方式表示信息数据表所包含的示例数据, 数据库字段为简化后的字段。 表 1为信息数据表的表结构, 表 3为信息索引数据表的表结构。 表 2与表 4为具体的示例 数据。 从表 2到表 4说明了信息索引数据表的生成过程。 本领域技术人员可以根据本发明 实施例所述的信息索引数据表的生成过程, 生成其他各种形式的信息索引数据表。
1.1信息数据表
信息数据表用于存放具体的信息数据, 表名 xinxi。
Figure imgf000014_0001
表 1 : 信息数据表
Figure imgf000014_0002
表 2-2: 信息数据表集合数据示例
1.2、 信息索引数据表 信息索引数据表用于存放信息的索引数据, 表名 xinxiindex。
Figure imgf000015_0001
表 3 : 信息索引数据表
示例数据
Figure imgf000015_0002
Figure imgf000015_0003
表 4-2: 信息索引数据表集合数据示例
在本例中, 信息索引数据表包括 Indexed (信息索引的编号的编号)、 Xinxiid (信息索引 所标记的信息的编号)和 Tmsp (索引) 三个字段。 其中索引字段来源于信息数据表中 content字段的全文自动标引或分词。 比如, 表 2中第一条数据 "棉大衣, 羊毛大衣", 在表 4中变成了两行索引。 索引项除了标记了该索引的关键词外, 还记录了索引所标记的信息 项的 xinxiid, 以便于数据的查找。
表 4-2 索引词集合名称与索引词集合的关系是:
用户界面 {配置用户界面、 升级 }
用户界面¥{打印、 安装、 配置用户界面、 升级、 迁移、 通用串行总线 } 用户界面∑{打印、 安装、 配置用户界面、 升级、 迁移、 通用串行总线、 使用向导、 安装 }
所属领域的技术人员知道, 通过增加、 删除或变更信息索引数据表的字段, 可以相应 地编辑信息索引数据表的功能, 这没有超出本发明实施例的核心思想, 是属于显而易见 的。
2、 原始检索词数据表的生成
原始检索词数据表的生成, 可以通过对信息索引数据表的索引字段进行唯一性选择来 实现。 原始检索词数据表是生成检索词目录表的原始数据。 原始检索词数据表可以根据需 要进行更新, 更新方式是更新其中的检索字符串。 如果有需要, 原始检索词数据表会不停 的更新。 例如, 当信息索引数据表的数据发生变化时, 相应地可能会引起索引字段发生变 化, 如果在对索引字段的唯一性选择时产生了新的检索字符串, 原始检索词数据表就会进 行更新。
原始检索词数据表用于存放最初的检索字符串集合, 其中的一种可能的实现方式如 下: 原始检索词数据表包括检索词编号和检索词两个字段, 表名可以示例为
tmsporiginal。
Tmsporiginal:
Figure imgf000016_0001
表 6: 原始检索词数据表数据示例
以表 4所示信息索引数据表为例, 可以知道其中包括: 棉大衣、 羊毛大衣、 女式 羊毛大衣、 羊绒大衣 4个检索字符串, 对这些检索字符串相应地分别赋予编号, 就可以得 到表 6所示的原始检索词数据表。
当原始检索字符串需要增加时, 可以釆用追加的方式, 在表 6后面直接追加新检索字 符串并赋予编号, 这样可以不破坏原有的检索字符串编号顺序。
当原始检索字符串需要删除时, 可以删除相应检索字符串及其编号; 当原始检索字符 串需要替换时, 可以釆用增加原始检索字符串的方式, 也可以釆用替换原始检索词表中的 相应检索词及其编号即可。
所属领域的技术人员知道, 对于不同的目标信息类型, 其所应当关注的重点是不同 的, 对于商品数据库而言, 可能其产品名称、 类型、 价格、 厂家和产地是消费者通常所关 注的。 而对于法律数据库的使用者而言, 其可能更特定法律制度中所规定的主体、 客体、 条件、 时限等等。 这并没有超出本发明实施例的保护范围, 也都同样可以运用本发明实施 例的技术方案进行处理。
3、 检索词目录表的生成
检索词目录表的生成主要是通过包含关系的匹配来实现的。 对原始检索词数据表中的 任一检索字符串, 将其与原始检索词数据表中的其它检索字符串相匹配, 确定与其它检索 字符串之间的包含关系。 根据不同的包含关系进行处理后, 就得到了检索词目录表。
在本发明实施例中, 默认认为词语字符的包含关系即表明了词语语义上的包含关系。 如"羊毛大衣"是"女式羊毛大衣"的上位概念, "羊毛大衣"在语义上包含了"女式羊毛大 衣"。 根据语法分析可知, 当两个词构成包含关系时, 包含词往往是被包含词的类型细 分, 或下位概念。 包含词去除被包含词后, 剩余的部分与被包含词之间一般是修饰关系或 补充关系等, 比如"钢笔"与"笔"、 "笔"与"笔芯"。 根据两个词语字符串的是否包含、 以及 包含者与被包含者所在的位置, 将词语之间的包含关系分为 4种: 不包含、 左包含、 右包 含与居中包含。
将检索字符串按照字符串之间的前述 4种包含关系, 就可以构造成树形的检索词目录 表。 该检索词目录表是一个排序多叉树, 在同一棵树枝上的两个相邻节点之间, 父节点是 子节点的最大长度包含字符串。 树的根节点是一个虚拟节点, 去掉该虚拟节点, 就可以得 到一个字符串森林。
如图 1所示, 是本发明实施例中检索词目录表的一种结构示例。 其中包括的字符串 有: 大衣、 棉大衣、 羊绒大衣、 羊毛大衣、 女士羊毛大衣、 被罩、 茶、 水茶、 果茶、 花 茶、 茉莉花茶。 以上的示例显示的是右包含关系, 比如: "棉大衣"右包含 "大衣"。
在包含关系的处理中, 不包含关系是被舍弃的, 只需要识别另外三种包含关系即可。 另外, 居中包含关系可以通过"左包含 "于"右包含 "的复合来实现, 比如"女士羊毛大衣 "居 中包含 "羊毛", 可以拆分为"女士羊毛大衣 "右包含"羊毛大衣", "羊毛大衣"左包含 "羊 毛"。 本发明实施例中, 包含关系的匹配包含两次处理, 一次左包含关系处理, 一次右包 含关系处理。 两次处理的流程是相似的, 并且经过这两次的处理流程, 能够将两个检索字 符串之间的包含关系标识出来。 在下面的具体实施过程的描述中, 主要以右包含匹配进行 说明, 对于左包含以及居中包含后面将给出示例说明。 本发明实施例能够在线性的时间 内, 根据检索字符串之间的包含关系, 将所有的检索字符串构成一个检索词目录表。
3.1检索词目录表中所使用的概念
父子关系与结点
在检索词目录树中, 位于不同结点上的字符之间存在着父子关系。 父子关系定义如 下:
字符串是若干字符组成的, 字符串 = (字符 n, ...,字符 3,字符 2,字符 1)。
当字符串 2以字符串 1的所有字符串为结束符时, 即字符串 2右包含字符串 1 , 则认 为字符串 1是字符串 2的长辈节点, 比如"大衣"是"羊毛大衣"与"女士羊毛大衣 "的长辈结 点, "羊毛大衣"也是"女士羊毛大衣"的长辈结点;
一个结点可以有若干个长辈结点, 这些长辈结点的长度可能会不相同, 如"羊毛大衣" 与"大衣"都是"女士羊毛大衣"的长辈结点。
父结点: 如果字符串 1是字符串 2长度最长(或者说符合度最大) 的长辈结点, 则将 字符串 1作为字符串 2的父结点, 即"羊毛大衣"是"女士羊毛大衣 "的父结点, "大衣 "是"羊 毛大衣 "的父结点。
当某字符串有父结点时, 则该字符串由两部分组成父结点字符串和附前缀字符串, 其 中, 父结点字符串, 在整个字符串的后面, 称为主后缀字符串, 如"大衣"之于"羊毛大 衣"; 另一部分是前面的字符串称为附前缀字符串, 如"羊毛大衣"之于"羊毛"。
兄弟关系:兄弟关系是指同一父结点下两个及两个以上相互并列的结点。 如"羊毛大衣" 之下的 "女式羊毛大衣"和"男式羊毛大衣"。
亲属关系: 亲属关系是同一根结点下所有结点间关系的总称, 包括: 父子关系、 兄弟 关系, 以及由互相关联的父子兄弟关系构成的结点之间的关系。 就像一个家族基于一个祖 先下的所有后代一样, 这些具有共同祖先的后代成员之间, 都具有亲属关系; 尽管有的是 父子关系, 有的是兄弟关系, 有的是叔俚关系等等。 对于包含"大衣"的检索词目录树而 言, 其下面的其他结点都与其存在直接或间接的父子结点关系, 包括"大衣"在内的该目录 树上的所有结点之间都构成亲属关系, 除"大衣"外的其他检索字符串都是"大衣"的直接子 串或间接子串。 没有直接父子关系或兄弟关系的结点之间是基于这种亲属关系而属于同一 目录树的。 这种亲属关系, 如父子、 兄弟、 直系、 旁系等等, 是向用户提示递归目录树的 依据所在。 检索词目录树就像一个以递归目录树形式显示的家族的谱系一样。
3.2检索词目录表
在本发明实施中, 检索词目录表可以使用数据表的形式进行存储, 下面, 以最简单的 模型为例进行说明, 在此核心的基础上可以改进优化。 本发明实施例中的检索词目录表数 据表模型主要包含四个字段:
编号字段 id,用于存放字符串的编号;
检索词字段 tmsp, 用于存储检索字符串的具体内容;
逆向检索词字段 untmsp,用于存储检索词字段 tms 逆向后的逆向检索词;
父结点字段 bplus, 用于存放检索词的父结点, 没有父节点时为 0。
检索词目录表是一个排序多叉树。 经发明人研究发现, 索引名称字符串逆向后的按字 符排序正好是检索词目录表的深度遍历。 因为索引名称字符串先进行逆向, 再进行排序 后, 构成父子关系的字符串因为具有相同的右包含串, 必然相邻。 下面, 参照表 11来进 行说明。 具体数据为"大衣"所在的子树。 比如"棉大衣"、 "羊毛大衣"、 "女士羊毛大衣"与 "羊绒大衣"都是"大衣"右包含关系的子结点或子结点的子结点。 将这些字符串逆向后, "衣大"、 "衣大棉"、 "衣大毛羊"、 "衣大毛羊式女"、 "衣大绒羊"具有相同的左字符串 "衣 大", 这样排序的时候, 必然相邻。
例如, "女士羊毛大衣"是"羊毛大衣"右包含的子结点, 对其进行字符串逆向后, 变成 "衣大毛羊"与"衣大毛羊式女", 这两个字符串在排序的时候, 也必然相邻。
就"大衣"而言, "大衣 "的子结点及子结点的子结点, 逆向排序后为"衣大、 衣大毛 羊、 衣大毛羊士女、 衣大棉、 衣大绒羊", 从中可以看出, 子结点及子结点的子结点会排 列在一起, 而子结点会紧挨着父结点先出现, 然后才是与子结点拥有相同父结点的其他子 结点, 这个排列关系正好是树的深度遍历。
因此, 在上述逻辑关系的基础上, 可以对数据项组合在字符串逆向后排序的基础上组 成检索词目录表。
在本发明实施例中, 使用堆栈进行处理, 不使用递归方法进行处理, 使计算的时间为 线性时间。 在计算机中使用已有的数据结构来实现堆栈。 在存储过程中, 堆栈可以使用字 符串实现, 例如: 1,大衣; 2,羊毛大衣; 3,女士羊毛大衣。 这个堆栈称为继承直系树堆 栈, 分号前面的单元内容是分号后面的单元内容的父节点。 堆栈单元内容包含两个部分, 字段 id与字段 tmsp, 这两个部分之间使用 "逗号"分隔。 堆栈单元之间使用 "分号"分隔。 入栈、 出栈操作都通过字符串的相关函数实现, 入栈操作转化为将单元内容附加到堆栈字 符串上, 出栈操作转化为返回倒数第二个分号之前的字符串。
本发明实施例中, 设置临时直系树堆栈字符串 tmpstackstring, 进行堆栈处理, 本实施 例中, 示例数据为: 1、 大衣, 2、 羊绒大衣, 3、 羊毛大衣, 4、 女士羊毛大衣, 5、 棉大 衣, 6、 被罩。 具体步骤如下:
步骤 1 ) , 根据 tmsp字段逆向后生成 untmsp字段, 比如"棉大衣 "逆向后变成 "衣大 棉";
步骤 2 ) , 初始化继承直系树堆栈字符串 stackstring为空 (此处只有父子关系, 没有 兄弟关系。 所谓兄弟关系是指同一父结点下并列两个以上的结点, 如羊毛大衣和羊绒大衣 相对大衣而言, 属于兄弟关系) 。 堆栈内容为目前继承的长辈结点的 id与 untmsp数据; 步骤 3 ) , 以 untmsp排序得到全部 [ id, untmsp ]数据。 例如依次返回的数据: "1 , 衣 大; 3 , 衣大毛羊; 4, 衣大毛羊士女; 5 , 衣大棉; 2, 衣大绒羊; 6, 罩被";
步骤 4 ) , 读取当前数据 [ id, untmsp ]到继承直系树堆栈 stackstring; 如果没有数据, 即当前数据 [ id, untmsp ] 为空, 跳转到步骤 10 ) 。 对示例数据而言, 第一次读入数据为 tmpid= 1 ,tmpuntmsp= ' '; 第二次读入的数据为 tmpid=3.tmpuntmsp='衣大毛羊'; 第三次 读入的数据为 tmpid=4. tmpuntmsp= '衣大毛羊士女'; 其它数据依次类推。 如果没有数据, 跳转到步骤 10 ) ;
步骤 5 ) , 初始化临时直系树堆栈字符串 tmpstackstring为空 (只有父子关系, 没有兄 弟关系) ;
步骤 6 ) , 如果继承直系树堆栈 stackstring为空跳转到步骤 8 ) ;
步骤 7 ) , 如果继承直系树堆栈 stackstring不为空, 则从继承直系树堆栈 stackstring 中, 查找当前数据 [ id, untmsp ] 的长辈结点;
则将所述长辈结点入栈到临时直系树堆栈 tmpstackstring , 所述长辈结点中的最后一个 结点是该长辈结点的父结点, 修改当前堆栈游标 bplus值为父结点的 id;
如果从继承直系树堆栈 stackstring中找不到长辈结点, 则设置堆栈游标 bplus值为 0; 并将临时直系树堆栈 tmpstackstring的值赋予给继承直系树堆栈 stackstring , 并将当前 的数据 [ id,untmsp ]压入继承直系树堆栈, 然后跳转到步骤 A139;
例如当游标读到第二条数据 "3 , 衣大毛羊"时, stackstring="l,衣大;"。 经过判断"衣大" 是"衣大毛羊"的父节点, 则修改当前游标所在的 Bplus值为 1 , 临时直系树堆栈
tmpstackstring="l , 衣大; "。 将临时直系树堆栈赋值给继承直系树堆栈, 然后再将当前 "3 , 衣大毛羊"压入栈, 得到 stackstring="l , 衣大; 3 , 衣大毛羊";
步骤 8 ) , 将当前的数据 [ id,untmsp ]压入继承直系树堆栈 stackstring, 比如第一次入 栈操作时, stackstring="l,衣大;"。 并更新当前游标的 Bplus字段为 0;
步骤 9 ) , 当前堆栈游标 Bplus值加 1 , 跳转到步骤 4 ) , 读取下一条数据 [ id,untmsp ] , 执行循环操作;
步骤 10 ) , 结束。
示例数据经过字符串逆向处理后的数据样式如表 10所示。 其中 Bplus父结点都初始 化为 0。
Figure imgf000021_0001
表 10
将表 10对 untmsp字段排序后输出得到表 11。 在表 11中已经根据本算法进行了 Bplus赋值。 在本表中, 另外增加了 id为 28,29的项, 并对 Bplus的赋值进行补充说明。
Figure imgf000021_0002
4 女士羊毛大衣 衣大毛羊士女 3
5 棉大衣 衣大棉 1
2 羊绒大衣 衣大绒羊 1
6 被罩 罩被 0
28 棉被罩 罩被棉 6
29 厚棉被罩 罩被棉厚 28 表 11
在表 11中, Bplus值表明了对表 9中的数据处理后的父子关系。 例如: 大衣的 Bplus 为 0, 表明是根节点, 没有父结点, 羊毛大衣的 Bplus值为 1 , 表明其为大衣(ID值为 1 ) 的子结点, 女士羊毛大衣的 Bplus值为 3 , 表明其为羊毛大衣(ID值为 3 ) 的子结点。 同理, 被罩(ID值为 6 ) 的子结点为棉被罩(ID值为 28, Bplus为 6 ) , 依次类推。
根据生成的父子关系 (id, bplus ) , 构造的树状结构如附图 1所示, 图 1中还包括了 茶、 水茶、 果茶和花茶等内容, 这部分树状结构的生成过程与"大衣"相关的树状结构相 同, 不再重复。
图 2所示, 是本发明实施例中左包含与右包含检索字符串同时存在的情况下, 将检索 词目录表混合显示的结果。 "棉大衣", "羊毛大衣"与"大衣"之间是右包含关系; "大衣紐 扣"与"大衣"之间是左包含关系, "金属大衣紐扣"与"大衣紐扣"之间是右包含关系。 "金属 大衣紐扣"与 "大衣 "构成了居中包含关系。 此处的居中包含关系通过一次右包含关系和一 次左包含关系就可以完成处理。
在上述处理过程的基础上, 还可以进行改进优化。 优化的方法包括:
( 1 ) 同一父字符串下的两个字符串子串包含关系的处理。 例如"大型物件运输 "与"大 型物件起重运输"没有构成左包含、 右包含或者居中包含关系, 但他们拥有同一个父串 "运 输"。 如果去掉父串后, "大型物件起重"左包含"大型物件", 这种情况下可以进行二次处 理, 将"大型物件起重运输"归为"大型物件运输 "的子串。 经过这种处理, 能够减少同一节 点下所包含的子节点的数目。
( 2 )增加各种辅助字段加快检索词目录表信息数据的显示, 比如增加当前节点所在 目录的层数目、 当前节点所在根节点中多子节点的序号等等。 4、 向用户返回目录树状检索提示和与检索字符串相匹配的数据
搜索引擎接收用户输入的检索字符串后, 会查询两个表。 一个表是检索词目录表, 通 过查询检索词目录表, 得到包含所述检索词的第一数据项集合, 并以递归目录树的形式向 用户显示所述数据项集合。 另一个表是信息索引数据表, 通过查询这个表, 得到与检索词 相关的第二数据项集合, 即信息索引数据表中与第一数据项集合关联的数据, 然后向用户 返回该数据。
本领域技术人员明白, 检索词目录表可以包含所有的检索字符串, 也可以包含部分检 索字符串。 例如, 可以将衣服、 电子产品、 法律库等检索字符串都列入一个检索词目录表 中, 也可以分别放在三个检索词目录表中。 同样是衣服类, 也可以分成多个检索词目录 表, 无论是哪种方式, 都没有超出本发明的保护范围。 无论是哪种方式, 其向用户返回的 都是只包含用户所输入的检索字符串的检索词目录表, 而不会将不包含用户所输入的检索 字符串的检索词目录表向用户返回。 从这个角度讲, 可以将检索词目录表理解为由复数个 根结点组合成的复数个检索词目录表, 其中每个根结点单独组成一个检索词目录表。 每一 个检索字符串都隶属于一个根结点下的检索词目录表。
如图 3所示, 是本发明实施例的一个系统原型截图。 用户输入的检索字符串是"大 衣"。 在完成检索字符串输入后, 本例中以输入的检索字符串为"大衣"为例, 向用户返回 的内容主要包括两部分: 第一部分是从检索词目录表所获得的部分, 第二部分是从信息索 引数据表返回的内容。
第一部分是使用所述检索字符串查询检索词目录表后, 得到该检索字符串的相关检索 字符串集合。 正如图 3左侧所显示的那样, 显示了相关检索字符串所在的检索词目录表, 其中父串"大衣"与其子串"半大衣"、 "棉大衣"、 "羊毛大衣"、 "羊绒大衣""各式大衣"、 "各 列, 以包含检索字符串的检索词递归目录树的形式向用户呈现。 由于检索字符串相互之间 的父子关系, 使得用户可以看到该检索字符串所属的父串以及它的子串, 如图 3中提示的 与"大衣"有关的 "羊毛大衣, 羊绒大衣"等。
第二部分是在系统中检索所述检索字符串, 从信息索引数据表中获取与所述检索字符 串相关的数据, 也就是与所述检索相关的信息概要内容, 即如图 3右侧所显示的那样, 是 包含检索字符串的信息索引数据表中的相关信息, 对于商品信息数据库而言, 通常可以包 括如厂家名称、 品牌、 价格、 商品名称等。 而对于法律信息数据库而言, 通常可以包括各 种法律主体、 法律责任类别等, 如果是保险有关的, 就会显示各种险种等。
通过在信息索引数据表中对检索字符串进行完全匹配查询, 将查询的结果联合信息数 据表输出作搜索返回的内容。 在本发明实施例中, 可以使用以下的联合查询 SQL语句返 回结果 "select * from xinxi,xinxiindex where xinxi.xinxiid = xinxiindex . xinxiid and
xinxiindex.tmsp = 'keyword*" 0 查询可以根据信息数据表的附加字段进行排序, 比如时间权 重等。 完全匹配能够得到比较准确的搜索结果,模糊匹配得到的搜索结果比较广泛, 两者 都有自己的用处, 可以通过功能按钮来实现功能选择。
由于父串与子串相互之间的父子关系, 能够以检索词目录表的形式排列显示出相互之 间的父子关系。 就图 3的显示结果而言, 用户不仅仅从右侧的内容中得到了相关的检索结 果, 而且还从图 3的左侧得到了包含检索字符串的字符串之间父子关系, 使得这些检索结 果在逻辑上的关联性也得到了显示。
而在现有技术中, 各个检索结果之间单纯通过用户点击数量所形成的倒排关系, 搜索 引擎并没有将这些检索结果的检索字符串的父子关系显示出来。 现有技术的检索结果之间 的排列关系依赖于用户的点击或链接数量的多少, 而不是检索字符串之间逻辑上的关系 联, 更不是包含检索字符串的字符串之间的父子关系。 所以, 两者之间具有本质的不同。
检索词目录表的查询过程可以如下。 在检索词目录表查询检索字符串是否存在。 如果 检索字符串存在, 则将检索字符串作为树根, 并从检索词目录表中查询该检索字符串的子 结点, 查询条件是子结点的父结点编号 bplus等于当前检索字符串的编号 id。 比如算法示 例中"大衣"的编号为 1 , 他的子节点为 "棉大衣"、 "羊绒大衣"与"羊毛大衣", 他们的 bplus 值均为 1。 相应的可以使用递归的方法返回子结点的子结点, 也可以使用 Ajax技术在点 击节点的时候自动生成该结点的子结点。 同一父结点下的子结点可以按照结点字符顺序进 行排序, 以方便查找, 也可以按照其他的顺序进行排列, 取决于用户的具体需求。
包含检索字符串的检索词目录表结构通过两种关系生成: 一种是左包含关系生成的, 另一种是右包含关系生成的, 根据左包含关系和右包含关系能够分别生成两个子检索词目 录表, 这两个子检索词目录表可以从形式上合并成为一个检索词目录表。
作为一种优化处理, 考虑到用户的检索字符串会存在别名, 为了提高检索的准确性和 覆盖面, 可以将别名也作为检索字符串, 在信息索引数据表和检索词目录表进行查询, 返 回相应的结果。 在本发明实施例中, 通过要查询别名数据表, 找到与当前检索字符串相对 应的别名, 同时将别名作为检索字符串, 在信息索引数据表和检索词目录表进行查询, 返 回相应的结果。 例如, 如果当前用户的检索字符串是"土豆"时, 通过查询别名数据表, 得 知其存在"马铃薯 "这个别名, 那么系统就将马铃薯作为新的检索字符串, 进行检索, 即在 信息索引数据表和检索词目录表中查询与其匹配的信息后, 向用户返回。 别名数据表的数 据通常可以人工建立。
为了更直观地显示出本发明实施例的检索效果与现有技术的区别, 下面使用附图进行 比较说明。 如附图 4、 附图 5所给出的 2个具体的例子。
附图 4所示, 是使用谷歌关键字工具( google keyword tool )返回的数据, 在谷歌关 键字工具中, 使用"大衣"进行搜索, 能最多返回 150条数据。 其所返回的数据包括 "大 衣", "大衣围巾", "大衣 男", "大衣 女", "大衣 淘宝", "VERO MOD A 大衣", "JACK JONES大衣", "时尚大衣", "外套大衣", "流行大衣", "男装大衣", "女装大衣", "流行 大衣", "大衣批发"等等, 这些返回的数据之间是各自独立的, 没有相互之间在逻辑包含 关系, 更没有显示这些逻辑包含关系的检索词目录表。
附图 5所示, 是使用百度即时提醒返回的数据。 当输入 "大衣 "后, 其所提示的数据是 "大衣拒", "大衣拒图片", "大衣拒内部结构图", "大衣搭配", "大衣款式", "大衣蝴蝶结 的打法", "大衣图片"等等, 这些数据之间除了包含检索字符串大衣之外, 没有显示出相 互之间在逻辑上的包含关系, 更没有显示这些反映这些数据之间逻辑关系的检索词目录 表。
综合图 3与图 4和图 5的差别可知, 无论是百度或者是谷歌, 其所返回的检索提示信 息均不全面, 这些检索提示信息之间不带有语义特征。 比如从语义上, 图 4中"男装大衣" 应该是 "大衣 "的一种, 图 5中"大衣拒 "是"大衣拒图片"和"大衣拒内部结构图"的上位概 念。 但是这些逻辑联系并没有为图 4和图 5以检索词目录表的形式直观地显示出来。
本发明实施例所提供的技术方案除了能够应用于生成检索提示外, 还可以用于商品目 录的自动生成, 专业技术领域的概念树生成。
应用举例一: 海关税则的海关编码(HS编码) 查询。 如图 6所示, 是从海关总署进 行检索的结果。 当从海关总署的官方网站中 "首页 >首页 >商品查询 "使用检索字符串"大衣" 进行查询时, 网站模糊查询引擎将返回 8页共 158条记录, 这些记录以包含商品编码和商 品名称的列表的形式呈现给检索者。 但是, 却没有显示出这些商品名称之间的内存逻辑联 系, 如"大衣"是"男装大衣"、 "纯毛机织男大衣"、 "棉制梭织女大衣"等商品的上位概念。 如果应用本发明实施例所提供的方法, 将商品名称作为检索词目录表中的父串或子串, 生 成检索词目录表后向检索者显示, 正如本发明图 3所显示的那样, 将极大的提升用户体验, 缩小用户查询的范围。
应用举例二: 商标注册用产品服务分类表。 注册商标时需要进行同行业产品类似商标 查询。 通过使用本发明实施例所提供的方法, 能更方便的查找已注册的商标的产品类别, 为用户的选择提供参考。 如图 7所示, 是使用本发明实施例所构造的检索词目录表示例, 关键字分别是 "帘子布"与"布"。 图有两部分组成, 左侧是 "帘子布"目录树的部分截图, 右 侧是"布"目录树的部分截图。 左侧能看出帘子布相关的种类或者服务, 种类如 "膨体帘子 布"、 "涤纶帘子布"、 "锦纶帘子布"、 "浸胶帘子布 "等, 服务如 "加工帘子布"。 右侧是与检 索字符串 "布"有关的商品, 如"里子布"、 "里布"。 同时从中可以的发现与"里布"有关的几 个不同的但同义的描述: 里子布和里料布等。 如果再釆用别名数据表的话, 将更有助于用 户准确地找到目标信息。
应用举例三: 对《国民经济行业分类》 中具体行业的产品进行自动分类。 比如纺织 业、 服装业、 皮革业、 电子产品行业等等, 这些 《国民经济行业分类》 中只是简单的行业 描述。 而这些行业的产品一般情况下没有更具体详细的分类, 而且产品类型更新一般较 快。 使用本发明实施例所提供的方式, 可以使得分类自动化, 每当增加新的产品类型时, 仅仅按本发明实施例所提供的技术方案, 对相关产品信息进行自动标引或分词, 提取出检 索字符串, 生成检索词目录表和信息索引数据表, 就能够将新产品接纳到已有的检索词目 录表和信息索引数据表中, 使得本发明实施例广泛接纳新的产品类型的能力很强。 以 CDMA产品为例, 已知的产品类型包括: CDMA手机、 黑莓 CDMA手机、 豪华型
CDMA手机、 双 SIM卡 CDMA手机、 CDMA手机充电器、 CDMA天线、 CDMA无线天 线、 CDMA上网卡、 CDMA无线上网卡、 CDMA电话机、 CDMA固定无线电话机、 CDMA投币电话机、 户外 CDMA投币电话机。 如图 8所示, 就是本发明实施例对 CDMA 产品处理后所显示的检索词目录表。 本发明实施例通过对 CDMA产品进行自动标引或分 词、 生成检索词目录表后所显示的检索词目录表, 所有的 CDMA产品都可以在图中所示 的检索词目录表上找到合适的位置, 而且很适合新产品的增加。
应用举例四: 法律专业领域构造概念树。 如图 9所示, 是本发明实施例中对 "保险 "构 造概念树的示例图。 使用本发明实施例所构造的检索词目录表构成了树状网络语义, 根结 点为该特定领域概念集合的上位概念, 子结点为下位概念。 上位概念是对其所有下位概念 属性的概括, 下位概念是从不同的角度对上位概念的细化。 本发明实施例的实施方式是: 1、 将法律条文自动标引或分词后, 取其中的概念短语建立倒排索引, 将保险、 财产 保险、 车辆保险、 当地保险、 出口信息保险、 行业保险、 第三者责任法定保险、 对前款保 险标的的保险、 工伤保险、 国家法定保险、 货物运输保险、 人身保险、 劳动保险、 养老保 险、 旅游意外保险等建立倒排索引, 建立的方式可以是现有技术中的各种方式;
2、 然后使用这些概念短语作为检索字符串, 生成原始检索词数据表; 将倒排索引表 中的保险、 财产保险、 车辆保险、 当地保险、 出口信息保险、 行业保险、 第三者责任法定 保险、 对前款保险标的的保险、 工伤保险、 国家法定保险、 货物运输保险、 人身保险、 劳 动保险、 养老保险、 旅游意外保险等进行唯一性选择后, 建立以这些词为检索字符串的原 始检索词数据表;
3、 根据原始检索词数据表生成检索词目录表。 如图 9所示,以保险、 财产保险、 车辆 保险、 当地保险、 出口信息保险、 行业保险、 第三者责任法定保险、 对前款保险标的的保 险、 工伤保险、 国家法定保险、 货物运输保险、 人身保险、 劳动保险、 养老保险、 旅游意 外保险等生成检索词目录表后, 以递归目录树的形式显示出来, 该检索词目录表以法律概 念为结点。
4、 当收到用户的检索请求后, 返回包含该检索字符串的概念树结构。 图 9所示,本例 中向用户返回的是保险有关的概念树结构。
通过使用检索词目录表可以将相关的法律词条进行分类, 在这个基础上进行检索, 能 清楚的看到相关的检索字符串及其相互之间的逻辑联系, 使得查准率更高。
图 9所示的应用实例中, 概念树的表现形式与递归目录树相同。 其实在本发明的各个 实施例中, 递归目录树的表现形式是多种多样的, 不限于附图中的各种方式, 其核心在于 各个检索字符串所存在的亲属关系。
如图 10所示, 是本发明的实施例的结构时一种数据流向示意图。
当搜索引擎收到用户输入的检索字符串后, 在检索词目录表检索字符串和信息索引数 据表中查询, 信息索引数据表是由信息数据表生成的, 信息索引数据表再生成原始检索词 数据表, 原始检索词数据表生成检索词目录表。 搜索引擎还将检索别名数据表, 来查询检 索字符串对应的别名。
本发明实施例中, 可以根据检索词目录表中同一目录树数据项集合内的数据项与数据 项之间的父子关系 (可以使用 bplus字段或其他方式确定)可以构造成目录树。 目录树的 显示可以根据具体情况选择合适的技术方式实现, 包括单机软件、 客户机服务器模式与浏 览器服务器模式。 比如: 通过编程用定制的客户端来实现树形结构; 使用 html语言封装 结果在浏览器中显示为树形结构; 使用 wml语言封装结果在手机浏览器上显示为树形结 构。 具体的显示方式没有超出本发明实施例的保护范围。
递归目录树有助于显示各个检索字符串之间的层次结构。 目录树是最常用、 最直观的 方式。 但具体的显示可能不尽相同, 并有多种变形显示方式。 比如具有目录功能的多级菜 单, 就是一种目录树的变形, 下一级菜单是上级菜单的子项。 尽管表现形式多样, 但它表 现出的上下级的父子关系与层次结构是不变的。
如图 10 _ 1所示, 是本发明实施例所提供的具有目录功能的多级菜单图形示例。 目录 树结构中同一节点下的子节点, 变形为下一级菜单。 下级菜单会在光标移过时, 动态显 示。 例如, 当用户关注羊毛大衣时, 将光标移动到其上, 相应地就会显示出其下一级节点 男士羊毛大衣和女士羊毛大衣。 此时, 递归目录树结构的本质没有改变, 这些子节点都具 有共同的父节点"大衣", 仅仅是其显示的方式发生了变化。 这里向用户提供了目录树结构 的另一种显示方式。
如图 10-2所示, 是本发明实施例所提供的具有目录功能的多级菜单图形及商品信息结 果示例, 在本实施例中, 当光标选中羊绒大衣时, 相应地在目录表检索字符串和信息索引 数据表中查询。 检索词目录表返回有关的检索字符串, 如棉大衣, 羊绒大衣, 羊毛大衣和 大衣钮扣。 信息索引数据表中返回与当前选中的检索字符串有关的商品信息, 如各种品牌 的羊绒大衣。
在本发明各实施例中, 向用户发送的第二数据项集合可以包括以下步骤:
1.信息索引数据表中与所述检索字符串通过第一数据项集合简单匹配获得的第二数据 项集合, 具体包括以下 2种方式:
1.1当输入的检索字符串是第一数据项集合的某节点时, 直接显示检索字符串精确匹 配的信息, 可以如图 3那样, 直接显示检索字符串 "大衣 "对应的商品信息;
1.2当输入的检索字符串是第一数据项集合的某节点, 可以通过第一数据项集合提 示、 引导、 链接, 间接显示与检索字符串精确匹配的信息, 如图 10 - 2那样, 显示与检索 字符串 "大衣 "匹配但间接相关的"羊绒大衣"的商品信息。
2.与所述检索字符串通过第一数据项集合递归组合匹配获得的第二数据项集合, 具 体包括:
2.1用户的检索字符串与在信息索引数据表无检索结果时, 返回递归匹配结果; 如用 户输入的检索字符串是"兔毛大衣"若在信息索引数据表无检索结果时, 可以返回检索字符 串的上位节点, 即具有包含关系的如 "大衣 "的有关检索结果;
2.2无论用户输入的检索字符串命中第一数据项集合的哪个节点, 都显示与第一数据 项各节点对应的全部或部分信息。 如图 12无论输入"保险"下哪个节点, 如输入 "工伤社会 保险", 就可以显示"保险"、 "社会保险"、 " 工伤社会保险 "等递归匹配出来的节点在信息 索引数据表中对应的信息; 也可以显示"城镇社会保险"、 "当地社会保险"等亲属组合, 在 信息索引数据表中对应的信息。
当然对于 2.1进一步地, 当返回检索字符串的上位概念有关的信息时, 如果用户点击 具体下位概念时, 再显示下位概念有关的信息。 其过程举例, 即是无论输入的是"大衣"还 是"羊绒大衣", 都直接返回"大衣"有关的商品信息, 当用户进一步选择"羊绒大衣"或其他 下位概念时, 再返回 "羊绒大衣"或其他下位概念对应的信息。 或者对于 2.2进一步地, 递 归匹配的组合的信息可以有任意多种显示排列方式。 这些显示方式的变化没有超出本发明 的保护范围。
本发明通过对原始数据信息进行自动标引或分词, 获取检索字符串后, 生成检索词目 录表, 向用户返回以检索字符串为结点的检索词目录表。 使得用户有很好的检索体验。 本 发明实施例很好地解决了检索词目录表的数据来源、 构造方式、 以及显示方式等方面的技 术问题, 具有很好的市场应用前景。
上述的本发明实施方式, 并不构成对本发明保护范围的限定。 任何在本发明的精神和 原则之内所作的修改、 等同替换或改进等, 均应包含在本发明的保护范围之内。

Claims

权利要求
1.一种检索的方法, 其特征在于, 包括步骤:
A.根据用户在终端上所输入检索词查询检索词目录表, 获取包含所述输入检索词的 第一数据项集合; 其中, 所述第一数据项集合的各个数据项之间存在亲属关系;
B.根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息索引数 据表, 获取第二数据项集合;
C.组合并发送所述第一数据项集合给所述终端; 其中, 所述第一数据项集合以递归 方式组合; 以及
向所述终端发送所述第二数据项集合。
2.如权利要求 1所述的检索的方法, 其特征在于, 所述步骤 A中, 还包括下列步骤: A1.生成所述检索词目录表。
3.如权利要求 1所述的检索的方法, 其特征在于, 所述步骤 B中, 所述获取第二数据 项集合, 包括下列步骤:
由第一数据项集合的各个数据项, 查询信息索引表, 进行简单匹配获取第二数据项集 合; 或
由第一数据项集合的各个数据项, 查询信息索引表, 进行递归组合匹配获取第二数据 项集合。
4.如权利要求 2所述的检索的方法, 其特征在于, 所述步骤 A1中, 生成检索词目录 表, 包括如下步骤:
Al 1.将原始检索词数据表中的各个原始字符串两两互相匹配, 确定相互之间的包含 关系;
A12.根据所述包含关系, 确定所述两两互相匹配的所述原始字符串之间的父子关 系;
A13.才艮据所述两两互相匹配的具有父子关系的所述原始字符串, 分别生成数据项集 合 Dl、 D2 Dn, 其中, n大于等于 1 ; 所述数据项集合 Dl、 D2 Dn组成检索词 目录表; 其中, 所述数据项集合 Dn的各个数据项的所述原始字符串之间具有亲属关 系。
5.如权利要求 4所述的检索的方法, 其特征在于, 所述包含关系包括:
1 左包含、 右包含、 居中包含或不包含。
6.如权利要求 4 所述的检索的方法, 特征在于, 所述步骤 A12包括下列步骤: 如果所述至少两个原始字符串之间构成左包含或右包含关系, 则将所述两个原始字符 串设置为父子关系, 所述被包含的原始字符串为父; 及
如果所述至少两个原始字符串集合之间构成包含关系, 则将所述两个原始字符串集合 设置为父子关系, 所述被包含的原始字符串集合为父。
7. 如权利要求 6所述的检索的方法, 其特征在于, 所述步骤 A13中, 所述包含关系 为右包含关系时, 则所述数据项集合组成检索词目录表, 还包括对数据项组合在字符串逆 向后排序的基础上组成检索词目录表;
所述字符串逆向后排序, 包括如下步骤:
A131.根据检索词字段逆向后生成逆向检索词字段;
A132.初始化继承直系树堆栈为空;
A133.以逆向检索词字段排序得到全部 [编号,逆向检索词 ]数据;
A134.读取当前 [编号, 逆向检索词 ]数据到继承直系树堆栈; 如果没有数据, 即当 前 [编号, 逆向检索词 ]数据为空, 跳转到步骤 A1310; 否则, 进入步骤 A135 ;
A135.初始化临时直系树堆栈为空;
A136.如果继承直系树堆栈为空, 则跳转到步骤 A138;
A137.如果继承直系树堆栈不为空, 则从继承直系树堆栈中, 查找当前 [编号, 逆向 检索词 ]数据的长辈结点;
如果有长辈结点, 则将所述长辈结点入栈到临时直系树堆栈, 所述长辈结点中的最后 一个结点是该长辈结点的父结点, 修改当前堆栈游标值为父结点的编号;
如果从继承直系树堆栈中找不到长辈结点, 则设置堆栈游标值为 0; 并将临时直系树 堆栈的值赋予给继承直系树堆栈, 并将当前的 [编号, 逆向检索词 ]数据压入继承直系树 堆栈, 然后跳转到步骤 A139;
步骤 A138.将当前的 [编号, 逆向检索词 ]数据压入继承直系树堆栈,并更新当前堆栈 游标值为 0;
步骤 A139.当前堆栈游标值加 1 , 跳转到步骤 A134 , 读取下一条 [编号, 逆向检索词 ]数据, 执行循环;
步骤 A1310.结束。
2
8.如权利要求 4所述的检索的方法, 其特征在于, 所述步骤 Al l中, 还包括原始检索 词数据表生成的步骤; 所述原始检索词数据表生成的步骤包括下列步骤:
根据所述信息索引数据表的信息索引数据, 去重复后生成原始字符串集, 得到原始检 索词数据表。
9.如权利要求 8所述的检索的方法, 其特征在于, 还包括信息索引数据表生成的步 骤; 所述信息索引数据表生成的步骤, 包括下列步骤:
从信息数据表获取索引词;
使用索引词或索引词集合建立倒排数据表, 根据倒排数据表生成信息索引数据表。
10.如权利要求 1所述的检索的方法, 其特征在于, 还包括如下步骤:
根据所述输入检索词, 查询别名数据表, 获取与所述输入检索词对应的别名检索词; 以所述别名检索词作为新的输入检索词, 重复执行步骤 A-C。
11.一种信息检索的处理系统, 其特征在于, 包括:
-信息数据表模块, 用于存储所有的信息内容;
-信息索引数据表模块, 用于根据所述信息数据表模块的所述所有的信息内容生成并 存储为信息索引数据表;
-原始检索词数据表模块, 用于将所述信息索引数据表去重复后得到原始字符串, 所 述原始字符串构成原始字符串集, 全部所述原始字符串集组成并存储为原始检索词数据 表;
-检索词目录表模块, 用于根据所述每个原始字符串与所述原始检索词数据表中其他 原始字符串的亲属关系, 进行相互匹配生成数据项集合, 所述数据项集合组成并存储为检 索词目录表;
-搜索引擎模块, 用于根据用户在终端上所输入检索词, 查询所述检索词目录表, 获 取包含所述输入检索词的第一数据项集合; 其中, 所述第一数据项集合的各个数据项之间 存在亲属关系; 根据与所述输入检索词相关联的第一数据项集合的各个数据项, 查询信息 索引数据表, 获取第二数据项集合; 组合并发送第一数据项集合, 以及发送第二数据项集 合到所述终端。
12.如权利要求 11所述的处理系统, 其特征在于, 所述包含包括不包含、 左包含、 右 包含、 居中包含; 所述检索词目录表模块中的相互匹配, 是指对每一原始字符串与其他原始字符串进行 包含关系匹配, 将与所述每个原始字符串具有左包含和 /或右包含关系的其他原始字符串 一起生成所述数据项。
13.如权利要求 11所述的处理系统, 其特征在于, 所述检索词目录表的数据项集合构 成一排序多叉的检索树。
14.如权利要求 11所述的处理系统, 其特征在于, 所述搜索引擎模块, 还用于按照所 述检索树的递归关系呈现与所述输入检索词相关的第一数据项集合。
15.如权利要求 11所述的处理系统, 其特征在于, 所述第二数据项集合与所述第一数 据项集合关联。
16.如权利要求 11所述的处理系统, 其特征在于, 还包括别名数据表模块, 用于存储 所述原始字符串的别名字符串; 所述别名字符串用于进行所述原始字符串的别名查询; 所述第二数据项集合还通过别名数据表与所述第一数据项集合关联。
4
PCT/CN2010/080578 2010-12-31 2010-12-31 一种检索的方法和系统 WO2012088706A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
US13/977,528 US9870392B2 (en) 2010-12-31 2010-12-31 Retrieval method and system
PCT/CN2010/080578 WO2012088706A1 (zh) 2010-12-31 2010-12-31 一种检索的方法和系统
CN201080071023.1A CN103314371B (zh) 2010-12-31 2010-12-31 一种检索的方法和系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2010/080578 WO2012088706A1 (zh) 2010-12-31 2010-12-31 一种检索的方法和系统

Publications (1)

Publication Number Publication Date
WO2012088706A1 true WO2012088706A1 (zh) 2012-07-05

Family

ID=46382223

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2010/080578 WO2012088706A1 (zh) 2010-12-31 2010-12-31 一种检索的方法和系统

Country Status (3)

Country Link
US (1) US9870392B2 (zh)
CN (1) CN103314371B (zh)
WO (1) WO2012088706A1 (zh)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI640878B (zh) * 2013-01-09 2018-11-11 香港商阿里巴巴集團服務有限公司 Query word fusion method, product information publishing method, search method and system
CN111782750A (zh) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 地图检索信息倾向地域的确定方法、装置、电子设备和存储介质
CN113010537A (zh) * 2019-12-20 2021-06-22 北京宸瑞科技股份有限公司 一种通用结构化数据灵活检索系统及方法

Families Citing this family (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10325239B2 (en) 2012-10-31 2019-06-18 United Parcel Service Of America, Inc. Systems, methods, and computer program products for a shipping application having an automated trigger term tool
CN104615782B (zh) * 2015-03-02 2017-10-10 武汉工程大学 基于滑动窗口最大匹配算法的地址匹配方法
US10719802B2 (en) * 2015-03-19 2020-07-21 United Parcel Service Of America, Inc. Enforcement of shipping rules
CN107632972B (zh) * 2017-08-31 2021-02-09 北京秒针人工智能科技有限公司 表单处理方法和装置
US11281850B2 (en) * 2017-12-28 2022-03-22 A9.Com, Inc. System and method for self-filing customs entry forms
CN110149804B (zh) 2018-05-28 2022-10-21 北京嘀嘀无限科技发展有限公司 用于确定兴趣点的父-子关系的系统和方法
CN110851459B (zh) * 2018-07-25 2021-08-13 上海柯林布瑞信息技术有限公司 一种搜索方法及装置、存储介质、服务器
CN109241456A (zh) * 2018-09-13 2019-01-18 上海宇佑船舶科技有限公司 地点推荐方法、装置及服务器
CN109542890B (zh) * 2018-10-11 2024-01-26 平安科技(深圳)有限公司 数据修改方法、装置、计算机设备及存储介质
CN109559256A (zh) * 2018-11-15 2019-04-02 苏州征之魂专利技术服务有限公司 一种专利数据挖掘系统及方法
CN110134888B (zh) * 2019-04-03 2022-05-31 广州朗国电子科技股份有限公司 树形结构节点检索方法、装置、存储介质及服务器
CN110502611B (zh) * 2019-08-01 2022-04-12 武汉虹信科技发展有限责任公司 字符串检索方法和装置
KR20210060958A (ko) * 2019-11-19 2021-05-27 엔에이치엔 주식회사 메타 쇼핑몰의 상품 검색결과 제공 방법
CN113886433A (zh) * 2021-10-01 2022-01-04 浙江大学 一种层次结构区域检索方法

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101201841A (zh) * 2007-02-15 2008-06-18 刘二中 电子文本处理与检索的便捷方法和系统
US20080154875A1 (en) * 2006-12-21 2008-06-26 Thomas Morscher Taxonomy-Based Object Classification
US20080177731A1 (en) * 2007-01-23 2008-07-24 Justsystems Corporation Data processing apparatus, data processing method and search apparatus
JP2010146430A (ja) * 2008-12-22 2010-07-01 Nec Corp 情報処理装置

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6772141B1 (en) * 1999-12-14 2004-08-03 Novell, Inc. Method and apparatus for organizing and using indexes utilizing a search decision table
US6876997B1 (en) * 2000-05-22 2005-04-05 Overture Services, Inc. Method and apparatus for indentifying related searches in a database search system
US7099885B2 (en) * 2001-05-25 2006-08-29 Unicorn Solutions Method and system for collaborative ontology modeling
US7765247B2 (en) * 2003-11-24 2010-07-27 Computer Associates Think, Inc. System and method for removing rows from directory tables
AU2004292680B2 (en) * 2003-11-28 2010-04-22 Canon Kabushiki Kaisha Method of constructing preferred views of hierarchical data
US7797312B2 (en) * 2006-04-06 2010-09-14 International Business Machines Corporation Database query processing method and system
CN1983255A (zh) * 2006-05-17 2007-06-20 唐红春 一种互联网搜索方法
US9015301B2 (en) * 2007-01-05 2015-04-21 Digital Doors, Inc. Information infrastructure management tools with extractor, secure storage, content analysis and classification and method therefor
CN101063975A (zh) * 2007-02-15 2007-10-31 刘二中 电子文本处理与检索的方法和系统
JP2009026083A (ja) * 2007-07-19 2009-02-05 Fujifilm Corp コンテンツ検索装置
CN101281530A (zh) * 2008-05-20 2008-10-08 上海大学 基于概念衍生树的关键词层次聚类方法
JP4342596B1 (ja) * 2008-05-20 2009-10-14 株式会社東芝 電子装置およびコンテンツデータ提供方法
US8239395B2 (en) * 2008-12-26 2012-08-07 Sandisk Il Ltd. Storage device presenting to hosts only files compatible with a defined host capability

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20080154875A1 (en) * 2006-12-21 2008-06-26 Thomas Morscher Taxonomy-Based Object Classification
US20080177731A1 (en) * 2007-01-23 2008-07-24 Justsystems Corporation Data processing apparatus, data processing method and search apparatus
CN101201841A (zh) * 2007-02-15 2008-06-18 刘二中 电子文本处理与检索的便捷方法和系统
JP2010146430A (ja) * 2008-12-22 2010-07-01 Nec Corp 情報処理装置

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI640878B (zh) * 2013-01-09 2018-11-11 香港商阿里巴巴集團服務有限公司 Query word fusion method, product information publishing method, search method and system
CN113010537A (zh) * 2019-12-20 2021-06-22 北京宸瑞科技股份有限公司 一种通用结构化数据灵活检索系统及方法
CN111782750A (zh) * 2020-06-28 2020-10-16 北京百度网讯科技有限公司 地图检索信息倾向地域的确定方法、装置、电子设备和存储介质
CN111782750B (zh) * 2020-06-28 2024-01-09 北京百度网讯科技有限公司 地图检索信息倾向地域的确定方法、装置、电子设备

Also Published As

Publication number Publication date
US20130275466A1 (en) 2013-10-17
US9870392B2 (en) 2018-01-16
CN103314371A (zh) 2013-09-18
CN103314371B (zh) 2017-12-15

Similar Documents

Publication Publication Date Title
WO2012088706A1 (zh) 一种检索的方法和系统
US10657161B2 (en) Intelligent navigation of a category system
US7702685B2 (en) Querying social networks
US20210303529A1 (en) Hierarchical structured data organization system
US20120221571A1 (en) Efficient presentation of comupter object names based on attribute clustering
CN103294692B (zh) 一种信息推荐方法及系统
CN104794242B (zh) 一种搜索方法
JP5097328B2 (ja) 情報検索のための階層式データ駆動型ナビゲーションのシステムおよび方法
CA2780918A1 (en) Systems, methods, and computer program products for generating relevant search results using snomed ct and semantic ontological terminology
US20170068732A1 (en) Multi-system segmented search processing
EP2131293A1 (en) Method for mapping an X500 data model onto a relational database
CN109002432A (zh) 同义词的挖掘方法及装置、计算机可读介质、电子设备
Miele et al. A methodology for preference-based personalization of contextual data
CN107291951B (zh) 数据处理方法、装置、存储介质和处理器
US20170277687A1 (en) System and methods for searching documents in a relational database using a tree structure stored in a tabular format
JP3367174B2 (ja) 文書群分析装置および方法
KR20000049333A (ko) 지능형 인터넷 쇼핑몰 상품비교검색엔진
US9659059B2 (en) Matching large sets of words
Li et al. Exploring personal corespace for dataspace management
JPH11282882A (ja) 文書管理方法
Proper Interactive query formulation using spider queries
JP2004234582A (ja) 辞書構築方法,システム及び画面
TWI605351B (zh) Query method, system and device based on vertical search
Chandsarkar et al. Information retrieval system: For skill set improvement in software projects
Adomavicius et al. Rql: A query language for recommender systems

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 10861297

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 13977528

Country of ref document: US

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 10861297

Country of ref document: EP

Kind code of ref document: A1