WO2008154823A1 - Procédé, système et dispositif de recherche - Google Patents
Procédé, système et dispositif de recherche Download PDFInfo
- Publication number
- WO2008154823A1 WO2008154823A1 PCT/CN2008/070615 CN2008070615W WO2008154823A1 WO 2008154823 A1 WO2008154823 A1 WO 2008154823A1 CN 2008070615 W CN2008070615 W CN 2008070615W WO 2008154823 A1 WO2008154823 A1 WO 2008154823A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- morpheme
- compound
- document
- index
- search
- Prior art date
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/316—Indexing structures
- G06F16/319—Inverted lists
Definitions
- the present invention belongs to the field of search technologies, and in particular, to a search method, system and device. Background of the invention
- the search expression input by the user is a search string
- the search string may contain one or more keywords, separated by spaces, and spaces indicate that the keywords before and after are logically ANDed.
- the keyword is a string of one or more morphemes, and the word segmentation system can segment the keywords. If two morphemes are segmented, the keyword is said to be a 2-ary compound morpheme. If three morphemes are segmented, the keyword is called a ternary compound morpheme. Morphemes are the smallest language units that can express independent semantics and are inseparable. In Chinese, the morphemes are Chinese words that are segmented by the word segmentation system. In English, morphemes are basic English words or letters.
- the background indexing technology is the core technology, which is directly related to the organization and access mode of the index structure. Index access overhead due to different index structures varies widely.
- the entire index file can be put into the memory, which can provide access speeds of more than 3-4 orders of magnitude faster than the file index access speed, but in the actual application environment, since the memory capacity of the network server is always insufficient, in order to make A single web server can support a larger number of indexed documents, generally using file indexing.
- the existing file index is mainly implemented by using the inverted index method.
- Inverted index is used to speed up A data structure for retrieving a particular search string, which may be in the form of a disk file or loaded into memory, and the inverted index includes a correspondence table between each search keyword and the document:
- t represents the keyword
- di represents the ID of a series of documents containing t
- W d represents the weight of t in document 4
- lo represents the positional offset of each place where t appears in the document (offset) , usually expressed in two bytes.
- the inverted index file consists of N of the above data items, and the number of N is equal to the sum of all the different keywords obtained by the entire document collection during the analysis.
- the search string input by the user can be divided into two or more morphemes, in order to obtain a list of document IDs containing both morphemes, an AND operation must be performed after respectively obtaining the list of document IDs containing different morphemes. And performing a position offset matching operation of the search keyword on the document ID list after the AND operation, wherein the position offset in the keyword hit document can be stored separately by using a file, thereby ensuring that multiple morphemes of the AND operation are performed
- the search string can be divided into two or more morphemes, it is necessary to perform multiple input and output (10) accesses of the disk file, resulting in a low search speed of the search string. .
- the search string input by the user is composed of 2 or more morphemes, and the search string includes high-frequency morphemes such as "China", “Net”, “We”, etc.
- the document ID list and position are read.
- the amount of data in the offset list is large.
- the index data amount of the high frequency morpheme usually accounts for a large proportion of the data of the entire inverted index file, so it is difficult to read the index data of the high frequency morpheme in a short time, so that most of the search is consumed in the disk file.
- the read operation of 10 reduces the search speed of the search string and the overall concurrency of the system, and at the same time increases the loss of the network server hardware device, which causes the network server to be prone to failure. Summary of the invention
- Embodiments of the present invention provide a search method, system, and apparatus, which can improve the retrieval speed of long search strings.
- a search method including:
- An indexing method including:
- the index word is a compound morpheme, and the compound morpheme includes at least two morphemes;
- a search system including:
- An index database configured to store a file index with a compound morpheme as an index word, where the compound morpheme includes at least two morphemes;
- a searcher configured to divide the search term to be searched into at least one compound morpheme, search the index database according to the segmented at least one compound morpheme; and determine a document including the search term.
- a search device comprising:
- a search word segmentation module configured to divide the search term to be searched into at least one compound morpheme
- an index search module configured to search for a pre-established file index using a compound morpheme as an index word according to the at least one compound morpheme, Obtaining an index entry corresponding to each compound morpheme, where the index entry includes an identifier of the document;
- a logic processing module configured to perform logical operations on the acquired index table items in sequence, and obtain an identifier of a document that simultaneously includes the at least one compound morpheme; a search result output module for determining a document containing the search term.
- a file index is created by using a compound morpheme as an index word, and the search term to be retrieved is divided into at least one compound morpheme, and the search is performed according to the compound morphemes of the segmentation.
- the compound morpheme is used as the index word, thereby reducing the number of index words generated after the search term is cut, and reducing the document.
- the embodiment of the invention effectively improves the search speed of the search string, and ensures the stability of the network search in the case of large-scale concurrent access, and saves the hardware comprehensive cost of the network server.
- FIG. 1 is a flowchart of an implementation of a network search method according to an embodiment of the present invention
- FIG. 2 is a flowchart of an implementation of a network search by using a multivariate compound morpheme according to an embodiment of the present invention
- FIG. 3 is a structural diagram of a network search system to which an embodiment of the present invention is applied;
- FIG. 4 is a structural diagram of a network search engine according to an embodiment of the present invention.
- FIG. 5 is a structural diagram of a searcher according to an embodiment of the present invention.
- BEST MODE FOR CARRYING OUT THE INVENTION In order to make the objects, technical solutions and advantages of the present invention more comprehensible, the present invention will be further described in detail with reference to the accompanying drawings. It is understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
- the file index is established by using the compound morpheme as an index word, and the search term to be searched is divided into at least one compound morpheme, and the search is performed according to the compound morphemes of the segmentation.
- the search term in the embodiment of the present invention includes at least two morphemes, that is, the search term can be divided into at least one compound morpheme.
- FIG. 1 shows a flow of a search method provided by an embodiment of the present invention, and the process includes the following steps.
- step S101 a file index is established by using a compound morpheme as an index word, and the compound morpheme includes at least two morphemes.
- co-occurrence co-occurrence
- the probability that two morphemes A and B are adjacent to each other is stable, and does not occur with changes in document collection data and document content changes. Unstable jumps. That is to say, the number of morphemes that can be connected after morpheme A is limited, and the types are also fixed, and will not grow in a permutation-combination theoretical multiplication manner.
- the compound morpheme includes 2 morphemes or 3 morphemes in the embodiment of the present invention.
- the file index is dynamically created by using the compound morpheme as the index word, which can effectively improve the speed of long string search and long sentence accurate search.
- index words can be achieved through existing word segmentation systems.
- the word segmentation system generates a word segmentation result of a monomorph.
- the word segmentation result can be combined to generate a compound morpheme.
- the obtained compound morpheme is used as the index of the document, such as "Peking University", "Computer” and so on.
- the file index is established according to the obtained index word.
- the file index adopts an inverted index mode.
- the position offset information is generally that the index word is the first index word in the document. Assume that the index word "Peking University” appears twice in document 1, which is the third and seventh index words of document 1, and appears in document 2 once, which is the second index word; index word "computer " appears one time in document 1, is the fourth index word in document 1, appears one time in document 3, is the first index word of document 2..., the index at this time
- the structure becomes:
- the above three columns are saved as a Dictionary ( Term Dictionary ), a frequency file (frequencies), and a position file (positions).
- the dictionary file not only stores each index word, but also retains a pointer to the frequency file and the location file, and the frequency information and position offset information of the index word can be found by the pointer.
- the composite morpheme in addition to adopting the above-mentioned inverted index mode, in order to reduce the data amount of the positional offset information of the compound morpheme in the document, improve the retrieval speed and efficiency, the composite morpheme may also be stored by the lossy compression mode. Position offset information in the document. The details are described below.
- each document is divided into a plurality of sections, wherein the section may be a "sentence” or a "segment” in the document, and the position offset information of the section in which the compound morpheme is located in the document is used as the The positional offset information of the compound morpheme in the document.
- the number of segments is preferably determined by the number of bytes of the storage location offset information, and the more segments are divided, the higher the accuracy of the retrieval.
- the segment may be segmented according to the following conditions: When an inverted index is created on a webpage in the network, the webpage content is not analyzed by the page.
- the number of morphemes contained in the section should satisfy a certain number to ensure that the section has a reasonable text length.
- the segment may be set to have more than 6 morphemes and less than or equal to 32, which is equivalent to the text length of an ordinary natural sentence.
- step S102 the search term to be retrieved is divided into at least one compound morpheme for network search.
- FIG. 2 shows a flow of searching according to a multivariate compound morpheme according to an embodiment of the present invention, which is described in detail below.
- step S201 the search term to be retrieved is divided into at least one compound morpheme.
- the number of compound morphemes that are segmented depends on the search keyword.
- the high-combination morphemes such as the ternary compound morphemes, are preferentially selected.
- step S202 an index entry corresponding to each compound morpheme is obtained according to the split compound morpheme, and the index entry is a document ID set containing a composite morpheme.
- step S203 the obtained index table items are sequentially logically ANDed, and a list of document IDs including each compound morpheme is obtained.
- step S204 position offset matching is performed according to the position offset information of each compound morpheme in the document, and the ID list of the document whose position offset matches is obtained.
- step S205 the corresponding search result is output according to the document ID of each document in the document offset list matching position offset.
- FIG. 3 shows the structure of a network search system to which the embodiment of the present invention is applied. For the convenience of description, only parts related to the embodiment of the present invention are shown.
- the network client 10 is generally an electronic communication device such as a personal computer (PC), a mobile terminal, or a personal digital assistant (PDA) for the customer, and has a web browser 11 loaded and operated therein.
- the network client 10 is connected to the network 20 by wire or wirelessly, and is then connected to the network server 30.
- the network 20 may be the Internet, the intranet, etc., and the web server 30 is loaded with the search engine 31.
- the search engine 31 is connected to the network 20, and the collector 311 in the search engine 31 automatically accesses the website on the network 20, tracks the hypertext links in the website, and extracts each of the websites encountered on the website according to the keywords.
- the files are parsed and stored in the original database 312.
- the Collector 311 is generally implemented by software such as "Crawler” or "Spider”.
- the original database 312 is a data set consisting of file records that are recorded and accessible via a web network on the network 20.
- Each file record in the dataset typically contains a web address accessible to web browser 11, such as Uniform Resource Locator (URL); predefined keywords; a short summary, usually a few of the files The line or the first few lines; it may also contain a description of the file as provided in its Hypertext Markup Language (HTML) description section.
- HTML Hypertext Markup Language
- the original database 312 can also store text content after the original web page file filters out data such as pictures, audio and video data, and hypertext link language (HTML) tags.
- HTML hypertext link language
- the indexer 313 extracts the index words of the files stored in the original database 312, and builds the index database 314.
- the compound morpheme is used as the index word of the file at the time of extraction.
- the cable I database 314 can appear as a disk file or can be loaded into memory.
- the index database 314 adopts an inverted index mode.
- the inverted index is used to speed up the retrieval of a specific search string, and its structure is to store the correspondence between each keyword and the document.
- Table the index data item.
- An index data item includes an index word corresponding to the index data item, an identifier of a series of documents including the index word, a frequency of occurrence of the index word in each document, and an occurrence of the index word in each document. Position offset.
- the index database 314 includes N index data items, the number of N being equal to the sum of all the different index words obtained in the entire index database 314 during the analysis.
- the user inputs a search string on the search engine 31 through the web browser 11, and submits a search request to the search engine 31.
- the retriever 315 segments the search string, generates at least one compound morpheme, queries the index database 314 based on the compound morpheme, performs a network search, and returns the search result to the network client 10.
- Fig. 5 shows the structure of a retriever in an embodiment of the present invention.
- the search word segmentation module 3151 divides the search term to be retrieved into at least one compound morpheme, which may include 2 morphemes or 3 morphemes.
- the search term can be preferentially divided into three composite morphemes.
- the index lookup module 3152 queries the index database 314 according to the compound morphemes segmented by the search word segmentation module 3151, and obtains an index entry corresponding to each compound morpheme.
- the logic processing module 3153 performs an AND operation on the obtained index table items in turn, and searches for a list of document IDs that include each compound morpheme.
- the position offset matching module 3154 performs position offset matching according to the positional offset information of the compound morphemes in the document, and ensures that the composite morphemes are adjacent to each other before and after, so that the obtained position offset matching document ID list is basically the required search result.
- the location offset information of the compound morpheme in the document is the position offset information of the section where the compound morpheme is located in the document.
- the search result output module 3155 outputs the corresponding search result according to the document ID of each document in the document ID list according to the position offset, and returns it to the network client 10.
- the search term to be retrieved includes n morphemes
- the compound morpheme includes m morphemes, where n is an integer greater than 1, and m is an integer greater than 1 less than or equal to n
- the search term may be n morphemes are divided into n-m+1 compound morphemes, compound morphemes
- the m adjacent morphemes in the search term are included, that is, the adjacent m morphemes in the search term are sequentially used as a composite morpheme.
- the search string input by the user may include more than one search term, such as "Peking University.”
- the search string input by the user may be "Peking University Patent” or “Peking University Invention Patent", that is, the search string contains other search terms or contains single morpheme words.
- the "Peking University” can be searched by using the embodiment of the present invention, and the "patent” or “invention patent” can be searched by using the existing technology, or the “invention patent” can be searched by using the embodiment of the present invention, and finally
- the corresponding document identifier of "Peking University” and the document identification corresponding to "patent” are logically operated, or the document identification corresponding to "Peking University” and the document identification corresponding to "invention patent” are logically operated and corresponding positions are made.
- the offset is matched to obtain the search result of the search string.
- search string entered by the user is AB
- index entry of the 2-ary composite morpheme AB is directly searched, and a 10 operation of the disk file is performed, and the corresponding search result can be obtained.
- search string entered by the user is A BC, ie A and BC
- index entries of the morpheme A and the 2-ary compound morpheme AB are respectively searched, and then the normal AND operation is performed to ensure that the returned document contains both morpheme A and multivariate morphemes.
- BC then perform position offset matching according to the segment position offset information, and ensure that A and BC appear in the same segment, so that the corresponding list of document IDs obtained is basically the required search result.
- search key word entered by the user is AB C, that is, AB and C, it is similar to the case where the search keyword is A BC.
- search string entered by the user is ABC
- directly search for the index entry of the 3-ary composite morpheme ABC and perform a 10 operation of the disk file to obtain the corresponding search result.
- the AND of the 3-ary compound morpheme is prioritized. Operation, performing an AND operation on the index entry corresponding to the 3-ary composite morpheme ABC and BCD, and then performing segment position offset matching to obtain a corresponding search result;
- the search string input by the user is ABCDEF
- the number of morphemes n included is 6, and when the ternary compound morpheme is used, that is, when the number of morphemes included in the compound morpheme is 3, it is divided into four. That is, n-m+1 three-dimensional compound morphemes, and in turn perform AND operations on the index entries corresponding to the 3-ary compound morphemes, for example, ABC&BCD&CDE&DEF.
- the index entries corresponding to the multivariate compound morphemes are sequentially subjected to an AND operation instead of the AND operation at the same time.
- ternary compound morphemes involved in the operation are often low-frequency morphemes, that is, the number of documents containing the morphemes is less than 4, so often the first few multi-composite morphemes complete the AND operation, and get a small number of result documents, which is often needed Search results.
- the search speed of the search string can be improved, and the data index management capability of the single network server can be improved, and a single network server can be managed to manage a large number of documents, thereby effectively improving the speed of index access and ensuring The stability of network search under the condition of large concurrent access, and at the same time can save the hardware comprehensive cost of the network server.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Description
一种搜索方法、 系统和装置
技术领域
本发明属于搜索技术领域, 尤其涉及一种搜索方法、 系统和装置。 发明背景
网络搜索中, 用户输入的检索表达式为检索串, 检索串中可以包含 一个或者多个关键词, 中间采用空格隔开, 空格表示其前后的关键词将 进行逻辑与 (AND )操作。 关键词为一个或者多个语素组成的字符串, 分词系统可以对关键词进行切分。 如果切分出 2个语素, 则称该关键词 为 2元复合语素,如果切分出 3个语素,则称该关键词为 3元复合语素。 语素为最小能够表达独立语义的语言单位, 是不可分的。 在中文中, 语 素为分词系统切分出的中文词, 在英文中, 语素为基本的英文单词或者 字母。
在对检索串进行搜索时, 需要在较短的时间内找出由所有包含该检 索串的文档组成的文档集合, 该文档集合通常使用文档标识(ID )列表 来表示。 通过文档 ID对被检索的文档进行唯一编号, 以保证每个文档 对应一个唯一的 ID , 实现文档的定位。
在网络搜索引擎所应用的各种技术中, 后台索引技术是最为核心的 技术, 它直接关系到索引结构的组织和访问方式。 不同的索引结构带来 的索引访问开销差别很大。理论上,可以将整个索引文件都放到内存中, 能够提供超过文件索引访问速度 3-4个数量级以上的访问速度, 但在实 际应用环境中, 由于网络服务器的内存容量总是不足, 为了使单台网络 服务器能够支持更大的索引文档数量, 一般采用文件索引方式。
现有的文件索引主要利用倒排索引方式实现。 倒排索引是用来加速
对特定检索串进行检索的数据结构, 它可以以磁盘文件形式出现, 也可 以加载到内存中, 倒排索引中包括每个检索关键词与文档的对应关系 表:
<di,Wd,t,loci, loc2, ... loCfdi,t><d2> ... <dft... >
其中, t表示关键词, di表示包含 t的一系列文档的 ID , Wd,t表示 t 在文档 4中的权值, lo 表示 t在文档 中出现的每个地方的位置偏移 ( offset ) , 一般用两个字节来表示。
倒排索引文件由 N个上述数据项组成, N的数量等于整个文档集合 在进行分析过程中得到的所有不同的关键词的总和。 通过上述数据结 构, 可以快速进行针对特定检索关键词的查找。
当用户输入的检索串能被切分出 2个或者 2个以上的语素时, 为了 得到同时包含这两个语素的文档 ID列表, 在分别获取包含不同语素的 文档 ID列表后,必须进行 AND操作,并对 AND操作后的文档 ID列表 再进行检索关键词的位置偏移匹配操作, 其中, 关键词命中文档中的位 置偏移可以单独用文件来存储, 从而保证进行 AND操作的多个语素在 同时出现的文档中的前后相邻, 可见当检索串能被切分出 2个或者 2个 以上的语素时, 需要进行多次磁盘文件的输入输出 (10 )访问, 导致检 索串的搜索速度减低。
进一步的, 当用户输入的检索串由 2个或者 2个以上语素组成, 并 该检索串中包括如 "中国" , "网" , "我们" 等高频语素时, 读取文 档 ID列表和位置偏移列表的数据量很大。 高频语素的索引数据量通常 占到整个倒排索引文件数据量的很大比例, 所以很难在短时间内读完高 频语素的索引数据, 从而使搜索的大部分时间都消耗在磁盘文件 10 的 读取操作上, 降低了检索串的搜索速度和系统整体的并发能力, 同时加 大了网络服务器硬件设备的损耗, 导致网络服务器容易发生故障。
发明内容
本发明实施例提供了一种搜索方法、 系统和装置, 能够提高对长检 索串的检索速度。
一种搜索方法, 包括:
将待检索的检索词切分为至少一个复合语素, 所述复合语素中包含 至少两个语素;
根据所述至少一个复合语素以及预先建立的以复合语素为索引词的 文件索引进行搜索; 确定包含所述检索词的文档。
一种索引方法, 包括:
从文档中获取索引词, 所述索引词为复合语素, 所述复合语素中包 含至少两个语素;
建立对应所述文档和所述复合语素的索引表项。
一种搜索系统, 包括:
索引数据库, 用于存储以复合语素为索引词的文件索引, 所述复合 语素中包含至少两个语素;
检索器, 用于将待检索的检索词切分为至少一个复合语素, 根据所 述切分后的至少一个复合语素, 对所述索引数据库进行搜索; 确定包含 所述检索词的文档。
一种搜索装置, 包括:
检索分词模块, 用于将待检索的检索词切分为至少一个复合语素; 索引查找模块, 用于^^据所述至少一个复合语素, 查找预先建立的 以复合语素为索引词的文件索引, 获取所述每个复合语素对应的索引表 项, 所述索引表项中包括文档的标识;
逻辑处理模块, 用于对所述获取的索引表项依次进行逻辑与操作, 获取同时包含所述至少一个复合语素的文档的标识;
搜索结果输出模块, 用于确定包含所述检索词的文档。
在本发明实施例中, 以复合语素为索引词建立文件索引, 将待检索 的检索词切分为至少一个复合语素, 根据切分的复合语素进行搜索。 对 比将单一语素作为索引词的检索方式, 在本发明实施例中, 以复合语素 为索引词, 从而减少了切分检索词后生成的索引词的数量, 减少了文档
ID列表的读取次数, 更进一步的, 采用复合语素为索引词之后, 降低了 每个索引词对应的文件索引数据量, 从而减少了每次读取文档 ID列表 时的 10访问次数。 可见本发明实施例有效地提高了检索串的搜索速度, 并保证了网络搜索在大规模并发访问情况下的稳定性, 节约了网络服务 器的硬件综合成本。 附图简要说明
图 1是本发明实施例提供的网络搜索方法的实现流程图;
图 2是本发明实施例提供的通过多元复合语素进行网络搜索的实现 流程图;
图 3是本发明实施例适用的网络搜索系统的架构图;
图 4是本发明实施例提供的网络搜索引擎的结构图;
图 5是本发明实施例提供的检索器的结构图。 实施本发明的方式 为了使本发明的目的、 技术方案及优点更加清楚明白, 以下结合附 图及实施例, 对本发明进行进一步详细说明。 应当理解, 此处所描述的 具体实施例仅仅用以解释本发明, 并不用于限定本发明。
在本发明实施例中, 以复合语素为索引词建立文件索引, 将待检索 的检索词切分为至少一个复合语素, 根据切分的复合语素进行搜索。 在
本发明实施例中的检索词至少包含 2个语素, 即检索词可以被切分为至 少一个复合语素。
图 1示出了本发明实施例提供的搜索方法的流程, 该流程包括以下 步骤。
在步骤 S101 中, 以复合语素为索引词建立文件索引, 该复合语素 中包含至少两个语素。
一般情况下, 语素之间的前后连接(共现)情况是固定的, 例如两 个语素 A和 B同时相邻连接的概率是稳定的,不会随着文档集合数据的 变化和文档内容变化发生不稳定的跳跃性变化。 也就是说, 能够连接在 语素 A之后的语素的个数是有限的, 而且种类也是固定的, 不会按排列 组合理论乘法方式增长。 因此, 如果对每个文档中切分出来的前后两个 基本的语素 Ai和 Ai+1进行两两前后组配形成 2元复合语素 Ai Ai+1时,其 中 i为大于 0的整数, 新增的 2元复合语素 A1+1的索引数据量不会无 限制膨胀。 同理, 在一个文档中, 顺序相邻的 3个基本语素 A1+1和 A1+2也能组合成 3元复合语素 +1 +2, 3元复合语素 +1 +2对应的 索引数据量也不会无限制膨胀。 由于多元复合语素中包含的语素越多, 索引数据量膨胀的越大。 为了避免索引数据量的过度膨胀, 在本发明实 施例中复合语素包括 2个语素或者 3个语素。
本发明实施例在建立文件索引时, 以复合语素为索引词动态创建文 件索引, 可以有效提高长串检索和长句精确检索的速度。
以下通过示例对本发明实施例中文件索引的建立过程进行说明, 殳 设有多个文档, 文档的内容分别为:
文档 内容
Docl …北京大学计算机系…
文档 内容
Doc2 …北京大学主页…
Doc3 …计算机的发展…
首先, 获取每个文档的索引词。 获取索引词可以通过现有的分词系 统实现。 通常分词系统生成单语素的分词结果, 在本发明实施例中, 可 以对分词结果进行复合后, 生成复合语素。 将获取的复合语素作为文档 的索引词, 例如 "北京大学" 、 "计算机" 等。 然后, 根据获取的索引 词建立文件索引, 在本发明实施例中, 文件索引采取倒排索引方式。
倒排索引中的对应关系为 "索引词" 对 "拥有该索引词的所有文档 标只" , :¾口下表所示:
在倒排索引中, 除了记录索引词所在文档的文档标识外, 还需要记 录索引词在文档中出现的频率和位置偏移信息。 位置偏移信息一般为某 索引词是文档中的第几个索引词。 假设索引词 "北京大学" 在文档 1中 出现了 2次, 分别为文档 1的第 3、 7个索引词, 在文档 2中出现了 1 次, 为第 2个索引词, ; 索引词 "计算机"在文档 1中出现了 1次, 为文档 1中的第 4个索引词, 在文档 3中出现了 1次, 为文档 2的第 1 个索引词 ......, 此时的索引结构变为:
索引词 文档标识 [出现频率] 位置偏移 北京大学 1[2],2[1] 3,7,1
索引词 文档标识 [出现频率] 位置偏移 计算机 1[1],3[1] 4,1
倒排索引实现时,将上面三列分别作为词典文件( Term Dictionary )、 频率文件 (frequencies)和位置文件 (positions)保存。 其中, 词典文件不仅 保存有每个索引词, 还保留了指向频率文件和位置文件的指针, 通过指 针可以找到该索引词的频率信息和位置偏移信息。
在本发明实施例中, 除了采用上述倒排索引方式之外, 为了降低复 合语素在文档中位置偏移信息的数据量, 提高检索速度和效率, 还可以 通过有损压缩模式来存储复合语素在文档中的位置偏移信息。 以下进行 详细说明。
在本发明实施例中, 将每个文档分成多个区段, 其中, 区段可以为 文档中的 "句" 或者 "段" , 将复合语素所在区段在文档中的位置偏移 信息作为该复合语素在文档中的位置偏移信息。 区段的数量以方便对存 储位置偏移信息的字节数进行标识为宜, 划分的区段越多, 检索的精确 度越高。
以位置偏移信息采用 8个字节(64个比特)来存储为例, 文档中区 段的数量可以为 64个, 依次将整个文档切分为 64个区段, 如果文档内 艮长, 则第 63 个区段将容纳后面所有文本内容, 区段位置偏移从 0 开始, 最大为 63 , 每个区段的位置偏移信息采用 1个比特存储。 如果某 个复合语素在第 M个区段出现,则将 64比特中对应第 M个区段的比特 置为 1 , 否则为 0。 当多个复合语素进行位置偏移匹配时, 对所有的区 段偏移信息按比特进行 AND操作, 当对应的比特为 1时, 表示多个多 元复合语素同时出现在某个文档的相应区段中。
在对网络中的网页建立倒排索引时, 由于网页内容经过页面分析, 往往没有明确的句分隔符, 本发明实施例中, 可以根据以下条件切分区 段。
1. 如果文档中有明确的句分隔符, 例如 "。 " "! " "? " 等, 则 根据文档中的自然句切分区段。
2. 如果文档中没有明确的句分隔符,则区段包含的语素个数应满足 一定数量, 以保证区段具有合理的文本长度。 在本发明实施例中, 可以 设置区段包含的语素大于 6个, 小于等于 32个, 相当于一个普通自然 句的文本长度。
在步骤 S102 中, 将待检索的检索词切分为至少一个复合语素进行 网给搜索。
图 2 示出了本发明实施例提供的根据多元复合语素进行搜索的流 程, 详述如下。
在步骤 S201中, 将将待检索的检索词切分为至少一个复合语素。 切分的复合语素的数量根据检索关键词而定, 为了提高检索的精确 度和检索效率, 在对检索关键词进行切分时, 优先选择高元复合语素, 如 3元复合语素等。
在步骤 S202 中, 根据切分的复合语素, 查询文件索引, 获取每个 复合语素对应的索引表项, 该索引表项为包含复合语素的文档 ID集合。
在步骤 S203 中, 对获取的索引表项依次进行逻辑与操作, 获取同 时包含每个复合语素的文档 ID列表。
在步骤 S204 中, 根据每个复合语素在文档中的位置偏移信息进行 位置偏移匹配, 获取位置偏移匹配的文档的 ID列表。
在步骤 S205中, 根据位置偏移匹配文档 ID列表中的每个文档的文 档 ID, 输出对应的搜索结果。
图 3示出了本发明实施例适用的网络搜索系统的结构, 为了便于说 明, 仅示出了与本发明实施例相关的部分。
网络客户端 10—般为客户用个人计算机(PC ) 、 移动终端、 个人 数字助理(Personal Digital Assistant, PDA )等电子通信设备, 其内部装 载、 运行有网络浏览器 11。 网络客户端 10通过有线或者无线方式连接 到网络 20, 进而连接到网络服务器 30。 其中, 网络 20 可以是因特网 ( Internet ) , 内联网等, 网络服务器 30中装载有搜索引擎 31。
如图 4所示, 搜索引擎 31连接到网络 20, 搜索引擎 31中的搜集器 311 自动访问网络 20上的网站, 跟踪网站中的超文本链接, 并根据关键 词提取在网站上遇到的每个文件, 进行相应解析后存储到原始数据库 312。 搜集器 311—般通过软件, 如 "网络爬行器(Crawler ) "或者 "网 络蜘蛛 ( Spider ) " 等实现。
在本实施例中, 原始数据库 312是由文件记录组成的数据集, 该文 件记录并可以通过网络 20上的 Web网进行访问。 数据集中的每个文件 记录通常包含一个可供网络浏览器 11 访问的网址, 例如统一资源定位 符( Uniform Resource Locator, URL ) ; 预定义的关键词; 一个短的摘 要, 通常为该文件的几行或者前几行; 还可能包含如其超文本标记语言 ( HTML )描述段中提供的文件描述。 为了降低文件存储量, 原始数据 库 312中也可以存储原始网页文件过滤掉图片、 音视频数据、 超文本链 接语言 (HTML )标签等数据后的文本内容。
索引器 313提取原始数据库 312中存储的文件的索引词, 建立索引 数据库 314。 在本发明实施例中, 提取时以复合语素作为文件的索引词。
索弓 I数据库 314可以以磁盘文件形式出现, 也可以加载到内存中。 在本发明实施例中, 索引数据库 314采用倒排索引方式。 倒排索引用来 加速对特定检索串的检索, 其结构是保存每个关键词对文档的对应关系
表, 即索引数据项。 一个索引数据项中包含有该索引数据项对应的索引 词、 包含该索引词的一系列文档的标识、 该索引词在每个文档中的出现 频率、 以及该索引词在每个文档中出现的位置偏移。 索引数据库 314包 括 N个索引数据项, N的数量等于整个索引数据库 314中在进行分析过 程中得到的所有不同的索引词的总和。
在网络搜索时, 用户通过网络浏览器 11在搜索引擎 31上输入检索 串, 向搜索引擎 31提交一个检索请求。 检索器 315将检索串进行切分, 生成至少一个复合语素, 根据复合语素查询索引数据库 314, 进行网络 搜索, 并将搜索结果返回网络客户端 10。
图 5示出了本发明实施例中检索器的结构。 参见图 5, 检索分词模 块 3151 将待检索的检索词切分为至少一个复合语素, 该复合语素可以 包括 2个语素或者 3个语素。 为了提高检索的精确度和检索效率, 在对 检索词进行切分时,可以优先将检索词切分为包含 3个语素的复合语素。
索引查找模块 3152根据检索分词模块 3151切分的复合语素, 查询 索引数据库 314,获取每个复合语素对应的索引表项。逻辑处理模块 3153 对获取的索引表项依次进行 AND操作, 查找同时包含每个复合语素的 文档 ID列表。 位置偏移匹配模块 3154根据复合语素在文档中的位置偏 移信息进行位置偏移匹配, 保证复合语素前后位置相邻, 这样得到的位 置偏移匹配文档 ID列表基本都是需要的搜索结果。 在本发明实施例中, 复合语素在文档中的位置偏移信息为复合语素所在区段在文档中的位 置偏移信息。 搜索结果输出模块 3155根据位置偏移匹配文档 ID列表中 的每个文档的文档 ID, 输出对应的搜索结果, 返回给网络客户端 10。
在应用本发明实施例时, 当待检索的检索词包括 n个语素, 复合语 素包括 m个语素时, 其中 n为大于 1的整数, m为大于 1小于或等于 n 的整数, 可以将检索词中 n个语素切分为 n-m+1个复合语素, 复合语素
包括检索词中 m个相邻的语素, 即依次将检索词中相邻的 m个语素作 为一个复合语素。
另外, 在应用本发明实施例时, 用户输入的检索串可以包括不止一 个检索词, 如 "北京大学" 。 例如用户输入的检索串可以是 "北京大学 专利" 或者 "北京大学 发明专利" , 即检索串中同时包含其它检索词 或者包含单语素词。 此时可以采用本发明实施例对 "北京大学" 进行检 索, 采用现有的技术对 "专利" 或者 "发明专利" 进行检索, 或者同样 采用本发明实施例对 "发明专利" 进行检索, 最后将 "北京大学" 对应 的文档标识和 "专利" 对应的文档标识进行逻辑与操作, 或者对 "北京 大学"对应的文档标识和 "发明专利"对应的文档标识进行逻辑与操作, 并进行相应的位置偏移匹配, 从而得到该检索串的检索结果。
下面以不同类型的检索串为例, 对本发明实施例的网络搜索过程进 行说明。
1. 如果用户输入的检索串为 AB, 则直接查找 2元复合语素 AB的 索引表项, 进行一次磁盘文件的 10操作, 可以得到对应的搜索结果。
2. 如果用户输入的检索串为 A BC, 即 A and BC, 则分别查找语素 A和 2元复合语素 AB的索引表项, 然后进行正常 AND操作, 保证返 回的文档同时包含语素 A和多元语素 BC, 然后根据区段位置偏移信息 进行位置偏移匹配, 保证 A和 BC出现在同一个区段中, 这样得到的对 应的文档 ID列表基本都是需要的搜索结果。 如果用户输入的检索关键 词为 AB C, 即 AB and C时, 与检索关键词为 A BC的情况类似。
3. 如果用户输入的检索串为 ABC, 则直接查找 3元复合语素 ABC 的索引表项, 进行一次磁盘文件的 10操作, 即可得到对应的搜索结果。
4. 如果用户输入的检索串为 ABCD,优先进行 3元复合语素的 AND
操作, 对 3元复合语素 ABC和 BCD对应的索引表项进行 AND操作, 然后进行区段位置偏移匹配, 即可得到对应的搜索结果;
5. 如果用户输入的检索串为 ABCDEF, 其包括的语素个数 n为 6, 当采用 3元复合语素时, 即复合语素包括的语素个数 m为 3时, 将其切 分为 4个, 即 n-m+1个 3元复合语素, 依次对 3元复合语素对应的索引 表项进行 AND操作,例如 ABC&BCD&CDE&DEF。在本发明实施例中, 对多元复合语素对应的索引表项依次进行 AND操作, 而不是同时进行 AND操作。 由于参与操作的 3元复合语素往往是低频语素, 即包含该语 素的文档数 4艮少, 因此往往前几个多元复合语素进行完 AND操作, 就 得到很少的结果文档数, 这往往就是需要的搜索结果。
可见, 应用本发明实施例, 可以提高检索串的搜索速度, 并且能提 高单台网络服务器的数据索引管理能力, 保证单台网络服务器能够管理 大量的文档, 有效的提高了索引访问的速度, 保证在大并发访问的条件 下网络搜索的稳定性, 同时又能节约网络服务器的硬件综合成本。
以上所述仅为本发明的较佳实施例而已, 并不用以限制本发明, 凡 在本发明的精神和原则之内所作的任何修改、 等同替换和改进等, 均应 包含在本发明的保护范围之内。
Claims
1、 一种搜索方法, 其特征在于, 包括:
将待检索的检索词切分为至少一个复合语素, 所述复合语素中包含 至少两个语素;
根据所述至少一个复合语素以及预先建立的以复合语素为索引词 的文件索引进行搜索; 确定包含所述检索词的文档。
2、 如权利要求 1 所述的搜索方法, 其特征在于, 当所述检索词包 括 n个语素, 所述复合语素包括 m个语素时, 所述将检索词切分为至少 一个复合语素包括:
将所述检索词中 n个语素切分为 n-m+1个复合语素,所述复合语素 包括所述检索词中 m个相邻的语素, 其中 n为大于 1的整数, m为大于 1小于或等于 n的整数。
3、 如权利要求 1 所述的搜索方法, 其特征在于, 所述根据所述至 少一个复合语素以及预先建立的以复合语素为索引词的文件索引进行 搜索包括:
分别根据所述至少一个复合语素, 查询所述文件索引, 获取所述每 个复合语素对应的索引表项, 所述索引表项中包括文档的标识;
对所述获取的索引表项依次进行逻辑与操作, 获取同时包含所述至 少一个复合语素的文档的标识;
4、 如权利要求 3所述的搜索方法, 其特征在于, 进一步包括: 根据预先设置的复合语素在文档中的位置偏移信息, 获取位置偏移 匹配的文^^的标只。
5、 如权利要求 3 所述的搜索方法, 其特征在于, 所述复合语素在 文档中的位置偏移信息为复合语素在文档区段中的位置偏移信息;
所述根据预先设置的复合语素在文档中的位置偏移信息获取位置 偏移匹配的文档的标识包括:
对所述至少一个复合语素的位置偏移信息进行逻辑与操作, 当所述 逻辑与操作的结果表示所述至少一个复合语素同时包含在一个区段时, 获取所述区段对应的文档的标识。
6、 如权利要求 1所述的搜索方法, 其特征在于, 进一步包括: 对与所述检索词具有逻辑关系字符串进行检索, 确定包含所述字符 串的文档;
根据所述逻辑关系, 对包含所述检索词的文档的标识与包含所述字 符串的文档的标识进行逻辑运算;
确定满足检索条件的文档, 在所述文档中所述检索词与所述第一检 索词之间满足所述逻辑关系。
7、 如权利要求 1至 6任一项权利要求所述的搜索方法, 其特征在 于, 所述的文件索引为倒排索引, 所述索引表项为倒排索引表项。
8、 如权利要求 1至 6任一权利要求所述的搜索方法, 其特征在于, 所述复合语素包括 2个语素, 或者 3个语素。
9、 一种索引方法, 其特征在于, 包括:
从文档中获取索引词, 所述索引词为复合语素, 所述复合语素中包 含至少两个语素;
建立对应所述文档和所述复合语素的索引表项。
10、 如权利要求 9所述的索引方法, 其特征在于,
所述索引表项包含所述文档的标识以及所述复合语素在所述文档 中的位置偏移信息。
11、 如权利要求 10所述索引方法, 其特征在于, 所述复合语素在 文档中的位置偏移信息为:
所述复合语素所在区段在所述文档中的位置偏移信息。
12、 如权利要求 10所述的索引方法, 其特征在于, 所述区段为文 档中的一个自然句, 或者相当于一个自然句长度的文本。
13、 如权利要求 10所述的索引方法, 其特征在于, 所述区段在文 档中的位置偏移信息采用 1个比特表示, 当所述区段中包含所述复合语 素时, 将所述比特置 1 , 当所述区段中不包含所述复合语素时, 将所述 比特置 0。
14、 一种搜索系统, 其特征在于, 包括:
索引数据库, 用于存储以复合语素为索引词的文件索引, 所述复合 语素中包含至少两个语素;
检索器, 用于将待检索的检索词切分为至少一个复合语素, 根据所 述切分后的至少一个复合语素, 对所述索引数据库进行搜索; 确定包含 所述检索词的文档。
15、 一种搜索装置, 其特征在于, 包括:
检索分词模块, 用于将待检索的检索词切分为至少一个复合语素; 索引查找模块, 用于^^据所述至少一个复合语素, 查找预先建立的 以复合语素为索引词的文件索引, 获取所述每个复合语素对应的索引表 项, 所述索引表项中包括文档的标识;
逻辑处理模块, 用于对所述获取的索引表项依次进行逻辑与操作, 获取同时包含所述至少一个复合语素的文档的标识;
搜索结果输出模块, 用于确定包含所述检索词的文档。
16、 如权利要求 15所述的索引方法, 其特征在于, 进一步包括: 位置偏移匹配模块, 用于根据预先设置的复合语素在文档中的位置 偏移信息, 获取位置偏移匹配的文档的标识。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2007100761163A CN101075252A (zh) | 2007-06-21 | 2007-06-21 | 一种网络搜索方法及系统 |
CN200710076116.3 | 2007-06-21 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2008154823A1 true WO2008154823A1 (fr) | 2008-12-24 |
Family
ID=38976305
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2008/070615 WO2008154823A1 (fr) | 2007-06-21 | 2008-03-28 | Procédé, système et dispositif de recherche |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN101075252A (zh) |
WO (1) | WO2008154823A1 (zh) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605708A (zh) * | 2013-11-11 | 2014-02-26 | 中国科学院计算技术研究所 | Kad网络中由关键词哈希值推测关键词的方法及系统 |
CN109241356A (zh) * | 2018-06-22 | 2019-01-18 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置及存储介质 |
Families Citing this family (15)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101075252A (zh) * | 2007-06-21 | 2007-11-21 | 腾讯科技(深圳)有限公司 | 一种网络搜索方法及系统 |
CN102270201B (zh) * | 2010-06-01 | 2013-07-17 | 富士通株式会社 | 用于网络文件的多维索引的方法和设备 |
CN102880614B (zh) * | 2011-07-15 | 2015-04-15 | 阿里巴巴集团控股有限公司 | 数据搜索方法及设备 |
CN103164408B (zh) * | 2011-12-09 | 2016-03-30 | 阿里巴巴集团控股有限公司 | 基于垂直搜索引擎的信息存储、查询方法及其装置 |
CN104572789B (zh) * | 2013-10-29 | 2018-05-01 | 北大方正集团有限公司 | 一种文本排序方法及设备 |
CN104679778B (zh) * | 2013-11-29 | 2019-03-26 | 腾讯科技(深圳)有限公司 | 一种搜索结果的生成方法及装置 |
WO2015096609A1 (zh) * | 2013-12-26 | 2015-07-02 | 乐视网信息技术(北京)股份有限公司 | 视频资源的倒排索引文件建立方法及其系统 |
CN105279233A (zh) * | 2015-09-23 | 2016-01-27 | 浙江宇视科技有限公司 | 一种资源的检索方法和装置 |
CN106227734B (zh) * | 2016-07-08 | 2019-06-25 | 大唐融合通信股份有限公司 | 一种基于问题搜索系统的数据处理方法和系统 |
CN109086285B (zh) * | 2017-06-14 | 2021-10-15 | 佛山辞荟源信息科技有限公司 | 基于语素的汉语智能处理方法和系统及装置 |
CN107506481A (zh) * | 2017-09-15 | 2017-12-22 | 北京京东尚科信息技术有限公司 | 用于处理信息的方法及装置 |
CN108376170A (zh) * | 2018-02-27 | 2018-08-07 | 广州酷狗计算机科技有限公司 | 确定检索结果的方法和装置 |
CN109213758B (zh) * | 2018-07-24 | 2021-03-30 | 中国联合网络通信集团有限公司 | 数据存取方法、装置、设备及计算机可读存储介质 |
CN110413735B (zh) * | 2019-07-25 | 2022-04-29 | 深圳供电局有限公司 | 一种问答检索方法及其系统、计算机设备、可读存储介质 |
CN114443728B (zh) * | 2022-01-04 | 2022-11-15 | 广州粤建三和软件股份有限公司 | 一种基于Elasticsearch的检测报告搜索方法及装置 |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0954781A (ja) * | 1995-08-17 | 1997-02-25 | Oki Electric Ind Co Ltd | 文書検索システム |
JPH09198395A (ja) * | 1996-01-16 | 1997-07-31 | Fuji Xerox Co Ltd | 文書検索装置 |
JP2000322416A (ja) * | 1999-05-06 | 2000-11-24 | Ntt Data Corp | 文書検索装置 |
CN101075252A (zh) * | 2007-06-21 | 2007-11-21 | 腾讯科技(深圳)有限公司 | 一种网络搜索方法及系统 |
-
2007
- 2007-06-21 CN CNA2007100761163A patent/CN101075252A/zh active Pending
-
2008
- 2008-03-28 WO PCT/CN2008/070615 patent/WO2008154823A1/zh active Application Filing
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0954781A (ja) * | 1995-08-17 | 1997-02-25 | Oki Electric Ind Co Ltd | 文書検索システム |
JPH09198395A (ja) * | 1996-01-16 | 1997-07-31 | Fuji Xerox Co Ltd | 文書検索装置 |
JP2000322416A (ja) * | 1999-05-06 | 2000-11-24 | Ntt Data Corp | 文書検索装置 |
CN101075252A (zh) * | 2007-06-21 | 2007-11-21 | 腾讯科技(深圳)有限公司 | 一种网络搜索方法及系统 |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103605708A (zh) * | 2013-11-11 | 2014-02-26 | 中国科学院计算技术研究所 | Kad网络中由关键词哈希值推测关键词的方法及系统 |
CN103605708B (zh) * | 2013-11-11 | 2017-12-08 | 中国科学院计算技术研究所 | Kad网络中由关键词哈希值推测关键词的方法及系统 |
CN109241356A (zh) * | 2018-06-22 | 2019-01-18 | 腾讯科技(深圳)有限公司 | 一种数据处理方法、装置及存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN101075252A (zh) | 2007-11-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2008154823A1 (fr) | Procédé, système et dispositif de recherche | |
US9760570B2 (en) | Finding and disambiguating references to entities on web pages | |
JP4857075B2 (ja) | ウェブドキュメントの集合において効率的に日付を検索する方法、コンピュータプログラム | |
US8090708B1 (en) | Searching indexed and non-indexed resources for content | |
US9069857B2 (en) | Per-document index for semantic searching | |
US7937395B2 (en) | Systems and methods of displaying and re-using document chunks in a document development application | |
US8751484B2 (en) | Systems and methods of identifying chunks within multiple documents | |
US20110119262A1 (en) | Method and System for Grouping Chunks Extracted from A Document, Highlighting the Location of A Document Chunk Within A Document, and Ranking Hyperlinks Within A Document | |
KR101522049B1 (ko) | 모호성 민감 자연 언어 처리 시스템에서의 동일 지시어 분석 | |
US8352485B2 (en) | Systems and methods of displaying document chunks in response to a search request | |
US20090089278A1 (en) | Techniques for keyword extraction from urls using statistical analysis | |
US20070038665A1 (en) | Local computer search system and method of using the same | |
US8359533B2 (en) | Systems and methods of performing a text replacement within multiple documents | |
US20080288442A1 (en) | Ontology Based Text Indexing | |
US20090216715A1 (en) | Systems and Methods of Semantically Annotating Documents of Different Structures | |
US20040111401A1 (en) | Using text search engine for parametric search | |
WO2012125350A2 (en) | Keyword extraction from uniform resource locators (urls) | |
US9129036B2 (en) | Systems and methods of identifying chunks within inter-related documents | |
US8548989B2 (en) | Querying documents using search terms | |
US8126880B2 (en) | Systems and methods of adaptively screening matching chunks within documents | |
CN102201007A (zh) | 一种大规模数据搜索系统 | |
US8924421B2 (en) | Systems and methods of refining chunks identified within multiple documents | |
JP2005242416A (ja) | 自然言語文の検索方法および検索装置 | |
US10380195B1 (en) | Grouping documents by content similarity | |
US20090132493A1 (en) | Method for retrieving and editing HTML documents |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 08715351 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 28.04.10) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 08715351 Country of ref document: EP Kind code of ref document: A1 |