WO2012119339A1 - Retrieval method and apparatus - Google Patents

Retrieval method and apparatus Download PDF

Info

Publication number
WO2012119339A1
WO2012119339A1 PCT/CN2011/073036 CN2011073036W WO2012119339A1 WO 2012119339 A1 WO2012119339 A1 WO 2012119339A1 CN 2011073036 W CN2011073036 W CN 2011073036W WO 2012119339 A1 WO2012119339 A1 WO 2012119339A1
Authority
WO
WIPO (PCT)
Prior art keywords
file
information
files
same
key information
Prior art date
Application number
PCT/CN2011/073036
Other languages
French (fr)
Chinese (zh)
Inventor
齐波
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2012119339A1 publication Critical patent/WO2012119339A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying

Definitions

  • the present invention relates to the field of information retrieval, and in particular to a search method and apparatus.
  • BACKGROUND OF THE INVENTION Currently, many files are shared on the network, so how to retrieve the files required by the users is particularly important.
  • the prior art provides a method for downloading and downloading a mobile phone file, that is, a server in a domain where the mobile terminal is located receives search request keyword information, searches for a resource in the mobile communication network, and returns a searched resource list to the mobile terminal.
  • the mobile terminal receives the data source information selected by the user according to the resource list, and initiates a request for downloading the required resource to the server in the domain, and the server in the domain where the mobile terminal is located sends the required resource to the mobile terminal.
  • a primary object of the present invention is to provide a search method and apparatus to solve at least the above problems.
  • a search method including: acquiring a keyword for requesting a search; acquiring information of a plurality of files corresponding to a word segmentation of the keyword or a segmentation term identical to the keyword
  • the information of each of the plurality of files includes: one or more word segment corresponding to the key information of the file, and a frequency of occurrence of each word segment in the key information, the key information Is the information set for retrieving the file; determining the same file among the plurality of files, wherein the same file is a file corresponding to the word segment corresponding to the file and a frequency corresponding to each segment term having a ratio exceeding a threshold Returning the search result, wherein one of the files is retained for the search result of the same file.
  • the same file further includes: a file whose key information included in the information of the file is identical.
  • determining the same one of the plurality of files comprises: determining that the files with the same key information included in the information in the plurality of files are the same file; and retaining only one of the files for the same file, Determining that the corresponding word segmentation in the remaining ones of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file.
  • the search result retains one of the files and a plurality of information required to acquire the file for the same file.
  • the method further comprises: segmenting the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file.
  • a search apparatus including: a first obtaining module, configured to acquire a keyword for requesting a search; and a second obtaining module, configured to acquire a word segmentation item including the keyword or The information of the plurality of files corresponding to the same word segmentation item, wherein the information of each of the plurality of files includes: one or more word items corresponding to the key information of the file and each The frequency at which the word segment appears in the key information, the key information is information set for retrieving the file; the determining module is configured to determine the same file among the plurality of files, wherein the same file The file corresponding to the word segmentation corresponding to the file and the frequency corresponding to each segment term exceeds the threshold; the return module is set to return the search result, wherein one of the files is retained for the same file.
  • the same file determined by the determining module further includes: the key information included in the information of the file is completely The same file.
  • the determining module includes: a first determining module, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; and the second determining module is set to be the same file Only one of the files is retained, and it is determined that the corresponding word segmentation in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file.
  • the apparatus further includes: a word segmentation module, configured to segment the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file.
  • a word segmentation module configured to segment the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file.
  • FIG. 2 is a structural block diagram of a search apparatus according to an embodiment of the present invention
  • FIG. 3a is an index monthly service according to an embodiment of the present invention
  • FIG. 3b is a schematic diagram showing the internal structure of a word segmentation according to an embodiment of the present invention
  • FIG. 3c is a schematic structural diagram of location information of a word segment according to an embodiment of the present invention
  • FIG. 3d is a schematic diagram of the present invention.
  • FIG. 3 is a schematic diagram of the composition of the shared file information according to the embodiment of the present invention;
  • FIG. 3f is a schematic diagram of the structure information of the space vector according to the embodiment of the present invention; Is a flow chart of a search method in accordance with a preferred embodiment of the present invention; and
  • FIG. 5 is a flow chart for creating a space vector in accordance with a preferred embodiment of the present invention.
  • FIG. 1 is a flowchart of a search method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps: Step S102: Obtain a keyword for requesting search.
  • Step S104 Obtain information of a plurality of files corresponding to the word segment of the keyword or the word segmentation item having the same appeal keyword, wherein the information of each file includes: one or more corresponding to the key information of the file The frequency of occurrence of the word segmentation and each term item in the key information.
  • the key information is the information set for searching the file, for example, document summary, author, title, and so on.
  • Step S106 Determine the same file in the plurality of files, where the same file is a file whose file segment corresponding to all the word items and the corresponding frequency of each word item exceed the threshold value; for example, the word segment corresponding to the file is : (al , appears 10 times), ( bl , appears 9 times), ( cl , appears 1 time ); the corresponding word of the B file is: (al , appears 10 times ), ( bl , appears 9 times ) , ( Cl , appears once), therefore, it can be judged that A and B are the same file. If the cl corresponding to the B file appears twice, the similarity between the A file and the B file is considered to be the same file.
  • Step S108 returning the search result, wherein one of the files is retained for the same file search result, for example, the A file and the B file are the same file, and one of the A file and the B file is retained in the search result.
  • the query result returned by the server to the user is denoised, thereby solving the problem caused by repeated redundancy of the search result in the prior art, improving the user experience and saving network resources.
  • the search result retains one of the files and a plurality of information required to obtain the file. For example, if the A file is retained, the download addresses of the A file and the B file may also be retained, so that the user Multiple downloads can be implemented.
  • the key information with a small amount of information may be compared first. If the key information is the same, the two files may be considered to be the same. For example, the author and the title may be compared. By such a simple comparison, it is possible to block some duplicate files, and also reduce the burden of searching for the server. That is, in the case where the information of each file further includes one or more of the key information, the same file further includes: the file whose key information is identical in the information of the file. In this case, determining the same file in the plurality of files in step S106 includes: determining that the files with the same key information included in the information in the plurality of files are the same file; only the same file is retained for the same file.
  • a file, and then determining that the corresponding word segment in the remaining files in the plurality of files and the file having the same frequency corresponding to each of the word segment items exceeding the threshold are the same file.
  • word segmentation method is used: the key information of each file is segmented according to the inverse maximum matching algorithm, and the key information in the file is segmented.
  • One or more word breakers corresponding to the file are used.
  • the user experience can be improved by using the method shown in FIG.
  • a search device is further provided, and the device may be located in a server that provides a search function, and the device is used to implement the foregoing embodiments and preferred embodiments thereof.
  • FIG. 2 is a structural block diagram of a search device according to an embodiment of the present invention. As shown in FIG. 2, the search device includes: a first obtaining module 20, and a second acquiring module 22 , the module 24 and the return module 26 are determined. The structure will be described below.
  • the first obtaining module 20 is configured to acquire a keyword for requesting a search;
  • the second obtaining module 22 is connected to the first obtaining module 20, and the module is configured to acquire a word segmentation including a keyword or a word segment corresponding to the keyword
  • the information of the plurality of files wherein the information of each of the plurality of files includes: the key information of the file corresponds to all the word segments and the frequency of occurrence of each word segment in the key information, and the key information is for searching
  • the determining module 24 is connected to the second obtaining module 22, and is configured to determine the same file among the plurality of files, wherein the same file is the word segment corresponding to the file and the frequency corresponding to each word segment The same ratio exceeds the threshold of the file;
  • the return module 26, connected to the determination module 24, is set to return the search result, wherein, for the same file, the search result retains one of the files.
  • the same file determined by the determining module 24 further includes: a file having the same key information included in the information of the file.
  • the determining module 24 may include: a first determining module 242, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; the second determining module 244, the module is connected Up to the first determining module 242, configured to retain only one of the files for the same file, and further determine that the corresponding word segment in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word terms exceed the threshold value For the same file.
  • one of the search results returned by the search module retains one of the files and a plurality of information required to acquire the file for the same file.
  • the apparatus further comprises: a word segmentation module, configured to segment the key information in the file according to the inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file.
  • a word segmentation module configured to segment the key information in the file according to the inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file.
  • any other word segmentation method can achieve the same effect as long as it can perform word segmentation.
  • the following description will be made in connection with the search of shared files related to a mobile terminal (for example, a mobile phone).
  • the following preferred embodiments can also be used in other terminals of non-mobile terminals.
  • the mobile terminal can use two ways of sharing files.
  • One is that the publisher uploads the file to be shared directly to the shared area of a file server, and the downloader accesses the shared area of the relay server to download the file.
  • the publisher can also set the corresponding permissions. Only the authorized downloader can obtain the file. If this is the case, you may need to obtain the key information of the file, for example, the file. The name, author, etc., this information allows the publisher to enter when uploading the file.
  • the publisher only publishes the name, abstract, type, size and other information of the file to be shared to the server instead of the original file.
  • the downloader accesses the relay server according to its own needs.
  • the preferred embodiment provides a terminal file search denoising method based on a space vector algorithm, wherein the space vector is a multi-dimensional vector in which the word segmentation is a vector dimension, the frequency at which the word segment appears as the height of the vector in the dimension, and then all the dimensions The vector obtained by the above data integration.
  • the mobile terminal shares a denoising method for the same or similar files in the file search process and uses the space vector as the denoising factor. The following describes the establishment of the cable module I and the search process. Create an index model with denoising capabilities.
  • the process may include the following steps: Step 1: The mobile terminal issues a shared file information to the index server; Step 2: The index server opens the shared information published by the space storage terminal; Step 3: The index server refers to the lexicon to summarize the file in the shared information Key information such as author, title, etc. are analyzed by lexical analysis, and the article is divided into a collection of word items. For example, lexical analysis can be through searching the lexicon, and according to the inverse maximum matching algorithm, the file summary, author, title, etc. in the shared file information will be shared.
  • Step 4 the index server statistics the frequency and location of each word segment in these key messages, and The unique identifier of the word segmentation is referred to as the master code in this embodiment; in step 5, the index server sequentially combines the frequency of each word segment with the master code, and then integrates the combined values according to the dimensions of the vector to form a An abstract data model about the shared file, Also referred to as model space vector data; Step 6: The index server serializes the counted frequency, location, space vector, and correspondence with the shared file information into the encrypted file to form an index.
  • the index server mainly stores the shared file information of the terminal, and manages the released data in the storage area; and the inverted index is established in the other area for the data information posted to the server, so as to facilitate the terminal search.
  • the word segmentation formed by the above steps is a data structure, which mainly includes the main code of the word segment, the frequency and position of the word appearing, and the main code is a unique identification code corresponding to each word in the thesaurus. It should be noted that the main purpose of doing this mapping in the above database is to facilitate the formation of a mathematical abstract model of the space vector; wherein, the frequency refers to the number of occurrences of the word item in the key information of the currently shared file information, and the position refers to the word segmentation item.
  • the search process may include the following steps: Step 1: The index server enumerates the search result and joins the queue to be processed; Step 2: The index server is pending The search result information is retrieved from the queue, and the information such as the title, author, size, and creation time of the shared file is obtained, and the data volume is less than the threshold; Step 3: If the processing completion queue is empty, the information is directly listed; The completion queue is not empty. The index server compares this information with the search result information in the processing completion queue.
  • Step 4 If the same information is found in the processing completion queue, the download address of the shared file is directly recorded and added to the processing completion. In the same information in the queue, the information is discarded at the same time. Step 4: If the comparison result in step 3 is different, then the space vector of the information is taken out and compared with the information in the processing completion queue. If the same information is found in the processing completion queue, the download address of the shared file is directly recorded, added to the same information, and the information is discarded; if not, the information is added to the processing completion queue; Step 5, The index server continues to retrieve information from the queue to be processed, and repeats steps 2, 3, and 4 to wait until the queue to be processed is empty.
  • Step 6 The index server assembles the information in the processing completion queue into a result list in a certain format, and sends the information to the terminal.
  • the space vector has many dimensions, and when determining whether the files are the same, a threshold may be given in advance, that is, for a dimension with a small frequency value, the comparison between the vectors may be omitted as appropriate, for example, The similarity of the two articles to 98% can also be considered the same article.
  • the storage structure of the server data is optimized, and the uniqueness and accuracy of the query time and the shared file information returned from the server to the terminal are taken into consideration. Improve the user experience.
  • FIGS. 3a to 3f are schematic diagrams showing the structure of a space vector based index server system according to an embodiment of the present invention. The following is a description of the role of the space vector and the feasibility of denoising by referring to the framework of the server to the relevant unit composition.
  • Figure 3a shows the composition of the index server aOL from the overall framework.
  • the index server aal1 includes two sub-portions: index module a i02 and file information module al04.
  • the index module al02 is used to store index information. Including the word item al03, that is, all the shared file information is separated into the index module al02 in a certain format by the lexical analysis.
  • the file information module al04 is a set, and the terminal file information shared by each terminal is internally included.
  • Al05 The terminal file information al05 is a 4 ⁇ directory for sharing information of a terminal, and stores resources for sharing file information.
  • Figure 3b depicts the internal structure of the segmentation term al03. As shown in Figure 3b, blOl is the segmentation term al03 master code, and bl02 is the location where the segmentation term al03 appears in all shared file information. This information can be sorted from high to low by frequency.
  • Fig. 3c illustrates the composition of the word segment position information bl02, as shown in Fig.
  • FIG. 3d illustrates the structure of the shared file information a05 from a terminal in the file information module.
  • the terminal directory is divided into a plurality of shared file information items dlO1 according to the shared file, and one shared file information corresponds to one item.
  • Figure 3e depicts the composition of a shared file information dlO l, as shown in Figure 3e, elOl is the terminal shared file number C102 mentioned in Figure 3c, Figure 3e is associated with Figure 3c by this number; space vector el02, Used to determine whether two files are the same or similar; the file name el03 is used as auxiliary information It is shown in the list of shared files returned by the index server to the terminal in the future; the file address el04 is the address of another terminal that issues the shared file information, and serves as an entry point for communication between the two terminals.
  • FIG. 4 is a flowchart of a search method according to a preferred embodiment of the present invention.
  • the process for a terminal user to make a search request and finally obtain a search result includes the following steps: Step S401: A terminal user sends a search request to an index server; Step S402, the internal index of the index server starts to be searched, and the search index operation is performed.
  • Step S403 if the matching of the related word segmentation item and the search request is retrieved, the searched word segmentation item is located to the relevant shared file information; S404, acquiring all the shared file information, and pressing into a pending queue; Step S405, performing a denoising operation of the same or similar shared file in the step, the operation is mainly determined by comparing the space vectors to process The shared file information in the completion queue is compared with the shared file information extracted from a queue to be processed as a comparison reference, and finally, if the two file information is different, the shared file information extracted from the queue to be processed is put into the processing completion queue; S406, finishing the shared file letter in the processing completion queue And generating a shared file list to be sent to the search terminal user; Step S407, the terminal user selects a shared file in the list, and establishes a point-to-point link with the publishing terminal of the shared file; Step S408, in this step, after the terminal is authorized by the publishing terminal Download the shared file.
  • FIG. 5 is a flowchart of creating a space vector according to a preferred implementation of the present invention. The process may be performed on an index server. As shown in FIG. 5, the process includes the following steps: Step S501: Obtain a shared file information from a file information module.
  • Step S502 extract a key statement from key information such as a file summary, an author, a title, and the like of the shared file information;
  • Step S503 perform lexical analysis on the key sentence, and search for a keyword library in the server, The sentence is split into a plurality of word segmentation items;
  • step S504 the frequency of occurrence of the different word segmentation items in the key sentence is counted and the main code corresponding to the different word segmentation items is recorded;
  • Step S505 determining whether there are remaining statements without lexical analysis, If yes, proceed to step S502, if not, proceed to step S506;
  • step S506 combine all the word segmentation main codes completed by the lexical analysis with the frequency of occurrence thereof, and then integrate the combined values according to the dimension of the vector It is organized into a space vector corresponding to the shared file information.
  • the above embodiments solve the problems caused by repeated redundancy in the search results in the prior art, save bandwidth resources, and improve user experience.
  • the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices.
  • they may be implemented by program code executable by the computing device so that they may be stored in the storage device by the computing device, or they may be separately fabricated into individual integrated circuit modules, or Multiple modules or steps are made into a single integrated circuit module.
  • the invention is not limited to any specific combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A retrieval method and apparatus are disclosed. The retrieval method includes: obtaining a keyword requesting for retrieval, obtaining information of a plurality of documents corresponding to word segmentation items of the keyword or word segmentation items similar to the keyword, wherein, the information of each document of a plurality of documents includes one or more word segmentation items corresponding to key information of the document and frequency of each word segmentation item's occurrence in the key information which is the information defined for retrieval of the document, determining similar documents from a plurality of documents, wherein, the similar documents are the documents with a ratio of similarity of word segmentation items corresponding to the documents and the frequency corresponding to each word segmentation item above a threshold value, returning a retrieval result which only maintains one of the similar documents. By the present invention, bandwidth resource is saved and user experience is improved.

Description

搜索方法^置 技术领域 本发明涉及信息检索领域, 具体而言, 涉及一种搜索方法及装置。 背景技术 目前, 在网络上共享了很多文件, 因此, 如何检索到用户需要的文件显得 尤为重要。 现有技术中提供了一种手机文件搜索下载的方法, 即, 移动终端所在域的 服务器接收搜索请求关键字信息, 在移动通信网络中搜索资源, 并向所述移动 终端返回搜索到的资源列表, 移动终端收到用户根据资源列表选择的数据源信 息, 向所在域的服务器发起下载所需资源请求, 移动终端所在域的服务器向该 移动终端发送所需资源。 上述方法以及现有技术中的其他方法的检索结果均有可能存在重复冗余 的项目, 这种冗余的不仅占用带宽资源还影响用户体验。 发明内容 本发明的主要目的在于提供一种搜索方法及装置, 以至少解决上述问题。 根据本发明的一个方面, 提供了一种搜索方法, 包括: 获取请求搜索的关 键词; 获取包括所述关键词的分词项或者与所述关键词相同的分词项所对应的 多个文件的信息, 其中, 所述多个文件中的每个文件的信息均包括: 该文件的 关键信息对应的一个或多个分词项以及每个分词项在所述关键信息中出现的 频率, 所述关键信息是为检索该文件所设定的信息; 确定所述多个文件中相同 的文件, 其中, 所述相同的文件为文件对应的分词项和每个分词项对应的频率 相同的比例超过阈值的文件; 返回搜索结果, 其中, 对于所述相同的文件所述 搜索结果保留了其中的一个文件。 优选地, 在所述每个文件的信息还包括所述关键信息中的一个或多个的情 况下,所述相同的文件还包括: 文件的信息中包括的关键信息完全相同的文件。 优选地, 确定所述多个文件中相同的文件包括: 确定所述多个文件中的信 息中包括的关键信息完全相同的文件为相同的文件; 对于相同的文件仅保留其 中的一个文件, 再确定所述多个文件中剩余的文件中对应的分词项和每个分词 项对应的频率相同的比较超过阈值的文件为相同的文件。 优选地, 对于相同的文件所述搜索结果保留了其中的一个文件和多个获取 该文件所需要的信息。 优选地, 该方法还包括: 对所述每个文件的关键信息按照逆向最大匹配算 法将该文件中的关键信息进行分词, 得到该文件对应的一个或多个分词项。 根据本发明的另一个方面, 还提供了一种搜索装置, 包括: 第一获取模块, 设置为获取请求搜索的关键词; 第二获取模块, 设置为获取包括所述关键词的 分词项或者与所述关键词相同的分词项所对应的多个文件的信息, 其中, 所述 多个文件中的每个文件的信息均包括: 该文件的关键信息对应的一个或多个分 词项以及每个分词项在所述关键信息中出现的频率, 所述关键信息是为检索该 文件所设定的信息; 确定模块 , 设置为确定所述多个文件中相同的文件, 其中, 所述相同的文件为文件对应的分词项和每个分词项对应的频率相同的比例超 过阈值的文件; 返回模块, 设置为返回搜索结果, 其中, 对于所述相同的文件 所述搜索结果保留了其中的一个文件。 优选地, 在所述每个文件的信息还包括所述关键信息中的一个或多个的情 况下, 所述确定模块确定的所述相同的文件还包括: 文件的信息中包括的关键 信息完全相同的文件。 优选地, 所述确定模块包括: 第一确定模块, 设置为确定所述多个文件中 的信息中包括的关键信息完全相同的文件为相同的文件; 第二确定模块, 设置 为对于相同的文件仅保留其中的一个文件, 并再确定所述多个文件中剩余的文 件中对应的分词项和每个分词项对应的频率相同的比较超过阈值的文件为相 同的文件。 优选地, 所述搜索模块返回的搜索结果中, 对于相同的文件保留了其中的 一个文件和多个获取该文件所需要的信息。 优选地, 该装置还包括: 分词模块, 设置为对所述每个文件的关键信息按 照逆向最大匹配算法将该文件中的关键信息进行分词, 得到该文件对应的一个 或多个分词项。 通过本发明, 解决了现有技术中检索结果存在重复冗余而导致的问题, 节 约了带宽资源, 提高了用户体验。 附图说明 此处所说明的附图用来提供对本发明的进一步理解, 构成本申请的一部 分, 本发明的示意性实施例及其说明用于解释本发明, 并不构成对本发明的不 当限定。 在附图中: 图 1是根据本发明实施例的搜索方法的流程图; 图 2是根据本发明实施例的搜索装置的结构框图; 图 3a是才艮据本发明实施例的索引月艮务器 alOl的整体框架示意图; 图 3b是才艮据本发明实施例的分词项的内部结构示意图; 图 3c是才艮据本发明实施例的分词项位置信息的结构示意图; 图 3d是 居本发明实施例的来自一个终端的共享文件信息的结构示意图; 图 3e是才艮据本发明实施例的共享文件信息的组成示意图; 图 3f是 居本发明实施例的空间向量的结构信息示意图; 图 4是根据本发明优选实施例的搜索方法的流程图; 图 5是 居本发明优选实施的空间向量的创建流程图。 具体实施方式 下文中将参考附图并结合实施例来详细说明本发明。 需要说明的是, 在不 冲突的情况下, 本申请中的实施例及实施例中的特征可以相互组合。 以下的实施例可以应用于各种网络中的检索, 当然, 由于移动终端这种对 带宽要求很严格, 重复的数据传递意味着增付额外的流量费用, 因此应用于移 动终端中的检索可以取得更好的效果。 在本实施例中, 提供了一种搜索方法, 图 1是根据本发明实施例的搜索方 法的流程图, 如图 1所示, 该流程包括如下步骤: 步骤 S 102 , 获取请求搜索的关键词。 步骤 S 104,获取包括上述关键词的分词项或者与上诉关键词相同的分词项 所对应的多个文件的信息, 其中, 每个文件的信息均包括: 该文件的关键信息 对应的一个或多个分词项以及每个分词项在关键信息中出现的频率, 关键信息 是为检索该文件所设定的信息, 例如, 文件摘要、 作者、 标题等。 步骤 S 106, 确定多个文件中相同的文件, 其中, 相同的文件为文件对应的 所有分词项和每个分词项对应的频率相同的比例超过阈值的文件; 例如, Α文 件对应的分词项为: (al , 出现 10次), ( bl , 出现 9次), ( cl , 出现 1次); B 文件对应的分词项为: (al , 出现 10次), ( bl , 出现 9次), ( cl , 出现 1次), 因此, 可以判断 A和 B为相同的文件。 如果 B文件对应的 cl 出现 2次, A文 件和 B文件的相似程度比较高也可以认为是相同的文件。 步骤 S 108, 返回搜索结果, 其中, 对于相同的文件搜索结果保留了其中的 一个文件, 例如, A文件和 B文件是相同的文件, 搜索结果中保留 A文件和 B 文件中的一个即可。 通过上述步骤, 对服务器返回给用户的查询结果进行了去噪处理, 从而解 决了现有技术中搜索结果存在重复冗余而导致的问题, 提高了用户体验, 也节 约了网络资源。 优选地, 对于相同的文件, 搜索结果保留了其中的一个文件和多个获取该 文件所需要的信息, 例如, 如果保留了 A文件, 那么还可以保留 A文件和 B 文件的下载地址, 这样用户可以实现多点下载。 优选地, 在实施时, 可以首先比对一些信息量较小的关键信息, 如果这些 关键信息相同, 则可以认为这两个文件相同, 例如, 可以比较作者和标题。 通 过这样的简单比对就可以屏蔽一些重复的文件, 也可以减轻搜索月艮务器的负 担。 即, 在每个文件的信息还包括关键信息中的一个或多个的情况下, 相同的 文件还包括: 文件的信息中包括的关键信息完全相同的文件。 在这种情况下, 步骤 S 106 中的确定多个文件中相同的文件包括: 确定多个文件中的信息中包 括的关键信息完全相同的文件为相同的文件; 对于相同的文件仅保留其中的一 个文件, 再确定多个文件中剩余的文件中对应的分词项和每个分词项对应的频 率相同的比较超过阈值的文件为相同的文件。 对关键信息进行分词的方式可以有艮多种, 在本实施例中釆用了一种分词 方式: 对每个文件的关键信息按照逆向最大匹配算法将该文件中的关键信息进 行分词, 得到该文件对应的一个或多个分词项。 无论釆用哪种分词方式, 只要 釆用了图 1所示出的方法, 均可以达到提高用户体验的效果。 在本实施例中还提供了一种搜索装置, 该装置可以位于提供搜索功能的服 务器中, 该装置用于实现上述实施例及其优选实施方式, 上述已经进行过说明 的在此不再赘述, 下面对该装置中涉及到的模块进行说明, 图 2是根据本发明 实施例的搜索装置的结构框图, 如图 2所示, 该搜索装置包括: 第一获取模块 20, 第二获取模块 22, 确定模块 24和返回模块 26。 下面对该结构进行说明。 第一获取模块 20, 设置为获取请求搜索的关键词; 第二获取模块 22, 连 接至第一获取模块 20,该模块设置为获取包括关键词的分词项或者与关键词相 同的分词项所对应的多个文件的信息, 其中, 多个文件中的每个文件的信息均 包括: 该文件的关键信息对应所有的分词项以及每个分词项在关键信息中出现 的频率, 关键信息是为检索该文件所设定的信息; 确定模块 24, 连接至第二获 取模块 22, 设置为确定多个文件中相同的文件, 其中, 相同的文件为文件对应 的分词项和每个分词项对应的频率相同的比例超过阈值的文件; 返回模块 26, 连接至确定模块 24, 设置为返回搜索结果, 其中, 对于相同的文件, 搜索结果 保留了其中的一个文件。 优选地, 在每个文件的信息还包括关键信息中的一个或多个的情况下, 确 定模块 24 确定的相同的文件还包括: 文件的信息中包括的关键信息完全相同 的文件。 在这种情况下, 确定模块 24可以包括: 第一确定模块 242, 设置为确 定多个文件中的信息中包括的关键信息完全相同的文件为相同的文件; 第二确 定模块 244, 该模块连接至第一确定模块 242, 设置为对于相同的文件仅保留 其中的一个文件, 并再确定多个文件中剩余的文件中对应的分词项和每个分词 项对应的频率相同的比较超过阈值的文件为相同的文件。 优选地, 搜索模块返回的搜索结果中对于相同的文件保留了其中的一个文 件和多个获取该文件所需要的信息。 优选地, 该装置还包括: 分词模块, 设置为对每个文件的关键信息按照逆 向最大匹配算法将该文件中的关键信息进行分词, 得到该文件对应的一个或多 个分词项。 当然, 这只是分词模块的一个优选的实施方式, 其他的任何分词方 式只要能够进行分词就能够达到相同的效果。 下面结合移动终端(例如, 手机)相关的共享文件的搜索进行说明, 当然, 以下的优选实施例也可以用于非移动终端的其他终端中。 在本优选实施例中, 移动终端可以釆用两种共享文件的方式。 一种是发布 者将需要共享的文件直接上传到某文件服务器的共享区域中, 下载者通过访问 该中继服务器的共享区域, 实现文件的下载。 当然, 在发布此文件的时候, 发 布者也可以设置相应的权限, 只有被授权的下载者才可以获取该文件, 如果是 这种方式, 可能需要获取到该文件的关键信息, 例如, 该文件的名称、 作者等, 这些信息可以让发布者在上传文件时输入。 还有一种比较优化的共享方式, 发 布者仅将需要共享的文件的名称、 摘要、 类型、 大小等信息发布到服务器上, 而非原始文件; 下载者通过访问该中继服务器, 根据自身的需要来选择相应的 文件, 从而定位到原发布者, 获取该共享文件。 本优选实施例提供了基于空间向量算法的终端文件搜索去噪方法, 该空间 向量是以分词项为向量维度的多维向量, 分词项出现的频率作为向量在该维上 的高度, 然后将所有维上面的数据整合得到的向量。 在该优选实施例中, 移动 终端之间共享文件搜索过程中相同或相似文件的去噪方法, 并使用空间向量作 为去噪因子。 下面对建立索弓 I模块和搜索过程进行分别说明。 建立具有去噪功能的索引模型。 该过程可以包括如下步骤: 步骤一, 移动终端向索引服务器发布一条共享文件信息; 步骤二, 索引服务器开辟空间存储终端发布的共享信息; 步骤三, 索引服务器参照词库对共享信息中的文件摘要、 作者、 标题等关 键信息进行词法分析, 将文章拆分为分词项的集合, 例如, 词法分析可以是通 过查找词库, 按照逆向最大匹配算法将共享文件信息中的文件摘要、 作者、 标 题等关键信息分词, 或者也可以把一篇完整的文章分解为多个分词项的集合; 步骤四, 索引月艮务器统计各个分词项在这些关键信息中出现的频率和位 置, 并 ΐ己录各分词的唯一识别码, 在本实施例中称为主码; 步骤五, 索引服务器将各个分词的频率和主码依次组合起来, 然后再将这 些组合后的值按照向量的维度整合起来, 形成一个关于该共享文件的抽象数据 模型, 该抽象数据模型也称为空间向量; 步骤六, 索引服务器将统计后的频率、 位置、 空间向量以及与共享文件信 息的对应关系序列化到加密文件中, 形成索引。 优选地, 上述索引服务器主要存储终端的共享文件信息, 在该存储区域管 理发布上来的数据; 可以在另一个区域对发布到月艮务器端的数据信息建立倒排 索引, 便于终端搜索。 通过上述步骤形成的分词项是一个数据结构, 里面主要包含分词的主码、 该词出现的频率和位置等信息, 其中, 主码是词库中每个词语对应的唯一识别 码。 需要说明的是, 在以上数据库中做此映射主要目的是便于形成空间向量的 数学抽象模型; 其中, 频率是指该分词项在当前共享文件信息的关键信息中出 现的次数, 位置是指分词项在关键信息中出现的位置。 需要说明的是, 由于文件的基本组成单位是词语, 而且不同的文件包含词 语的种类以及词频均不同, 因此使用词语作为维度可以唯一区分不同内容的文 件。 相应于上述索引月艮务器建立的索引, 其搜索流程可以包括如下步骤: 步骤一, 索引月艮务器对搜索后的结果进行枚举, 加入待处理队列; 步骤二, 索引服务器从待处理队列中取出一条搜索结果信息, 获取共享文 件的标题、 作者、 大小、 创建时间等关键且数据量小于阈值的信息; 步骤三, 如果处理完成队列为空, 就直接将此信息入列; 如果处理完成队 列不为空, 索引服务器将此信息与处理完成队列中的搜索结果信息——对比, 如果发现处理完成队列中与之有相同的信息则直接记录该共享文件的下载地 址, 添加到处理完成队列里相同的信息中, 同时丢弃此条信息; 步骤四, 如果步骤三对比结果是不同, 那么, 取出此条信息的空间向量继 续与处理完成队列中的信息比较。 如果发现处理完成队列中与之有相同的信息 则直接记录该共享文件的下载地址,添加到相同的信息中, 同时丢弃此条信息; 如果不是, 将此信息加入处理完成队列中; 步骤五, 索引服务器继续从待处理队列中取出信息, 重复步骤二、 步骤三 和步 4聚四, 直至待处理队列为空; 步骤六, 索引服务器将处理完成队列中的信息按一定的格式组装为结果列 表, 发送给终端。 需要说明的是, 上述空间向量的维度较多, 在判定文件是否相同时, 可以 预先给定一个阈值, 即对于频率值较小的维, 向量之间进行比较时可以酌情省 略掉, 例如, 只要两篇文章达到 98%相似也可以认为是相同文章。 通过本实施例, 优化了服务器索 I数据的存储结构, 兼顾了查询时间和从 服务器向终端返回的共享文件信息的唯一性和精准性。 提高了用户体验。 图 3a至图 3f是根据本发明实施例的基于空间向量的索引服务器系统结构 示意图, 下面参照这些附图结合月艮务器的框架到相关的单元组成说明了空间向 量的作用和去噪的可行性。 图 3a从整体框架上介绍了索引服务器 alO l的组成结构, 索引服务器 alOl 包括索引模块 ai02和文件信息模块 al04两个子部分, 如图 3a所示, 索引模块 al02 用来存储索引信息, 最主要包括分词项 al03 , 即所有共享文件信息经过 词法分析分解出来的分词项 al03都以一定的格式存储在索引模块 al02中; 文 件信息模块 al04 是一个集合, 内部包括了各个终端共享发布的终端文件信息 al05 ; 终端文件信息 al05 则为一个终端的共享信息发布 4艮目录, 里面存储共 享文件信息的资源。 图 3b描述了分词项 al03的内部结构, 如图 3b所示, blOl为分词项 al03 主码, bl02则是说明了分词项 al03在所有共享文件信息中出现的位置。 这些 信息可以按照频率从高向低排序。 图 3c说明了分词项位置信息 bl02的构成, 如图 3c所示, 包括终端编号 clO l , 终端共享文件编号 cl02, 以及分词项在该共享文件中出现的频率 cl03。 图 3d说明了文件信息模块中来自一个终端的共享文件信息 al05的结构, 如图 3d所示, 该终端目录中按照共享文件分为很多个共享文件信息项 dlOl , 一个共享文件信息对应一项。 图 3e描述了一项共享文件信息 dlO l的组成, 如图 3e所示, elOl则为图 3c中提及的终端共享文件编号 C102, 图 3e则通过此编号与图 3c关联; 空间向 量 el02, 用于判定两份文件是否相同或相似; 文件名称 el03作为辅助信息显 示在将来索引服务器向终端返回的共享文件列表中; 文件地址 el04 是发布该 共享文件信息的另一终端的地址, 作为以后两终端通讯的入口。 图 3f进一步更详细的刻画了空间向量 e 102的结构信息 ,为各个分词项 al03 在当前共享文件信息中出现的频率和其主码 b 101组合而成的多维向量。 图 4是根据本发明优选实施例的搜索方法的流程图, 如图 4所示, 终端用 户提出搜索请求并最终获得搜索结果的过程包括如下步骤: 步骤 S401 , 终端用户向索引服务器发送搜索请求; 步骤 S402, 索引服务器的对内部的索引开始检索, 执行搜索索引操作; 步骤 S403 , 如果检索到相关的分词项与搜索请求的匹配, 则 居搜索到的 分词项定位到相关的共享文件信息; 步骤 S404, 获取所有的共享文件信息, 并压入一个待处理队列中; 步骤 S405 , 在该步骤中进行相同或相似共享文件的去噪操作, 该操作主要 是通过空间向量的对比来判定, 以处理完成队列中的共享文件信息作为对比参 照来和一个待处理队列中提取出的共享文件信息比较, 最终如果两文件信息不 同则将待处理队列中提取出的共享文件信息放入处理完成队列; 步骤 S406, 整理处理完成队列中的共享文件信息, 生成共享文件列表, 发 送给搜索终端用户; 步骤 S407, 终端用户选择列表中的一个共享文件, 与该共享文件的发布终 端建立点对点链接; 步骤 S408, 在该步骤中终端经过发布终端授权后下载共享文件。 需要说明 的是, 在进行步骤 S405 时, 如果对比的两个文件信息是重复的, 则不需要将 从待处理队列中获取的文件信息加入处理完成队列, 只是将其地址信息追加到 处理完成队列中相同文件的地址字段中, 这样以后下载终端收到共享文件列表 信息后可以实现一个文件的多点下载支持。 图 5是 居本发明优选实施的空间向量的创建流程图, 该流程可以发生在 索引服务器上, 如图 5所示, 该流程包括如下步骤: 步骤 S501 , 从文件信息模块中获取一个共享文件信息; 步骤 S 502, 从该共享文件信息的文件摘要、 作者、 标题等关键信息中提取 出一条关键语句; 步骤 S503 , 对此关键句进行词法分析, 查找月艮务器内的关键词库, 将该句 子拆分为多个分词项的集合; 步骤 S504,统计不同分词项在关键句中的出现频率并记录不同分词项对应 的主码; 步骤 S505 , 判断是否还有剩余的语句未进行词法分析, 如果存在则继续执 行步骤 S502, 如果不存在, 则执行步骤 S506; 步骤 S506 , 将词法分析完成的所有分词项主码和其出现频率组合在一起, 再将这些组合后的值按照向量的维度整合起来, 组织成该共享文件信息对应的 空间向量。 综上所述, 通过上述实施例, 解决了现有技术中检索结果存在重复冗余而 导致的问题, 节约了带宽资源, 提高了用户体验。 显然, 本领域的技术人员应该明白, 上述的本发明的各模块或各步骤可以 用通用的计算装置来实现, 它们可以集中在单个的计算装置上, 或者分布在多 个计算装置所组成的网络上, 可选地, 它们可以用计算装置可执行的程序代码 来实现, 从而可以将它们存储在存储装置中由计算装置来执行, 或者将它们分 别制作成各个集成电路模块, 或者将它们中的多个模块或步骤制作成单个集成 电路模块来实现。 这样, 本发明不限制于任何特定的硬件和软件结合。 以上所述仅为本发明的优选实施例而已, 并不用于限制本发明, 对于本领 域的技术人员来说, 本发明可以有各种更改和变化。 凡在本发明的 ^"神和原则 之内, 所作的任何修改、 等同替换、 改进等, 均应包含在本发明的保护范围之 内。 TECHNICAL FIELD The present invention relates to the field of information retrieval, and in particular to a search method and apparatus. BACKGROUND OF THE INVENTION Currently, many files are shared on the network, so how to retrieve the files required by the users is particularly important. The prior art provides a method for downloading and downloading a mobile phone file, that is, a server in a domain where the mobile terminal is located receives search request keyword information, searches for a resource in the mobile communication network, and returns a searched resource list to the mobile terminal. The mobile terminal receives the data source information selected by the user according to the resource list, and initiates a request for downloading the required resource to the server in the domain, and the server in the domain where the mobile terminal is located sends the required resource to the mobile terminal. Both the above methods and the retrieval results of other methods in the prior art may have redundant redundant items, and such redundancy not only occupies bandwidth resources but also affects the user experience. SUMMARY OF THE INVENTION A primary object of the present invention is to provide a search method and apparatus to solve at least the above problems. According to an aspect of the present invention, a search method is provided, including: acquiring a keyword for requesting a search; acquiring information of a plurality of files corresponding to a word segmentation of the keyword or a segmentation term identical to the keyword The information of each of the plurality of files includes: one or more word segment corresponding to the key information of the file, and a frequency of occurrence of each word segment in the key information, the key information Is the information set for retrieving the file; determining the same file among the plurality of files, wherein the same file is a file corresponding to the word segment corresponding to the file and a frequency corresponding to each segment term having a ratio exceeding a threshold Returning the search result, wherein one of the files is retained for the search result of the same file. Preferably, in a case that the information of each file further includes one or more of the key information, the same file further includes: a file whose key information included in the information of the file is identical. Preferably, determining the same one of the plurality of files comprises: determining that the files with the same key information included in the information in the plurality of files are the same file; and retaining only one of the files for the same file, Determining that the corresponding word segmentation in the remaining ones of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file. Preferably, the search result retains one of the files and a plurality of information required to acquire the file for the same file. Preferably, the method further comprises: segmenting the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. According to another aspect of the present invention, a search apparatus is further provided, including: a first obtaining module, configured to acquire a keyword for requesting a search; and a second obtaining module, configured to acquire a word segmentation item including the keyword or The information of the plurality of files corresponding to the same word segmentation item, wherein the information of each of the plurality of files includes: one or more word items corresponding to the key information of the file and each The frequency at which the word segment appears in the key information, the key information is information set for retrieving the file; the determining module is configured to determine the same file among the plurality of files, wherein the same file The file corresponding to the word segmentation corresponding to the file and the frequency corresponding to each segment term exceeds the threshold; the return module is set to return the search result, wherein one of the files is retained for the same file. Preferably, in the case that the information of each file further includes one or more of the key information, the same file determined by the determining module further includes: the key information included in the information of the file is completely The same file. Preferably, the determining module includes: a first determining module, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; and the second determining module is set to be the same file Only one of the files is retained, and it is determined that the corresponding word segmentation in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file. Preferably, in the search result returned by the search module, one of the files and a plurality of information required to acquire the file are retained for the same file. Preferably, the apparatus further includes: a word segmentation module, configured to segment the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. The invention solves the problems caused by the repeated redundancy of the search results in the prior art, saves bandwidth resources, and improves the user experience. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are set to illustrate,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1 is a flowchart of a search method according to an embodiment of the present invention; FIG. 2 is a structural block diagram of a search apparatus according to an embodiment of the present invention; FIG. 3a is an index monthly service according to an embodiment of the present invention. FIG. 3b is a schematic diagram showing the internal structure of a word segmentation according to an embodiment of the present invention; FIG. 3c is a schematic structural diagram of location information of a word segment according to an embodiment of the present invention; FIG. 3d is a schematic diagram of the present invention. FIG. 3 is a schematic diagram of the composition of the shared file information according to the embodiment of the present invention; FIG. 3f is a schematic diagram of the structure information of the space vector according to the embodiment of the present invention; Is a flow chart of a search method in accordance with a preferred embodiment of the present invention; and FIG. 5 is a flow chart for creating a space vector in accordance with a preferred embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. The following embodiments can be applied to retrieval in various networks. Of course, since the bandwidth requirement of the mobile terminal is very strict, repeated data transmission means additional traffic charges, so the retrieval applied to the mobile terminal can be obtained. Better results. In this embodiment, a search method is provided. FIG. 1 is a flowchart of a search method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps: Step S102: Obtain a keyword for requesting search. Step S104: Obtain information of a plurality of files corresponding to the word segment of the keyword or the word segmentation item having the same appeal keyword, wherein the information of each file includes: one or more corresponding to the key information of the file The frequency of occurrence of the word segmentation and each term item in the key information. The key information is the information set for searching the file, for example, document summary, author, title, and so on. Step S106: Determine the same file in the plurality of files, where the same file is a file whose file segment corresponding to all the word items and the corresponding frequency of each word item exceed the threshold value; for example, the word segment corresponding to the file is : (al , appears 10 times), ( bl , appears 9 times), ( cl , appears 1 time ); the corresponding word of the B file is: (al , appears 10 times ), ( bl , appears 9 times ) , ( Cl , appears once), therefore, it can be judged that A and B are the same file. If the cl corresponding to the B file appears twice, the similarity between the A file and the B file is considered to be the same file. Step S108, returning the search result, wherein one of the files is retained for the same file search result, for example, the A file and the B file are the same file, and one of the A file and the B file is retained in the search result. Through the above steps, the query result returned by the server to the user is denoised, thereby solving the problem caused by repeated redundancy of the search result in the prior art, improving the user experience and saving network resources. Preferably, for the same file, the search result retains one of the files and a plurality of information required to obtain the file. For example, if the A file is retained, the download addresses of the A file and the B file may also be retained, so that the user Multiple downloads can be implemented. Preferably, in implementation, the key information with a small amount of information may be compared first. If the key information is the same, the two files may be considered to be the same. For example, the author and the title may be compared. By such a simple comparison, it is possible to block some duplicate files, and also reduce the burden of searching for the server. That is, in the case where the information of each file further includes one or more of the key information, the same file further includes: the file whose key information is identical in the information of the file. In this case, determining the same file in the plurality of files in step S106 includes: determining that the files with the same key information included in the information in the plurality of files are the same file; only the same file is retained for the same file. A file, and then determining that the corresponding word segment in the remaining files in the plurality of files and the file having the same frequency corresponding to each of the word segment items exceeding the threshold are the same file. There are many ways to perform word segmentation on key information. In this embodiment, a word segmentation method is used: the key information of each file is segmented according to the inverse maximum matching algorithm, and the key information in the file is segmented. One or more word breakers corresponding to the file. Regardless of which word segmentation method is used, the user experience can be improved by using the method shown in FIG. In this embodiment, a search device is further provided, and the device may be located in a server that provides a search function, and the device is used to implement the foregoing embodiments and preferred embodiments thereof. The following is a description of the modules involved in the device. FIG. 2 is a structural block diagram of a search device according to an embodiment of the present invention. As shown in FIG. 2, the search device includes: a first obtaining module 20, and a second acquiring module 22 , the module 24 and the return module 26 are determined. The structure will be described below. The first obtaining module 20 is configured to acquire a keyword for requesting a search; the second obtaining module 22 is connected to the first obtaining module 20, and the module is configured to acquire a word segmentation including a keyword or a word segment corresponding to the keyword The information of the plurality of files, wherein the information of each of the plurality of files includes: the key information of the file corresponds to all the word segments and the frequency of occurrence of each word segment in the key information, and the key information is for searching The information set by the file; the determining module 24 is connected to the second obtaining module 22, and is configured to determine the same file among the plurality of files, wherein the same file is the word segment corresponding to the file and the frequency corresponding to each word segment The same ratio exceeds the threshold of the file; the return module 26, connected to the determination module 24, is set to return the search result, wherein, for the same file, the search result retains one of the files. Preferably, in the case that the information of each file further includes one or more of the key information, the same file determined by the determining module 24 further includes: a file having the same key information included in the information of the file. In this case, the determining module 24 may include: a first determining module 242, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; the second determining module 244, the module is connected Up to the first determining module 242, configured to retain only one of the files for the same file, and further determine that the corresponding word segment in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word terms exceed the threshold value For the same file. Preferably, one of the search results returned by the search module retains one of the files and a plurality of information required to acquire the file for the same file. Preferably, the apparatus further comprises: a word segmentation module, configured to segment the key information in the file according to the inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. Of course, this is only a preferred implementation of the word segmentation module, and any other word segmentation method can achieve the same effect as long as it can perform word segmentation. The following description will be made in connection with the search of shared files related to a mobile terminal (for example, a mobile phone). Of course, the following preferred embodiments can also be used in other terminals of non-mobile terminals. In the preferred embodiment, the mobile terminal can use two ways of sharing files. One is that the publisher uploads the file to be shared directly to the shared area of a file server, and the downloader accesses the shared area of the relay server to download the file. Of course, when publishing this file, the publisher can also set the corresponding permissions. Only the authorized downloader can obtain the file. If this is the case, you may need to obtain the key information of the file, for example, the file. The name, author, etc., this information allows the publisher to enter when uploading the file. There is also a more optimized sharing method. The publisher only publishes the name, abstract, type, size and other information of the file to be shared to the server instead of the original file. The downloader accesses the relay server according to its own needs. To select the appropriate file to locate the original publisher and get the shared file. The preferred embodiment provides a terminal file search denoising method based on a space vector algorithm, wherein the space vector is a multi-dimensional vector in which the word segmentation is a vector dimension, the frequency at which the word segment appears as the height of the vector in the dimension, and then all the dimensions The vector obtained by the above data integration. In the preferred embodiment, the mobile terminal shares a denoising method for the same or similar files in the file search process and uses the space vector as the denoising factor. The following describes the establishment of the cable module I and the search process. Create an index model with denoising capabilities. The process may include the following steps: Step 1: The mobile terminal issues a shared file information to the index server; Step 2: The index server opens the shared information published by the space storage terminal; Step 3: The index server refers to the lexicon to summarize the file in the shared information Key information such as author, title, etc. are analyzed by lexical analysis, and the article is divided into a collection of word items. For example, lexical analysis can be through searching the lexicon, and according to the inverse maximum matching algorithm, the file summary, author, title, etc. in the shared file information will be shared. Key information segmentation, or a complete article can be decomposed into a collection of multiple word segments; Step 4, the index server statistics the frequency and location of each word segment in these key messages, and The unique identifier of the word segmentation is referred to as the master code in this embodiment; in step 5, the index server sequentially combines the frequency of each word segment with the master code, and then integrates the combined values according to the dimensions of the vector to form a An abstract data model about the shared file, Also referred to as model space vector data; Step 6: The index server serializes the counted frequency, location, space vector, and correspondence with the shared file information into the encrypted file to form an index. Preferably, the index server mainly stores the shared file information of the terminal, and manages the released data in the storage area; and the inverted index is established in the other area for the data information posted to the server, so as to facilitate the terminal search. The word segmentation formed by the above steps is a data structure, which mainly includes the main code of the word segment, the frequency and position of the word appearing, and the main code is a unique identification code corresponding to each word in the thesaurus. It should be noted that the main purpose of doing this mapping in the above database is to facilitate the formation of a mathematical abstract model of the space vector; wherein, the frequency refers to the number of occurrences of the word item in the key information of the currently shared file information, and the position refers to the word segmentation item. The location that appears in the key message. It should be noted that since the basic constituent units of the file are words, and different files contain different types of words and different word frequencies, words are used as dimensions to uniquely distinguish files of different contents. Corresponding to the index established by the index server, the search process may include the following steps: Step 1: The index server enumerates the search result and joins the queue to be processed; Step 2: The index server is pending The search result information is retrieved from the queue, and the information such as the title, author, size, and creation time of the shared file is obtained, and the data volume is less than the threshold; Step 3: If the processing completion queue is empty, the information is directly listed; The completion queue is not empty. The index server compares this information with the search result information in the processing completion queue. If the same information is found in the processing completion queue, the download address of the shared file is directly recorded and added to the processing completion. In the same information in the queue, the information is discarded at the same time. Step 4: If the comparison result in step 3 is different, then the space vector of the information is taken out and compared with the information in the processing completion queue. If the same information is found in the processing completion queue, the download address of the shared file is directly recorded, added to the same information, and the information is discarded; if not, the information is added to the processing completion queue; Step 5, The index server continues to retrieve information from the queue to be processed, and repeats steps 2, 3, and 4 to wait until the queue to be processed is empty. Step 6: The index server assembles the information in the processing completion queue into a result list in a certain format, and sends the information to the terminal. It should be noted that the space vector has many dimensions, and when determining whether the files are the same, a threshold may be given in advance, that is, for a dimension with a small frequency value, the comparison between the vectors may be omitted as appropriate, for example, The similarity of the two articles to 98% can also be considered the same article. Through the embodiment, the storage structure of the server data is optimized, and the uniqueness and accuracy of the query time and the shared file information returned from the server to the terminal are taken into consideration. Improve the user experience. 3a to 3f are schematic diagrams showing the structure of a space vector based index server system according to an embodiment of the present invention. The following is a description of the role of the space vector and the feasibility of denoising by referring to the framework of the server to the relevant unit composition. Sex. Figure 3a shows the composition of the index server aOL from the overall framework. The index server aal1 includes two sub-portions: index module a i02 and file information module al04. As shown in Figure 3a, the index module al02 is used to store index information. Including the word item al03, that is, all the shared file information is separated into the index module al02 in a certain format by the lexical analysis. The file information module al04 is a set, and the terminal file information shared by each terminal is internally included. Al05 ; The terminal file information al05 is a 4艮 directory for sharing information of a terminal, and stores resources for sharing file information. Figure 3b depicts the internal structure of the segmentation term al03. As shown in Figure 3b, blOl is the segmentation term al03 master code, and bl02 is the location where the segmentation term al03 appears in all shared file information. This information can be sorted from high to low by frequency. Fig. 3c illustrates the composition of the word segment position information bl02, as shown in Fig. 3c, including the terminal number clO1, the terminal shared file number cl02, and the frequency cl03 in which the word segment appears in the shared file. FIG. 3d illustrates the structure of the shared file information a05 from a terminal in the file information module. As shown in FIG. 3d, the terminal directory is divided into a plurality of shared file information items dlO1 according to the shared file, and one shared file information corresponds to one item. Figure 3e depicts the composition of a shared file information dlO l, as shown in Figure 3e, elOl is the terminal shared file number C102 mentioned in Figure 3c, Figure 3e is associated with Figure 3c by this number; space vector el02, Used to determine whether two files are the same or similar; the file name el03 is used as auxiliary information It is shown in the list of shared files returned by the index server to the terminal in the future; the file address el04 is the address of another terminal that issues the shared file information, and serves as an entry point for communication between the two terminals. Figure 3f further details the structure information of the space vector e 102, which is a multi-dimensional vector in which the frequency of occurrence of the respective word item al03 in the current shared file information and its main code b 101 are combined. 4 is a flowchart of a search method according to a preferred embodiment of the present invention. As shown in FIG. 4, the process for a terminal user to make a search request and finally obtain a search result includes the following steps: Step S401: A terminal user sends a search request to an index server; Step S402, the internal index of the index server starts to be searched, and the search index operation is performed. Step S403, if the matching of the related word segmentation item and the search request is retrieved, the searched word segmentation item is located to the relevant shared file information; S404, acquiring all the shared file information, and pressing into a pending queue; Step S405, performing a denoising operation of the same or similar shared file in the step, the operation is mainly determined by comparing the space vectors to process The shared file information in the completion queue is compared with the shared file information extracted from a queue to be processed as a comparison reference, and finally, if the two file information is different, the shared file information extracted from the queue to be processed is put into the processing completion queue; S406, finishing the shared file letter in the processing completion queue And generating a shared file list to be sent to the search terminal user; Step S407, the terminal user selects a shared file in the list, and establishes a point-to-point link with the publishing terminal of the shared file; Step S408, in this step, after the terminal is authorized by the publishing terminal Download the shared file. It should be noted that, when performing step S405, if the two file informations that are compared are duplicated, it is not necessary to add the file information acquired from the queue to be processed to the processing completion queue, but only append the address information to the processing completion queue. In the address field of the same file, the download terminal can realize multi-point download support of a file after receiving the shared file list information. FIG. 5 is a flowchart of creating a space vector according to a preferred implementation of the present invention. The process may be performed on an index server. As shown in FIG. 5, the process includes the following steps: Step S501: Obtain a shared file information from a file information module. ; Step S502: extract a key statement from key information such as a file summary, an author, a title, and the like of the shared file information; Step S503, perform lexical analysis on the key sentence, and search for a keyword library in the server, The sentence is split into a plurality of word segmentation items; step S504, the frequency of occurrence of the different word segmentation items in the key sentence is counted and the main code corresponding to the different word segmentation items is recorded; Step S505, determining whether there are remaining statements without lexical analysis, If yes, proceed to step S502, if not, proceed to step S506; step S506, combine all the word segmentation main codes completed by the lexical analysis with the frequency of occurrence thereof, and then integrate the combined values according to the dimension of the vector It is organized into a space vector corresponding to the shared file information. In summary, the above embodiments solve the problems caused by repeated redundancy in the search results in the prior art, save bandwidth resources, and improve user experience. Obviously, those skilled in the art should understand that the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device so that they may be stored in the storage device by the computing device, or they may be separately fabricated into individual integrated circuit modules, or Multiple modules or steps are made into a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the scope of the present invention are intended to be included within the scope of the present invention.

Claims

权 利 要 求 书 Claim
1. 一种搜索方法, 包括: 1. A search method, including:
获取请求搜索的关键词; 获取包括所述关键词的分词项或者与所述关键词相同的分词项所对 应的多个文件的信息, 其中, 所述多个文件中的每个文件的信息均包括: 该文件的关键信息对应的一个或多个分词项以及每个分词项在所述关键 信息中出现的频率, 所述关键信息是为检索该文件所设定的信息;  Obtaining a keyword for requesting a search; acquiring information of a plurality of files corresponding to the word segmentation of the keyword or the segmentation term that is the same as the keyword, wherein information of each of the plurality of files is The method includes: one or more word breakers corresponding to the key information of the file, and a frequency of occurrence of each of the word segmentation items in the key information, where the key information is information set for searching the file;
确定所述多个文件中相同的文件, 其中, 所述相同的文件为文件对 应的分词项和每个分词项对应的频率相同的比例超过阈值的文件;  Determining the same file among the plurality of files, wherein the same file is a file corresponding to the word segmentation corresponding to the file and the frequency corresponding to each of the word segmentation items exceeding a threshold;
返回搜索结果, 其中, 对于所述相同的文件, 所述搜索结果保留了 其中的一个文件。  The search result is returned, wherein, for the same file, the search result retains one of the files.
2. 根据权利要求 1所述的方法, 其中, 在所述每个文件的信息还包括所述 关键信息中的一个或多个的情况下, 所述相同的文件还包括: 文件的信 息中包括的关键信息完全相同的文件。 2. The method according to claim 1, wherein, in the case that the information of each file further includes one or more of the key information, the same file further includes: the information of the file includes The key information is exactly the same file.
3. 根据权利要求 2所述的方法, 其中, 确定所述多个文件中相同的文件包 括: 3. The method of claim 2, wherein determining the same of the plurality of files comprises:
确定所述多个文件中的信息中包括的关键信息完全相同的文件为相 同的文件;  Determining that the files having the same key information included in the information in the plurality of files are the same file;
对于相同的文件仅保留其中的一个文件, 再确定所述多个文件中剩 余的文件中对应的分词项和每个分词项对应的频率相同的比较超过阈值 的文件为相同的文件。  Only one of the files remaining in the same file is retained, and it is determined that the corresponding word segmentation in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word segment items exceeding the threshold are the same file.
4. 根据权利要求 1所述的方法, 其中, 对于所述相同的文件, 所述搜索结 果保留了其中的一个文件和多个获取该文件所需要的信息。 4. The method according to claim 1, wherein, for the same file, the search result retains one of the files and a plurality of pieces of information required to acquire the file.
5. 根据权利要求 1至 4中任一项所述的方法, 其中, 还包括: The method according to any one of claims 1 to 4, further comprising:
对所述每个文件的关键信息按照逆向最大匹配算法将该文件中的关 键信息进行分词, 得到该文件对应的一个或多个分词项。  The key information of each file is segmented according to the inverse maximum matching algorithm to obtain one or more word items corresponding to the file.
6. —种搜索装置, 包括: 第一获取模块, 设置为获取请求搜索的关键词; 6. A search device, including: a first obtaining module, configured to obtain a keyword for requesting a search;
第二获取模块, 设置为获取包括所述关键词的分词项或者与所述关 键词相同的分词项所对应的多个文件的信息, 其中, 所述多个文件中的 每个文件的信息均包括: 该文件的关键信息对应的一个或多个分词项以 及每个分词项在所述关键信息中出现的频率, 所述关键信息是为检索该 文件所设定的信息;  a second obtaining module, configured to acquire information of a plurality of files corresponding to the word segmentation of the keyword or the word segmentation item that is the same as the keyword, wherein information of each of the plurality of files is The method includes: one or more word breakers corresponding to the key information of the file, and a frequency of occurrence of each of the word segmentation items in the key information, where the key information is information set for searching the file;
确定模块, 设置为确定所述多个文件中相同的文件, 其中, 所述相 同的文件为文件对应的分词项和每个分词项对应的频率相同的比例超过 阈值的文件;  a determining module, configured to determine the same file among the plurality of files, wherein the same file is a file corresponding to the word segment of the file and a frequency corresponding to each of the word segment items having a ratio exceeding a threshold;
返回模块, 设置为返回搜索结果, 其中, 对于所述相同的文件, 所 述搜索结果保留了其中的一个文件。  Returning to the module, set to return a search result, wherein, for the same file, the search result retains one of the files.
7. 根据权利要求 6所述的装置, 其中, 在所述每个文件的信息还包括所述 关键信息中的一个或多个的情况下, 所述确定模块确定的所述相同的文 件还包括: 文件的信息中包括的关键信息完全相同的文件。 The apparatus according to claim 6, wherein, in a case that the information of each file further includes one or more of the key information, the same file determined by the determining module further includes : The information contained in the file contains exactly the same key information.
8. 根据权利要求 7所述的装置, 其中, 所述确定模块包括: The device according to claim 7, wherein the determining module comprises:
第一确定模块, 设置为确定所述多个文件中的信息中包括的关键信 息完全相同的文件为相同的文件;  a first determining module, configured to determine that the files with the same key information included in the information in the plurality of files are the same file;
第二确定模块, 设置为对于相同的文件仅保留其中的一个文件, 并 再确定所述多个文件中剩余的文件中对应的分词项和每个分词项对应的 频率相同的比较超过阈值的文件为相同的文件。  a second determining module, configured to retain only one of the files for the same file, and further determine that the corresponding part of the plurality of files and the corresponding frequency of each of the word items are the same as the comparison threshold For the same file.
9. 根据权利要求 6所述的装置, 其中, 所述搜索模块返回的搜索结果中, 对于相同的文件保留了其中的一个文件和多个获取该文件所需要的信 息。 9. The apparatus according to claim 6, wherein one of the search results returned by the search module retains one of the files and a plurality of pieces of information required to acquire the file for the same file.
10. 根据权利要求 6至 9中任一项所述的装置, 其中, 还包括: The apparatus according to any one of claims 6 to 9, further comprising:
分词模块, 设置为对所述每个文件的关键信息按照逆向最大匹配算 法将该文件中的关键信息进行分词, 得到该文件对应的一个或多个分词 项。  The word segmentation module is configured to segment the key information in the file according to the reverse maximum matching algorithm for the key information of each file, to obtain one or more word breakers corresponding to the file.
PCT/CN2011/073036 2011-03-04 2011-04-19 Retrieval method and apparatus WO2012119339A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201110052484.0A CN102654879B (en) 2011-03-04 2011-03-04 Search method and device
CN201110052484.0 2011-03-04

Publications (1)

Publication Number Publication Date
WO2012119339A1 true WO2012119339A1 (en) 2012-09-13

Family

ID=46730505

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2011/073036 WO2012119339A1 (en) 2011-03-04 2011-04-19 Retrieval method and apparatus

Country Status (2)

Country Link
CN (1) CN102654879B (en)
WO (1) WO2012119339A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110546028A (en) * 2017-02-03 2019-12-06 宾利汽车有限公司 Regenerative braking system
CN113449063A (en) * 2021-06-25 2021-09-28 树根互联股份有限公司 Method and device for constructing document structure information retrieval library

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105447169B (en) * 2015-12-07 2019-02-12 百度在线网络技术(北京)有限公司 Document normalizing method, literature search method and corresponding intrument
CN107544982B (en) * 2016-06-24 2022-12-02 中兴通讯股份有限公司 Text information processing method and device and terminal
CN109697642A (en) * 2017-10-23 2019-04-30 北京京东尚科信息技术有限公司 Data push method, device and computer readable storage medium
CN109558417B (en) * 2018-11-28 2023-08-08 亚信科技(南京)有限公司 Data processing method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373486A (en) * 2008-10-16 2009-02-25 北京航空航天大学 Personalized summary system based on user interest model
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN101853298A (en) * 2010-05-26 2010-10-06 上海大学 Event-oriented query expansion method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN100454308C (en) * 2006-08-30 2009-01-21 华为技术有限公司 Method of file distributing and searching and its system
CN1924854B (en) * 2006-09-18 2010-05-12 浙江大学 Desktop searching method for intelligent mobile terminal
CN101853272B (en) * 2010-04-30 2012-07-04 华北电力大学(保定) Search engine technology based on relevance feedback and clustering

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101373486A (en) * 2008-10-16 2009-02-25 北京航空航天大学 Personalized summary system based on user interest model
CN101630321A (en) * 2009-08-26 2010-01-20 中山大学 On-line article screening method based on data mining (DM)
CN101853298A (en) * 2010-05-26 2010-10-06 上海大学 Event-oriented query expansion method

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LOU HUA FENG: "Research of Feature Vector Value Weighted Based on Semantic Analysis in Chinese Text Clustering", A DISSERTATION SUBMITTED TO SHANGHAI JIAO TONG UNIVERSITY FOR THE DEGREE OF MASTER, 26 July 2010 (2010-07-26) *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110546028A (en) * 2017-02-03 2019-12-06 宾利汽车有限公司 Regenerative braking system
CN113449063A (en) * 2021-06-25 2021-09-28 树根互联股份有限公司 Method and device for constructing document structure information retrieval library
CN113449063B (en) * 2021-06-25 2023-06-16 树根互联股份有限公司 Method and device for constructing document structure information retrieval library

Also Published As

Publication number Publication date
CN102654879A (en) 2012-09-05
CN102654879B (en) 2015-01-28

Similar Documents

Publication Publication Date Title
US9792340B2 (en) Identifying data items
CN108304444B (en) Information query method and device
US10747951B2 (en) Webpage template generating method and server
US9646060B2 (en) Method and server for pushing information proactively
US10250526B2 (en) Method and apparatus for increasing subresource loading speed
US8447120B2 (en) Incremental feature indexing for scalable location recognition
WO2012119339A1 (en) Retrieval method and apparatus
US8214411B2 (en) Atomic deletion of database data categories
US20150249719A1 (en) Method and device for pushing information
US10275486B2 (en) Multi-system segmented search processing
CN112328548A (en) File retrieval method and computing device
CN108259544B (en) UR L query method and UR L query server
CN112231481A (en) Website classification method and device, computer equipment and storage medium
CN103136294A (en) File operating method and device
CN110442614B (en) Metadata searching method and device, electronic equipment and storage medium
CN103646034A (en) Web search engine system and search method based content credibility
JP6449961B2 (en) Method and apparatus for generating web pages
CN107291875B (en) Metadata organization management method and system based on metadata graph
CN107077509B (en) Method, device and equipment for improving browsing experience
CN111045987B (en) Ceph-based distributed file system metadata access acceleration method and system
JP5464082B2 (en) Document processing apparatus, document processing method, document processing program, and computer-readable recording medium recording the document processing program
CN105531697B (en) A kind of content information processing method and system
CN110609959B (en) Retrieval method based on project lifecycle, storage medium and electronic equipment
CN110737663B (en) Data storage method, device, equipment and storage medium
CN107819806B (en) Method, device and system for optimizing APP interface data

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 11860684

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 11860684

Country of ref document: EP

Kind code of ref document: A1