WO2012119339A1

WO2012119339A1 - Retrieval method and apparatus

Info

Publication number: WO2012119339A1
Application number: PCT/CN2011/073036
Authority: WO
Inventors: 齐波
Original assignee: 中兴通讯股份有限公司
Priority date: 2011-03-04
Filing date: 2011-04-19
Publication date: 2012-09-13
Also published as: CN102654879A; CN102654879B

Abstract

A retrieval method and apparatus are disclosed. The retrieval method includes: obtaining a keyword requesting for retrieval, obtaining information of a plurality of documents corresponding to word segmentation items of the keyword or word segmentation items similar to the keyword, wherein, the information of each document of a plurality of documents includes one or more word segmentation items corresponding to key information of the document and frequency of each word segmentation item's occurrence in the key information which is the information defined for retrieval of the document, determining similar documents from a plurality of documents, wherein, the similar documents are the documents with a ratio of similarity of word segmentation items corresponding to the documents and the frequency corresponding to each word segmentation item above a threshold value, returning a retrieval result which only maintains one of the similar documents. By the present invention, bandwidth resource is saved and user experience is improved.

Description

TECHNICAL FIELD The present invention relates to the field of information retrieval, and in particular to a search method and apparatus. BACKGROUND OF THE INVENTION Currently, many files are shared on the network, so how to retrieve the files required by the users is particularly important. The prior art provides a method for downloading and downloading a mobile phone file, that is, a server in a domain where the mobile terminal is located receives search request keyword information, searches for a resource in the mobile communication network, and returns a searched resource list to the mobile terminal. The mobile terminal receives the data source information selected by the user according to the resource list, and initiates a request for downloading the required resource to the server in the domain, and the server in the domain where the mobile terminal is located sends the required resource to the mobile terminal. Both the above methods and the retrieval results of other methods in the prior art may have redundant redundant items, and such redundancy not only occupies bandwidth resources but also affects the user experience. SUMMARY OF THE INVENTION A primary object of the present invention is to provide a search method and apparatus to solve at least the above problems. According to an aspect of the present invention, a search method is provided, including: acquiring a keyword for requesting a search; acquiring information of a plurality of files corresponding to a word segmentation of the keyword or a segmentation term identical to the keyword The information of each of the plurality of files includes: one or more word segment corresponding to the key information of the file, and a frequency of occurrence of each word segment in the key information, the key information Is the information set for retrieving the file; determining the same file among the plurality of files, wherein the same file is a file corresponding to the word segment corresponding to the file and a frequency corresponding to each segment term having a ratio exceeding a threshold Returning the search result, wherein one of the files is retained for the search result of the same file. Preferably, in a case that the information of each file further includes one or more of the key information, the same file further includes: a file whose key information included in the information of the file is identical. Preferably, determining the same one of the plurality of files comprises: determining that the files with the same key information included in the information in the plurality of files are the same file; and retaining only one of the files for the same file, Determining that the corresponding word segmentation in the remaining ones of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file. Preferably, the search result retains one of the files and a plurality of information required to acquire the file for the same file. Preferably, the method further comprises: segmenting the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. According to another aspect of the present invention, a search apparatus is further provided, including: a first obtaining module, configured to acquire a keyword for requesting a search; and a second obtaining module, configured to acquire a word segmentation item including the keyword or The information of the plurality of files corresponding to the same word segmentation item, wherein the information of each of the plurality of files includes: one or more word items corresponding to the key information of the file and each The frequency at which the word segment appears in the key information, the key information is information set for retrieving the file; the determining module is configured to determine the same file among the plurality of files, wherein the same file The file corresponding to the word segmentation corresponding to the file and the frequency corresponding to each segment term exceeds the threshold; the return module is set to return the search result, wherein one of the files is retained for the same file. Preferably, in the case that the information of each file further includes one or more of the key information, the same file determined by the determining module further includes: the key information included in the information of the file is completely The same file. Preferably, the determining module includes: a first determining module, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; and the second determining module is set to be the same file Only one of the files is retained, and it is determined that the corresponding word segmentation in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word segmentation items exceeding the threshold are the same file. Preferably, in the search result returned by the search module, one of the files and a plurality of information required to acquire the file are retained for the same file. Preferably, the apparatus further includes: a word segmentation module, configured to segment the key information in the file according to an inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. The invention solves the problems caused by the repeated redundancy of the search results in the prior art, saves bandwidth resources, and improves the user experience. BRIEF DESCRIPTION OF THE DRAWINGS The accompanying drawings, which are set to illustrate,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,, 1 is a flowchart of a search method according to an embodiment of the present invention; FIG. 2 is a structural block diagram of a search apparatus according to an embodiment of the present invention; FIG. 3a is an index monthly service according to an embodiment of the present invention. FIG. 3b is a schematic diagram showing the internal structure of a word segmentation according to an embodiment of the present invention; FIG. 3c is a schematic structural diagram of location information of a word segment according to an embodiment of the present invention; FIG. 3d is a schematic diagram of the present invention. FIG. 3 is a schematic diagram of the composition of the shared file information according to the embodiment of the present invention; FIG. 3f is a schematic diagram of the structure information of the space vector according to the embodiment of the present invention; Is a flow chart of a search method in accordance with a preferred embodiment of the present invention; and FIG. 5 is a flow chart for creating a space vector in accordance with a preferred embodiment of the present invention. BEST MODE FOR CARRYING OUT THE INVENTION Hereinafter, the present invention will be described in detail with reference to the accompanying drawings. It should be noted that the embodiments in the present application and the features in the embodiments may be combined with each other without conflict. The following embodiments can be applied to retrieval in various networks. Of course, since the bandwidth requirement of the mobile terminal is very strict, repeated data transmission means additional traffic charges, so the retrieval applied to the mobile terminal can be obtained. Better results. In this embodiment, a search method is provided. FIG. 1 is a flowchart of a search method according to an embodiment of the present invention. As shown in FIG. 1, the process includes the following steps: Step S102: Obtain a keyword for requesting search. Step S104: Obtain information of a plurality of files corresponding to the word segment of the keyword or the word segmentation item having the same appeal keyword, wherein the information of each file includes: one or more corresponding to the key information of the file The frequency of occurrence of the word segmentation and each term item in the key information. The key information is the information set for searching the file, for example, document summary, author, title, and so on. Step S106: Determine the same file in the plurality of files, where the same file is a file whose file segment corresponding to all the word items and the corresponding frequency of each word item exceed the threshold value; for example, the word segment corresponding to the file is : (al , appears 10 times), ( bl , appears 9 times), ( cl , appears 1 time ); the corresponding word of the B file is: (al , appears 10 times ), ( bl , appears 9 times ) , ( Cl , appears once), therefore, it can be judged that A and B are the same file. If the cl corresponding to the B file appears twice, the similarity between the A file and the B file is considered to be the same file. Step S108, returning the search result, wherein one of the files is retained for the same file search result, for example, the A file and the B file are the same file, and one of the A file and the B file is retained in the search result. Through the above steps, the query result returned by the server to the user is denoised, thereby solving the problem caused by repeated redundancy of the search result in the prior art, improving the user experience and saving network resources. Preferably, for the same file, the search result retains one of the files and a plurality of information required to obtain the file. For example, if the A file is retained, the download addresses of the A file and the B file may also be retained, so that the user Multiple downloads can be implemented. Preferably, in implementation, the key information with a small amount of information may be compared first. If the key information is the same, the two files may be considered to be the same. For example, the author and the title may be compared. By such a simple comparison, it is possible to block some duplicate files, and also reduce the burden of searching for the server. That is, in the case where the information of each file further includes one or more of the key information, the same file further includes: the file whose key information is identical in the information of the file. In this case, determining the same file in the plurality of files in step S106 includes: determining that the files with the same key information included in the information in the plurality of files are the same file; only the same file is retained for the same file. A file, and then determining that the corresponding word segment in the remaining files in the plurality of files and the file having the same frequency corresponding to each of the word segment items exceeding the threshold are the same file. There are many ways to perform word segmentation on key information. In this embodiment, a word segmentation method is used: the key information of each file is segmented according to the inverse maximum matching algorithm, and the key information in the file is segmented. One or more word breakers corresponding to the file. Regardless of which word segmentation method is used, the user experience can be improved by using the method shown in FIG. In this embodiment, a search device is further provided, and the device may be located in a server that provides a search function, and the device is used to implement the foregoing embodiments and preferred embodiments thereof. The following is a description of the modules involved in the device. FIG. 2 is a structural block diagram of a search device according to an embodiment of the present invention. As shown in FIG. 2, the search device includes: a first obtaining module 20, and a second acquiring module 22 , the module 24 and the return module 26 are determined. The structure will be described below. The first obtaining module 20 is configured to acquire a keyword for requesting a search; the second obtaining module 22 is connected to the first obtaining module 20, and the module is configured to acquire a word segmentation including a keyword or a word segment corresponding to the keyword The information of the plurality of files, wherein the information of each of the plurality of files includes: the key information of the file corresponds to all the word segments and the frequency of occurrence of each word segment in the key information, and the key information is for searching The information set by the file; the determining module 24 is connected to the second obtaining module 22, and is configured to determine the same file among the plurality of files, wherein the same file is the word segment corresponding to the file and the frequency corresponding to each word segment The same ratio exceeds the threshold of the file; the return module 26, connected to the determination module 24, is set to return the search result, wherein, for the same file, the search result retains one of the files. Preferably, in the case that the information of each file further includes one or more of the key information, the same file determined by the determining module 24 further includes: a file having the same key information included in the information of the file. In this case, the determining module 24 may include: a first determining module 242, configured to determine that the files whose key information included in the information in the plurality of files are identical are the same file; the second determining module 244, the module is connected Up to the first determining module 242, configured to retain only one of the files for the same file, and further determine that the corresponding word segment in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word terms exceed the threshold value For the same file. Preferably, one of the search results returned by the search module retains one of the files and a plurality of information required to acquire the file for the same file. Preferably, the apparatus further comprises: a word segmentation module, configured to segment the key information in the file according to the inverse maximum matching algorithm for the key information of each file, to obtain one or more word items corresponding to the file. Of course, this is only a preferred implementation of the word segmentation module, and any other word segmentation method can achieve the same effect as long as it can perform word segmentation. The following description will be made in connection with the search of shared files related to a mobile terminal (for example, a mobile phone). Of course, the following preferred embodiments can also be used in other terminals of non-mobile terminals. In the preferred embodiment, the mobile terminal can use two ways of sharing files. One is that the publisher uploads the file to be shared directly to the shared area of a file server, and the downloader accesses the shared area of the relay server to download the file. Of course, when publishing this file, the publisher can also set the corresponding permissions. Only the authorized downloader can obtain the file. If this is the case, you may need to obtain the key information of the file, for example, the file. The name, author, etc., this information allows the publisher to enter when uploading the file. There is also a more optimized sharing method. The publisher only publishes the name, abstract, type, size and other information of the file to be shared to the server instead of the original file. The downloader accesses the relay server according to its own needs. To select the appropriate file to locate the original publisher and get the shared file. The preferred embodiment provides a terminal file search denoising method based on a space vector algorithm, wherein the space vector is a multi-dimensional vector in which the word segmentation is a vector dimension, the frequency at which the word segment appears as the height of the vector in the dimension, and then all the dimensions The vector obtained by the above data integration. In the preferred embodiment, the mobile terminal shares a denoising method for the same or similar files in the file search process and uses the space vector as the denoising factor. The following describes the establishment of the cable module I and the search process. Create an index model with denoising capabilities. The process may include the following steps: Step 1: The mobile terminal issues a shared file information to the index server; Step 2: The index server opens the shared information published by the space storage terminal; Step 3: The index server refers to the lexicon to summarize the file in the shared information Key information such as author, title, etc. are analyzed by lexical analysis, and the article is divided into a collection of word items. For example, lexical analysis can be through searching the lexicon, and according to the inverse maximum matching algorithm, the file summary, author, title, etc. in the shared file information will be shared. Key information segmentation, or a complete article can be decomposed into a collection of multiple word segments; Step 4, the index server statistics the frequency and location of each word segment in these key messages, and The unique identifier of the word segmentation is referred to as the master code in this embodiment; in step 5, the index server sequentially combines the frequency of each word segment with the master code, and then integrates the combined values according to the dimensions of the vector to form a An abstract data model about the shared file, Also referred to as model space vector data; Step 6: The index server serializes the counted frequency, location, space vector, and correspondence with the shared file information into the encrypted file to form an index. Preferably, the index server mainly stores the shared file information of the terminal, and manages the released data in the storage area; and the inverted index is established in the other area for the data information posted to the server, so as to facilitate the terminal search. The word segmentation formed by the above steps is a data structure, which mainly includes the main code of the word segment, the frequency and position of the word appearing, and the main code is a unique identification code corresponding to each word in the thesaurus. It should be noted that the main purpose of doing this mapping in the above database is to facilitate the formation of a mathematical abstract model of the space vector; wherein, the frequency refers to the number of occurrences of the word item in the key information of the currently shared file information, and the position refers to the word segmentation item. The location that appears in the key message. It should be noted that since the basic constituent units of the file are words, and different files contain different types of words and different word frequencies, words are used as dimensions to uniquely distinguish files of different contents. Corresponding to the index established by the index server, the search process may include the following steps: Step 1: The index server enumerates the search result and joins the queue to be processed; Step 2: The index server is pending The search result information is retrieved from the queue, and the information such as the title, author, size, and creation time of the shared file is obtained, and the data volume is less than the threshold; Step 3: If the processing completion queue is empty, the information is directly listed; The completion queue is not empty. The index server compares this information with the search result information in the processing completion queue. If the same information is found in the processing completion queue, the download address of the shared file is directly recorded and added to the processing completion. In the same information in the queue, the information is discarded at the same time. Step 4: If the comparison result in step 3 is different, then the space vector of the information is taken out and compared with the information in the processing completion queue. If the same information is found in the processing completion queue, the download address of the shared file is directly recorded, added to the same information, and the information is discarded; if not, the information is added to the processing completion queue; Step 5, The index server continues to retrieve information from the queue to be processed, and repeats steps 2, 3, and 4 to wait until the queue to be processed is empty. Step 6: The index server assembles the information in the processing completion queue into a result list in a certain format, and sends the information to the terminal. It should be noted that the space vector has many dimensions, and when determining whether the files are the same, a threshold may be given in advance, that is, for a dimension with a small frequency value, the comparison between the vectors may be omitted as appropriate, for example, The similarity of the two articles to 98% can also be considered the same article. Through the embodiment, the storage structure of the server data is optimized, and the uniqueness and accuracy of the query time and the shared file information returned from the server to the terminal are taken into consideration. Improve the user experience. 3a to 3f are schematic diagrams showing the structure of a space vector based index server system according to an embodiment of the present invention. The following is a description of the role of the space vector and the feasibility of denoising by referring to the framework of the server to the relevant unit composition. Sex. Figure 3a shows the composition of the index server aOL from the overall framework. The index server aal1 includes two sub-portions: index module _a i02 and file information module al04. As shown in Figure 3a, the index module al02 is used to store index information. Including the word item al03, that is, all the shared file information is separated into the index module al02 in a certain format by the lexical analysis. The file information module al04 is a set, and the terminal file information shared by each terminal is internally included. Al05 ; The terminal file information al05 is a 4艮 directory for sharing information of a terminal, and stores resources for sharing file information. Figure 3b depicts the internal structure of the segmentation term al03. As shown in Figure 3b, blOl is the segmentation term al03 master code, and bl02 is the location where the segmentation term al03 appears in all shared file information. This information can be sorted from high to low by frequency. Fig. 3c illustrates the composition of the word segment position information bl02, as shown in Fig. 3c, including the terminal number clO1, the terminal shared file number cl02, and the frequency cl03 in which the word segment appears in the shared file. FIG. 3d illustrates the structure of the shared file information a05 from a terminal in the file information module. As shown in FIG. 3d, the terminal directory is divided into a plurality of shared file information items dlO1 according to the shared file, and one shared file information corresponds to one item. Figure 3e depicts the composition of a shared file information dlO l, as shown in Figure 3e, elOl is the terminal shared file number C102 mentioned in Figure 3c, Figure 3e is associated with Figure 3c by this number; space vector el02, Used to determine whether two files are the same or similar; the file name el03 is used as auxiliary information It is shown in the list of shared files returned by the index server to the terminal in the future; the file address el04 is the address of another terminal that issues the shared file information, and serves as an entry point for communication between the two terminals. Figure 3f further details the structure information of the space vector e 102, which is a multi-dimensional vector in which the frequency of occurrence of the respective word item al03 in the current shared file information and its main code b 101 are combined. 4 is a flowchart of a search method according to a preferred embodiment of the present invention. As shown in FIG. 4, the process for a terminal user to make a search request and finally obtain a search result includes the following steps: Step S401: A terminal user sends a search request to an index server; Step S402, the internal index of the index server starts to be searched, and the search index operation is performed. Step S403, if the matching of the related word segmentation item and the search request is retrieved, the searched word segmentation item is located to the relevant shared file information; S404, acquiring all the shared file information, and pressing into a pending queue; Step S405, performing a denoising operation of the same or similar shared file in the step, the operation is mainly determined by comparing the space vectors to process The shared file information in the completion queue is compared with the shared file information extracted from a queue to be processed as a comparison reference, and finally, if the two file information is different, the shared file information extracted from the queue to be processed is put into the processing completion queue; S406, finishing the shared file letter in the processing completion queue And generating a shared file list to be sent to the search terminal user; Step S407, the terminal user selects a shared file in the list, and establishes a point-to-point link with the publishing terminal of the shared file; Step S408, in this step, after the terminal is authorized by the publishing terminal Download the shared file. It should be noted that, when performing step S405, if the two file informations that are compared are duplicated, it is not necessary to add the file information acquired from the queue to be processed to the processing completion queue, but only append the address information to the processing completion queue. In the address field of the same file, the download terminal can realize multi-point download support of a file after receiving the shared file list information. FIG. 5 is a flowchart of creating a space vector according to a preferred implementation of the present invention. The process may be performed on an index server. As shown in FIG. 5, the process includes the following steps: Step S501: Obtain a shared file information from a file information module. ; Step S502: extract a key statement from key information such as a file summary, an author, a title, and the like of the shared file information; Step S503, perform lexical analysis on the key sentence, and search for a keyword library in the server, The sentence is split into a plurality of word segmentation items; step S504, the frequency of occurrence of the different word segmentation items in the key sentence is counted and the main code corresponding to the different word segmentation items is recorded; Step S505, determining whether there are remaining statements without lexical analysis, If yes, proceed to step S502, if not, proceed to step S506; step S506, combine all the word segmentation main codes completed by the lexical analysis with the frequency of occurrence thereof, and then integrate the combined values according to the dimension of the vector It is organized into a space vector corresponding to the shared file information. In summary, the above embodiments solve the problems caused by repeated redundancy in the search results in the prior art, save bandwidth resources, and improve user experience. Obviously, those skilled in the art should understand that the above modules or steps of the present invention can be implemented by a general-purpose computing device, which can be concentrated on a single computing device or distributed over a network composed of multiple computing devices. Alternatively, they may be implemented by program code executable by the computing device so that they may be stored in the storage device by the computing device, or they may be separately fabricated into individual integrated circuit modules, or Multiple modules or steps are made into a single integrated circuit module. Thus, the invention is not limited to any specific combination of hardware and software. The above is only the preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes can be made to the present invention. Any modifications, equivalent substitutions, improvements, etc. made within the scope of the present invention are intended to be included within the scope of the present invention.

Claims

Claim

1. A search method, including:

Obtaining a keyword for requesting a search; acquiring information of a plurality of files corresponding to the word segmentation of the keyword or the segmentation term that is the same as the keyword, wherein information of each of the plurality of files is The method includes: one or more word breakers corresponding to the key information of the file, and a frequency of occurrence of each of the word segmentation items in the key information, where the key information is information set for searching the file;

Determining the same file among the plurality of files, wherein the same file is a file corresponding to the word segmentation corresponding to the file and the frequency corresponding to each of the word segmentation items exceeding a threshold;

The search result is returned, wherein, for the same file, the search result retains one of the files.

2. The method according to claim 1, wherein, in the case that the information of each file further includes one or more of the key information, the same file further includes: the information of the file includes The key information is exactly the same file.

3. The method of claim 2, wherein determining the same of the plurality of files comprises:

Determining that the files having the same key information included in the information in the plurality of files are the same file;

Only one of the files remaining in the same file is retained, and it is determined that the corresponding word segmentation in the remaining files of the plurality of files and the file having the same frequency corresponding to each of the word segment items exceeding the threshold are the same file.

4. The method according to claim 1, wherein, for the same file, the search result retains one of the files and a plurality of pieces of information required to acquire the file.

The method according to any one of claims 1 to 4, further comprising:

The key information of each file is segmented according to the inverse maximum matching algorithm to obtain one or more word items corresponding to the file.

6. A search device, including: a first obtaining module, configured to obtain a keyword for requesting a search;

a second obtaining module, configured to acquire information of a plurality of files corresponding to the word segmentation of the keyword or the word segmentation item that is the same as the keyword, wherein information of each of the plurality of files is The method includes: one or more word breakers corresponding to the key information of the file, and a frequency of occurrence of each of the word segmentation items in the key information, where the key information is information set for searching the file;

a determining module, configured to determine the same file among the plurality of files, wherein the same file is a file corresponding to the word segment of the file and a frequency corresponding to each of the word segment items having a ratio exceeding a threshold;

Returning to the module, set to return a search result, wherein, for the same file, the search result retains one of the files.

The apparatus according to claim 6, wherein, in a case that the information of each file further includes one or more of the key information, the same file determined by the determining module further includes : The information contained in the file contains exactly the same key information.

The device according to claim 7, wherein the determining module comprises:

a first determining module, configured to determine that the files with the same key information included in the information in the plurality of files are the same file;

a second determining module, configured to retain only one of the files for the same file, and further determine that the corresponding part of the plurality of files and the corresponding frequency of each of the word items are the same as the comparison threshold For the same file.

9. The apparatus according to claim 6, wherein one of the search results returned by the search module retains one of the files and a plurality of pieces of information required to acquire the file for the same file.

The apparatus according to any one of claims 6 to 9, further comprising:

The word segmentation module is configured to segment the key information in the file according to the reverse maximum matching algorithm for the key information of each file, to obtain one or more word breakers corresponding to the file.