WO2024022180A1 - 网盘文档索引方法、装置、网盘及存储介质 - Google Patents

网盘文档索引方法、装置、网盘及存储介质 Download PDF

Info

Publication number
WO2024022180A1
WO2024022180A1 PCT/CN2023/108029 CN2023108029W WO2024022180A1 WO 2024022180 A1 WO2024022180 A1 WO 2024022180A1 CN 2023108029 W CN2023108029 W CN 2023108029W WO 2024022180 A1 WO2024022180 A1 WO 2024022180A1
Authority
WO
WIPO (PCT)
Prior art keywords
index
organization
query
document
dictionary
Prior art date
Application number
PCT/CN2023/108029
Other languages
English (en)
French (fr)
Inventor
岳晨
Original Assignee
天津联想协同科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 天津联想协同科技有限公司 filed Critical 天津联想协同科技有限公司
Publication of WO2024022180A1 publication Critical patent/WO2024022180A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques

Definitions

  • Embodiments of the present invention relate to the field of network disk technology, and in particular, to a network disk document indexing method, device, network disk, and storage medium.
  • Netdisk is an online storage service launched by Internet companies.
  • the network disk system computer room divides a certain amount of disk space for users, provides users with file storage, access, backup, sharing and other file management functions for free or for a fee, and has advanced disaster recovery backup around the world. Users can regard the network disk as a hard disk or USB flash drive placed on the network. Whether at home, work or anywhere else, as long as they are connected to the Internet, they can manage and edit files in the network disk. No need to carry it with you, and no need to worry about losing it.
  • ElasticSearch is a document-oriented database that supports distributed real-time file storage and indexes each field so that it can be searched. It can be expanded to hundreds of servers at the same time, making it easy to process PB-level structured or unstructured data.
  • the same SAAS service search engine is generally used for multi-tenant enterprises. That is, each indexing service targets multiple enterprises. In this case, it is necessary to first determine the enterprise to which the query initiator belongs, then determine the index address range in which it is located, and obtain the index results from the index address range.
  • network disk files are changing dynamically. Therefore, the index address range needs to be adjusted from time to time, which increases the pressure on the indexing service and also affects the efficiency of external indexing services.
  • Embodiments of the present invention provide a network disk document indexing method, device, network disk, and storage medium to solve the technical problem in the prior art that network disk indexing service efficiency is low in a multi-organization scenario.
  • embodiments of the present invention provide a network disk document indexing method, including:
  • embodiments of the present invention also provide a network disk document indexing device, including:
  • An acquisition module used to obtain the organization where the document creator is located, and obtain the organization index code of the organization
  • a coding determination module used to determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization;
  • the index fragment determination module is used to determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments; the search module is used to determine the keyword according to the query request, and uses the keyword to determine the corresponding index fragment in the corresponding index fragment.
  • the index results are obtained by searching in the slice's multi-organization index dictionary.
  • embodiments of the present invention also provide a network disk, including:
  • processors one or more processors
  • a storage device for storing one or more programs
  • the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the network disk document indexing method provided in the above embodiment.
  • embodiments of the present invention also provide a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to execute the network disk document indexing method provided in the above-mentioned embodiments.
  • the network disk document indexing method, device, network disk and storage medium provided by embodiments of the present invention obtain the organization where the document creator is located and obtain the organization index code of the organization; create an index for the document and The document name in the index is added with the organization index code and stored in the index fragment.
  • a multi-organization index dictionary is generated for each index fragment; the organization where the query requester is located is determined based on the query request, and the query organization is determined based on the organization.
  • Index coding determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments; determine the keyword according to the query request, and use the keyword to search in the multi-organization index dictionary corresponding to the index fragment to obtain the index result.
  • the organization index code corresponding to the queryer is used as the query Conditions, determine the corresponding index fragment, and obtain the index results based on keywords from the multi-organization index dictionary in the index fragment.
  • Figure 1 is a schematic flow chart of a network disk document indexing method provided by Embodiment 1 of the present invention
  • Figure 2 is a schematic flow chart of the network disk document indexing method provided by Embodiment 2 of the present invention.
  • Figure 3 is a schematic flow chart of a network disk document indexing method provided by Embodiment 3 of the present invention.
  • Figure 4 is a schematic flow chart of a network disk document indexing method provided by Embodiment 4 of the present invention.
  • Figure 5 is a schematic structural diagram of a network disk document indexing device provided in Embodiment 5 of the present invention.
  • Figure 6 is a schematic structural diagram of a network disk provided in Embodiment 6 of the present invention.
  • Figure 1 is a schematic flow chart of a network disk document indexing method provided in Embodiment 1 of the present invention. This embodiment can be applied to situations where indexing services are provided to each organization in a multi-organization scenario. This method can be implemented by a network disk document indexing device. Execution includes the following steps:
  • Step 110 Obtain the organization where the document creator belongs and obtain the organization index code of the organization.
  • indexing service is available to multiple businesses. Therefore, the index needs to be divided to meet the requirements of multiple enterprises using the same index service.
  • each enterprise user who uses the same indexing service can be assigned a code to distinguish it from other enterprises that use the same indexing service.
  • indexing documents of this enterprise organization first determine the enterprise organization where the creator of the document belongs.
  • the owner of the document can be considered to be the enterprise organization. Therefore, the corresponding organization index code needs to be obtained.
  • Step 120 Create an index for the document, add an organization index code to the document name in the index, store it in the index fragment, and generate a multi-organization index dictionary for each index fragment.
  • An index is usually a separate, physical storage structure that sorts the values of one or more columns in a database table. It is a collection of one or several column values in a table and the corresponding physical identifiers pointing to the table. A list of logical pointers to the data page of the value. Therefore, the document name will be included in the index, and when creating an index for the document, the organization index code will be added to the corresponding document name.
  • the index after adding the organization index encoding is stored in the index shard, and then a multi-organization index dictionary is generated for each index shard based on the index results.
  • a multi-organization index dictionary is an index dictionary that includes documents from multiple organizations.
  • Said to be Generating a multi-organization index dictionary for each index fragment includes: generating a multi-organization index dictionary for each index fragment based on the word segmentation results and the document organization index coding corresponding to the word segmentation.
  • the word segmentation result and the organizational index code in the corresponding document can be used to generate a multi-organization index dictionary for each index fragment for the index fragment.
  • Step 130 Determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization.
  • users can issue document query requests, and the query requests can be keywords or document identifiers and other information.
  • the organization where the query requester is located is first determined according to the query request. For example, the ID of the query requester can be obtained, and the organization where the requester is located can be determined through the ID. Since the mapping relationship between the organization index code and the organization has been established in advance, the organization index code of the organization where the query requester is located can be determined through table lookup.
  • Step 140 Determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments.
  • index sharding is currently commonly used to save the index.
  • the index can have multiple shards, and a large index can be split into multiple shards and distributed on different nodes.
  • the index shards of the organization can be determined based on the query organization index code and the total number of index shards.
  • Step 150 Determine keywords according to the query request, and use the keywords to search in the multi-organization index dictionary corresponding to the index shard to obtain the index result.
  • the query request includes keywords, and the corresponding relationship between the keywords and documents established in the multi-organization index dictionary in the index shard determined in the above step is used to search and obtain the index results.
  • the corresponding operation feedback operation for the file in the drive letter is performed by the network disk. Therefore, the network disk side can be used to preview the operation response function execution result. After the network disk uses the operation response function to obtain the execution result, it sends the execution result to the network disk. The network disk will display the execution results. For example, a preview area can be set on the current interface, and the file preview content can be displayed in the preview area.
  • This embodiment obtains the organization where the document creator belongs and obtains the organization index code of the organization; creates an index for the document, adds the organization index code to the document name in the index, and stores it in the index shard.
  • Generate a multi-organization index dictionary for each index shard determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization; determine the query request correspondence based on the query organization index code and the number of index shards Index fragmentation: determine keywords according to the query request, and use the keywords to search in the multi-organization index dictionary corresponding to the index fragment to obtain the index results.
  • the organization index code corresponding to the queryer is used as the query Conditions, determine the corresponding index fragment, and obtain the index results based on keywords from the multi-organization index dictionary in the index fragment.
  • FIG. 2 is a schematic flowchart of a network disk document indexing method provided in Embodiment 2 of the present invention.
  • This embodiment is optimized based on the above embodiment.
  • the organization index code is optimized into a serial number generated in the order of arrangement; accordingly, the organization index code and the number of index shards are modified based on the query.
  • the specific optimization is: extract the organizational index code in the index result; perform a modular operation on the number of index fragments using the extracted organizational index code, and determine the corresponding index fragment based on the modular operation result. Index sharding.
  • the network disk document indexing method provided by this embodiment specifically includes:
  • Step 210 Obtain the organization where the document creator belongs, and obtain the organization index code of the organization.
  • the organization index code is a serial number generated in order.
  • an organization index code can be assigned to each enterprise organization according to the time when it is joined, and the allocated organization index code uses a serial number generated by adding one in sequence.
  • Step 220 Create an index for the document, add an organization index code to the document name in the index, store it in the index fragment, and generate a multi-organization index dictionary for each index fragment.
  • Step 230 Determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization.
  • Step 240 Use the query organization index code to perform a modulo operation on the number of index shards, and determine the corresponding index shards based on the modulo operation result.
  • the number of index shards and the corresponding index storage content can be determined based on the actual number of documents corresponding to all company organizations.
  • the index can also be reasonably allocated according to the expected development scale. . Therefore, the index shards corresponding to each company organization are arranged in sequence.
  • the query organization index code can be used to perform a modulo operation on the number of index shards.
  • the modulo operation is essentially equivalent to the remainder. Therefore, the index shards corresponding to the company organization can be determined. At the same time, when the index is expanded, the index settings are still performed according to the above rules. Similarly, the modulo operation can also be used to obtain all index shards corresponding to the company organization.
  • Step 250 Determine keywords according to the query request, and use the keywords to search in the multi-organization index dictionary corresponding to the index shard to obtain the index result.
  • the organization index code is optimized into a serial number generated in the order of arrangement; accordingly, the index fragment corresponding to the query request is determined based on the query organization index code and the number of index fragments.
  • the specific optimization is as follows: extracting all The tissue index code in the index result is calculated; the extracted tissue index code is used to perform a modulo operation on the number of index shards, and the corresponding index shards are determined based on the modulo operation result.
  • Index sharding improves the efficiency of determining index sharding.
  • it can also be applied to the situation of index expansion caused by the increase in the size of network disk documents.
  • FIG 3 is a schematic flowchart of a network disk document indexing method provided in Embodiment 3 of the present invention.
  • This embodiment is optimized based on the above embodiment.
  • the multi-organization index dictionary can be generated for each index fragment. The specific optimization is as follows: According to the word segmentation result and the document organization index code corresponding to the word segmentation, Each index shard generates a multi-organization index dictionary.
  • the network disk document indexing method provided by this embodiment specifically includes:
  • Step 310 Obtain the organization where the document creator belongs and obtain the organization index code of the organization.
  • Step 320 Generate a multi-organization index dictionary for each index fragment based on the word segmentation result and the document organization index code corresponding to the word segmentation.
  • index shards include: index content of enterprise users A, B and C.
  • index shards include: index content of enterprise users A, B and C.
  • results obtained from the multi-organization index dictionary need to be filtered again using the organizational index encoding of the document. Only then can the corresponding index results be obtained.
  • the organization index code can be directly written into the multi-organization index dictionary based on the word segmentation results of the index engine and the corresponding document organization index code to achieve isolation of the index dictionary of each organization.
  • generating a multi-organization index dictionary for each index fragment based on the word segmentation result and the document organization index code corresponding to the word segmentation may also include: obtaining the index sorting rules of each organization;
  • the word segmentation results coded for the same organization according to the document organization index are sorted according to the sorting rules of the organization; a multi-organization index dictionary is generated for each index fragment according to the sorting results.
  • the index dictionary When using the index dictionary to output index results, it is usually necessary to sort the index results according to corresponding rules to achieve better recommendation effects and increase the probability of being selected.
  • the TF-IDF (term frequency–inverse document frequency) method is commonly used, which is a statistical method used to evaluate the importance of a word to a document set or a document in a corpus. The importance of a word increases proportionally to the number of times it appears in the document, but at the same time decreases inversely to the frequency of its occurrence in the corpus.
  • Various forms of TF-IDF weighting are commonly used by search engines as a measure or ranking of the relevance of a document to a user's query.
  • TFIDF is actually: TF*IDF, TF term frequency (Term Frequency), IDF inverse document frequency (Inverse Document Frequency).
  • TF indicates that the entry is in document d frequency of occurrence.
  • IDF is: if there are fewer documents containing term t, that is, the smaller n is and the larger IDF is, it means that term t has good category distinguishing ability.
  • n m + k.
  • the IDF value obtained according to the IDF formula will be small, which means that the ability to distinguish the t category of the entry is not strong. But in fact, if a term appears frequently in documents of a category, it means that the term can well represent the characteristics of the text of this category. Such terms should be given a higher weight and selected as The characteristic words of this type of text are used to distinguish them from other types of documents. This is where IDF falls short.
  • term frequency refers to the frequency with which a given word appears in the document. This number is normalized to the term count to prevent it from being biased towards longer files. But different organizations use different recommendation rules. If sorted in the same way, it will inevitably affect the sorting results.
  • the sorting rules of each enterprise organization in the multi-organization index dictionary are first obtained, all indexes of the organization are obtained through the organization index code, and all indexes of the organization are sorted according to each enterprise organization's own sorting rules. And regenerate the multi-organization index dictionary.
  • the multi-organization index dictionary generated through personalized sorting can output personalized index results that meet the requirements of each enterprise organization, achieving the index sorting effect of a single index.
  • generating a multi-organization index dictionary for each index fragment according to the sorting result may also include: determining the maximum number of each word segment in the multi-organization index dictionary; according to The maximum number and sorting result generates a multi-organization index dictionary for each index shard. Select important index results through sorting results, and generate an index dictionary for each enterprise organization based on the important index results, and then generate a multi-organization index dictionary.
  • Step 330 Determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization.
  • Step 340 Determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments.
  • Step 350 Determine keywords according to the query request, and use the keywords to search in the multi-organization index dictionary corresponding to the index shard to obtain the index result.
  • This embodiment generates a multi-organization index dictionary for each index fragment as described above, and is specifically optimized to: generate a multi-organization index dictionary for each index fragment based on the word segmentation results and the document organization index coding corresponding to the word segmentation.
  • Figure 4 is a schematic flowchart of a network disk document indexing method provided in Embodiment 4 of the present invention.
  • This embodiment is optimized based on the above embodiment.
  • the keyword can be used to search in the multi-organization index dictionary corresponding to the index fragment to obtain the index result.
  • the specific optimization is: according to the above
  • the keywords are searched in the multi-organization index dictionary to obtain multi-organization index results; the multi-organization index results are searched according to the query organization index code to obtain index results.
  • the network disk document indexing method provided by this embodiment specifically includes:
  • Step 410 Obtain the organization where the document creator belongs and obtain the organization index code of the organization.
  • Step 420 Obtain the index sorting rules of each organization.
  • Step 430 The word segmentation results encoded into the same organization according to the document organization index are sorted according to the sorting rules of the organization.
  • Step 440 Determine the maximum number of each segment in the corresponding organization in the multi-organization index dictionary.
  • Step 450 Determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization.
  • Step 460 Determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments.
  • Step 470 Determine keywords according to the query request.
  • Step 480 Search the multi-organization index dictionary according to the keyword to obtain a multi-organization index result.
  • Step 490 Search the multi-organization index results according to the query organization index code to obtain the index results.
  • searching among multiple organization index results according to the query organization index code to obtain the index result may include: performing a modulo operation on the maximum number with the extracted organization index code, and determining based on the modulo operation result. Corresponding index results.
  • the query organization index code is used to perform a modulo operation on the maximum number of each word in the multi-organization index dictionary in each index shard.
  • the modulo operation is essentially equivalent to the remainder. Therefore, it can be determined that the keyword is in the multi-organization index dictionary.
  • the corresponding index of the organization in the index dictionary You can quickly determine the index content corresponding to the keyword in the multi-organization index dictionary.
  • the index content obtained through the modulo operation can still be sorted and displayed according to importance.
  • the index results are obtained by searching in the multi-organization index dictionary corresponding to the index fragment using the keywords.
  • the specific optimization is as follows: searching in the multi-organization index dictionary according to the keywords to obtain the multi-organization index.
  • Result Search the multi-organization index results according to the query organization index code to obtain the index result.
  • the corresponding index results can be quickly obtained.
  • the modular operation can also be used to sort and display the results according to their importance. Improved multi-organization indexing Dictionary indexing efficiency.
  • Figure 5 is a schematic structural diagram of a network disk document indexing device provided in Embodiment 5 of the present invention.
  • the device includes: an acquisition module 510, used to obtain the organization where the document creator is located, and obtain the The organizational index code of the organization;
  • Add module 520 used to create an index for the document, add an organization index code to the document name in the index, store it in the index fragment, and generate a multi-organization index dictionary for each index fragment;
  • the coding determination module 530 is used to determine the organization where the query requester is located based on the query request, and determine the query organization index code based on the organization;
  • the index fragment determination module 540 is used to determine the index fragment corresponding to the query request according to the query organization index code and the number of index fragments;
  • the search module 550 is configured to determine keywords according to the query request, and use the keywords to search in the multi-organization index dictionary corresponding to the index shard to obtain the index results.
  • the network disk document indexing device obtained by this embodiment obtains the organization where the document creator is located and obtains the organization index code of the organization; creates an index for the document, and adds the organization index code to the document name in the index. And stored in the index fragment, a multi-organization index dictionary is generated for each index fragment; the organization where the query requester is located is determined based on the query request, and the query organization index code is determined based on the organization; the query organization index code and The number of index fragments determines the index fragment corresponding to the query request; the keyword is determined according to the query request, and the keyword is used to search in the multi-organization index dictionary corresponding to the index fragment to obtain the index result.
  • the organization index code corresponding to the queryer is used as the query Conditions, determine the corresponding index fragment, and obtain the index results based on keywords from the multi-organization index dictionary in the index fragment.
  • the organization index is encoded as a serial number generated in the order of arrangement
  • the index fragment determination module includes:
  • An arithmetic unit configured to perform a modulo operation on the number of index shards based on the query organization index code, and determine the corresponding index shards based on the modulo operation result.
  • the additional modules include:
  • the generation unit is used to generate a multi-organization index dictionary for each index fragment based on the word segmentation result and the document organization index code corresponding to the word segmentation.
  • the generation unit includes:
  • a sorting subunit used to sort the word segmentation results encoded into the same organization according to the document organization index according to the organization's sorting rules
  • the search module includes:
  • a first search unit configured to search in the multi-organization index dictionary according to the keyword to obtain a multi-organization index result
  • the second search unit is used to search among multi-organization index results according to the query organization index code to obtain index results.
  • the additional modules include:
  • Determination unit used to determine the maximum number of each segment in the corresponding organization in the multi-organization index dictionary
  • a multi-organization index dictionary generation unit is configured to generate a multi-organization index dictionary for each index shard according to the maximum number and the sorting result.
  • the second search unit includes:
  • the modular operation subunit is used to perform a modular operation on the extracted tissue index code to the maximum number, and determine the corresponding index result according to the modular operation result.
  • the network disk document indexing device provided by the embodiment of the present invention can execute the network disk document indexing method provided by any embodiment of the present invention, and has functional modules and beneficial effects corresponding to the execution method.
  • Figure 6 is a schematic structural diagram of a network disk provided in Embodiment 6 of the present invention.
  • Figure 6 shows a block diagram of an exemplary network disk 12 suitable for implementing embodiments of the present invention.
  • the network disk 12 shown in FIG. 6 is only an example and should not impose any restrictions on the functions and scope of use of the embodiments of the present invention.
  • the network disk 12 is embodied in the form of a general computing device.
  • the components of the network disk 12 may include, but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics accelerated port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, the Industry Standard Architecture (ISA) bus, the Micro Channel Architecture (MAC) bus, the Enhanced ISA bus, the Video Electronics Standards Association (VESA) local bus, and the Peripheral Component Interconnect ( PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnect
  • Network disk 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the network disk 12, including volatile and non-volatile media, removable and non-removable media.
  • System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache 32 .
  • the network disk 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in Figure 5, commonly referred to as a "hard drive”).
  • a disk drive may be provided for reading and writing to removable non-volatile disks (e.g., "floppy disks"), and for removable non-volatile optical disks (e.g., CD-ROM, DVD-ROM). or other optical media) that can read and write optical disc drives.
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • System memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of embodiments of the invention.
  • a program/utility 40 having a set (at least one) of program modules 42 may be stored, for example, in system memory 28 Data, each of these examples or some combination may include an implementation of a network environment.
  • Program modules 42 generally perform functions and/or methods in the described embodiments of the invention.
  • the network disk 12 may also communicate with one or more external devices 14 (such as a keyboard, pointing device, display 24, etc.), and may also communicate with one or more devices that enable a user to interact with the network disk 12, and/or with Any device (eg, network card, modem, etc.) that enables the network disk 12 to communicate with one or more other computing devices. This communication may occur through input/output (I/O) interface 22.
  • the network disk 12 can also communicate with one or more networks (such as a local area network (LAN), a wide area network (WAN) and/or a public network, such as the Internet) through the network adapter 20. As shown in the figure, the network adapter 20 communicates with other modules of the network disk 12 through the bus 18 .
  • network disk 12 may be used in conjunction with the network disk 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives And data backup storage system, etc.
  • the processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, for example, implementing the network disk document indexing method provided by the embodiment of the present invention.
  • Embodiment 7 of the present invention also provides a storage medium containing computer-executable instructions. When executed by a computer processor, the computer-executable instructions are used to execute the network disk document indexing as provided in any of the above embodiments. method.
  • the computer storage medium in this embodiment of the present invention may be any combination of one or more computer-readable media.
  • the computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium.
  • a computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device. items, or any combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program for use by or in connection with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take a variety of forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the above.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium that can send, propagate, or transmit a program for use by or in connection with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer-readable medium may be transmitted using any suitable medium, including but not limited to wireless, wire, optical cable, RF, etc., or any suitable combination of the foregoing.
  • Computer program code for performing the operations of the present invention may be written in one or more programming languages, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional Procedural programming language - such as "C" or similar programming language.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user's computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as an Internet service provider through Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as an Internet service provider through Internet connection

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种网盘文档索引方法、装置、网盘及存储介质。其中,所述方法包括:获取所述文档创建者所在的组织,并获取所述组织的组织索引编码(110);为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典(120);根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码(130);根据查询组织索引编码和索引分片数量确定查询请求对应索引分片(140);根据查询请求确定关键词,并利用关键词在对应索引分片的多组织索引字典中查找得到索引结果(150)。可以无需事先存储索引分片与组织之间的对应关系,进而减少了索引服务的压力,同时也提高了对外提供索引服务的效率。

Description

网盘文档索引方法、装置、网盘及存储介质 技术领域
本发明实施例涉及网盘技术领域,尤其涉及一种网盘文档索引方法、装置、网盘及存储介质。
背景技术
网盘,是由互联网公司推出的在线存储服务。网盘系统机房为用户划分一定的磁盘空间,为用户免费或收费提供文件的存储、访问、备份、共享等文件管理等功能,并且拥有高级的世界各地的容灾备份。用户可以把网盘看成一个放在网络上的硬盘或U盘,不管是在家中、单位或其它任何地方,只要连接到因特网,就可以实现管理、编辑网盘里的文件。不需要随身携带,更不怕丢失。
对于企业网盘来说,其中所包含的文档数量处于海量级别,为便于用户查找文档,目前网盘通常提供ES(ElasticSearch)索引服务。Elasticsearch是面向文档型数据库,支持分布式实时文件存储,并将每一个字段都编入索引,使其可以被搜索。同时可以扩展到上百台服务器,便于处理PB级别的结构化或非结构化数据。
在实现本发明的过程中,发明人发现如下技术问题:目前出于成本考虑,普遍针对多租户企业采用同一SAAS服务搜索引擎。即每个索引服务面向多个企业。在此种情况下,需要首先确定查询发起方所属的企业,进而判断其所在的索引地址范围,从索引地址范围中获取索引结果。但网盘文件处于动态变化中,因此,需要时时调整索引地址范围,进而增加了索引服务的压力,同时也影响了对外提供索引服务的效率。
发明内容
本发明实施例提供一种网盘文档索引方法、装置、网盘及存储介质,以解决现有技术中在多组织场景下,网盘索引服务效率较低的技术问题。
第一方面,本发明实施例提供了一种网盘文档索引方法,包括:
获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;
为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;
根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;
根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找 得到索引结果。
第二方面,本发明实施例还提供了一种网盘文档索引装置,包括:
获取模块,用于获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;
增加模块,用于为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;
编码确定模块,用于根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;
索引分片确定模块,用于根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;查找模块,用于根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
第三方面,本发明实施例还提供了一种网盘,包括:
一个或多个处理器;
存储装置,用于存储一个或多个程序;
当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如上述实施例提供的网盘文档索引方法。
第四方面,本发明实施例还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如上述实施例提供的网盘文档索引方法。
本发明实施例提供的网盘文档索引方法、装置、网盘及存储介质,通过获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。通过为每个组织设定组织索引编码,并将其附加到文档名称中进行存储,并利用索引结果生成多组织索引字典,在接收到查询请求时,将对应查询人所在的组织索引编码作为查询条件,确定对应的索引分片,并从该索引分片中的多组织索引字典中根据关键词获取到索引结果。可以无需事先存储索引分片与组织之间的对应关系,进而减少了索引服务的压力,同时也提高了对外提供索引服务的效率。
附图说明
通过阅读参照以下附图所作的对非限制性实施例所作的详细描述,本发明的其它特征、 目的和优点将会变得更明显:
图1为本发明实施例一提供的网盘文档索引方法的流程示意图;
图2为本发明实施例二提供的网盘文档索引方法的流程示意图;
图3为本发明实施例三提供的网盘文档索引方法的流程示意图;
图4为本发明实施例四提供的网盘文档索引方法的流程示意图;
图5为本发明实施例五提供的网盘文档索引装置的结构示意图;
图6为本发明实施例六提供的一种网盘的结构示意图。
具体实施方式
下面结合附图和实施例对本发明作进一步的详细说明。可以理解的是,此处所描述的具体实施例仅仅用于解释本发明,而非对本发明的限定。另外还需要说明的是,为了便于描述,附图中仅示出了与本发明相关的部分而非全部结构。
实施例一
图1为本发明实施例一提供的网盘文档索引方法的流程示意图,本实施例可适用于在多组织场景下,对每个组织提供索引服务的情况,该方法可以网盘文档索引装置来执行,具体包括如下步骤:
步骤110、获取所述文档创建者所在的组织,并获取所述组织的组织索引编码。
在本实施例中,多个企业用户共同使用一个索引服务。该索引服务面向多个企业。因此,需要对索引进行划分,以满足多企业使用同一索引服务的要求。
可选的,可以为共同使用同一索引服务的每个企业用户分配一个编码,用于与使用同一索引服务的其它企业进行区分。在对该企业组织的文档生成索引时,首先确定该文档的创建者所在的企业组织。文档的创建者为企业组织成员时,可认为该文档的所有者即为该企业组织,因此,需要获取对应的组织索引编码。
步骤120、为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典。
索引通常是一种单独的、物理的对数据库表中一列或多列的值进行排序的一种存储结构,它是某个表中一列或若干列值的集合和相应的指向表中物理标识这些值的数据页的逻辑指针清单。因此,索引中会包括文档名称,在为文档创建索引时,在对应的文档名称中增加组织索引编码。并将增加组织索引编码后的索引存储于索引分片中,然后根据索引结果为每个索引分片生成多组织索引字典。多组织索引字典为包括多组织文档的索引字典。所述为 每个索引分片生成多组织索引字典,包括:根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。可以利用分词结果和对应文档中的组织索引编码为每个索引分片生成该索引分片的多组织索引字典。
步骤130、根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码。
在网盘中,用户可发出文档查询请求,所述查询请求可以为关键词或者文档标识等信息。在本实施例中,首先根据查询请求确定查询请求人所在的组织。示例性的,可以获取查询请求人的ID,并通过ID确定所在的组织。由于组织索引编码与组织事先已经建立完成映射关系,通过查表方式即可确定查询请求人所在组织的组织索引编码。
步骤140、根据查询组织索引编码和索引分片数量确定查询请求对应索引分片。
由于索引较大,因此,目前普遍采用索引分片的方式保存索引。索引可以有多个分片,可将大的索引拆分成多个,分布在不同节点上。可以根据查询组织索引编码和总的索引分片数量确定该组织的索引分片。
步骤150、根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
查询请求中包括关键词,利用上述步骤中确定的索引分片中多组织索引字典中建立的关键词与文档的对应关系,查找得到索引结果。
在本实施例中,对于盘符中文件的相应操作反馈运算由网盘执行。因此,可利用所述网盘侧预览操作响应功能函数执行结果。网盘在利用操作响应功能函数得到执行结果后,将执行结果发送至网盘。网盘将执行结果进行展示。示例性的,可以在当前界面设定预览区域,并在所述预览区域显示文件预览内容。
本实施例通过获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。通过为每个组织设定组织索引编码,并将其附加到文档名称中进行存储,并利用索引结果生成多组织索引字典,在接收到查询请求时,将对应查询人所在的组织索引编码作为查询条件,确定对应的索引分片,并从该索引分片中的多组织索引字典中根据关键词获取到索引结果。可以无需事先存储索引分片与组织之间的对应关系,进而减少 了索引服务的压力,同时也提高了对外提供索引服务的效率。
实施例二
图2为本发明实施例二提供的网盘文档索引方法的流程示意图。本实施例以上述实施例为基础进行优化,在本实施例中,将所述组织索引编码优化为按排列顺序生成的流水号;相应的,将所述根据查询组织索引编码和索引分片数量确定查询请求对应索引分片,具体优化为:提取所述索引结果中的组织索引编码;将提取到的组织索引编码对所述索引分片数量进行求模运算,根据求模运算结果确定对应的索引分片。
相应的,本实施例所提供的网盘文档索引方法,具体包括:
步骤210、获取所述文档创建者所在的组织,并获取所述组织的组织索引编码,所述组织索引编码为按排列顺序生成的流水号。
在本实施例中,可以按照每个企业组织加入的时间,为其分配组织索引编码,并且分配的组织索引编码采用按照顺序依次加一生成的流水号。
步骤220、为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典。
步骤230、根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码。
步骤240、将查询组织索引编码对所述索引分片数量进行求模运算,根据求模运算结果确定对应的索引分片。
在本实施例中,所述索引分片数量和对应的索引存储内容可根据所有公司组织实际对应的文档数量确定,在索引分片数量较为充裕时,也可按照预期发展规模对索引进行合理分配。因此,每个公司组织对应的索引分片是按照顺序排列设定的。
因此,可以将查询组织索引编码对所述索引分片数量进行求模运算,求模运算实质上相当于余数,因此,可确定该公司组织对应的索引分片。同时,在索引扩展时,仍然按照上述规律进行索引设置。同样,还可利用求模运算得到该公司组织对应的所有索引分片。
步骤250、根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
本实施例通过将所述组织索引编码优化为按排列顺序生成的流水号;相应的,将所述根据查询组织索引编码和索引分片数量确定查询请求对应索引分片,具体优化为:提取所述索引结果中的组织索引编码;将提取到的组织索引编码对所述索引分片数量进行求模运算,根据求模运算结果确定对应的索引分片。利用上述方式可快速准确的得到该公司组织对应的 索引分片,提高了确定索引分片的效率,同时利用上述优化方式,可以同样适用于在网盘文档规模增加导致的索引扩展的情况。
实施例三
图3为本发明实施例三提供的网盘文档索引方法的流程示意图。本实施例以上述实施例为基础进行优化,在本实施例中,可以将所述为每个索引分片生成多组织索引字典,具体优化为:根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。
相应的,本实施例所提供的网盘文档索引方法,具体包括:
步骤310、获取所述文档创建者所在的组织,并获取所述组织的组织索引编码。
步骤320、根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。
由于每个索引分片中包括多个企业组织的索引字典。在用户进行查询时,利用该索引分片的索引字典可能会得出多个企业组织的索引字典。例如:索引分片包括:企业用户A、B和C的索引内容。其中每个企业用户文档中都存在“业务”词语的大量文档。在此种情况下,需要利用文档这种的组织索引编码从多组织索引字典中得到的结果再次进行筛选。才能得到相应的索引结果。
因此,在本实施例中,可以根据索引引擎的分词结果,和对应的文档组织索引编码,在多组织索引字典中直接写入组织索引编码,实现每个组织的索引字典的隔离。
进一步的,所述根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典,还可包括:获取每个组织的索引排序规则;
根据所述文档组织索引编码为同一组织的分词结果按照该组织的排序规则进行排序;按照所述排序结果为每个索引分片生成多组织索引字典。
在利用索引字典输出索引结果时,通常还需要根据相应的规则对索引结果进行排序,以实现更好的推荐效果,增大被选中的几率。目前,普遍采用的是TF-IDF(term frequency–inverse document frequency)方法,其是一种统计方法,用以评估一字词对于一个文件集或一个语料库中的其中一份文件的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。TF-IDF的主要思想是:如果某个词或短语在一篇文章中出现的频率TF高,并且在其他文章中很少出现,则认为此词或者短语具有很好的类别区分能力,适合用来分类。TFIDF实际上是:TF*IDF,TF词频(Term Frequency),IDF逆向文件频率(Inverse Document Frequency)。TF表示词条在文档d 中出现的频率。IDF的主要思想是:如果包含词条t的文档越少,也就是n越小,IDF越大,则说明词条t具有很好的类别区分能力。如果某一类文档C中包含词条t的文档数为m,而其它类包含t的文档总数为k,显然所有包含t的文档数n=m+k,当m大的时候,n也大,按照IDF公式得到的IDF的值会小,就说明该词条t类别区分能力不强。但是实际上,如果一个词条在一个类的文档中频繁出现,则说明该词条能够很好代表这个类的文本的特征,这样的词条应该给它们赋予较高的权重,并选来作为该类文本的特征词以区别与其它类文档。这就是IDF的不足之处。在一份给定的文件里,词频(term frequency,TF)指的是某一个给定的词语在该文件中出现的频率。这个数字是对词数(term count)的归一化,以防止它偏向长的文件。但不同的组织使用的推荐规则是不同的。如果采用同一种方式排序,则必然会影响排序结果。
因此,在本实施例中,需要对多组织索引字典排序进行调整。具体的,首先获取多组织索引字典中每个企业组织的排序规则,通过组织索引编码获取该组织的所有索引,并按照每个企业组织自身的排序规则对该组织的所有索引进行排序。并重新生成多组织索引字典。在实现每个企业组织索引隔离的前提下,通过个性化的排序生成的多组织索引字典,能够输出个性化符合每个企业组织要求的索引结果,实现单索引的索引排序效果。
此外,为避免索引字典无限扩展,进而占用大量的索引资源。在本实施例中,所述按照所述排序结果为每个索引分片生成多组织索引字典,还可包括:确定多组织索引字典中每个分词在该多组织索引字典中的最大数量;根据所述最大数量和排序结果为每个索引分片生成多组织索引字典。通过排序结果选取重要的索引结果,并根据重要的索引结果生成每个企业组织的索引字典,进而生成多组织索引字典。
步骤330、根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码。
步骤340、根据查询组织索引编码和索引分片数量确定查询请求对应索引分片。
步骤350、根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
本实施例通过将所述为每个索引分片生成多组织索引字典,具体优化为:根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。实现在单个索引分片中实现每个企业组织的索引字典隔离。并可针对多组织索引字典按照每个企业组织的排序要求进行排序,能够输出个性化符合每个企业组织要求的索引结果,实现单索引的索引排序效果。
实施例四
图4为本发明实施例四提供的网盘文档索引方法的流程示意图。本实施例以上述实施例为基础进行优化,在本实施例中,可将所述利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果,具体优化为:根据所述关键词在所述多组织索引字典中查找得到多组织索引结果;根据所述查询组织索引编码在多组织索引结果中查找得到索引结果。
相应的,本实施例所提供的网盘文档索引方法,具体包括:
步骤410、获取所述文档创建者所在的组织,并获取所述组织的组织索引编码。
步骤420、获取每个组织的索引排序规则。
步骤430、根据所述文档组织索引编码为同一组织的分词结果按照该组织的排序规则进行排序。
步骤440、确定多组织索引字典中每个分词在对应组织中的最大数量。
步骤450、根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码。
步骤460、根据查询组织索引编码和索引分片数量确定查询请求对应索引分片。
步骤470、根据所述查询请求确定关键词。
步骤480、根据所述关键词在所述多组织索引字典中查找得到多组织索引结果。
利用多组织索引字典中关键词和文档的对应关系,查找得到多组织索引结果。
步骤490、根据所述查询组织索引编码在多组织索引结果中查找得到索引结果。
示例性的,所述根据所述查询组织索引编码在多组织索引结果中查找得到索引结果,可以包括:将提取到的组织索引编码对所述最大数量进行求模运算,根据求模运算结果确定对应的索引结果。
将查询组织索引编码对所述每个索引分片中的多组织索引字典中每个词的最大数量进行求模运算,求模运算实质上相当于余数,因此,可确定该关键词在多组织索引字典中对应的该组织的索引。可以快速确定该关键词在多组织索引字典中对应的索引内容。同时,由于在多组织索引字典中已经对应进行了排序,通过求模运算得到的索引内容仍然能够按照重要程度进行排序显示。
本实施例通过将所述利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果,具体优化为:根据所述关键词在所述多组织索引字典中查找得到多组织索引结果;根据所述查询组织索引编码在多组织索引结果中查找得到索引结果。可以快速获取得到对应的索引结果,同时,还可利用求模运算能够按照重要程度进行排序显示。提升了多组织索引 字典的索引效率。
实施例五
图5为本发明实施例五提供的网盘文档索引装置的结构示意图,如图5所示,所述装置包括:获取模块510,用于获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;
增加模块520,用于为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;
编码确定模块530,用于根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;
索引分片确定模块540,用于根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;
查找模块550,用于根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
本实施例提供的网盘文档索引装置,通过获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。通过为每个组织设定组织索引编码,并将其附加到文档名称中进行存储,并利用索引结果生成多组织索引字典,在接收到查询请求时,将对应查询人所在的组织索引编码作为查询条件,确定对应的索引分片,并从该索引分片中的多组织索引字典中根据关键词获取到索引结果。可以无需事先存储索引分片与组织之间的对应关系,进而减少了索引服务的压力,同时也提高了对外提供索引服务的效率。
在上述各实施例的基础上,所述组织索引编码为按排列顺序生成的流水号;
相应的,所述索引分片确定模块,包括:
运算单元,用于将查询组织索引编码对所述索引分片数量进行求模运算,根据求模运算结果确定对应的索引分片。
在上述各实施例的基础上,所述增加模块,包括:
生成单元,用于根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。
在上述各实施例的基础上,所述生成单元包括:
获取子单元,用于获取每个组织的索引排序规则;
排序子单元,用于根据所述文档组织索引编码为同一组织的分词结果按照该组织的排序规则进行排序;
生成子单元,用于按照所述排序结果为每个索引分片生成多组织索引字典。
在上述各实施例的基础上,所述查找模块包括:
第一查找单元,用于根据所述关键词在所述多组织索引字典中查找得到多组织索引结果;
第二查找单元,用于根据所述查询组织索引编码在多组织索引结果中查找得到索引结果。
在上述各实施例的基础上,所述增加模块包括:
确定单元,用于确定多组织索引字典中每个分词在对应组织中的最大数量;
多组织索引字典生成单元,用于根据所述最大数量和排序结果为每个索引分片生成多组织索引字典。
在上述各实施例的基础上,所述第二查找单元,包括:
求模运算子单元,用于将提取到的组织索引编码对所述最大数量进行求模运算,根据求模运算结果确定对应的索引结果。
本发明实施例所提供的网盘文档索引装置可执行本发明任意实施例所提供的网盘文档索引方法,具备执行方法相应的功能模块和有益效果。
实施例六
图6为本发明实施例六提供的一种网盘的结构示意图。图6示出了适于用来实现本发明实施方式的示例性网盘12的框图。图6显示的网盘12仅仅是一个示例,不应对本发明实施例的功能和使用范围带来任何限制。
如图6所示,网盘12以通用计算设备的形式表现。网盘12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(ISA)总线,微通道体系结构(MAC)总线,增强型ISA总线、视频电子标准协会(VESA)局域总线以及外围组件互连(PCI)总线。
网盘12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被网盘12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
系统存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(RAM)30和/或高速缓存32。网盘12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图5未显示,通常称为“硬盘驱动器”)。尽管图5中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如CD-ROM,DVD-ROM或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。系统存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本发明各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如系统存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本发明所描述的实施例中的功能和/或方法。
网盘12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该网盘12交互的设备通信,和/或与使得该网盘12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,网盘12还可以通过网络适配器20与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与网盘12的其它模块通信。应当明白,尽管图中未示出,可以结合网盘12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现本发明实施例所提供的网盘文档索引方法。
实施例七
本发明实施例七还提供了一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如上述实施例提供的任一所述的网盘文档索引方法。
本发明实施例的计算机存储介质,可以采用一个或多个计算机可读的介质的任意组合。计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器 件,或者任意以上的组合。计算机可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本文件中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
计算机可读的信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读的信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括——但不限于无线、电线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言或其组合来编写用于执行本发明操作的计算机程序代码,所述程序设计语言包括面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如”C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。
注意,上述仅为本发明的较佳实施例及所运用技术原理。本领域技术人员会理解,本发明不限于这里所述的特定实施例,对本领域技术人员来说能够进行各种明显的变化、重新调整和替代而不会脱离本发明的保护范围。因此,虽然通过以上实施例对本发明进行了较为详细的说明,但是本发明不仅仅限于以上实施例,在不脱离本发明构思的情况下,还可以包括更多其他等效实施例,而本发明的范围由所附的权利要求范围决定。

Claims (10)

  1. 一种网盘文档索引方法,其特征在于,包括:
    获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;
    为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;
    根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;
    根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;
    根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
  2. 根据权利要求1所述的方法,其特征在于,所述组织索引编码为按排列顺序生成的流水号;
    相应的,所述根据查询组织索引编码和索引分片数量确定查询请求对应索引分片,包括:
    将查询组织索引编码对所述索引分片数量进行求模运算,根据求模运算结果确定对应的索引分片。
  3. 根据权利要求1所述的方法,其特征在于,所述为每个索引分片生成多组织索引字典,包括:
    根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典。
  4. 根据权利要求3所述的方法,其特征在于,所述根据分词结果和分词对应的文档组织索引编码为每个索引分片生成多组织索引字典,还包括:
    获取每个组织的索引排序规则;
    根据所述文档组织索引编码为同一组织的分词结果按照该组织的排序规则进行排序;
    按照所述排序结果为每个索引分片生成多组织索引字典。
  5. 根据权利要求3所述的方法,其特征在于,所述利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果,包括:
    根据所述关键词在所述多组织索引字典中查找得到多组织索引结果;
    根据所述查询组织索引编码在多组织索引结果中查找得到索引结果。
  6. 根据权利要求4所述的方法,其特征在于,所述按照所述排序结果为每个索引分片生成多组织索引字典,包括:
    确定多组织索引字典中每个分词在对应组织中的最大数量;
    根据所述最大数量和排序结果为每个索引分片生成多组织索引字典。
  7. 根据权利要求6所述的方法,其特征在于,所述根据所述查询组织索引编码在多组织索 引结果中查找得到索引结果,包括:
    将提取到的组织索引编码对所述最大数量进行求模运算,根据求模运算结果确定对应的索引结果。
  8. 一种网盘文档索引装置,其特征在于,包括:
    获取模块,用于获取所述文档创建者所在的组织,并获取所述组织的组织索引编码;
    增加模块,用于为所述文档创建索引,并为索引中文档名称增加组织索引编码,并存储于索引分片中,为每个索引分片生成多组织索引字典;
    编码确定模块,用于根据查询请求确定查询请求人所在的组织,并根据所述所在的组织确定查询组织索引编码;
    索引分片确定模块,用于根据查询组织索引编码和索引分片数量确定查询请求对应索引分片;
    查找模块,用于根据所述查询请求确定关键词,并利用所述关键词在对应索引分片的多组织索引字典中查找得到索引结果。
  9. 一种网盘,其特征在于,所述网盘包括:
    一个或多个处理器;
    存储装置,用于存储一个或多个程序;
    当所述一个或多个程序被所述一个或多个处理器执行,使得所述一个或多个处理器实现如权利要求1-7中任一所述的网盘文档索引方法。
  10. 一种包含计算机可执行指令的存储介质,所述计算机可执行指令在由计算机处理器执行时用于执行如权利要求1-7中任一所述的网盘文档索引方法。
PCT/CN2023/108029 2022-07-28 2023-07-19 网盘文档索引方法、装置、网盘及存储介质 WO2024022180A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210894410.X 2022-07-28
CN202210894410.XA CN115080684B (zh) 2022-07-28 2022-07-28 网盘文档索引方法、装置、网盘及存储介质

Publications (1)

Publication Number Publication Date
WO2024022180A1 true WO2024022180A1 (zh) 2024-02-01

Family

ID=83243319

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/108029 WO2024022180A1 (zh) 2022-07-28 2023-07-19 网盘文档索引方法、装置、网盘及存储介质

Country Status (2)

Country Link
CN (1) CN115080684B (zh)
WO (1) WO2024022180A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115080684B (zh) * 2022-07-28 2023-01-06 天津联想协同科技有限公司 网盘文档索引方法、装置、网盘及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408882A (zh) * 2008-08-05 2009-04-15 北大方正集团有限公司 一种授权文档的检索方法和系统
US20120310928A1 (en) * 2011-06-01 2012-12-06 Microsoft Corporation Discovering expertise using document metadata in part to rank authors
CN107506464A (zh) * 2017-08-30 2017-12-22 武汉烽火众智数字技术有限责任公司 一种基于ES实现HBase二级索引的方法
CN112395387A (zh) * 2019-08-15 2021-02-23 北京京东尚科信息技术有限公司 全文检索方法及装置、计算机存储介质、电子设备
CN113312355A (zh) * 2021-06-15 2021-08-27 北京沃东天骏信息技术有限公司 一种数据管理的方法和装置
CN114416670A (zh) * 2022-04-01 2022-04-29 天津联想协同科技有限公司 适用于网盘文档的索引创建方法、装置、网盘及存储介质
CN115080684A (zh) * 2022-07-28 2022-09-20 天津联想协同科技有限公司 网盘文档索引方法、装置、网盘及存储介质

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408876B (zh) * 2007-10-09 2011-03-16 中兴通讯股份有限公司 一种电子文档全文检索的方法及系统
CN101246500B (zh) * 2008-03-27 2011-04-13 腾讯科技(深圳)有限公司 一种实现数据快速索引的检索系统和方法
CN101599069A (zh) * 2009-07-10 2009-12-09 腾讯科技(深圳)有限公司 电子文档的搜索方法及系统
CN102073719A (zh) * 2011-01-10 2011-05-25 复旦大学 一种基于区间编码的gml文档索引方法
WO2012126180A1 (en) * 2011-03-24 2012-09-27 Microsoft Corporation Multi-layer search-engine index
CN108628867A (zh) * 2017-03-16 2018-10-09 北京科瑞云安信息技术有限公司 面向云存储的多关键词密文检索方法和系统
CN110019647B (zh) * 2017-10-25 2023-12-15 华为技术有限公司 一种关键词搜索方法、装置和搜索引擎
CN111737316A (zh) * 2020-06-19 2020-10-02 广联达科技股份有限公司 一种工程清单查询方法、装置、计算机设备和存储介质
CN112612845A (zh) * 2020-12-22 2021-04-06 中国建设银行股份有限公司 一种组织机构视图实现方法、装置、电子设备及可读存储介质
CN113486156A (zh) * 2021-07-30 2021-10-08 北京鼎普科技股份有限公司 一种基于es的关联文档检索方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101408882A (zh) * 2008-08-05 2009-04-15 北大方正集团有限公司 一种授权文档的检索方法和系统
US20120310928A1 (en) * 2011-06-01 2012-12-06 Microsoft Corporation Discovering expertise using document metadata in part to rank authors
CN107506464A (zh) * 2017-08-30 2017-12-22 武汉烽火众智数字技术有限责任公司 一种基于ES实现HBase二级索引的方法
CN112395387A (zh) * 2019-08-15 2021-02-23 北京京东尚科信息技术有限公司 全文检索方法及装置、计算机存储介质、电子设备
CN113312355A (zh) * 2021-06-15 2021-08-27 北京沃东天骏信息技术有限公司 一种数据管理的方法和装置
CN114416670A (zh) * 2022-04-01 2022-04-29 天津联想协同科技有限公司 适用于网盘文档的索引创建方法、装置、网盘及存储介质
CN115080684A (zh) * 2022-07-28 2022-09-20 天津联想协同科技有限公司 网盘文档索引方法、装置、网盘及存储介质

Also Published As

Publication number Publication date
CN115080684A (zh) 2022-09-20
CN115080684B (zh) 2023-01-06

Similar Documents

Publication Publication Date Title
US9318027B2 (en) Caching natural language questions and results in a question and answer system
CN108304444B (zh) 信息查询方法及装置
JP6553649B2 (ja) クラスタリング記憶方法および装置
US8244767B2 (en) Composite locality sensitive hash based processing of documents
US9959347B2 (en) Multi-layer search-engine index
US20120166414A1 (en) Systems and methods for relevance scoring
CN111258966A (zh) 一种数据去重方法、装置、设备及存储介质
US20130198221A1 (en) Indexing structures using synthetic document summaries
US20060179039A1 (en) Method and system for performing secondary search actions based on primary search result attributes
JP2005339542A (ja) クエリからタスクへのマッピング
US20060101004A1 (en) Method and system for retrieving a document
WO2024022180A1 (zh) 网盘文档索引方法、装置、网盘及存储介质
CN113407785B (zh) 一种基于分布式储存系统的数据处理方法和系统
US8548989B2 (en) Querying documents using search terms
US9870168B1 (en) Key-value store with internal key-value storage interface
US20100010973A1 (en) Vector Space Lightweight Directory Access Protocol Data Search
CN112559871A (zh) 一种信息查询方法及其系统、服务器设备
US11610062B2 (en) Label assignment model generation device and label assignment model generation method
CN114064729A (zh) 一种数据检索方法、装置、设备及存储介质
CN106776772B (zh) 一种数据检索的方法及装置
US11609909B2 (en) Zero copy optimization for select * queries
US11645472B2 (en) Conversion of result processing to annotated text for non-rich text exchange
US20230418878A1 (en) Multi-model enrichment memory and catalog for better search recall with granular provenance and lineage
Frieder et al. On scalable information retrieval systems
US20230385240A1 (en) Optimizations for data deduplication operations

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23845392

Country of ref document: EP

Kind code of ref document: A1