CN116962516A - Data query method, device, equipment and storage medium - Google Patents

Data query method, device, equipment and storage medium Download PDF

Info

Publication number
CN116962516A
CN116962516A CN202310715282.2A CN202310715282A CN116962516A CN 116962516 A CN116962516 A CN 116962516A CN 202310715282 A CN202310715282 A CN 202310715282A CN 116962516 A CN116962516 A CN 116962516A
Authority
CN
China
Prior art keywords
file
keyword
storage
query
files
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310715282.2A
Other languages
Chinese (zh)
Inventor
石志林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN202310715282.2A priority Critical patent/CN116962516A/en
Publication of CN116962516A publication Critical patent/CN116962516A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/568Storing data temporarily at an intermediate stage, e.g. caching
    • H04L67/5681Pre-fetching or pre-delivering data based on network characteristics
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/50Network services
    • H04L67/56Provisioning of proxy services
    • H04L67/561Adding application-functional data or data for application control, e.g. adding metadata
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the application provides a data query method, a device, equipment and a storage medium, which relate to the technical field of cloud, are applicable to computing nodes in a cloud storage system and comprise the following steps: extracting query keywords from the data query request; the following operations are performed for each query keyword: determining each mapping area associated with a query keyword by adopting different preset mapping methods, respectively obtaining combined inverted files from the mapping areas and performing intersection treatment to obtain independent inverted files, wherein each independent inverted file comprises file information associated with the query keyword, each independent inverted file comprises the independent inverted files and other keywords, and the file information comprises node identifiers and file identifiers of storage nodes where corresponding storage files are located; based on the independent inverted files, sending data acquisition requests to corresponding storage nodes and receiving request responses, and generating data query responses according to the storage files carried in each request response. The storage and computation of data may be separated.

Description

Data query method, device, equipment and storage medium
Technical Field
The present application relates to the field of cloud technologies, and in particular, to a data query method, device, equipment, and storage medium.
Background
Cloud storage is a mode of online storage (Cloud storage), that is, data is stored in a server cluster hosted by a third party, the third party operates a large data center, a person who needs to store the data is required to store the data, the requirement of data storage is met by purchasing or leasing storage space from the person, a third party can prepare virtualized resources for storage on the server cluster according to the requirement of a client, the virtualized resources are provided in a storage resource pool mode, and a user can use the storage resource pool to store files by himself.
In the related art, the cloud storage technology adopts a manner of integrating computation and storage to provide cloud storage service for the usage object, as shown in fig. 1, the servers of the server cluster 120 include both computation logic and storage logic, and the usage object can upload, download, query, etc. files on the terminal device 110. However, in this way of integrating calculation and storage, each server needs to include a storage file and an index structure of the storage file, and the storage files of the servers and the corresponding index structures are different among the servers, so that even if the amount of the query task is small, each server in the server cluster needs to execute the query task for each query task, which results in high resource consumption and negative effects on the performance of the server, and the whole server cluster needs to be kept running, thereby further increasing the consumption of resources and the cost of cluster operation.
For example, assuming that a server cluster includes 100 servers, the file storage amount of each server is about one hundred thousand, and the index structure is a balanced multi-way tree (balanced tree of order m, abbreviated as B-tree) composed of keywords, file IDs and file addresses. If the number of query tasks is ten thousand in the daytime between 06:00 and 24:00, all servers in the server cluster kept running can execute effective query tasks, if the number of query tasks of 0:00 and 06:00 in the daytime is only 100, only 50 servers in the server cluster can execute effective queries, but all servers in the server cluster still need to be kept running, each server needs to execute a process of acquiring query task keywords according to each query task, and then acquiring document IDs corresponding to the document IDs based on the document IDs corresponding to the B-tree query keywords, so that the problems of huge resource consumption, poor running performance of the servers and high running cost of the cluster are caused.
Therefore, there is a need to redesign a data query method, and overcome the above-mentioned drawbacks.
Disclosure of Invention
The embodiment of the application provides a data query method, a device, equipment and a storage medium, which are used for separating data storage and calculation on the premise of reducing resource consumption, ensuring good running performance of a server and low cluster running cost.
In a first aspect, an embodiment of the present application provides a data query method, which is applicable to a computing node in a cloud storage system, where the cloud storage system includes the computing node and a storage node, and the storage node includes a plurality of storage files, and the method includes:
extracting at least one query keyword from a data query request sent by a client;
the following operations are respectively executed for the at least one query keyword:
determining each mapping area associated with one query keyword by adopting different preset mapping methods, respectively obtaining corresponding combined inverted files from each mapping area, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one query keyword; each combined inverted file is obtained by combining the independent inverted file and the independent inverted file of other keywords; the single inverted file of one query keyword comprises: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located, a plurality of mapping areas are respectively related to different preset mapping methods, and each mapping area comprises a merging and inverted file;
Based on the respective node identification and the file identification of each file information, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node;
and generating a data query response and returning the data query response to the client according to the storage file carried in each received request response of each query keyword.
In a second aspect, an embodiment of the present application provides a data query device, which is applicable to a computing node in a cloud storage system, where the cloud storage system includes the computing node and a storage node, and the storage node includes a plurality of storage files, and includes:
the extraction unit is used for extracting at least one query keyword from the data query request sent by the client;
the first processing unit is used for respectively executing the following operations for the at least one query keyword:
the mapping unit is used for respectively adopting different preset mapping methods to determine each mapping area associated with one query keyword, respectively obtaining corresponding combined inverted files from each mapping area, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one query keyword; each combined inverted file is obtained by combining the independent inverted file and the independent inverted file of other keywords; the single inverted file of one query keyword comprises: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located, a plurality of mapping areas are respectively related to different preset mapping methods, and each mapping area comprises a merging and inverted file;
The receiving and transmitting unit is used for respectively transmitting a data acquisition request to each corresponding storage node based on the respective node identification and the file identification of each file information and receiving a request response transmitted by each storage node;
and the second processing unit is used for generating a data query response according to the received storage file carried in each request response of each query keyword and returning the data query response to the client by adopting the receiving and transmitting unit.
Optionally, the mapping unit is specifically configured to,
for the different preset mapping methods, the following operations are respectively executed:
mapping the query keyword by adopting a preset mapping method to obtain a mapping value;
determining a mapping area associated with the mapping values, and acquiring combined inverted files from the mapping area, wherein the mapping values of the keywords of each single inverted file in the combined inverted files in the mapping area are the same in one preset mapping method, and the combined inverted files in a plurality of mapping areas associated with the preset mapping method comprise file information of storage files of each storage node. Optionally, the extracting unit is further configured to obtain new file information of a new file based on detection of each storage node, and extract at least one insert keyword from the new file, where the new file is a file that is received by the storage node and stored in a different manner from each stored file;
The first processing unit is further configured to, if it is determined that the keyword record includes the at least one inserted keyword, record keywords included in a storage file of each storage node;
the following operations are performed for the at least one inserted keyword, respectively:
the mapping unit is further configured to determine each mapping area associated with one inserted keyword by using the different preset mapping methods, obtain a corresponding combined inverted file from each mapping area associated with one inserted keyword, and store an association relationship between the one inserted keyword and the new file information in the corresponding combined inverted file.
Optionally, the extracting unit is further configured to obtain new file information of a new file based on detection of each storage node, and extract at least one insert keyword from the new file, where the new file is a file that is received by the storage node and stored in a different manner from each stored file;
the first processing unit is further configured to, if it is determined that the at least one inserted keyword includes a new keyword that is not included in a keyword record, where the keyword record is used to record keywords that are included in a storage file of each storage node;
The mapping unit is further configured to generate, for the new keyword, an individual inverted file of the new keyword, where the individual inverted file of the new keyword includes: new file information of the new file associated with the new keyword;
the mapping unit is further configured to determine each mapping area associated with the new keyword by using the different preset mapping methods, obtain corresponding combined inverted files from each mapping area associated with the new keyword, and combine the single inverted files of the new keyword into the corresponding combined inverted files.
Optionally, the extracting unit is further configured to obtain deleted file information of a deleted file based on detection of each storage node, and extract at least one deleted keyword from the deleted file, where the deleted file is a file deleted by the storage node from each stored file stored in the storage node;
the first processing unit is further configured to perform, for the at least one deletion keyword, the following operations respectively:
the mapping unit is further used for determining each mapping area associated with one deletion keyword by adopting different preset mapping methods, obtaining corresponding combined inverted files from each mapping area associated with the one deletion keyword, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one deletion keyword; and deleting the deleted file information from the single inverted file of the deleted keyword.
Optionally, the second processing unit is specifically configured to, for the at least one query keyword, perform the following operations in a parallel manner, respectively:
based on the node identification and the file identification of each file information of a query keyword, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node.
Optionally, the extracting unit is further configured to extract an initial keyword and file information of a file stored in each storage node;
the first processing unit is further configured to, for each initial keyword, perform the following steps:
acquiring file information of each storage file containing one initial keyword aiming at the initial keyword;
establishing association relations between the initial keywords and file information of each storage file respectively, and storing the association relations to initial independent inverted files of the initial keywords;
obtaining respective initial independent inverted files of the initial keywords, and respectively executing the following steps for the different preset mapping methods:
combining the initial independent inverted files of at least one initial keyword with the same mapping value based on the respective mapping value of each initial keyword under the preset mapping method to obtain initial combined inverted files;
And storing the initial merging and inverted file into a mapping area associated with the corresponding mapping value.
Optionally, the first processing unit is further configured to,
acquiring the number of keywords of initial keywords contained in each storage file and the query probability of each initial keyword, wherein the query probability is determined according to the historical query condition of the initial keywords;
determining the keyword query conditions of each storage file according to the query probability of each initial keyword, wherein the keyword query conditions are determined according to the query probability of the initial keywords not contained in the corresponding storage file;
the method comprises the steps that the function association relation between the number of keywords and the keyword query condition of each storage file, the number of areas of a mapping area and the number of methods of a preset mapping method and the expected comprehensive query error rate are provided with curve association relation, and the comprehensive query error rate is used for representing the error rate of the computing node query data;
and when the expected comprehensive error rate meets a preset value condition, acquiring different preset mapping methods meeting the function association relation, wherein the number of the methods is smaller than that of the areas.
Optionally, the second processing unit is specifically configured to,
determining the total number of the received storage files according to the storage files carried in each request response of each query keyword;
determining the sampling number according to the total number of the files, the comprehensive query error rate and a set probability value, wherein the set probability value is the ratio of the number of history-related files to the number of history samples, and the number of history-related files is the number of stored files which are contained in the stored files of the number of history samples and are related to the history query keywords;
according to the sampling number, randomly taking out the sampling number storage files from the received storage files;
and generating the data query response according to the sampling number of the storage files, and returning to the union client.
In a third aspect, an embodiment of the present application provides a computer device, including a processor and a memory, where the memory stores a computer program, and when the computer program is executed by the processor, causes the processor to execute any one of the data query methods in the first aspect.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium, which includes a computer program, where the computer program is configured to cause a computer device to execute any one of the data query methods of the first aspect, when the computer program is run on the computer device.
In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program stored in a computer readable storage medium; when a processor of a computer device reads the computer program from a computer-readable storage medium, the processor executes the computer program, so that the computer device performs any one of the data query methods of the first aspect.
The application has the following beneficial effects:
the data query method, the device, the computer equipment and the storage medium are suitable for the computing nodes in the cloud storage system, the cloud storage system comprises a client, the computing nodes and the storage nodes, the computing nodes are used for receiving the data query requests of the client, analyzing and acquiring query keywords in the data query requests, then acquiring file information containing the query keywords, respectively generating a data acquisition request according to each file information and sending the data acquisition request to the storage node corresponding to the node identification in the corresponding file information, so that the storage nodes acquire the file information from the data acquisition request, search the corresponding storage files according to the file identifications in the file information and return the file information to the computing nodes, the computing nodes receive the storage files returned by the storage nodes, generate data query responses based on the storage files and return the data query responses to the client, and data query is completed.
In the related art, after the server receives the data query request of the client, the method can directly complete the steps of extracting the query keywords, searching the file addresses associated with the query keywords, acquiring the storage files according to the file addresses, and returning the storage files to the client. The method comprises the steps of constructing a single inverted file of each query keyword in advance, obtaining the query keywords in a data query request, obtaining the single inverted file of the query keywords, obtaining file information of the query keywords, generating a data acquisition request according to the file information, determining a storage node to which the storage file belongs based on a node identifier of the file information, and sending the data acquisition request to the corresponding storage node, so that the storage node obtains the corresponding storage file in the storage node based on the file identifier, and communicates with the storage node in a targeted manner, thereby preventing unnecessary communication resources from being consumed and unnecessary communication cost from being generated.
Under the condition that the number of keywords is huge, the number of the independent inverted files of the keywords is larger, the association relation between a large number of keywords and the independent inverted files is stored in a computing node, the performance of the computing node is not friendly, the independent inverted index files of the keywords mapped to the same mapping area through one preset mapping method are combined in a mode of the preset mapping method and the mapping area, so that combined inverted files are obtained, and according to each query keyword, under different preset mapping methods, the combined inverted files can be respectively obtained from a plurality of corresponding mapping areas, and the intersection of the plurality of combined inverted files is processed, so that the independent inverted files are obtained.
Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the application. The objectives and other advantages of the application will be realized and attained by the structure particularly pointed out in the written description and claims thereof as well as the appended drawings.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only embodiments of the present application, and other drawings may be obtained according to the provided drawings without inventive effort for those skilled in the art.
Fig. 1 is a schematic diagram of an application scenario in which a cloud storage technology adopts integration of computing and storage in a related technology provided by an embodiment of the present application;
fig. 2 is an optional schematic diagram of an application scenario provided in an embodiment of the present application;
fig. 3 is a schematic flow chart of a data query method according to an embodiment of the present application;
fig. 4 is a schematic diagram of a method for mapping query keywords to corresponding mapping regions according to a preset mapping method provided in an embodiment of the present application;
FIG. 5 is a schematic diagram of a relationship between a query keyword and different preset mapping methods according to an embodiment of the present application;
FIG. 6 is a schematic diagram of a method for obtaining individual inverted files of query keywords according to an embodiment of the present application;
FIG. 7 is a schematic diagram of a multi-layer hash map according to an embodiment of the present application;
FIG. 8 is a simplified schematic diagram of a merging inverted file intersection process according to an embodiment of the present application;
FIG. 9 is a schematic diagram of a relationship between a separate inverted list, hash function, bucket, and merged inverted list of query keywords according to an embodiment of the present application;
FIG. 10 is a schematic diagram illustrating communications among a client, a computing node, and a storage node according to an embodiment of the present application;
FIG. 11 is a flowchart illustrating a method for adding a storage file according to an embodiment of the present application;
FIG. 12 is a flowchart illustrating a method for adding a storage file according to an embodiment of the present application;
FIG. 13 is a schematic diagram of a method for adding new keywords according to an embodiment of the present application;
FIG. 14 is a schematic diagram of a method for adding new keywords according to an embodiment of the present application;
FIG. 15 is a flowchart of a method for deleting a storage file according to an embodiment of the present application;
FIG. 16 is a schematic diagram of a method for deleting a storage file according to an embodiment of the present application;
FIG. 17 is a schematic diagram of a method for deleting a storage file according to an embodiment of the present application;
FIG. 18 is a flowchart of a method for merging and inverted file acquisition according to an embodiment of the present application;
FIG. 19 is a schematic diagram of a method for obtaining statistical information of a stored file in a corpus according to an embodiment of the present application;
FIG. 20 is a schematic diagram of a search engine acquisition method according to an embodiment of the present application;
fig. 21 is a flow chart of a method for obtaining different preset mapping methods according to an embodiment of the present application;
fig. 22 is a schematic diagram of a data query device according to an embodiment of the present application;
FIG. 23 is a schematic diagram showing a hardware configuration of a computer device to which the embodiment of the present application is applied;
fig. 24 is a schematic diagram showing a hardware configuration of another computer device to which the embodiment of the present application is applied.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application. Embodiments of the application and features of the embodiments may be combined with one another arbitrarily without conflict. Also, while a logical order is depicted in the flowchart, in some cases, the steps depicted or described may be performed in a different order than presented herein.
It will be appreciated that in the following detailed description of the application, data relating to storage of files, file information, and data query requests, etc., is relevant to obtain relevant permissions or consent when embodiments of the application are applied to a particular product or technology, and collection, use and processing of relevant data is required to comply with relevant laws and regulations and standards of relevant countries and regions. For example, where relevant data is required, this may be implemented by recruiting relevant volunteers and signing the relevant agreement of volunteer authorisation data, and then using the data of these volunteers; alternatively, by implementing within the scope of the authorized allowed organization, relevant recommendations are made to the organization's internal members by implementing the following embodiments using the organization's internal member's data; alternatively, the relevant data used in the implementation may be analog data, for example, analog data generated in a virtual scene.
In order to facilitate understanding of the technical solution provided by the embodiments of the present application, some key terms used in the embodiments of the present application are explained here:
search engine: the system is used for collecting information from the Internet by using a specific computer program according to a certain strategy, providing search service for users after organizing and processing the information, and displaying the searched related information to the users. The search engine is a search technology working on the Internet, and aims to improve the speed of acquiring and collecting information and provide a better network use environment for people. Search engines are roughly divided into four categories, namely full text search engines, meta search engines, vertical search engines, and directory search engines, in terms of functionality and principle.
Inverted index: it is derived from the fact that it is necessary to find records according to the values of the attributes. Each entry in such an index includes an attribute value and the address of each record having the attribute value. Since the attribute value is not determined by the record but the position of the record is determined by the attribute value, the inverted index (inverted index), which is a specific storage form realizing a "keyword-file information matrix (file information of each storage file associated with a keyword)", by which file information of the storage file containing the keyword can be quickly acquired from the keyword. The inverted index is mainly composed of two parts: "word (keyword) dictionary" and "reverse file". The file with inverted index is called inverted index file (inverted file) for short. In the data query scenario, the forward index is a mapping from documents to keywords (known documents for keywords), and the reverse index is a mapping from keywords to documents (known keywords for documents). For example, the inverted index uses a word or a word as a keyword to index, and if an index table is used, a record table entry corresponding to the keyword records all storage files in which the word or the word appears, and one table entry is a word table segment and records the file ID of the storage file and the position condition of the keyword in the storage file.
Word dictionary (Lexicon): a typical index unit of a search engine is a word (keyword), a word dictionary is a set of character strings composed of all words that appear in a set of stored files, and each index item in the word dictionary records some information of the word itself and pointers to file information (file address of the stored file, file identification, location in the stored file where the keyword is located).
Corpus: refers to a large-scale electronic text library which is scientifically sampled and processed, wherein language materials which are actually appeared in the actual use of the language are stored.
B+ tree: the tree data structure is characterized in that a B+ tree is a variant form of the B tree, and leaf nodes on the B+ tree store keywords and addresses of corresponding records, and layers above the leaf nodes are used as indexes. Typically in a file system for databases and operating systems. The B+ tree is characterized by being capable of keeping data stable and orderly, and has relatively stable logarithmic time complexity in insertion and modification. The b+ tree elements are inserted bottom-up, as opposed to binary trees.
Jump table: is a randomized data structure, a linked list based data structure, which can be seen as a variant of a binary tree that is comparable in performance to a red black tree, an AVL tree, but the principle of table jump is very simple, which is currently used in both dis and LeveIDB. It uses a random technique to determine which nodes in the linked list should have forward pointers added to them and how many pointers should be added to them. The head node of the jump table structure needs to have enough pointer fields to meet the needs of the maximum number of possible constructions, while the tail node does not need pointer fields. Is the ability to quickly find elements. It accelerates queries by introducing multiple levels of index, each skipping some elements to achieve faster lookup speeds. The skip list insertion and deletion operations are also efficient, and are applicable to scenes requiring quick polling, such as inverted indexes in search engines.
The pointers are memory addresses, pointer variables are variables used for storing the memory addresses, the lengths of memory units occupied by different types of pointer variables are the same under the same CPU framework, and the lengths of the memory spaces occupied by the variables for storing data are different according to the types of the data. With the pointer, not only the data itself, but also the variable address where the data is stored can be operated. The pointer describes the location of the data in memory, and identifies an entity that occupies memory space, and the relative distance value at the beginning of that space. In the C/c++ language, a pointer is generally considered as a pointer variable, the content of which stores the first address of an object to which it points, and the object to which it points may be a variable (pointer variable is also a variable), an array, a function, or the like, which occupies an entity of a memory space.
Protocol Buffers, a data description language developed by Google corporation, can serialize structured data similar to XML, and can be used in data storage, communication protocols, etc.
The prior probability (prior probability) refers to the probability obtained according to past experience and analysis, such as a full probability formula, which is often used as the probability of occurrence of the "cause" in the "cause result" problem. In bayesian statistical inference, an uncertainty quantity of a priori probability distribution is a probability distribution that expresses a confidence level for that quantity before taking into account some factors. For example, the prior probability distribution may represent a probability distribution of the relative proportion of voted objects voted to a particular voted object in future votes. The unknown quantity may be a parameter of the model or a potential variable.
Boolean queries refer to queries that join terms using AND, OR, OR NOT operators.
Elastiscearch (distributed search): the full-text search engine based on the Lucene library can be operated under a distributed architecture. It supports complex search requests and provides real-time search and data analysis functions.
Solr: solr is a separate enterprise-level search application server that provides an API interface to the outside similar to Web-service. A user can submit an XML file with a certain format to a search engine server through an http request to generate an index; the search request can also be provided through the Http Get operation, and a returned result in the XML format is obtained. Is an open source code search platform based on Lucene. It provides distributed search and indexing functionality that can be used to build efficient search applications.
The technical scheme of the embodiment of the application relates to Cloud technology, and Cloud technology (Cloud technology) refers to a hosting technology for integrating serial resources such as hardware, software, network and the like in a wide area network or a local area network to realize calculation, storage, processing, transmission and sharing of data. Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Among them, background services of technical network systems require a large amount of computing and storage resources, such as video websites, picture websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data of different levels can be processed separately, various industry data need strong system rear shield support, the cloud computing technology can be realized only through cloud computing, and the cloud computing technology becomes an important support of the cloud technology.
Cloud computing (closed computing): refers to the delivery and usage mode of the IT infrastructure, meaning that the required resources are obtained in an on-demand, easily scalable manner through the network; generalized cloud computing refers to the delivery and usage patterns of services, meaning that the required services are obtained in an on-demand, easily scalable manner over a network. Such services may be IT, software, internet related, or other services. Cloud Computing is a product of fusion of traditional computer and network technology developments such as Grid Computing (Grid Computing), distributed Computing (distributed Computing), parallel Computing (Parallel Computing), utility Computing (Utility Computing), network storage (Network Storage Technologies), virtualization (Virtualization), load balancing (Load balancing), and the like. With the development of the internet, real-time data flow and diversification of connected devices, and the promotion of demands of search services, social networks, mobile commerce, open collaboration and the like, cloud computing is rapidly developed. For example, in data query, a distributed computing technology and a network storage technology of cloud computing are applied, based on the distributed computing technology, computing logic in the data query is executed by using computing nodes, and based on the network storage technology, data is stored and provided to the computing nodes.
Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. In the distributed computing technology, the computing node cluster can be applied to a large-scale data center, and can better adapt to mass data storage requirements of the large-scale data center and upload, query and download tasks of a large amount of data by adopting a cloud storage technology.
At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as an Identity (ID) of the data, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.
The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the group of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (Redundant Array of Independent Disk, RAID), and a logical volume can be understood as a stripe, whereby physical storage space is allocated to a logical volume.
Big data (Big data) refers to a data set which cannot be captured, managed and processed by a conventional software tool within a certain time range, and is a massive, high-growth-rate and diversified information asset which needs a new processing mode to have stronger decision-making ability, insight discovery ability and flow optimization ability. With the advent of the cloud age, big data has attracted more and more attention, and special techniques are required for big data to effectively process a large amount of data within a tolerant elapsed time. Technologies applicable to big data include massively parallel processing databases, data mining, distributed file systems, distributed databases, cloud computing platforms, the internet, and scalable storage systems.
The scheme provided by the embodiment of the application relates to a cloud computing technology, a cloud storage technology and a big data technology in cloud technology. When each storage file in the storage node cluster is acquired, a big data analysis technology can be adopted to analyze each storage file, initial keywords, keyword numbers, file identifications, storage addresses and file lengths contained in the storage file and data analysis statistical information of the storage file numbers containing certain initial keywords are acquired, query probabilities of the initial keywords are determined according to historical query conditions of the initial keywords, then query conditions of the keywords of each storage file (determined according to query probabilities of the initial keywords not contained in the corresponding storage file) are determined according to the query probabilities of the initial keywords obtained through analysis, a function association relation between the respective keyword numbers and keyword query conditions of each storage file and the number of areas of mapping areas and the number of methods of preset mapping methods is adopted, and a curve association relation is formed between the function association relation and the expected comprehensive query error rate. The method comprises the steps that a function association relation between the number of areas of a mapping area and the number of methods of a preset mapping method is adopted, data analysis statistical information of each storage file and storage addresses of each storage file are obtained, calculation logic is designed and obtained, the calculation logic is deployed to a calculation node cluster, after a data query request of a client is received, the calculation node can determine file information of the storage file corresponding to a query keyword in the data query request based on a cloud calculation technology, and the storage file is obtained from the storage node based on the cloud storage technology so as to generate data query corresponding to the client.
The following briefly describes the design concept of the embodiment of the present application:
search engine: is a system for retrieving related documents containing query terms provided using objects. Commonly used search engines typically rely on some form of index to quickly find relevant files, e.g., indices in the form of jumpers, b+ trees, learning indices, etc.
In the related art, a computing node stores storage files in storage nodes, each storage node generates a file address index in the form of a skip list, a b+ tree, and the like for each stored storage file, and after receiving each data query request, the computing node sends a file identifier corresponding to a query keyword in the data query request to each storage node, and each storage node traverses based on the respective file address index, acquires the storage file corresponding to the file identifier, returns the storage file to the computing node, and then returns the storage file to a client by the computing node. Therefore, in the mode of separating the calculation method and the storage file, even if the storage file corresponding to the query keyword does not exist in the storage node, the traversing index flow still needs to be executed, and the larger communication cost and traversing index cost are consumed.
In order to perform fast index traversal, reducing the cost of inter-node communication, it is necessary to place the computation and storage closely on the same server. That is, the computing method in the computing node and each storage file in the storage node are in the same server, and accordingly, after the server receives the data query request of the client, the server can directly complete extracting the query keyword, searching the file address associated with the query keyword, acquiring the storage file according to the file address, and returning the storage file to the client. However, even when the data query requests of the client are few, the cloud storage service still needs to keep running and calculating all servers, so that larger server resource consumption can be generated, the running performance of the servers is influenced, and the service quality of the data query is influenced.
In view of this, the embodiment of the application provides a data query method, a device, a computer device and a storage medium, which are applicable to a computing node in a cloud storage system, wherein the cloud storage system comprises a client, a computing node and a storage node, the computing node is used for receiving a data query request of the client, analyzing and acquiring query keywords in the data query request, then acquiring each file information comprising the query keywords, respectively generating a data acquisition request according to each file information and sending the data acquisition request to the storage node corresponding to a node identifier in the corresponding file information, so that the storage node acquires the file information from the data acquisition request, searches the corresponding storage file according to the file identifier in the file information and returns the file to the computing node, the computing node receives the storage file returned by each storage node, and generates a data query response based on each storage file and returns the client to complete the data query.
In the related art, after the server receives the data query request of the client, the method can directly complete the steps of extracting the query keywords, searching the file addresses associated with the query keywords, acquiring the storage files according to the file addresses, and returning the storage files to the client. The method comprises the steps of constructing a single inverted file of each query keyword in advance, obtaining the query keywords in a data query request, obtaining the single inverted file of the query keywords, obtaining file information of the query keywords, generating a data acquisition request according to the file information, determining a storage node to which the storage file belongs based on a node identifier of the file information, and sending the data acquisition request to the corresponding storage node, so that the storage node obtains the corresponding storage file in the storage node based on the file identifier, and communicates with the storage node in a targeted manner, thereby preventing unnecessary communication resources from being consumed and unnecessary communication cost from being generated.
Under the condition that the number of keywords is huge, the number of the independent inverted files of the keywords is larger, the association relation between a large number of keywords and the independent inverted files is stored in a computing node, the performance of the computing node is not friendly, the independent inverted index files of the keywords mapped to the same mapping area through one preset mapping method are combined in a mode of the preset mapping method and the mapping area, so that combined inverted files are obtained, and according to each query keyword, under different preset mapping methods, the combined inverted files can be respectively obtained from a plurality of corresponding mapping areas, and the intersection of the plurality of combined inverted files is processed, so that the independent inverted files are obtained.
The preferred embodiments of the present application will be described below with reference to the accompanying drawings of the specification, it being understood that the preferred embodiments described herein are for illustration and explanation only, and not for limitation of the present application, and embodiments of the present application and features of the embodiments may be combined with each other without conflict.
Fig. 2 is a schematic diagram of an application scenario according to an embodiment of the present application. The application scenario diagram includes any one of a plurality of terminal devices 210, any one of a plurality of servers 220 (computing nodes), and any one of a plurality of servers 230 (storage nodes).
In the embodiment of the present application, the terminal device 210 includes, but is not limited to, a mobile phone, a tablet computer, a notebook computer, a desktop computer, an electronic book reader, an intelligent voice interaction device, an intelligent home appliance, a vehicle-mounted terminal, and the like; the terminal device may be provided with a client related to the data query service, where the client may be software (such as a browser, communication software, etc.), or may be a web page, an applet, etc., and the servers 220 and 230 are background servers corresponding to the software or the web page, the applet, etc., or are background servers specially used for providing cloud storage service to the client, which is not particularly limited in the present application. The server 220 and the server 230 may be independent physical servers, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be cloud servers for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDN), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
It should be noted that, the data query method in the embodiment of the present application may be executed by a computer device, where the computer device may be the server 220, and when a client in the terminal device 210 initiates a data query request, the client sends the data query request to the server 220. The server 220 is provided with different preset mapping methods and mapping areas corresponding to the preset mapping methods respectively, and each mapping area is provided with a combined inverted file, wherein each combined inverted file is obtained by combining a plurality of independent inverted files, and the independent inverted files comprise: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located; after receiving the data query request, the server 220 parses at least one query keyword in the data query request, determines each mapping area associated with the one query keyword by adopting different preset mapping methods for each query keyword, obtains corresponding combined inverted files from each mapping area, performs intersection processing on the obtained combined inverted files to obtain individual inverted files of the one query keyword, generates corresponding data acquisition requests based on each file information in the obtained individual inverted files, respectively sends the data acquisition requests to the corresponding servers 230, receives the request responses sent by the servers 230, and generates a data query response according to the storage files carried in each request response of each received query keyword, and returns the data query response to the client of the terminal device 210.
It should be noted that, the number of terminal devices and servers and the communication manner are not limited in practice, and are not particularly limited in the embodiment of the present application, as shown in fig. 2.
In addition, the data query in the embodiment of the application can be applied to various scenes, such as short video query, news query, novel query, literature query, shopping search query and the like.
The data query method provided by the exemplary embodiment of the present application will be described below with reference to the accompanying drawings in conjunction with the application scenario described above, and it should be noted that the application scenario described above is only shown for the convenience of understanding the spirit and principles of the present application, and the embodiment of the present application is not limited in any way in this respect.
Referring to fig. 3, a flow chart of a data query method provided by an embodiment of the present application is suitable for a computing node in a cloud storage system, where the cloud storage system includes the computing node and a storage node, and the storage node includes a plurality of storage files, and here, a server (computing node) is taken as an execution body for illustration, and a specific implementation flow of the method is as follows:
step 301, extracting at least one query keyword from a data query request sent by a client.
In one embodiment, if the client is shopping software, the query keywords in the data query request may be "jacket", "summer", and other keywords.
In one embodiment, if the client is document related software, the query keyword in the data query request may be a keyword related to document related content, such as a polymer material, a medium, and the like. It should be noted that, the specific application scenario of the client is not limited herein.
In one embodiment, a typical flow for a client to employ a search engine includes performing a search based on query keywords, e.g., if the client receives strings "hello world" and "hello system," then parsing into a plurality of keywords based on the search engine: "hello" and "world", "hello" and "system" may be employed to generate data query requests and send to the compute nodes.
In one embodiment, the development object may set different parsing rules, convert the "hello world" and "hello system" received by the client into a format recognizable by the computing node, generate a data query request, and send the data query request to the computing node.
Step 302, respectively executing the following operations for the at least one query keyword:
step 3021, respectively adopting different preset mapping methods to determine each mapping area associated with a query keyword, respectively obtaining corresponding combined inverted files from each mapping area, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the query keyword; each combined inverted file is obtained by combining an independent inverted file with an independent inverted file of other keywords; a single inverted file of query keywords comprising: and each file information comprises node identifiers and file identifiers of storage nodes where corresponding storage files are located, a plurality of mapping areas are associated with different preset mapping methods, and each mapping area comprises a merging and inverted file.
In one embodiment, one single inverted file corresponds to one query keyword, and each file information included in the single inverted file corresponds to a storage file, where the one query keyword is included, for example, file information of 10000 storage files is included in the single inverted file of the query keyword "system".
In one embodiment, one merged and rearranged file corresponds to a plurality of query keywords, each of the file information included in the merged and rearranged file corresponds to a storage file, one or more query keywords of the plurality of query keywords are included, for example, file information of 10000 storage files is included in a single and rearranged file of a query keyword "system", file information of 8000 storage files is included in a single and rearranged file of a query keyword "hello", file information of 50000 storage files is included in a single and rearranged file of a query keyword "world", file information of each of the 3 query keywords is included in a merged and rearranged file of the query keyword "system", "hello" and "world", wherein file information of 68000 storage files may be included, and in addition, if 1000 file information of the plurality of file information is duplicated, that is, file information of at least 2 query keywords may be included in a storage file corresponding to the duplicated file information, 67000 file information is included in the merged and rearranged file. It is also known that the file information in the combined inverted file does not need to be aggregated according to the file information in the individual inverted file, and the file information in each individual inverted file in the combined inverted file may be scattered.
Based on the method flow in fig. 3, an embodiment of the present application provides a method for obtaining a merged and inverted file, in step 3021, each mapping area associated with a query keyword is determined by using different preset mapping methods, and corresponding merged and inverted files are obtained from each mapping area, including:
for the different preset mapping methods, the following operations are respectively executed:
mapping the query keyword by adopting a preset mapping method to obtain a mapping value;
determining a mapping area associated with the mapping values, and acquiring combined inverted files from the mapping area, wherein the mapping values of the keywords of each single inverted file in the combined inverted files in the mapping area are the same in one preset mapping method, and the combined inverted files in a plurality of mapping areas associated with the preset mapping method comprise file information of storage files of each storage node.
In one embodiment, the preset mapping method may be a mapping method such as a residual function, a hash function, etc., where the type of the preset mapping method is not limited, and may be set as required. The region identifier of the mapping region may be determined according to a mapping value of a preset mapping method. If the preset mapping method only has mapping values of 10 integers of 0-9, the preset mapping method may have 10 mapping areas, and the area identifiers of the 10 mapping areas are respectively 10 integers of 0-9, and after the query keyword is converted into the digital form, the mapping values (area identifiers) obtained by calculation in the preset mapping method are determined to correspond to the query keyword. As shown in fig. 4, a method schematic diagram of mapping query keywords to corresponding mapping areas by a preset mapping method provided in an embodiment of the present application is shown, where preset mapping method a corresponds to 10 mapping areas, an area identifier is an integer from 0 to 1 at a time, and each mapping area includes a combined inverted file corresponding to the preset mapping method a: the mapping area 0 comprises a merging and arranging file a-0, the mapping area 1 comprises a merging and arranging file a-1, the mapping area 2 comprises a merging and arranging file a-2, the mapping area 3 comprises a merging and arranging file a-3, the mapping area 4 comprises a merging and arranging file a-4, the mapping area 5 comprises a merging and arranging file a-5, the mapping area 6 comprises a merging and arranging file a-6, the mapping area 7 comprises a merging and arranging file a-7, the mapping area 8 comprises a merging and arranging file a-8, the mapping area 9 comprises a merging and arranging file a-9, the mapping value of the query keyword 1 after the preset mapping method a is 0, the corresponding merging and arranging file is the merging and arranging file a-0, the mapping value of the query keyword 2 after the preset mapping method a is 2, the corresponding merging and arranging file is the merging and arranging file a-2, the mapping value of the query keyword 3 after the preset mapping method a is 1, and the corresponding merging and arranging file is the merging and arranging file a-1.
In one embodiment, the different preset mapping methods may be a plurality of preset mapping methods, and each preset mapping method may obtain a mapping value for each query keyword, for example, the query keyword 1 is respectively in the plurality of preset mapping methods, and there are a plurality of mapping values for the query keyword 1 respectively. As shown in fig. 5, a schematic diagram of a relationship between a query keyword and different preset mapping methods is provided in an embodiment of the present application, where mapping values and corresponding mapping areas of the query keyword 1 in the preset mapping method a, the preset mapping method b, the preset mapping method c, and the preset mapping method d, respectively: map area 0, map area 6, map area 1, map area 7.
In one embodiment, the multiple different preset mapping methods, the multiple mapping areas corresponding to the multiple different preset mapping methods, and the combined inverted file included in each mapping area in the foregoing fig. 4 and fig. 5 may form an inverted index structure in the computing node, and the search engine includes the inverted index structure of the combined inverted file for obtaining the query keyword. In one embodiment, based on the foregoing embodiment, as shown in fig. 6, a schematic diagram of a method for obtaining an individual inverted file of a query keyword provided in the embodiment of the present application, corresponding merged inverted file a-0, merged inverted file b-6, merged inverted file c-1, and merged inverted file d-7 are obtained from a mapping area 0, a mapping area 6, a mapping area 1, and a mapping area 7, respectively, a computing node uses merged inverted file a-0, merged inverted file b-6, merged inverted file c-1, and merged inverted file d-7 as a merged inverted file set of query keyword 1, and performs an intersecting process on each merged inverted file to obtain an individual inverted file of query keyword 1, and generates a data obtaining request sent to a node identifier corresponding to a storage node according to a node identifier and a file identifier in each file information in the individual inverted file corresponding to query keyword 1, where the data obtaining request includes a file identifier of a storage file in the individual inverted file information corresponding to query keyword 1.
In one embodiment, as shown in fig. 6, where the merging and inverted files of query keywords are intersected to obtain separate inverted files of query keywords, the processing logic may be implemented by an intersection index structure that obtains the merging and inverted files of query keywords in the inverted index structure.
In one embodiment, the intersection index may also be a part of an inverted index structure, for example, a plurality of different preset mapping methods, a plurality of mapping areas corresponding to the plurality of different preset mapping methods, a merged inverted file contained in each mapping area, and intersection calculation logic of each merged inverted file form the inverted index structure in the calculation node, that is, the application does not limit the composition of a specific index structure set in the calculation node, for example, the inverted index structure may also include other index structures such as a B tree or a skip list, etc., and individual inverted files of query keywords with query frequency higher than a set query frequency threshold may be directly mounted in the index structures such as the B tree or the skip list, etc., if the data query request includes query keywords with query frequency higher than the set query frequency threshold, the index structure of the B tree or the skip list is preferably used to directly query individual inverted files of the high-frequency query keywords. In an embodiment, if different preset mapping methods are a plurality of different hash functions, as shown in fig. 7, a multi-layer hash map schematic provided in the embodiment of the present application includes a plurality of hash functions L, where the number of buckets (mapping areas) corresponding to each hash function may be the same or different, each hash function may obtain a unique hash value for the query keyword 1, the hash value may be used as a bucket identifier of the bucket corresponding to the hash function, so as to obtain the bucket corresponding to the query keyword 1, for example, the bucket corresponding to the hash value of the query keyword 1 in the hash function a is bucket 1, where the bucket corresponding to the hash value of the query keyword 1 in the hash function b is bucket 1, where the bucket corresponding to the hash value in the hash function b is included is merged and the bucket 2, where the bucket corresponding to the hash value of the hash keyword 1 in the hash function c is included is bucket m, and the bucket corresponding to the hash value in the hash keyword 1 in the hash function L is included and where the hash value in the hash function b is included and the hash file c-1 is included.
In one embodiment, based on the above embodiment, as shown in fig. 7, a schematic diagram of a method for obtaining an individual inverted file of a query keyword provided in the embodiment of the present application is that corresponding merged inverted file a-1, merged inverted file b-2, merged inverted file c-m, and merged inverted file L-1 are obtained from hash function a-bucket 1, hash function b-bucket 1, hash function c-bucket m, and hash function L-bucket 1, respectively, a computing node uses merged inverted file a-1, merged inverted file b-2, merged inverted file c-m, and merged inverted file L-1 as a merged inverted file set of query keyword 1, and performs intersection processing on each merged inverted file to obtain an individual inverted file of query keyword 1, and generates a data obtaining request sent to a corresponding storage node according to node identifier and file identifier in each file information in the individual inverted file corresponding to query keyword 1, where the data obtaining request includes the file identifier in the individual file corresponding to the query keyword 1. It should be noted that, in the above example, the mapping of the hash function d-hash function (L-1) to the query keyword 1 and the merged inverted file of the corresponding bucket are not specifically listed, but the mapping step of the hash function d-hash function (L-1) to the query keyword 1 and the step of obtaining the merged inverted file from the corresponding bucket, and the merged inverted files obtained here and the listed merged inverted files together form a merged inverted file set of the query keyword 1, and the merged inverted file set is subjected to intersection processing based on the merged inverted file set to obtain the individual inverted file of the query keyword 1.
In one embodiment, the multiple hash functions, the multiple buckets corresponding to the multiple hash functions, and the combined inverted file included in each bucket in fig. 7 may form an inverted index structure in the computing node, the search engine includes the inverted index structure of the combined inverted file for obtaining the query keyword, in addition, the combined inverted files of the query keyword are intersected to obtain the separate inverted file of the query keyword, the processing logic may be implemented by an intersection index structure, the intersection index structure obtains the intersection of the combined inverted files of the query keyword in the inverted index structure, or the intersection index may also be a part of the inverted index structure, for example, the multiple buckets corresponding to the multiple hash functions, the combined inverted file included in each bucket, and the intersection calculation logic of the combined inverted files form the inverted index structure in the computing node. In one embodiment, based on the above embodiments, as shown in fig. 8, the embodiment of the present application provides a simple schematic diagram of the intersecting processing of the merged inverted file, and assuming that the query keyword 1 obtains the merged inverted file 1, the merged inverted file 2, the merged inverted file 3, and the merged inverted file 4 under different preset mapping methods, the merged inverted file 1, the merged inverted file 2, the merged inverted file 3, and the merged inverted file 4 are the independent inverted files of the query keyword 1.
For example, suppose that the merged inverted file 1 contains the individual inverted file 1 (1-1) of the query keyword 1, the individual inverted file 2 (2-2) of the other keywords 2, and the individual inverted file 3 (3-3) of the other keywords 3;
merging the single inverted file 1 (1-1) containing the query keyword 1 in the inverted file 2, the single inverted file 4 (4-4) of other keywords 4, and the single inverted file 5 (5-5) of other keywords 5;
merging the single inverted file 1 (1-1) containing the query keyword 1 in the inverted file 3, the single inverted file 6 (6-6) of other keywords 6 and the single inverted file 7 (7-7) of other keywords 7;
the combined inverted file 4 contains the individual inverted file 1 (1-1) of the query keyword 1 and the individual inverted files 8 (8-8) of the other keywords 8.
It can be seen that the intersection of the merged inverted file 1, the merged inverted file 2, the merged inverted file 3, and the merged inverted file 4 is equal to [ 1-1 ], (2-2), (3-3) ] ≡ [ 1-1 ], (4-4) - (-5) ] ≡ [ 1-1 ], (6-6) - (-7) ] ≡ = (1-1), (8-8) ] = (1-1), that is, the intersection calculation result is the single inverted file 1 (1-1) of the query keyword 1, and the single inverted file 1 (1-1) of the query keyword 1 is the shaded portion in the figure, which is shown in fig. 8.
In one embodiment, within a typical search engine, retrieval of relevant documents may be accomplished using inverted indexes. An inverted index is a data structure used to quickly identify and retrieve a stored file containing query keywords. It is generally composed of two subassemblies: and (5) inverted list and keyword index. The inverted list is a list of file information (e.g., file identifier-doc 1, node identifier-storage node 1, file identifier-doc 2-storage node 1) containing query keywords (e.g., "hello"). How many query keywords there are and how many inverted lists. The keyword index is a mapping from each query keyword (e.g., "hello") to its associated inverted list location. Here, the inverted index may be regarded as an inverted file in the present application, and then an inverted file corresponding to one query keyword is a single inverted file, and inverted files corresponding to a plurality of query keywords are combined inverted files.
In one embodiment, as shown in fig. 9, a schematic diagram of the relationship among the individual inverted list, the hash function, the bucket, and the merged inverted list of query keywords is provided for an embodiment of the present application, where the relationship between the merged inverted file and the hash function among 4 query keywords w1, w2, w3, and w4 having different individual inverted files is shown. Specifically, using a three-level hash function, the query keyword w2 is mapped to (layer 1, bucket 2), (layer 2, bucket 2), and (layer 3, bucket 1). It shares the same bucket with w3 of the first layer, w4 of the second layer, and w1 and w3 of the third layer. Each bucket then stores the combined inverted file of query keywords. For example, the merged inverted file of (layer 1, bucket 2) is the union of the individual inverted files of each of query keywords w2 and w 3: { d2, d3 }. U { d2, d3, d4} = { d2, d3, d4}. Thus, query term w2 will result in three combined inverted files { d2, d3, d4}. D2, d3, d4, d5 }. D { d1, d2, d3, d4 = { d2, d3, d4} (exact individual inverted files) containing false positives (false negatives: in the search engine, matching stored files from storage nodes to query terms, and not appearing in the query results: false positives: stored files contained in the query results, wherein no query terms were contained). Wherein the query keyword w1 generates a merged inverted file containing only the individual inverted file { d1} in bucket 1 of the hash function a.
Step 3022, based on the node identifier and the file identifier of each file information, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node.
In one embodiment, it is assumed that the file information 1, the file information 2, the file information 3, the file information 4, the file information 5, the file information 6 and the file information 7 are included in separate inverted files, wherein the file information 1 includes a node identifier: storage node 1 and file identity: storing a file 1, wherein the file information 2 comprises node identification: storage node 1 and file identity: storing a file 2, wherein the file information 3 comprises node identification: storage node 3 and file identification: storing a file 3, wherein the file information 4 comprises node identification: storage node 8 and file identification: the storage file 120, the file information 5 includes node identification: storage node 5 and file identification: the storage file 80, the file information 6 contains node identification: storage node 6 and file identification: the storage file 10, the file information 7 contains node identification: storage node 7 and file identity storage file 40. The data acquisition request is generated for the storage node 1 including the file identity: the storage file 1 and the storage file 2 generate a data acquisition request aiming at the storage node 3, wherein the data acquisition request comprises a file identifier: the storage file 3 generates a data acquisition request for the storage node 5, wherein the data acquisition request comprises a file identifier: the storage file 80 generates a data acquisition request for the storage node 6, where the data acquisition request includes a file identifier: the storage file 10 generates a data acquisition request for the storage node 7, including a file identifier: the storage file 40 generates a data acquisition request for the storage node 8, including a file identifier: according to this embodiment, as shown in fig. 10, a communication schematic diagram among a client, a computing node and a storage node is provided for an embodiment of the present application, after the computing node 1-5 determines the data acquisition requests to be sent to the storage nodes, the data acquisition requests are sent to the corresponding storage nodes, and when the data query task is less, some computing nodes can be put into a sleep state to reduce the computing cost, for example, when the data query task is less, the computing nodes 3-5 can be set to the sleep state, so that the expansion and the contraction of the computing nodes can be conveniently performed.
In one embodiment, the file information may include not only node identifier, file identifier, but also file length information, file format information, etc., where the file information is not limited specifically.
In one embodiment, the merged inverted file may be stored in a blob, that is, a single inverted table of each of the plurality of query keywords is merged and stored in the blob, and if the preset mapping method is a hash function, then the same number of blobs as the number of buckets will be required. The storage files may be stored in a single blob (e.g., separated by row separators) or in different blobs in the storage node. In each file identifier, the (binary large object name, offset and length) can be recorded as a part of the file identifier, so that the data can be better managed, and the file information in the combined inverted file can be read at will without affecting the performance of the computing node.
In one embodiment, the storage node may generate an index according to the file address of each storage file in the storage node in the disk, for example, if the index is a b+ tree, each tree node may be made to be a file identifier value interval, and the last leaf is made to be the file address of the corresponding file identifier. It should be noted that, the description of the index in the storage node is only an example, and the index may also be in the form of a skip list, etc., where the information and structure contained in the index in the storage node are not specifically limited.
Step 303, generating a data query response according to the storage file carried in each received request response of each query keyword, and returning the data query response to the client.
In one embodiment, after obtaining the storage file 1 and the storage file 2, the storage file 3, the storage file 80, the storage file 10, the storage file 40, and the storage file 120 from the storage node 1, the storage node 3, the storage node 5, the storage node 6, the storage node 7, and the storage node 8 based on the above embodiments, it may be determined that each obtained storage file contains a query keyword, and the display order of each storage file in the display page of the client may be determined, for example, a storage file containing a large number of keywords of query keywords in the data query request may be used as the first storage file displayed in the client display page, or a storage file having the largest number of occurrences of one query keyword may be used as the first storage file displayed in the client display page, and the manner of determining the display order is not limited specifically herein, and may be set as required.
In one embodiment, based on the foregoing embodiment, the computing node may send the obtained storage files 1 and 2, the storage file 3, the storage file 80, the storage file 10, the storage file 40, and part of the storage files in the storage file 120 to the client, for example, if only 5 storage files can be sent to the client at a time, 5 storage files may be selected from the obtained storage files to generate a data query response and return the data query response to the client, where the selection manner is that the importance degree of the storage files may be determined according to the information such as the number of keywords including the query keywords in the storage files, the occurrence frequency of the query keywords, and the like, the greater the number of keywords including the query keywords in the storage files, the higher the occurrence frequency of the query keywords, the greater the importance degree of the storage files, the first 5 storage files with the greatest importance degree are selected, and the data query response is generated and returned to the client, and it is to be explained that the manner of calculating the importance degree here is only used to illustrate the scheme clearly, and does not limit the specific implementation of the scheme.
Based on the above method flow in fig. 3, in the embodiment of the present application, a method for adding a storage file is further provided, so that when a newly added storage file exists in a storage node and there is no newly added keyword in the newly added storage file, file information of the newly added storage file is updated to a corresponding combined and inverted file in a computing node, as shown in fig. 11, including:
step 1101, based on the detection of each storage node, obtaining the information of the newly added file, and extracting at least one inserted keyword from the newly added file, where the newly added file is a file that is received and stored by the storage node and is different from each stored file.
In one embodiment, the storage node may further provide a new interface and a deletion interface for uploading and downloading the storage file, and after the storage node receives the new storage file, a notification is generated and sent to the computing node, and the computing node analyzes the new storage file to obtain new file information of the new storage file: and extracting at least one inserted keyword from the newly added file, wherein the node identifier is a node identifier of the attributive storage node, the newly added file identifier and the like.
In one embodiment, the computing node may detect each storage node, when the storage node receives a new storage file, detect information such as a new file identifier of the new storage file, obtain the new storage file from the storage node, and analyze the new storage file by the computing node to obtain new file information of the new storage file: and extracting at least one inserted keyword from the newly added file, wherein the node identifier is a node identifier of the attributive storage node, the newly added file identifier and the like. It should be noted that, the manner in which the computing node determines that the storage node has the new file and the manner in which the new file is acquired are not limited, and may be set as required.
Step 1102, if it is determined that the keyword record includes at least one inserted keyword, the keyword record is used to record keywords included in the storage file of each storage node; the following operations are performed for at least one inserted keyword, respectively:
and determining each mapping area associated with one inserted keyword by adopting different preset mapping methods, obtaining corresponding combined inverted files from each mapping area associated with one inserted keyword, and storing the association relation between one inserted keyword and the newly added file information into the corresponding combined inverted files.
In one embodiment, the keyword record is all keywords contained in all storage files of each storage node, if the keyword record contains at least one inserted keyword, the inserted keyword is determined to be originally available, the inserted keyword is not required to be added in the keyword record, and only the newly added file information of the newly added file is required to be added in the combined and inverted file corresponding to the inserted keyword.
In one embodiment, the keyword record may be a single record, or may be determined from a combined inverted file of a preset mapping method, where it is to be noted that the existence form and the acquisition method of the keyword record are not limited specifically.
In one embodiment, based on the above embodiment, if the preset mapping method is a residual function, based on the schematic diagrams of fig. 5 and 6, assuming that the inserted keyword is the same as the query keyword 1, the association relationship between the inserted keyword and the newly added file information is added to the combined inverted file a-0, the combined inverted file b-6, the combined inverted file c-1, and the combined inverted file d-7, respectively.
In one embodiment, based on the above embodiment, if the preset mapping method is a hash function, based on the schematic diagram of fig. 7, assuming that the inserted keyword is the same as the query keyword 1, the association relationship between the inserted keyword and the newly added file information is added to the merged inverted file a-1, the merged inverted file b-2, the merged inverted file c-m, and the merged inverted file L-1, respectively.
Based on the above-mentioned method flow in fig. 3, an embodiment of the present application provides a method for adding a storage file, so that when a newly added storage file exists in a storage node and a newly added keyword exists in the newly added storage file, file information of the newly added storage file and an individual inverted file of the newly added keyword are updated to a corresponding combined inverted file in a computing node, as shown in fig. 12, including:
Step 1201, based on the detection of each storage node, obtaining the information of the newly added file, and extracting at least one inserted keyword from the newly added file, where the newly added file is a file that is received and stored by the storage node and is different from each stored file.
In one embodiment, the storage node may further provide a new interface and a deletion interface for uploading and downloading the storage file, and after the storage node receives the new storage file, a notification is generated and sent to the computing node, and the computing node analyzes the new storage file to obtain new file information of the new storage file: and extracting at least one inserted keyword from the newly added file, wherein the node identifier is a node identifier of the attributive storage node, the newly added file identifier and the like.
In one embodiment, the computing node may detect each storage node, when the storage node receives a new storage file, detect information such as a new file identifier of the new storage file, obtain the new storage file from the storage node, and analyze the new storage file by the computing node to obtain new file information of the new storage file: and extracting at least one inserted keyword from the newly added file, wherein the node identifier is a node identifier of the attributive storage node, the newly added file identifier and the like. It should be noted that, the manner in which the computing node determines that the storage node has the new file and the manner in which the new file is acquired are not limited, and may be set as required.
Step 1202, if it is determined that at least one inserted keyword exists in the newly added keywords not included in the keyword records, the keyword records are used for recording keywords included in the storage files of the storage nodes; generating an independent inverted file of the newly added keyword aiming at the newly added keyword, wherein the independent inverted file of the newly added keyword comprises: newly added file information of newly added files associated with newly added keywords.
In one embodiment, the keyword record is all keywords contained in all storage files of each storage node, if the keyword record does not contain all keywords in at least one inserted keyword, that is, a new keyword exists, that is, it is determined that the new keyword is not originally contained in the keyword record, the new keyword needs to be added in the keyword record, in addition, a separate inverted file of the new keyword needs to be generated, and the separate inverted file of the new keyword is added into the corresponding combined inverted file.
In one embodiment, the keyword record may be a single record, or may be determined from a combined inverted file of a preset mapping method, where it is to be noted that the existence form and the acquisition method of the keyword record are not limited specifically.
In one embodiment, the single inverted file of the newly added keyword includes an association relationship between newly added file information of the newly added file and the newly added keyword, and the newly added file information includes a node identifier of a storage node to which the newly added file belongs, a file identifier of the newly added file, and related information such as a file length of the newly added file.
In one embodiment, if a plurality of newly-added files are newly-added in the storage node, and some or all of the newly-added files contain newly-added keywords, generating an independent inverted file according to the association relationship between the newly-added file information of each newly-added file containing the newly-added keywords and the newly-added keywords.
Step 1203, determining each mapping area associated with the newly added keyword by adopting different preset mapping methods, obtaining corresponding combined inverted files from each mapping area associated with the newly added keyword, and combining the independent inverted files of the newly added keyword into the corresponding combined inverted files.
In an embodiment, if the preset mapping method is a residual function, under different preset mapping methods, as shown in fig. 13, a method schematic diagram of adding the new keyword is provided in the embodiment of the present application, where a plurality of preset mapping methods in a computing node are a preset mapping method a, a preset mapping method b, a preset mapping method c, and a preset mapping method d, and mapping values of the new keyword in the preset mapping method a, the preset mapping method b, the preset mapping method c, and the preset mapping method d are 2, 8, 1, and 0, and mapping values and corresponding mapping areas respectively: the mapping area 2, the mapping area 8, the mapping area 1 and the mapping area 0 respectively obtain corresponding merged inverted files a-2, b-8, c-1 and d-0 from the mapping area 2, the mapping area 8, the mapping area 1 and the mapping area 0, and then the single inverted file of the newly added keyword can be respectively merged with the merged inverted file a-2, the merged inverted file b-8, the merged inverted file c-1 and the merged inverted file d-0, so that when the newly added keyword is the query keyword in the data query request, corresponding mapping values can be obtained based on different preset mapping methods, and further the merged inverted file a-2, the merged inverted file b-8, the merged inverted file c-1 and the merged inverted file d-0 of the single inverted file containing the newly added keyword are obtained from the corresponding mapping area.
In an embodiment, if the preset mapping method is a hash function, then under different preset mapping methods, as shown in fig. 14, a schematic diagram of a method for adding a new keyword according to an embodiment of the present application is provided, where a plurality of preset mapping methods in a computing node are hash function a, hash function b, and hash function c, hash values of the new keyword in the hash function a, hash function b, and hash function c are 1, 2, and m, respectively, and buckets corresponding to the hash values are calculated: bucket 1, bucket 2, bucket m, from bucket 1, bucket 2, bucket m, get the correspondent merge and fall and arrange the file a-1, merge and fall and arrange the file b-2, merge and arrange the file c-m, can combine with merge and fall and arrange the file a-1, merge and arrange the file b-2, merge and arrange the file c-m separately newly added keyword separately, make newly added keyword find out the query keyword in the request for data query, can obtain the correspondent hash value based on different preset mapping methods, thus obtain and include merge and fall and arrange the file a-1, merge and arrange the file b-2, merge and arrange the file c-m of the file alone newly added keyword from the correspondent bucket.
Based on the method flow in fig. 3, an embodiment of the present application provides a method for deleting a storage file, so as to update a corresponding combined inverted file in a computing node according to deleted file information of the deleted storage file when the deleted storage file exists in the storage node, including:
Step 1510, acquiring deleted file information of a deleted file based on detection of each storage node, and extracting at least one deleted keyword from the deleted file, wherein the deleted file is a file deleted by the storage node from each stored storage file.
In one embodiment, the storage node may further provide a new interface and a deletion interface for uploading and downloading the storage file, and after deleting the storage file by the storage node, a notification is generated and sent to the computing node, and the computing node analyzes the deletion file to obtain deletion file information of the deletion file: and extracting at least one deletion keyword from the deletion file, wherein the deletion file comprises a node identifier of the attributive storage node, a deletion file identifier and the like.
In one embodiment, the computing node may detect each storage node, when the storage node deletes a storage file, detect information such as a deleted file identifier of the deleted file, obtain the deleted file from the storage node, and analyze the deleted file by the computing node to obtain deleted file information of the deleted file: and extracting at least one deletion keyword from the deletion file, wherein the deletion file comprises a node identifier of the attributive storage node, a deletion file identifier and the like. It should be noted that, the manner in which the computing node determines that the storage node has the deleted file and the manner in which the deleted file is obtained are not limited, and may be set as required.
Step 1520, for at least one deletion keyword, performing the following operations respectively:
step 1521, determining each mapping area associated with a deleted keyword by adopting different preset mapping methods, obtaining corresponding combined inverted files from each mapping area associated with a deleted keyword, and performing intersection processing on each obtained combined inverted file to obtain an independent inverted file of the deleted keyword;
step 1522, delete file information from a single inverted file of delete keywords.
In an embodiment, if the preset mapping method is a residual function, the deletion keyword is in different preset mapping methods, as shown in fig. 16, which is a schematic diagram of a method for deleting a storage file provided in an embodiment of the present application, a plurality of preset mapping methods in a computing node are a preset mapping method a, a preset mapping method b, a preset mapping method c, and a preset mapping method d, where the deletion keyword has mapping values of 2, 8, 1, and 0 in the preset mapping method a, the preset mapping method b, the preset mapping method c, and the preset mapping method d, respectively, and the mapping values and the corresponding mapping areas are: and the mapping area 2, the mapping area 8, the mapping area 1 and the mapping area 0 respectively obtain corresponding merged inverted files a-2, b-8, c-1 and d-0 from the mapping area 2, the mapping area 8, the mapping area 1 and the mapping area 0, and the merged inverted file a-2, the merged inverted file b-8, the merged inverted file c-1 and d-0 are subjected to intersecting processing to obtain the single inverted file of the deletion keyword, and the deletion file information of the deletion keyword is removed from the single inverted file of the deletion keyword contained in each of the merged inverted file a-2, the merged inverted file b-8, the merged inverted file c-1 and the merged inverted file d-0, so that when the deletion keyword is the query keyword in the data query request, the corresponding mapping value cannot be obtained based on different preset mapping methods.
In one embodiment, based on the above embodiment, after obtaining the corresponding merge inverted file a-2, merge inverted file b-8, merge inverted file c-1, and merge inverted file d-0, each merge inverted file may also be traversed directly, and the association relationship between the deletion keyword and the deletion file information may be determined from each merge inverted file, and the association relationship may be deleted. The method is characterized in that the association relation between the deletion keywords and the deletion file information is traversed and deleted directly from the combined inverted file, or after the obtained combined inverted files are subjected to intersecting processing to obtain the independent inverted files of the deletion keywords, the deletion file information of the deletion keywords is removed, and the method is not limited in particular, and can be executed by selecting the method capable of completing deletion at the highest speed according to the data volume of the combined inverted files.
In an embodiment, if the preset mapping method is a hash function, the deletion keyword is in different preset mapping methods, as shown in fig. 17, which is a schematic diagram of a method for deleting a stored file in the embodiment of the present application, a plurality of preset mapping methods in a computing node are hash function a, hash function b, and hash function c, hash values of the deletion keyword in the hash function a, hash function b, and hash function c are 1, 2, and m, respectively, and buckets corresponding to the hash values are shown in the embodiment of the present application: and (3) respectively obtaining corresponding merging and arranging files a-1, merging and arranging files b-2 and merging and arranging files c-m from the barrel 1, the barrel 2 and the barrel m, wherein the association relation between the deleting key words and the deleting file information can be deleted from the merging and arranging files a-1, b-2 and c-m.
Based on the method flow in fig. 3 and the related methods and embodiments, the embodiment of the present application provides a method for processing a data query request, where in step 3022, based on a separate inverted file, the method respectively sends a data acquisition request to each corresponding storage node, and receives a request response sent by each storage node, where the method specifically includes:
for at least one query keyword, the following operations are executed in parallel:
based on the node identification and the file identification of each file information of a query keyword, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node.
In one embodiment, assuming that the data query request includes 2 query keywords, the separate inverted file of the query keyword 1 is acquired first, a data acquisition request is generated and sent to a corresponding storage node, the separate inverted file of the query keyword 2 can be acquired again without waiting for a response of the storage node, the data acquisition request is generated and sent to the corresponding storage node, and in this process, a data acquisition response of the data acquisition request of the query keyword 1 can be received and a data acquisition response of the query keyword 2 is acquired.
In one embodiment, the respective data acquisition requests are generated for the respective independent inverted files of the query keyword 1 and the query keyword 2 at the same time and sent to the corresponding storage nodes, and the data acquisition responses returned by the corresponding storage nodes are received.
In the above embodiment, since the data request is sent to the corresponding storage node according to the node identifier of the file information in the separate inverted file, compared with the data acquisition request sent to each storage node in the storage node cluster in the related art, the communication resource can be saved, and correspondingly, sufficient communication resource can be provided, so that the data acquisition request can be sent in a parallel manner.
Based on the method flow in fig. 3 and the related methods and embodiments, the embodiment of the present application provides a method for obtaining a combined inverted file, in step 3021, before extracting at least one query keyword from a data query request sent by a client, the method generates a combined inverted file in a computing node according to storage files of storage nodes, as shown in fig. 18, and further includes:
step 1810, extracting initial keywords and file information of the stored files in each storage node.
In one embodiment, a builder may be employed to parse a binary large object in a corpus (all storage files in each storage node) into storage files using a document parser, and if the binary large object includes a plurality of storage files, then a corresponding plurality of storage files may be parsed therefrom. Based on the keyword parser and the document parser, the initial keywords of each stored file and the total number of keywords of the initial keywords, the stored file length information, and the number of stored files containing the specific initial keywords are counted. Thus, the binary large object name, offset (byte offset) and storage file length (byte size) of each storage file can be obtained, and according to the binary large object name, offset and storage file length, file identification of the storage file can be generated, as shown in fig. 19, which is a statistical information acquisition method for storage files in a corpus, provided by the embodiment of the application, wherein the statistical information contains file information of each storage file.
In one embodiment, the file information may include information such as a file identifier (generated according to a binary large object name, an offset, and a stored file length), a node identifier of a home storage node, and the like.
In one embodiment, each storage file in the corpus in the foregoing embodiment may be obtained from each storage node, and the method for obtaining the statistical information of the storage file from the corpus may be performed in one or more computing nodes, or may also be performed in another independent node, where the main body of performing the method for obtaining the statistical information of the storage file is not specifically limited.
Step 1820, for each initial keyword, executing the following steps:
step 1821, obtaining file information of each storage file including an initial keyword for the initial keyword.
Step 1822, establishing association relations between an initial keyword and file information of each stored file respectively, and storing the association relations to initial independent inverted files of the initial keyword;
in one embodiment, for each initial keyword, determining file information of each storage file containing the initial keyword, establishing association relations between the initial keywords and the file information of each storage file, and storing the association relations to initial independent inverted files of the initial keywords, wherein when new or deletion of the storage file containing the initial keywords occurs, the new file information of the newly added storage file or the deleted file information of the deleted storage file can be correspondingly added in the initial independent inverted files.
In one embodiment, for each initial keyword, file information of each storage file with occurrence frequency higher than a set threshold is determined, association relations between the initial keywords and the file information of each storage file are established, and the association relations are stored in initial independent inverted files of the initial keywords. Wherein the frequency of occurrence of the initial keyword: may be based on the ratio of the sum of the number of occurrences of all keywords in the stored file to the number of occurrences of the initial keyword. That is, how to obtain the association relationship between the initial keyword and the stored file is not limited herein, and may be set as needed.
Step 1830, obtaining respective initial independent inverted files of each initial keyword, and respectively executing the following steps for different preset mapping methods:
and step 1831, merging the initial independent inverted files of at least one initial keyword with the same mapping value based on the respective mapping value of each initial keyword under a preset mapping method to obtain initial merged inverted files.
Step 1832, storing the initial combined inverted file into the mapping area associated with the corresponding mapping value.
In one embodiment, for each preset mapping method, all initial keywords are grouped, that is, the initial keywords with the same mapping value are divided into the same groups, that is, in order to keep pointers from the initial individual keywords of all the initial keywords to the rank file in the memory, the initial individual rank files of the initial keywords of the same group are combined by dividing the initial keywords with the same mapping value into the same groups, so as to obtain an initial combined rank file of the group, and the initial combined rank file is set to a mapping area (if the preset mapping method is a mapping method such as a residue function or a hash function) or a bucket (if the preset mapping method is a hash function).
In one embodiment, assume three initial keywords and their separate inverted files: ("hello" - - (doc 1, doc 2)), ("world" - - (doc 1)), ("system" - - (doc 2, doc 3)), assuming that "hello" and "world" are combined into one packet b1, the result is: (b 1- - (doc 1, doc 2)), ("System" - - (doc 2, doc 3)). The number of the keywords of the initial keywords after merging is reduced from 3 to 2, so that the memory occupied by the association relation between the initial keywords and the initial independent inverted files can be reduced, but the accurate initial independent inverted files of each initial keyword are not owned, namely, the initial merged inverted file of b1 contains both the initial independent inverted file of hello and the initial independent inverted file of world. If the doc2 which does not contain "world" exists in the initial combined inverted file (doc 1, doc 2), if the "world" is directly queried, the doc1 and doc2 are given out from the b1, the doc2 appears in the result, and the doc2 without the "world" is queried, thus, a plurality of different preset mapping methods can be set, each preset mapping method can group all initial keywords, and the initial combined inverted file is acquired, a reasonable plurality of different preset mapping methods are set, a plurality of combined inverted files obtained by the keywords in the plurality of different preset mapping methods can be caused, and the individual inverted files of the keywords are obtained by intersection of the plurality of combined inverted files
In one embodiment, if the preset mapping method is a hash function, and the number of files of all the storage files in each storage node is n, B is the total number of buckets corresponding to all the hash functions. In order to meet the requirement of reducing the memory occupied by the association relation between each keyword and the merged inverted file, the construction of a hash function and a hash seed are considered, so that the most complex case is considered to be O (min { nmL, bn+L }), wherein the stored file has m keywords on average, and the hash function has L layers. Thus, a search engine in a compute node needs memory holding O (B) for storing the multi-layer hash structure (which includes O (L) hash seeds, and O (B) bucket pointers, where L.ltoreq.B). Where B is fixed, the search engine has significantly fewer false positives than the hash function (l=1) (in the search engine, the stored file matches the query key of the search, and does not appear in the query results). The number of false positives decreases rapidly as the number of L increases from 1, but after a certain hash function number value, the allocated buckets are divided into too many layers, resulting in fewer buckets per layer, and thus higher selectivity, and more false positives. These results demonstrate the necessity to use multiple hash functions instead of a single hash function, and also demonstrate that there is an optimal choice of the number of hash functions, L, based on the choice of the corpus and the number of buckets B to which the multiple hash functions correspond.
In one embodiment, the subsequent builder creates a multi-layer hash function. The multi-layer hash function stores pointers to the merged inverted files that are compressed into a single blob. To locate each stored file, the multi-layered hash function includes file information associated with the key: binary large object name, byte offset, and byte size. In addition, the builder stores the hash seed (used to create the merged inverted file) and other metadata of the hash function in the same description file that is persisted as another blob. The development object may configure the cloud storage system in different ways, the storage driver specifying how the corpus is read, the parser specifying how the storage documents are separated in the corpus and how keywords are extracted from the storage documents. The query accuracy of the search engine may be set according to the number of average stored files (integrated error rate, described in detail later) that are not related to the query keyword. Memory limitations of the multi-layer hash function may also be set.
In one embodiment, as shown in fig. 20, in the search engine obtaining method provided by the embodiment of the present application, a computing node may obtain each initial keyword and an independent inverted file according to statistical information obtained from a corpus, and determine a plurality of different preset mapping methods and respective corresponding mapping areas thereof by adopting related parameters such as memory occupied by the search engine and a comprehensive error rate of a query result, and obtain a combined inverted file in each mapping area of each preset mapping method based on the same. And persisting the search engine comprising a plurality of different preset mapping methods, corresponding mapping areas thereof and the combined inverted file in each mapping area to each computing node, so as to provide data inquiry and data acquisition flow in the cloud storage system, and dynamically updating the combined inverted file in the corresponding mapping area in the search engine when the operations such as uploading and deleting exist in each storage file in each storage node.
Based on the method flow in fig. 3 and related methods and embodiments, an embodiment of the present application provides a method for obtaining different preset mapping methods, before determining each mapping area associated with a query keyword by using the different preset mapping methods in step 3021, the method analyzes statistical data of storage files of each storage node to determine different preset mapping methods, as shown in fig. 21, and further includes:
step 2101, obtaining the number of keywords of the initial keywords contained in each storage file and the query probability of each initial keyword, wherein the query probability is determined according to the historical query condition of the initial keywords.
Step 2102, determining respective keyword query conditions of each storage file according to the query probability of each initial keyword, wherein the keyword query conditions are determined according to the query probability of initial keywords not included in the corresponding storage file.
Step 2103, the number of keywords and the query condition of the keywords of each storage file, and the functional association relation between the number of areas of the mapping area and the number of methods of the preset mapping method, have a curve association relation with the expected comprehensive query error rate, and the comprehensive query error rate is used for representing the error rate of the query data of the computing node.
Step 2104, when the expected comprehensive error rate meets a preset value condition, obtaining different preset mapping methods meeting the function association relation, wherein the number of the methods is smaller than the number of the areas.
In one embodiment, the historical query request of the initial keyword may be a ratio of the query frequency of the initial keyword to the sum of the query frequencies of all keywords in the period of one month, half month or one week, or may be a ratio of the query frequency of the initial keyword in the period of time to the sum of the query frequencies of all keywords in the period of time, which should be noted that the query frequency acquisition mode of the initial keyword is not limited specifically.
In one embodiment, the different preset mapping methods and the method for acquiring the mapping areas of each preset mapping method include:
step 1, acquiring the number of keywords of initial keywords contained in each storage file and the query probability of each initial keyword, wherein the query probability is determined according to the historical query condition of the initial keywords;
step 2, for each storage file, executing the following operations:
step 21, aiming at the ith storage file, acquiring the expected keyword query condition of the ith storage file according to the keyword quantity of the ith storage file and the functional association relation between the area quantity of the mapping area and the method quantity of the preset mapping method;
In one embodiment, the expected false positives of the stored file (meaning that in the search results, the result stored file does not match the query terms of the search, but appears in the query results): search engine selection for a set of hash functionsWherein each h l Is a hash function of the first layer. Let W be i Is the collection of different initial keywords in the ith storage file, the byte size of which is |W i And W is the set of all keywords contained in all stored files in the corpus. Assuming that the total number of buckets B for each hash function is given, and assuming that B is divisible by L, the ith stored file is the query of any irrelevant query relationshipsThe false positive probability at the key word w is +.>Irrelevant query keywords w and q i The independence between is the result of multiple hash function mappings. Equation (1) also shows the approximation +.>Its nature results in efficient optimisation.
Step 22, obtaining the probability sum of the query probabilities of all the initial keywords except all the initial keywords in the ith storage file;
in one embodiment, c i =∑ w∈(W-w) p w The probability of the irrelevant query keyword w (the sum of the probabilities of the queries of the initial keywords except the initial keywords in the ith storage file) which is not contained in the ith storage file is used as a linear combination coefficient in F (comprehensive query error rate).
Step 23, determining an expected comprehensive query error rate according to the expected keyword query conditions of each storage file and the probability sum of each storage file;
in one embodiment, the distribution of query keywords is assumed to beWherein p is w Is the prior probability of the irrelevant query terms w in the query. Equation (2) describes the expected value of the number of false positives for all query terms in each query. This is the main objective function for tuning the search engine. For the sake of brevity, writeAnd similarly use +.>To define the approximation +.>The number of false positives observed is in fact highly concentrated around this expected value.
Although L is a discrete variable, its domain is extended to a continuous variable L ε R, where 1.ltoreq.L.ltoreq.B, to investigate the richer character of F (L). This expansion imparts a derivative By bringing z into i (L)=1-exp(-|W i L/B) is substituted into the formula to simplify the formula. Focusing on approximate transitions eases the analysis and results in an efficient algorithm to optimize the search engine.
Step 3, determining the different preset mapping methods corresponding to the function association relation according to preset value conditions, wherein the number of the areas is smaller than that of the methods;
Wherein, the preset value condition is: the expected comprehensive query error rate takes the minimum value, or in the set value range of the area number or the method number, it should be noted that the preset value condition is not limited specifically, and the expected comprehensive query error rate can be valued from the corresponding part in the curve according to the curve association relationship between the function association relationship and the expected comprehensive query error rate.
In one placeIn one embodiment, the number of hash functions (number of layers) is optimized: when the number of layers is small, the number of the combined inverted files which can be obtained by inquiry is small, and the number of the combined inverted files which are subjected to intersection processing is small, so that the overall performance can be improved. In addition, a larger number of buckets may copy more documents between layers, further increasing the memory size occupied by the search engine. Then the total barrel number B and false positive rate F can be given 0 The number of layers is minimized under the constraint of (integrated query error rate). In other words, the optimization problem (equation (4)) finds a solution to minimize the number of layers, so that the (B, L) search engine has a ratio F 0 Fewer false positives are expected.
F (L) is non-convex and may contain a plurality of minima, although approximation thereof The analysis of (a) reveals three important features of algorithm 1 as different mapping methods and their respective construction methods of the mapping region. First, there is a fast optimization area, covering the utility F 0 Value (lemma 2). Second, although L may reach B size, we only need to search within a smaller interval (lemma 3). Finally, there is a lower bound that allows us to quickly check feasibility.
Algorithm 1:
inputting the barrel number B and the expected false positive rate F 0 Keyword setQuery probability for query keywords
Distribution of
Output of minimum number of layers L *
Step (1),
Step (2), if F (L min )≤F 0 then;
Step (3), L * A search for L.epsilon.1, L by two parts min ];
Step (4), iteratively searching L epsilon [ [ L ] by else if min ,L max ]Satisfying the condition then;
step (5), returning the iterative search result;
and (6) returning if the value of L is equal, otherwise rejecting.
With these lemma algorithms 1, the proposed constraints B and F are first verified using the lower bound 0 It is then determined whether L falls within the fast or slow region and the optimization procedure is selected accordingly. For the followingA decreasing fast region which performs a binary search to find the range [1, L min ]The smallest L of (3). On the other hand in the range [ L ] min ,L max ]Such monotonicity cannot be guaranteed in the slow region of (c), and therefore the algorithm iteratively tries to increase the value of L until the constraint is satisfied (steps (4), (5)). If the lower bound check or iterative search fails, then L cannot be found that satisfies the constraint, so the algorithm will reject (step (6)).
And (5) quotation mark 1.
Can be obtained immediatelyThus->
And (3) proving: from equation (3), a minimizer is obtainedSatisfy->The left-hand factor is always positive, so the right-hand factor is zero or equivalent to +.>Thus->In other words, in the alternative,substituting this minimizer into equations (1) and (2) can yield the latter two results.
Because for 1.ltoreq.L.ltoreq.B,a lower bound is also obtained>Thereby verifying the feasibility check in algorithm 1.
2 for the quotationThe expected misjudgment rate strictly drops by f (L)<0, and->
And (5) proving. If it isThen z i (L)<1/2. The strictly decreasing nature can be demonstrated using equation (3):likewise, from z i (L)<1/2, also can get +.>According to formula (1); if->Then for all i e n],/>The expected false positive rate also decreases exponentially +.>
Remarks: interval [1, L min ]Covered with F 0 . Even in the worst case where c i =1, this area also covers the expected number of false positives, the coverage drops toI.e.The optimization can be fast by performing a binary search for a strictly decreasing function. Nevertheless, algorithm 1 measures the region F (L min ) Lower limit of expected false positives to decide whether to use rapid optimisation.
And (3) lemma 3. For the followingExpected false positives strictly increase f (L)>0。
And (3) proving: if it is Then z i (L)>1/2. In combination with formula (3), it means +.>
False positive assurance: for a fixed bin number B and tier number L, each false positive from i stored documents that are independent of the query keyword w is a multiple of the Bernoulli random variable, i.e., x i,w =p w b i Wherein b i ~Bern(q i (L)). Due to x i,w ∈[0,p w ]And E [ x ] i,w ]=p w q i (L) equation (4), hoeffding (Hoeffding Huo Fuding inequality) inequality guarantees the number of false positives observedw is not greater than the desired E [ X ]]=f (L) is increased by epsilon with a probability of at least 1- δ.
Wherein the method comprises the steps ofThus the deviation is +.>Is limited by the number of (a). In the case where few query keywords are not relevant to all stored files and dominant (p w 1), let>In the worst case of the distribution of (a) the deviation may be large, e.g.>Techniques such as query caching are sufficient to address this situation. In general, when there are many extraneous query keywords of similar probability, the bias will shrink as the number of words increases: />
In one embodiment, based on the above embodiment, while the optimization formula considers the query probability distribution Cat (w) of any classified keyword, the system defaults to a uniform distribution; in other words, the probability of including keywords in the corpus among the query keywords included in the data query request is the same as the probability of including any keywords in the query, i.e., p w =1/|w|. Although there is no further evidence to support or deny this selection, it may be too simplistic. Other possible optimization options are as follows: (a) P is p w = occurrences (w), determined by analyzing the number of occurrences of the query keyword and the total number of keywords in the corpus; (b) Prior probability p of open object provision or statistics w . Can also be considered in querying keywordsNon-zero p-assignment of keyword assignment in corpus w’
In one embodiment, a method of merging inverted file compression is provided herein to avoid creating too many tiny files or a small number of large files, the compression consisting of two components: a header block and a merge inverted file block. Each merge inverted file stores a continuous plurality of serialized file information. The present application uses Protocol Buffers to sequence file information into byte arrays. When the search engine is built, the builder follows the position of each file information and builds a binary pointer dictionary. Given the file information block structure, each binary pointer needs to represent a block ID, an offset, and a byte length in order to retrieve the bytes of the stored file to which the file information corresponds in one transmission. In addition, the device compresses repeated character strings in the storage files into integer keys, and the compression reduces the number of bytes needed to be downloaded for each storage file, thereby accelerating the query speed. The builder persists these binary pointers and string compression tables in the header block along with the hash seeds and other metadata. The hash seeds are collected from the hash functions of the search engine in a manner that succinctly represents the search engine map. This header block may be loaded upon initialization of the system search engine.
Based on the method flow of obtaining different preset mapping methods in fig. 21, after obtaining different preset mapping methods and corresponding mapping areas, corresponding to step 303 in the method flow in fig. 3, an embodiment of the present application provides a method for generating a data query response, in step 303, according to a storage file carried in each received request response of each query keyword, generating a data query response, and returning the data query response to the client, where the method includes:
and a, determining the total number of the received files of each storage file according to the storage files carried in each request response of each query keyword.
And b, determining the sampling number according to the total number of the files, the comprehensive query error rate and a set probability value, wherein the set probability value is the ratio of the number of history-related files to the number of history samples, and the number of history-related files is the number of stored files which are contained in the stored files of the number of history samples and are related to the history query keywords.
And c, randomly taking out the storage files with the sampling number from the received storage files according to the sampling number.
And d, generating the data query response according to the sampling number of the storage files, and returning to the parallel client.
In one embodiment, the search engine supports retrieval of at least K stored documents relevant to a data query request, rather than retrieving all relevant stored documents in the query. Top-K queries may implement client paging to provide fast browsing or batch processing. Due to false positive assurance of search engines, i.e. data acquisition requests from storage nodes, each storage file contained contains an average F 0 The search engine may sample the obtained storage files. Assuming that the total number of storage files obtained from the storage files is, if K.gtoreq.R-F 0 And the search engine acquires R storage files and sends the R storage files to the client. Otherwise each storage fileCorresponding to an associated stored file having a bernoulli distribution of Bern (p=1-F 0 R). Solving the quadratic inequality after applying the Hoeffding inequality ensures that the magnitude is R in the case of a probability of at least 1-delta K The sample storage file of (formula (6)) contains at least K relevant storage files.
Wherein p=1 to F 0 /R,F 0 : comprehensive query error rate, R: total number of files stored, R K : the number of samples, K, the stored files related to the query keyword contained in the stored files of the number of samples.
In the above data query method of fig. 3, and the related methods and embodiments thereof, common keywords (the keywords with the query frequency higher than the threshold value of the query frequency or keywords appearing in many storage files in the corpus may be set as required, it should be noted that, the common keywords may be set specifically and not limited), some information retrieval systems allocate them as stop words, and filter them in all the retrieval or query steps, in contrast to the present application supporting searching common keywords, the challenge is that the large individual inverted files thereof are combined into corresponding combined inverted files in the search engine, so that the performance of query of other keywords is reduced, and the solution of the present application is to set a bucket of 1% hash function as the individual inverted file storing the most common keywords. For example, if b=10000, the system sets 99,000 buckets as buckets for intersection processing in the search engine, and uses 1,000 buckets to store individual to-rank files of 1,000 most common keywords, individual reverse-rank files of these conventional keywords may also employ the compression method in the above embodiment.
In one embodiment, although the search engine only natively supports queries of a single query keyword, it may be adapted to accelerate queries of other categories, for example, similar to inverted index, the different preset mapping methods in the search engine of the present application-the manner in which the merged inverted file is processed by intersection-may be naturally generalized to boolean queries, letting Q (w) be the query function of the query search engine, and return the merged inverted file of the query keyword and other related stored files (e.g., value range queries, query keyword paraphrasing queries and anticonsite queries, etc.). The search engine performs any boolean query by assigning its query function to each query keyword. In addition, regular expressions can benefit from the different preset mapping methods-merging and inverted file intersection processing modes in the search engine, just like inverted indexes, by indexing N-grams, the engines use inverted indexes (the different preset mapping methods-merging and inverted file intersection processing modes in the search engine) as filters to avoid scanning the whole corpus, and the single inverted files obtained by intersection processing of the remaining merging and inverted files are matched with the storage files obtained by the regular expression engine to remove false positives.
In one embodiment, in the data query method of the present application, a cloud storage system may be implemented in a manner of elastic search (distributed search) or Solr, and the specific implementation manner of the cloud storage system is not limited herein.
Based on the same concept, an embodiment of the present application provides a data query device 2200, which is applicable to a computing node in a cloud storage system, where the cloud storage system includes the computing node and a storage node, and the storage node includes a plurality of storage files, as shown in fig. 22, and includes:
an extracting unit 2210, configured to extract at least one query keyword from a data query request sent by a client;
a first processing unit 2220, configured to perform the following operations for the at least one query keyword respectively:
the mapping unit 2221 is configured to determine each mapping area associated with a query keyword by using different preset mapping methods, obtain corresponding combined inverted files from each mapping area, and perform intersection processing on each obtained combined inverted file to obtain an independent inverted file of the query keyword; each combined inverted file is obtained by combining the independent inverted file and the independent inverted file of other keywords; the single inverted file of one query keyword comprises: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located, a plurality of mapping areas are respectively related to different preset mapping methods, and each mapping area comprises a merging and inverted file;
A transceiver 2222, configured to send a data acquisition request to each corresponding storage node based on the node identifier and the file identifier of each file information, and receive a request response sent by each storage node;
the second processing unit 2230 is configured to generate a data query response according to the received storage file carried in each request response of each query keyword, and return the data query response to the client by using the transceiver unit 2222.
Optionally, the mapping unit 2221 is specifically configured to perform the following operations for the different preset mapping methods, respectively:
mapping the query keyword by adopting a preset mapping method to obtain a mapping value;
determining a mapping area associated with the mapping values, and acquiring combined inverted files from the mapping area, wherein the mapping values of the keywords of each single inverted file in the combined inverted files in the mapping area are the same in one preset mapping method, and the combined inverted files in a plurality of mapping areas associated with the preset mapping method comprise file information of storage files of each storage node.
Optionally, the extracting unit 2210 is further configured to obtain new file information of a new file based on detection of each storage node, and extract at least one inserted keyword from the new file, where the new file is a file that is received by a storage node and stored in a storage node and is different from each stored file;
The first processing unit 2220 is further configured to, if it is determined that the at least one inserted keyword is included in a keyword record, record keywords included in a storage file of each storage node;
the following operations are performed for the at least one inserted keyword, respectively:
the mapping unit 2221 is further configured to determine each mapping area associated with one inserted keyword by using the different preset mapping methods, obtain a corresponding combined inverted file from each mapping area associated with the one inserted keyword, and store an association relationship between the one inserted keyword and the new file information in the corresponding combined inverted file.
Optionally, the extracting unit 2210 is further configured to obtain new file information of a new file based on detection of each storage node, and extract at least one inserted keyword from the new file, where the new file is a file that is received by a storage node and stored in a storage node and is different from each stored file;
the first processing unit 2220 is further configured to, if it is determined that the at least one inserted keyword includes a new keyword that is not included in a keyword record, where the keyword record is used to record keywords that are included in a storage file of each storage node;
The mapping unit 2221 is further configured to generate, for the new keyword, an individual inverted file of the new keyword, where the individual inverted file of the new keyword includes: new file information of the new file associated with the new keyword;
the mapping unit 2221 is further configured to determine each mapping area associated with the new keyword by using the different preset mapping methods, obtain corresponding combined inverted files from each mapping area associated with the new keyword, and combine the single inverted files of the new keyword into the corresponding combined inverted files.
Optionally, the extracting unit 2210 is further configured to obtain, based on detection of each storage node, deletion file information of a deletion file, and extract at least one deletion keyword from the deletion file, where the deletion file is a file deleted by the storage node from each storage file stored in the storage node;
the first processing unit 2220 is further configured to perform the following operations, for the at least one deletion keyword, respectively:
the mapping unit 2221 is further configured to determine each mapping area associated with one deletion keyword by using the different preset mapping methods, obtain corresponding merged inverted files from each mapping area associated with the one deletion keyword, and perform intersection processing on each obtained merged inverted file to obtain an independent inverted file of the one deletion keyword; and deleting the deleted file information from the single inverted file of the deleted keyword.
Optionally, the second processing unit 2230 is specifically configured to, for the at least one query keyword, perform the following operations in parallel:
based on the node identification and the file identification of each file information of a query keyword, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node.
Optionally, the extracting unit 2210 is further configured to extract initial keywords and file information of the stored files in each storage node;
the first processing unit 2220 is further configured to perform the following steps for each initial keyword:
acquiring file information of each storage file containing one initial keyword aiming at the initial keyword;
establishing association relations between the initial keywords and file information of each storage file respectively, and storing the association relations to initial independent inverted files of the initial keywords;
obtaining respective initial independent inverted files of the initial keywords, and respectively executing the following steps for the different preset mapping methods:
combining the initial independent inverted files of at least one initial keyword with the same mapping value based on the respective mapping value of each initial keyword under the preset mapping method to obtain initial combined inverted files;
And storing the initial merging and inverted file into a mapping area associated with the corresponding mapping value.
Optionally, the first processing unit 2220 is further configured to,
acquiring the number of keywords of initial keywords contained in each storage file and the query probability of each initial keyword, wherein the query probability is determined according to the historical query condition of the initial keywords;
determining the keyword query conditions of each storage file according to the query probability of each initial keyword, wherein the keyword query conditions are determined according to the query probability of the initial keywords not contained in the corresponding storage file;
the method comprises the steps that the function association relation between the number of keywords and the keyword query condition of each storage file, the number of areas of a mapping area and the number of methods of a preset mapping method and the expected comprehensive query error rate are provided with curve association relation, and the comprehensive query error rate is used for representing the error rate of the computing node query data;
and when the expected comprehensive error rate meets a preset value condition, acquiring different preset mapping methods meeting the function association relation, wherein the number of the methods is smaller than that of the areas.
Optionally, the second processing unit 2230 is specifically configured to,
determining the total number of the received storage files according to the storage files carried in each request response of each query keyword;
determining the sampling number according to the total number of the files, the comprehensive query error rate and a set probability value, wherein the set probability value is the ratio of the number of history-related files to the number of history samples, and the number of history-related files is the number of stored files which are contained in the stored files of the number of history samples and are related to the history query keywords;
according to the sampling number, randomly taking out the sampling number storage files from the received storage files;
and generating the data query response according to the sampling number of the storage files, and returning to the union client.
Based on the same inventive concept as the above-mentioned method embodiment, a computer device is also provided in the embodiment of the present application. In one embodiment, the computer device may be a server, such as server 220 and server 230 shown in FIG. 2. In this embodiment, the architecture of the computer device may include a memory 2301, a communication module 2303, and one or more processors 2302, as shown in FIG. 23.
Memory 2301 for storing computer programs executed by processor 2302. The memory 2301 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, programs required for running an instant messaging function, and the like; the storage data area can store various instant messaging information, operation instruction sets and the like.
The memory 2301 may be a volatile memory (RAM) such as a random-access memory (RAM); the memory 2301 may be a nonvolatile memory (non-volatile memory), such as a read-only memory (rom), a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); or memory 2301, is any other medium that can be used to carry or store a desired computer program in the form of instructions or data structures and that can be accessed by a computer, but is not limited to such. The memory 2301 may be a combination of the above.
The processor 2302 may include one or more central processing units (central processing unit, CPU) or digital processing units, or the like. A processor 2302 for implementing the data query method described above when invoking a computer program stored in memory 2301.
The communication module 2303 is used to communicate with terminal devices and other servers.
The specific connection medium between the memory 2301, the communication module 2303 and the processor 2302 is not limited in the embodiment of the application. The embodiment of the present application is illustrated in fig. 23 by a connection between the memory 2301 and the processor 2302 via a bus 2304, the bus 2304 being illustrated in fig. 23 by a bold line, and the connection between other components is merely illustrative and not limiting. The bus 2304 may be divided into an address bus, a data bus, a control bus, and the like. For ease of description, only one thick line is depicted in fig. 23, but only one bus or one type of bus is not depicted.
The memory 2301 stores a computer storage medium having stored therein computer executable instructions for implementing the data query method of the embodiments of the present application. The processor 2302 is configured to perform the data query method described above, as shown in fig. 3 or 11 or 12 or 15 or 18 or 21.
In another embodiment, the computer device may also be other computer devices, such as the terminal device 210 shown in FIG. 2. In this embodiment, the structure of the computer device may include, as shown in fig. 24: communication assembly 2410, memory 2420, display unit 2430, camera 2440, sensor 2450, audio circuit 2460, bluetooth module 2470, processor 2480, and the like.
The communication component 2410 is for communicating with a server. In some embodiments, a circuit wireless fidelity (Wireless Fidelity, wiFi) module may be included, where the WiFi module is a short-range wireless transmission technology, and the computer device may help the user to send and receive information through the WiFi module.
Memory 2420 may be used to store software programs and data. The processor 2480 performs various functions and data processing of the terminal device 210 by executing software programs or data stored in the memory 2420. The memory 2420 may include high speed random access memory and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state memory device. The memory 2420 stores an operating system that enables the terminal device 210 to operate. The memory 2420 of the present application may store an operating system and various application programs, and may also store a computer program for executing the data query method of the embodiment of the present application.
The display unit 2430 may also be used to display information input by a user or information provided to the user and a graphical user interface (graphical user interface, GUI) of various menus of the terminal device 210. Specifically, the display unit 2430 can include a display 2432 disposed on a front side of the terminal device 210. The display 2432 may be configured in the form of a liquid crystal display, light emitting diodes, or the like. The display unit 2430 may be used to display a data query user interface or the like in an embodiment of the present application.
The display unit 2430 may also be used to receive input numeric or character information, generate signal inputs related to user settings and function control of the terminal device 110, and in particular, the display unit 2430 may include a touch screen 2431 disposed on the front of the terminal device 210, and may collect touch operations on or near the user, such as clicking buttons, dragging scroll boxes, and the like.
The touch screen 2431 may cover the display screen 2432, or the touch screen 2431 may be integrated with the display screen 2432 to implement input and output functions of the terminal device 210, and after integration, the touch screen may be simply referred to as a touch screen. The display unit 2430 may display an application program and a corresponding operation procedure.
The camera 2440 may be used to capture still images and a user may comment on the images captured by the camera 2440 through an application. The number of cameras 2440 may be one or more. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive elements convert the optical signals to electrical signals, which are then transferred to a processor 2480 for conversion to digital image signals.
The terminal device may further comprise at least one sensor 2450, such as an acceleration sensor 2451, a distance sensor 2452, a fingerprint sensor 2453, a temperature sensor 2454. The terminal device may also be configured with other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, light sensors, motion sensors, and the like.
The audio circuitry 2460, speaker 2461, microphone 2462 can provide an audio interface between a user and the terminal device 110. The audio circuit 2460 may transmit the received electrical signal converted from audio data to the speaker 2461, where the electrical signal is converted to a sound signal by the speaker 2461 and output. The terminal device 210 may also be configured with a volume button for adjusting the volume of the sound signal. On the other hand, the microphone 2462 converts the collected sound signals into electrical signals, which are received by the audio circuit 2460 and converted into audio data, which are output to the communication component 2410 for transmission to, for example, another terminal device 210, or to the memory 2420 for further processing.
The bluetooth module 2470 is configured to interact with other bluetooth devices having bluetooth modules via a bluetooth protocol. For example, the terminal device may establish a bluetooth connection with a wearable computer device (e.g., a smart watch) that also has a bluetooth module through the bluetooth module 2470, thereby performing data interaction.
Processor 2480 is a control center of the terminal device and connects various parts of the entire terminal using various interfaces and lines, performs various functions of the terminal device and processes data by running or executing software programs stored in memory 2420, and invoking data stored in memory 2420. In some embodiments, processor 2480 can include one or more processing units; processor 2480 can also integrate an application processor that primarily handles operating systems, user interfaces, applications, and the like, with a baseband processor that primarily handles wireless communications. It will be appreciated that the baseband processor described above may not be integrated into the processor 2480. Processor 2480 can run an operating system, applications, user interface displays, and touch responses, as well as data query methods of embodiments of the present application. In addition, a processor 2480 is coupled to the display unit 2430.
In some possible embodiments, aspects of the data query method provided by the present application may also be implemented in the form of a program product comprising a computer program for causing a computer device to perform the steps of the data query method according to the various exemplary embodiments of the present application as described herein above, when the program product is run on a computer device, e.g. the computer device may perform the steps as shown in fig. 3 or 11 or 12 or 15 or 18 or 21.
The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. The readable storage medium can be, for example, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium would include the following: an electrical connection having one or more wires, a portable disk, a hard disk, random Access Memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
The program product of embodiments of the present application may take the form of a portable compact disc read only memory (CD-ROM) and comprise a computer program and may run on a computer device. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with a command execution system, apparatus, or device.
The readable signal medium may comprise a data signal propagated in baseband or as part of a carrier wave in which a readable computer program is embodied. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with a command execution system, apparatus, or device.
A computer program embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.
Computer programs for performing the operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C++ or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer program may execute entirely on the user's computer device, partly on the user's device, as a stand-alone software package, partly on the user's computer device and partly on a remote computer device or entirely on the remote computer device or server. In the case of remote computer devices, the remote computer device may be connected to the user computer device through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer device (for example, through the Internet using an Internet service provider).
It should be noted that although several units or sub-units of the apparatus are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functions of two or more of the elements described above may be embodied in one element in accordance with embodiments of the present application. Conversely, the features and functions of one unit described above may be further divided into a plurality of units to be embodied.
Furthermore, although the operations of the methods of the present application are depicted in the drawings in a particular order, this is not required to either imply that the operations must be performed in that particular order or that all of the illustrated operations be performed to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step to perform, and/or one step decomposed into multiple steps to perform.
It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
It will be apparent to those skilled in the art that various modifications and variations can be made to the present application without departing from the spirit or scope of the application. Thus, it is intended that the present application also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims (15)

1. The data query method is characterized by being applicable to a computing node in a cloud storage system, wherein the cloud storage system comprises the computing node and a storage node, the storage node comprises a plurality of storage files, and the data query method comprises the following steps:
extracting at least one query keyword from a data query request sent by a client;
the following operations are respectively executed for the at least one query keyword:
Determining each mapping area associated with one query keyword by adopting different preset mapping methods, respectively obtaining corresponding combined inverted files from each mapping area, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one query keyword; each combined inverted file is obtained by combining the independent inverted file and the independent inverted file of other keywords; the single inverted file of one query keyword comprises: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located, a plurality of mapping areas are respectively related to different preset mapping methods, and each mapping area comprises a merging and inverted file;
based on the respective node identification and the file identification of each file information, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node; and generating a data query response and returning the data query response to the client according to the storage file carried in each received request response of each query keyword.
2. The method of claim 1, wherein the determining each mapping area associated with a query keyword by using different preset mapping methods, and obtaining the corresponding combined inverted file from each mapping area respectively, includes:
for the different preset mapping methods, the following operations are respectively executed:
mapping the query keyword by adopting a preset mapping method to obtain a mapping value;
determining a mapping area associated with the mapping values, and acquiring combined inverted files from the mapping area, wherein the mapping values of the keywords of each single inverted file in the combined inverted files in the mapping area are the same in one preset mapping method, and the combined inverted files in a plurality of mapping areas associated with the preset mapping method comprise file information of storage files of each storage node.
3. The method as recited in claim 1, further comprising:
based on detection of each storage node, obtaining newly-added file information of newly-added files, and extracting at least one inserted keyword from the newly-added files, wherein the newly-added files are files which are received by the storage nodes and are stored in a manner different from the stored storage files;
If the keyword records are determined to contain the at least one inserted keyword, the keyword records are used for recording keywords contained in storage files of all storage nodes;
the following operations are performed for the at least one inserted keyword, respectively:
and determining each mapping area associated with one inserted keyword by adopting different preset mapping methods, obtaining corresponding combined inverted files from each mapping area associated with one inserted keyword, and storing the association relation between one inserted keyword and the newly added file information into the corresponding combined inverted files.
4. The method as recited in claim 1, further comprising:
based on detection of each storage node, obtaining newly-added file information of newly-added files, and extracting at least one inserted keyword from the newly-added files, wherein the newly-added files are files which are received by the storage nodes and are stored in a manner different from the stored storage files;
if the at least one inserted keyword is determined to be a newly added keyword which is not included in a keyword record, the keyword record is used for recording keywords included in storage files of all storage nodes;
Generating an independent inverted file of the newly added keyword aiming at the newly added keyword, wherein the independent inverted file of the newly added keyword comprises: new file information of the new file associated with the new keyword;
and determining each mapping area associated with the new keyword by adopting different preset mapping methods, respectively obtaining corresponding combined inverted files from each mapping area associated with the new keyword, and combining the independent inverted files of the new keyword into the corresponding combined inverted files.
5. The method as recited in claim 1, further comprising:
acquiring deleted file information of a deleted file based on detection of each storage node, and extracting at least one deleted keyword from the deleted file, wherein the deleted file is a file deleted by the storage node from each stored file;
for the at least one deletion keyword, the following operations are respectively executed:
determining each mapping area associated with one deletion keyword by adopting different preset mapping methods, respectively obtaining corresponding combined inverted files from each mapping area associated with the one deletion keyword, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one deletion keyword;
And deleting the deleted file information from the single inverted file of the deleted keyword.
6. The method of any one of claims 1-5, comprising:
for the at least one query keyword, the following operations are executed in parallel respectively:
based on the node identification and the file identification of each file information of a query keyword, respectively sending a data acquisition request to each corresponding storage node, and receiving a request response sent by each storage node.
7. The method according to any one of claims 1-5, further comprising, before extracting at least one query keyword from the data query request sent by the client:
extracting initial keywords and file information of the stored files in each storage node;
for each initial keyword, the following steps are respectively executed:
acquiring file information of each storage file containing one initial keyword aiming at the initial keyword;
establishing association relations between the initial keywords and file information of each storage file respectively, and storing the association relations to initial independent inverted files of the initial keywords;
Obtaining respective initial independent inverted files of the initial keywords, and respectively executing the following steps for the different preset mapping methods:
combining the initial independent inverted files of at least one initial keyword with the same mapping value based on the respective mapping value of each initial keyword under the preset mapping method to obtain initial combined inverted files;
and storing the initial merging and inverted file into a mapping area associated with the corresponding mapping value.
8. The method of any one of claims 1 to 5, wherein before determining each mapping area associated with a query keyword by using different preset mapping methods, the method further comprises:
acquiring the number of keywords of initial keywords contained in each storage file and the query probability of each initial keyword, wherein the query probability is determined according to the historical query condition of the initial keywords;
determining the keyword query conditions of each storage file according to the query probability of each initial keyword, wherein the keyword query conditions are determined according to the query probability of the initial keywords not contained in the corresponding storage file;
the method comprises the steps that the function association relation between the number of keywords and the keyword query condition of each storage file, the number of areas of a mapping area and the number of methods of a preset mapping method and the expected comprehensive query error rate are provided with curve association relation, and the comprehensive query error rate is used for representing the error rate of the computing node query data;
And when the expected comprehensive error rate meets a preset value condition, acquiring different preset mapping methods meeting the function association relation, wherein the number of the methods is smaller than that of the areas.
9. The method of claim 7, wherein generating a data query response back to the client based on the received stored file carried in each request response for each query keyword, comprises:
determining the total number of the received storage files according to the storage files carried in each request response of each query keyword;
determining the sampling number according to the total number of the files, the comprehensive query error rate and a set probability value, wherein the set probability value is the ratio of the number of history-related files to the number of history samples, and the number of history-related files is the number of stored files which are contained in the stored files of the number of history samples and are related to the history query keywords;
according to the sampling number, randomly taking out the sampling number storage files from the received storage files;
and generating the data query response according to the sampling number of the storage files, and returning to the union client.
10. A data query device, which is suitable for a computing node in a cloud storage system, wherein the cloud storage system comprises the computing node and a storage node, and the storage node comprises a plurality of storage files, and the data query device comprises:
the extraction unit is used for extracting at least one query keyword from the data query request sent by the client;
the first processing unit is used for respectively executing the following operations for the at least one query keyword:
the mapping unit is used for respectively adopting different preset mapping methods to determine each mapping area associated with one query keyword, respectively obtaining corresponding combined inverted files from each mapping area, and carrying out intersection processing on each obtained combined inverted file to obtain an independent inverted file of the one query keyword; each combined inverted file is obtained by combining the independent inverted file and the independent inverted file of other keywords; the single inverted file of one query keyword comprises: each piece of file information related to the query keyword comprises a node identifier and a file identifier of a storage node where a corresponding storage file is located, a plurality of mapping areas are respectively related to different preset mapping methods, and each mapping area comprises a merging and inverted file;
The receiving and transmitting unit is used for respectively transmitting a data acquisition request to each corresponding storage node based on the respective node identification and the file identification of each file information and receiving a request response transmitted by each storage node;
and the second processing unit is used for generating a data query response according to the received storage file carried in each request response of each query keyword and returning the data query response to the client by adopting the receiving and transmitting unit.
11. The apparatus of claim 10, wherein the mapping unit is configured to,
for the different preset mapping methods, the following operations are respectively executed:
mapping the query keyword by adopting a preset mapping method to obtain a mapping value;
determining a mapping area associated with the mapping values, and acquiring combined inverted files from the mapping area, wherein the mapping values of the keywords of each single inverted file in the combined inverted files in the mapping area are the same in one preset mapping method, and the combined inverted files in a plurality of mapping areas associated with the preset mapping method comprise file information of storage files of each storage node.
12. The apparatus of claim 10, wherein the extracting unit is further configured to obtain newly added file information of newly added files based on detection of each storage node, and extract at least one insert keyword from the newly added files, the newly added files receiving and storing files different from each storage file stored by the storage node;
the first processing unit is further configured to, if it is determined that the keyword record includes the at least one inserted keyword, record keywords included in a storage file of each storage node;
the following operations are performed for the at least one inserted keyword, respectively:
the mapping unit is further configured to determine each mapping area associated with one inserted keyword by using the different preset mapping methods, obtain a corresponding combined inverted file from each mapping area associated with one inserted keyword, and store an association relationship between the one inserted keyword and the new file information in the corresponding combined inverted file.
13. A computer readable non-volatile storage medium, characterized in that the computer readable non-volatile storage medium stores a program which, when run on a computer, causes the computer to implement the method of any one of claims 1 to 9.
14. A computer device, comprising:
a memory for storing a computer program;
a processor for invoking a computer program stored in said memory, performing the method according to any of claims 1 to 9 in accordance with the obtained program.
15. A computer program product comprising a computer program, the computer program being stored on a computer readable storage medium; when a processor of a computer device reads the computer program from the computer readable storage medium, the processor executes the computer program, causing the computer device to perform the method of any one of claims 1 to 9.
CN202310715282.2A 2023-06-15 2023-06-15 Data query method, device, equipment and storage medium Pending CN116962516A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310715282.2A CN116962516A (en) 2023-06-15 2023-06-15 Data query method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310715282.2A CN116962516A (en) 2023-06-15 2023-06-15 Data query method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116962516A true CN116962516A (en) 2023-10-27

Family

ID=88457355

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310715282.2A Pending CN116962516A (en) 2023-06-15 2023-06-15 Data query method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116962516A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633362A (en) * 2023-12-13 2024-03-01 北京小懂科技有限公司 Medical information recommendation method and platform based on big data analysis technology
CN118035324A (en) * 2024-04-15 2024-05-14 航天宏图信息技术股份有限公司 Data processing query method, device, server and medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117633362A (en) * 2023-12-13 2024-03-01 北京小懂科技有限公司 Medical information recommendation method and platform based on big data analysis technology
CN118035324A (en) * 2024-04-15 2024-05-14 航天宏图信息技术股份有限公司 Data processing query method, device, server and medium

Similar Documents

Publication Publication Date Title
CN111782965B (en) Intention recommendation method, device, equipment and storage medium
US12026155B2 (en) Executing one query based on results of another query
US11263268B1 (en) Recommending query parameters based on the results of automatically generated queries
US12093318B2 (en) Recommending query parameters based on tenant information
US11604799B1 (en) Performing panel-related actions based on user interaction with a graphical user interface
US10685071B2 (en) Methods, systems, and computer program products for storing graph-oriented data on a column-oriented database
Hu et al. Toward scalable systems for big data analytics: A technology tutorial
CA2953826C (en) Machine learning service
US7930288B2 (en) Knowledge extraction for automatic ontology maintenance
US11636128B1 (en) Displaying query results from a previous query when accessing a panel
Zhang et al. MRMondrian: Scalable multidimensional anonymisation for big data privacy preservation
CN116962516A (en) Data query method, device, equipment and storage medium
van Altena et al. Understanding big data themes from scientific biomedical literature through topic modeling
JP2006107446A (en) Batch indexing system and method for network document
KR20130049111A (en) Forensic index method and apparatus by distributed processing
CN112818195B (en) Data acquisition method, device and system and computer storage medium
US9223992B2 (en) System and method for evaluating a reverse query
US20230082446A1 (en) Compound predicate query statement transformation
CN117149777A (en) Data query method, device, equipment and storage medium
US20230153300A1 (en) Building cross table index in relational database
Rashid et al. Data lakes: a panacea for big data problems, cyber safety issues, and enterprise security
US20180060336A1 (en) Format Aware File System With File-to-Object Decomposition
DeBrie The dynamodb book
US11755626B1 (en) Systems and methods for classifying data objects
Pradeep et al. Big Data analysis: a step to define

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication