CN113821704B

CN113821704B - Method, device, electronic equipment and storage medium for constructing index

Info

Publication number: CN113821704B
Application number: CN202010562441.6A
Authority: CN
Inventors: 顾明
Original assignee: Huawei Cloud Computing Technologies Co Ltd
Current assignee: Huawei Cloud Computing Technologies Co Ltd
Priority date: 2020-06-18
Filing date: 2020-06-18
Publication date: 2024-01-16
Anticipated expiration: 2040-06-18
Also published as: CN113821704A

Abstract

The embodiment of the application provides a method, a device, electronic equipment and a storage medium for constructing an index, wherein the method comprises the following steps: generating a first index and a second index according to the document, wherein the first index represents the mapping relation between the vector and the document, and the second index represents the mapping relation between the text and the document; storing a first index into a first type of file set, wherein the first index is in an available state, and the first index in the available state is used for searching documents associated with search content through vectors; and storing the second index into a file set of a second type, and establishing a mapping relation among the first index, the second index and the document. According to the method and the device, files where the first index is located do not need to be combined, so that the time for constructing the index can be saved, and the efficiency of constructing the index is improved. In the embodiment of the application, the mapping relation between the first index, the second index and the document is also established, so that the consistency of the indexes in the first type file set and the second type file set can be ensured.

Description

Method, device, electronic equipment and storage medium for constructing index

Technical Field

Embodiments of the present application relate to search technologies, and in particular, to a method, an apparatus, an electronic device, and a storage medium for constructing an index.

Background

The user can search information through a web page of the terminal device or a search application program, taking a web page as an example, the user inputs text in an input box of the web page to search, and the search mode is called text search. With the development of search technology, a user can also input pictures or videos for searching, and the terminal equipment can display search results of the pictures or videos, and the search mode is called vector search. However, whether text searching or vector searching is performed, the terminal device sends text, pictures or videos input by the user to the server, and the server obtains search results according to the constructed index. The index is used for representing the mapping relation between the text and the document or the mapping relation between the vector and the document.

With the advent of vector search, the need for text and vector joint search has arisen. In order to achieve the purpose of searching text and vector at the same time, a text search system and a vector search system can be integrated in a server, and the text search system and the vector search system can respectively construct respective indexes. In the prior art, in order to ensure the consistency of indexes constructed by two systems, a vector search system sequentially generates small files comprising a plurality of indexes in the same way as the text search system generates indexes, after the small files are generated, the indexes in the small files can be searched, and when the small files reach a certain number, the index files in the small files are combined to generate a large file.

In the prior art, the vector search system continuously generates small files and merges the small files, so that the time for constructing the index is long and the efficiency is low.

Disclosure of Invention

The embodiment of the application provides a method, a device, electronic equipment and a storage medium for constructing an index, which can save the time for constructing the index, reduce the resources consumed by constructing the index and improve the efficiency of constructing the index.

In a first aspect, embodiments of the present application provide a method for constructing an index, where the method may be applied to a server for constructing an index, and may also be applied to a chip in a server. The method in which the server can receive a document from the first terminal device, the document being a document to be indexed, will be described below as applied to the server. And the server generates a first index and a second index according to the document, wherein the first index represents the mapping relation between the vector and the document, and the second index represents the mapping relation between the text and the document. That is, in the embodiment of the present application, the first index is a vector type index, and the second index is a text type index. Wherein, the index of the vector type refers to that after the user inputs the search content, the document related to the search content can be searched through the vector corresponding to the search content and the first index. The text type index refers to that, after a user inputs search contents, documents related to the search contents can be searched through keywords of the search contents and a second index.

In this embodiment of the present application, after generating a first index and a second index of a document, a server may store the first index into a first type of file set, store the second index into a second type of file set, and establish a mapping relationship between the first index, the second index, and the document. In the embodiment of the application, the first type file set is used for storing the index of the vector type, and the second type file set is used for storing the index of the text type. It should be noted that, in the embodiment of the present application, the files in the first type of file set are not subjected to a merging operation, that is, the files are not subjected to a merging operation in the same manner as the text type index, but after the first index is generated, the first index is in an available state, and the first index in the available state is used for searching the document associated with the search content through a vector. It should be understood that the availability status refers to that the first index may be searched, i.e. the first index may be used to obtain the above-mentioned document after the first index is generated.

In the process of constructing the index, the files where the first index is located do not need to be combined, so that the time for constructing the index can be saved, and the efficiency of constructing the index is improved. In the embodiment of the application, the mapping relation between the first index, the second index and the document is also established, so that the consistency of the indexes in the first type file set and the second type file set can be ensured.

The first type file set comprises at least one first file, the first file is used for storing a first index, the second type file set comprises at least one second file, and the second file is used for storing a second index. In the embodiment of the present application, when the first index is stored in the first type of file set, that is, when the first index is written into one first file in the first type of file set, and when the second index is stored in the second type of file set, that is, when the second index is written into one second file in the second type of file set. Wherein the first index may be written to any one of the first files when the first index is written to one of the first set of files of the first type, and the second index may be written to any one of the second files when the first index is written to one of the second set of files of the second type. Accordingly, the embodiments of the present application need to establish a mapping relationship among the document, the first index in the first file, and the second index in the second file.

The following describes a procedure of writing a first index into a first file in a first type of file set in the embodiment of the present application:

If the number of written indexes in the ith first file in the first type file set is smaller than a first threshold value, writing the first index into the ith first file, wherein i is an integer greater than or equal to 1; if the number of written indexes in the ith first file is equal to the first threshold value, newly building an ith+1th first file, and writing the first indexes into the ith+1th first file. That is, the first files in the first type file set are sequentially generated, when the number of the first indexes written in one first file reaches the first threshold value, a new first file is newly created, and the first indexes are continuously written in the newly created first file.

The following describes a procedure of writing a second index into a second file in a second type of file set in the embodiment of the present application:

if the number of written indexes in a j-th second file in the second type file set is smaller than a second threshold value, writing the second index into the j-th second file, wherein j is an integer greater than or equal to 1; and if the number of written indexes in the j-th second file is equal to the second threshold value, creating a j+1-th second file, and writing the second indexes into the j+1-th second file. Similar to the above process of writing the first index into the first file, the second files in the second type file set are sequentially generated, and when the number of the second indexes written into one second file reaches the second threshold value, a new second file is newly created, and the second indexes are continuously written into the newly created second file. It should be noted that, in view of the manner in which the small files are continuously generated and merged for the second file in the second type of file set, the second threshold in the embodiment of the present application is smaller than the first threshold.

Wherein, because the first index in the first file in the first type file set is searchable in real time, the second index in the second file in the second type file set is converted from the writing mode to the read-only mode only when the second index reaches the second threshold value, so that the second index in the second file converted to the read-only mode is in an available state. Taking the jth second file as an example, if the number of written indexes in the jth second file is equal to the second threshold, the jth second file is converted from a writing mode to a read-only mode, a second index in the jth second file converted to the read-only mode is in an available state, and the second index in the available state is used for searching the document associated with the search content through text.

In one possible implementation manner of the embodiment of the present application, the number of writable second indexes in one second file, that is, the second threshold value, may be determined according to a user setting. It should be understood that the second threshold value in each second file in embodiments of the present application may be the same or different. The user may set a conversion duration of the second file, that is, a duration of time that the second index in the second file may be searched, through the first terminal device, where the conversion duration is a duration of time that the second file is converted from the writing mode to the read-only mode. Further, in this embodiment of the present application, the second threshold may be determined according to a conversion duration, and specifically, how many second indexes, that is, the second threshold, may be written in the conversion duration may be determined according to the conversion duration and a time for writing one second index.

Wherein, similar to the second file described above, the user may also set the number of first indexes writable in the first file, i.e. the first threshold. Or the first threshold may be contracted.

It should be noted that the second file in the second type of file set employs: the way the small files are generated and merged is continuous. Therefore, in the embodiment of the present application, the second files in the second type of file set may be combined, where the time for combining the second files may be as follows:

the first way is: and if the occupied memory of the second file converted into the read-only mode reaches the preset memory, merging the second file converted into the read-only mode. That is, in the second type of file set, when the second file converted into the read-only mode reaches the preset memory, the second file converted into the read-only mode may be merged into one large file.

The second mode is as follows: and if the current available load is greater than the preset load, merging the second files converted into the read-only mode. That is, the server may detect the operation load, and merge the second file converted into the read-only mode into one large file when the available load is greater than the preset load.

It should be noted that after merging the second file into one large file, the mapping relationship needs to be updated, that is, the mapping relationship of the second index in the merged second file, the first index in the first file, and the document needs to be reestablished.

In the process that the user sends the document to the server through the first terminal device, if the user finds that a plurality of error documents exist in the document and wants to delete the sent error documents, the embodiment of the application can also delete the documents. In this embodiment of the present application, after receiving the deletion instruction sent by the first terminal device, the document may be deleted. Wherein the deletion instruction instructs deletion of the document.

The manner of deleting the document by the server in the embodiment of the present application may be: the document is marked as deleted, but the document is not actually deleted, that is, the document marked as deleted cannot be fed back to the terminal device. Or, in the embodiment of the present application, the document may be marked as a deletion state, and when the second files in the second type of file set are merged, the document in the second type of file set is deleted, so as to further achieve the purpose of releasing the occupied space of the document in the server.

Optionally, in the embodiment of the present application, for a scene where there are a large number of documents to be deleted, a method for synchronously merging the vector type index and the text type index is further provided. The first terminal device may be provided with a synchronization control, and when the user selects the synchronization control, the first terminal device may be triggered to send a synchronization deleting instruction to the server. After the server receives the synchronous deletion instruction from the first terminal device, the document in the first type file set can be deleted according to the synchronous deletion instruction. Wherein the sync deletion instruction instructs to synchronize a case of deleting a document in the second type of file set to the first type of file set. That is, in the embodiment of the present application, when the second file in the second type of file set is merged and deleted, the document in the first type of file set may be deleted, that is, the deletion of the document in the first type of file set and the deletion of the document in the second type of file set may be kept synchronous under the trigger of the user.

The foregoing description is a process of constructing an index by the server in the embodiment of the present application, and the following description is made about how to use the constructed index to perform a search in the process of constructing an index in the embodiment of the present application:

In the embodiment of the application, when the user searches the document, the user can access the search content through the second terminal device, and the second terminal can send the search content to the server, so that the server obtains a search result according to the search content, the first type file set and the second type file set, and the search result comprises the document. And after the server goes to the search result, the search result can be sent to the second terminal equipment, so that the second terminal equipment displays the search result on an interface or plays the search result. It should be understood that the second terminal device and the first terminal device in the embodiments of the present application may be the same or different.

In the process of obtaining the search result, the server can obtain a first search result according to the search content and the first type of file set; obtaining a second search result according to the search content and the second type of file set; and acquiring the search results according to the first search results and the second search results.

Because the second files in the second type of file set adopt a mode of continuously generating small files and merging the small files, the second indexes in the second type of file set in the embodiment of the application can be partially or completely in an available state, and the first indexes in the first type of file set in the embodiment of the application can be searched in real time, namely in the available state, so that the first indexes in the first type of file set in the embodiment of the application are completely in the available state. Therefore, in the embodiment of the present application, the first search result may be obtained according to the search content and the first index in the first type of file set; and obtaining the second search result according to the search content and a second index in an available state in the second type file set.

In view of the fact that the server in the embodiment of the application can delete the document according to the setting of the user, when the search result corresponding to the search content hits the deleted document, the search result which does not include the document is sent to the second terminal device.

Alternatively, in order to reduce the workload of the server in the embodiments of the present application, the index generated according to the deleted document may be deleted, so that the search result obtained by the server does not include the deleted document.

In a second aspect, an embodiment of the present application provides an apparatus for constructing an index, including: the receiving and transmitting module is used for receiving the document from the first terminal equipment; the processing module is used for generating a first index and a second index according to the document, storing the first index into a first type file set, storing the second index into a second type file set, and establishing a mapping relation among the first index, the second index and the document, wherein the first index represents the mapping relation between a vector and the document, the second index represents the mapping relation between a text and the document, the first index is in an available state, and the first index in the available state is used for searching the document related to search content through the vector.

In one possible implementation, the first type of file set includes at least one first file, where the first file is used to store a first index. The processing module is specifically configured to write the first index into a first file.

In a possible implementation manner, the second type of file set includes at least one second file, where the second file is used to store a second index. The processing module is specifically configured to write the second index into a second file.

In one possible implementation manner, the processing module is specifically configured to establish a mapping relationship between the first index in the first file, the second index in the second file, and the document.

In a possible implementation manner, the processing module is specifically configured to, if the number of written indexes in an ith first file in the first type of file set is smaller than a first threshold, write the first index into the ith first file, where i is an integer greater than or equal to 1; if the number of written indexes in the ith first file is equal to the first threshold value, newly building an ith+1th first file, and writing the first indexes into the ith+1th first file.

In a possible implementation manner, the processing module is specifically configured to, if the number of written indexes in a j-th second file in the second type of file set is smaller than a second threshold, write the second index into the j-th second file, where j is an integer greater than or equal to 1; and if the number of written indexes in the j-th second file is equal to the second threshold value, creating a j+1-th second file, and writing the second indexes into the j+1-th second file.

In one possible implementation manner, the processing module is further configured to switch the j second file from a writing mode to a read-only mode if the number of written indexes in the j second file is equal to the second threshold, where a second index in the j second file that is switched to the read-only mode is in an available state, and the second index in the available state is used for searching the document associated with the search content through text.

In one possible implementation manner, the transceiver module is further configured to receive a conversion duration of the second file from the first terminal device, where the conversion duration is a duration of converting the second file from the write mode to the read-only mode.

Correspondingly, the processing module is further configured to determine the second threshold according to the conversion duration.

In one possible implementation manner, the processing module is further configured to merge the second file converted into the read-only mode if an occupied memory of the second file converted into the read-only mode reaches a preset memory; or if the current available load is greater than the preset load, merging the second files converted into the read-only mode.

The processing module is further configured to establish a mapping relationship between the second index in the merged second file, the first index in the first file, and the document.

In one possible implementation, the documents are included in the second type of file collection.

The receiving and transmitting module is further configured to receive a deletion instruction sent by the first terminal device, where the deletion instruction indicates deletion of the document. Correspondingly, the processing module is further used for marking the document as a deleting state.

The receiving and transmitting module is further configured to receive a deletion instruction sent by the first terminal device, where the deletion instruction indicates deletion of the document. Correspondingly, the processing module is further configured to mark the document as a deletion state, and delete the document in the second type of file set when the second files in the second type of file set are merged.

In a possible implementation manner, the transceiver module is further configured to receive a synchronization deletion instruction from the first terminal device, where the synchronization deletion instruction indicates that a deletion situation of a document in the second type of file set is synchronized to the first type of file set.

Correspondingly, the processing module is further configured to delete the document in the first type of file set according to the synchronous deletion instruction.

In a possible implementation, the transceiver module is further configured to receive the search content from the second terminal device. Correspondingly, the processing module is further configured to obtain a search result according to the search content, the first type of file set and the second type of file set, where the search result includes the document.

The transceiver module is further configured to send the search result to the second terminal device.

In a possible implementation manner, the processing module is specifically configured to obtain a first search result according to the search content and the first type of file set; obtaining a second search result according to the search content and the second type of file set; and acquiring the search results according to the first search results and the second search results.

In a possible implementation manner, the processing module is specifically configured to obtain the first search result according to the search content and a first index in the first type of file set; and obtaining the second search result according to the search content and a second index in an available state in the second type file set.

In one possible implementation, the processing module is further configured to filter the document from the search result if the search result includes the document marked as deleted. Correspondingly, the transceiver module is specifically configured to send a search result that does not include the document to the second terminal device.

In a third aspect, embodiments of the present application provide an apparatus (e.g. a chip) for constructing an index, the apparatus for constructing an index having stored thereon a computer program which, when executed by the apparatus for constructing an index, implements a method as provided in the first aspect.

Fourth aspect the present application provides an electronic device, which may be a server in the following embodiments. The electronic device includes: a processor, a memory, a transceiver; the transceiver is coupled to the processor, and the processor controls the transceiving actions of the transceiver; wherein the memory is for storing computer executable program code, the program code comprising instructions; the instructions, when executed by a processor, cause the electronic device to perform the method as provided in the first aspect.

In a fifth aspect, embodiments of the present application provide a computer-readable storage medium storing computer instructions that, when executed by a computer, cause the computer to perform a method as provided in the first aspect.

The embodiment of the application provides a method, a device, electronic equipment and a storage medium for constructing an index, wherein the method comprises the following steps: generating a first index and a second index according to the document, wherein the first index represents the mapping relation between the vector and the document, and the second index represents the mapping relation between the text and the document; storing a first index into a first type of file set, wherein the first index is in an available state, and the first index in the available state is used for searching documents associated with search content through vectors; and storing the second index into a file set of a second type, and establishing a mapping relation among the first index, the second index and the document. Because the first index in the embodiment of the application is in the available state after being generated, that is, in the embodiment of the application, the files where the first index is located do not need to be combined, so that the time for constructing the index can be saved, and the efficiency of constructing the index is improved. In the embodiment of the application, the mapping relation between the first index, the second index and the document is also established, so that the consistency of the indexes in the first type file set and the second type file set can be ensured.

Drawings

Fig. 1 is a schematic diagram of a network architecture suitable for use in the embodiments of the present application;

FIG. 2 is a schematic diagram of a network architecture;

FIG. 3 is a schematic diagram of another network architecture;

FIG. 4 is a schematic diagram of constructing an index;

FIG. 5 is another schematic diagram of constructing an index;

FIG. 6 is a flowchart illustrating an embodiment of a method for constructing an index according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram I of a build index according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating another embodiment of a method for constructing an index according to an embodiment of the present application;

FIG. 9 is a second schematic diagram of a build index provided in accordance with an embodiment of the present application;

FIG. 10 is a flowchart illustrating another embodiment of a method for constructing an index according to an embodiment of the present application;

fig. 11 is a schematic interface diagram of a first terminal device provided in an embodiment of the present application;

fig. 12 is a second schematic interface diagram of the first terminal device provided in the embodiment of the present application;

FIG. 13 is a flowchart illustrating another embodiment of a method for constructing an index according to an embodiment of the present application;

FIG. 14 is a flowchart of another embodiment of a method for constructing an index according to an embodiment of the present application;

fig. 15 is an interface change schematic diagram of the second terminal device provided in the embodiment of the present application;

FIG. 16 is a schematic diagram of an apparatus for constructing an index according to an embodiment of the present disclosure;

FIG. 17 is a second schematic structural diagram of an apparatus for constructing an index according to an embodiment of the present disclosure;

fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

Fig. 1 is a schematic diagram of a network architecture suitable for the embodiment of the present application. As shown in fig. 1, the network architecture includes a terminal device and a server. The user can search information through the webpage of the terminal equipment or the search application program, and the server can search documents related to the search content according to the search content input by the user, so that the documents related to the search content are fed back to the terminal equipment. It should be understood that the documents described herein and in the embodiments described below represent storage objects in text form, in picture form, in video form, or in a combination of several forms, or in other forms. Documents in embodiments of the present application encompass a wide variety of forms, such as Word documents, portable document formats (portable document format, PDF), hypertext markup language (hyper text markup language, HTML), extensible markup language (extensible markup language, XML), images, video, and the like, which may be referred to as documents. For example, a mail, a short message, and a microblog may also be referred to as a document.

It should be understood that the network architecture shown in fig. 1 is applicable to the network architecture in the embodiment of the present application in which the user performs information searching through the terminal device, and in order to facilitate distinction from the terminal device that provides the document for the server described below, the terminal device is a second terminal device described in the embodiment described below, and is identified as a second terminal device in fig. 1. In fig. 1, a terminal device is illustrated as a smart phone.

In the embodiment of the present application, the terminal device may refer to a user equipment, an access terminal, a subscriber unit, a subscriber station, a remote terminal, a mobile device, a user terminal, a wireless communication device, a user agent, or a user apparatus. The terminal device may be a mobile phone, a tablet (pad), a computer with wireless transceiver function, a cellular phone, a cordless phone, a session initiation protocol (session initiation protocol, SIP) phone, a personal digital assistant (personal digital assistant, PDA), a handheld device with wireless communication function, a computer or other processing device, a vehicle-mounted device, a wearable device, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in a smart home (smart home), a terminal device in a future 5G network or a terminal device in a future evolved public land mobile network (public land mobile network, PLMN), etc., to which the embodiments of the present application are not limited.

The server needs to build an index (build index) from the documents, and then find documents related to the search content through the built index. The indexes include a forward index (forward index) and an inverted index (inverted index), the forward index may be referred to as a forward index, and the inverted index may be referred to as an inverted index, a indexes file, or an inverted file.

The forward index is first described as follows:

in the server, each document corresponds to an identification, such as a document number (identity document, ID), and the content of the document is represented as a set of keywords. For example, the server extracts 20 keywords by segmenting the document 1, and further records the occurrence number and the occurrence position of each keyword in the document. The 20 keywords, and the occurrence times and the occurrence positions of each recorded keyword in the document are indexes of the document 1. In this way, indexes of all documents can be obtained.

The following table one is a structural illustration of forward index:

list one

Document 1	Keyword 1, keyword 2 and keyword 3
		Document 2	Keyword 1, keyword 3 and keyword 4
…	…
		Document 5	Keyword 2, keyword 4 and keyword 5

When searching documents related to search content according to the forward index, the server needs to search all the documents, find out the documents containing the search content or keywords in the search content, score the obtained documents according to a scoring model (namely, the similarity or association degree between the documents obtained through the scoring model calculation and the search content or the keywords in the search content), and display the documents to the user after the ranking is discharged according to the score. Because the number of documents adopted by the server in the process of constructing the index is large, the mode of searching the documents by adopting the forward index needs to search all the documents in each search, and the time spent is long, so that the requirement of feeding back the documents in real time cannot be met.

To solve the problems of long search time and low search efficiency caused by forward indexing, reverse indexing has been developed. Unlike forward indexing, the server converts the mapping of documents to keywords into a mapping of keywords to documents, i.e., each keyword corresponds to a plurality of documents in which the keyword appears. Therefore, when the server searches the documents related to the search content by adopting the inverted index, the keywords related to the search content can be acquired, and the documents corresponding to the keywords can be fed back to the terminal equipment. Compared with the forward index, the reverse index can shorten the search time and improve the search efficiency.

In addition, the inverted index may include, in addition to the keywords and the mapping relationship between the keywords and the documents, the positions and frequencies at which the keywords appear in each document. The frequency of occurrence of keywords in each document can affect the ranking of the last document.

The following table two is a structural illustration of inverted indexes:

watch II

Keyword 1	Document 1 (position a, frequency 10), document 2 (position b, frequency 5)
		Keyword 2	Document 1 (position c, frequency 3), document 5 (position d, frequency 5)
…	…
		Keyword 5	Document 5 (position e, frequency 1)

As described above, the text search mode is that the server constructs the mapping relation between the keywords (text) and the document as the index, and the user needs to input the text-form search content, such as "apple" input by the user in fig. 1. When searching related documents according to the text form of the search content input by the user, the server can segment the search content input by the user to obtain keywords in the search content, and the server feeds back the documents corresponding to the keywords with higher similarity to the terminal equipment by calculating the similarity between the keywords in the search content and the keywords in the index.

It can be understood that when the server feeds back the documents corresponding to the keywords with higher similarity, the documents can be scored, so as to determine the sequence of the documents (namely, the arrangement sequence of the documents seen by the user on the terminal device). The scoring of the document may be determined according to a scoring model, a position and a frequency of occurrence of the keyword in the document, which is not described in detail in this embodiment. By way of example, if the keyword of the search content input by the user is "apple", the server may feed back the document having a higher similarity with "apple" to the terminal device.

With the development of search technology, a new type of search mode, i.e., a vector search mode, has emerged. That is, in addition to inputting text for searching, a user may also input search content of a picture, video, or other non-text type for searching. If a user inputs a picture in the terminal device, the server can feed back a document related to the picture according to the picture. In this way of vector searching, the server needs to build an index of vector types from the documents, and then search for related documents according to the index of vector types.

For example, the server may extract a vector from the document, characterize the document in terms of a vector, and construct a mapping relationship between the vector and the document, i.e., construct an index. Correspondingly, when searching, the server can extract the vector from the picture input by the user, and further calculate the distance between the vector of the picture and the vector in the index, so as to feed back the documents corresponding to the vectors with the closest vector distance to the picture to the terminal equipment. It should be understood that the manner in which the vector is extracted from the document may refer to the related description in the prior art, and will not be described herein.

With the advent of vector search, the need for text and vector joint searching has arisen, that is, users can enter text and other non-text content simultaneously as they enter search content. By way of example, the search content entered by the user may be: a brand picture containing brand a and the text "off-road vehicle", the intention of the user is to obtain a document about brand a off-road vehicles.

In order to meet the requirement of the text and vector joint search, the server can simultaneously establish an index of a text type and an index of a vector type aiming at the document, and further jointly obtain the document related to the searched content by combining the two indexes. Fig. 2 is a schematic diagram of a network architecture. As shown in fig. 2, the network architecture includes: terminal equipment, server, text indexing system and vector indexing system.

When the server constructs the text index and the vector index, the server can send the documents from the terminal equipment to the text index system and the vector index system respectively. The text indexing system builds a text index from the received text, and the vector indexing system builds a vector index from the received vector. When the server receives the search content from the terminal equipment, the search content can be sent to the text indexing system and the vector indexing system, the text indexing system and the vector indexing system search related documents according to the established indexes respectively, and the server can integrate the documents fed back by the two systems respectively and output final documents.

Because the text indexing system and the vector indexing system construct respective indexes according to the documents, the same documents in the two systems do not establish a mapping relationship, and the difficulty of integrating the documents fed back by the two systems respectively by the server is high. Therefore, when the text indexing system and the vector indexing system construct indexes, unique keys of the same document need to be recorded in the two systems, namely the documents with the same identification are identified in the two systems through the unique keys, so that the integration of the documents fed back by the server to the two systems respectively is facilitated. But with unique key identification, introduces additional space overhead.

In addition, the documents respectively enter two index systems, and when the text index system and the vector index system construct indexes, the speed of constructing the indexes is different, so that the data consistency in the text index system and the data consistency in the vector index system are poor, and the accuracy of outputting the documents is further affected. By way of example, search content as entered by a user may be: a brand picture containing brand a and the text "off-road vehicle" is intended by the user to obtain the relevant documents of brand a off-road vehicles, but in view of the poor consistency of the indexes of the two systems, the obtained results may be separate documents on brand a brand or documents on off-road vehicles, and the results intended by the user cannot be obtained. In addition, the server also needs to interact with the text indexing system and the vector indexing system through a network so as to realize index construction of the system and feedback of search results, and the feedback efficiency is low.

In order to solve the problems of poor data consistency and low feedback efficiency caused by the network architecture in fig. 2, a network architecture as shown in fig. 3 is also provided. Fig. 3 is a schematic diagram of another network architecture. As shown in fig. 3, the network architecture includes: terminal equipment and a server. Unlike fig. 2, in fig. 3, the functions of the text indexing system and the vector indexing system are integrated in the server, and the server constructs the text index and the vector index for the document input by the terminal device at the same time, so as to avoid the problems of misalignment and poor consistency of data caused by entering the document into two independent systems. It should be understood that fig. 3 is also a network architecture to which the embodiments of the present application are applicable.

Here, in order to illustrate that the terminal device in fig. 3 is different from the terminal device in fig. 1, the terminal device in fig. 3 is illustrated as a computer, and the terminal device here is a first terminal device in the following embodiment, and is identified as a second terminal device in fig. 3. It should be understood that possible configurations of the terminal device in fig. 3 may be referred to the above description of fig. 1.

The network architecture shown in fig. 3 builds an index in two ways:

since the text index construction process and the vector index construction process will be used as follows, the text index construction process and the vector index construction process will be briefly described. When the text index system builds the text index, the corresponding index can be generated according to each document, the index is written into the small file, the index in the small file can be searched only after the number of the indexes in the small file reaches the preset number (namely, after the number of the indexes in the small file reaches the preset number, the index in the small file searches the document corresponding to the index), and when the small file meets the merging condition, the index files in the small file are needed to be merged to generate a large file. The merging condition can be that the small files are merged when the occupied memory of the small files reaches a preset memory, or the small files are merged after a preset time length. When constructing the vector index, the vector index system can generate a corresponding index according to each document, and write the index into a file, wherein the index in the file can be searched in real time. The specific process of constructing the text index and the vector index can also refer to the detailed description in the prior art, which is briefly described herein.

The first way is: FIG. 4 is a schematic diagram of constructing an index. As shown in fig. 4, upon receiving document 1, the server may generate text index 1 and vector index 1, respectively, and write text index 1 into doclet 1 and vector index 1 into doclet 1'. When the number of indexes in the small file 1 and the small file 1 'is greater than the preset number, the indexes in the small file 1 and the small file 1' can be searched. Correspondingly, the server may also generate doclet 2 and doclet 2', and doclet 3'.

In order to ensure the consistency of the constructed indexes, the server can adopt the same mode as the mode of generating the indexes by a text search system, the indexes in the small files can be searched after the number of the indexes in the small files reaches a preset number, and in addition, the index files in the small files are combined to generate large files after the number of the small files reaches a certain number. Illustratively, the server merges doclet 1 and doclet 2 to generate one large file 4, and merges doclet 1' and doclet 2' to generate one large file 4'.

It should be noted that, according to the merging mode of the files corresponding to the text indexes, the server can implement merging by a simpler method such as splicing the indexes in the small file 1 and the small file 2, while the files corresponding to the vector indexes do not support merging in a splicing mode, but need to reconstruct the vector indexes again according to the files corresponding to the small file 1 'and the small file 2', so as to implement merging of the small file 1 'and the small file 2'. The method for continuously generating the small files for the vector indexes and combining the small files has the defects of high combining difficulty and long time consumption, so that the time for constructing the indexes is long and the efficiency for constructing the indexes is low; in addition, the server can simultaneously provide the searching function during the process of constructing the index, and in the method, the server needs to do file merging during the process of constructing the index, and particularly, the merging of a plurality of vector index files needs to consume larger resources, thereby influencing the searching efficiency.

The second way is: FIG. 5 is another schematic diagram of constructing an index. As shown in fig. 5, upon receiving document 1, the server may generate text index 1 and vector index 1, respectively, and write both text index 1 and vector index 1 into doclet 1. When the number of indexes in the small file 1 is greater than the preset number, the indexes in the small file 1 can be searched. Correspondingly, the server may also generate doclet 2 and doclet 3.

In order to ensure the consistency of the constructed indexes, the server can adopt the same mode as the mode of generating the indexes by a text search system, the indexes in the small files can be searched after the number of the indexes in the small files reaches a preset number, and in addition, the index files in the small files are combined to generate large files after the number of the small files reaches a certain number. Illustratively, the server merges doclet 1 and doclet 2 to generate one large file 4.

In the second mode, both the text index and the vector index are written into the same file, but the same problem as the first mode still exists in the mode that the vector index continuously generates small files and the small files are combined, so that the combination difficulty is high, the time is long, the efficiency of constructing the index is low, and the consumed resources are large.

In order to solve the above problems, the embodiment of the present application provides a method for constructing an index, on the basis of the network architecture shown in fig. 3, in the process of constructing a vector index, the vector index is written into a file, but the files are not combined, so that the problems of long time for constructing the index and low efficiency for constructing the index caused by small file combination can be avoided, and in the embodiment of the present application, a mapping relationship between a text index and the vector index is also established, so that consistency of the vector index and the text index can be ensured.

It should be noted that the method for constructing an index provided in the embodiments of the present application is applicable to a scenario where a text index and a vector index are constructed, and may also be applicable to a scenario where a text index and other types of indexes (different from a vector index) are constructed, and may also be applicable to a scenario where a vector index and other types of indexes (different from a text index) are constructed.

The method for constructing the index provided in the embodiment of the present application is described below with reference to specific embodiments. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes. Fig. 6 is a flowchart of an embodiment of a method for constructing an index according to an embodiment of the present application. As shown in fig. 6, the method for constructing an index provided in the embodiment of the present application may include:

S601, receiving a document from a first terminal device.

S602, according to the document, generating a first index and a second index, wherein the first index represents the mapping relation between the vector and the document, and the second index represents the mapping relation between the text and the document.

S603, storing a first index into a first type file set, wherein the first index is in an available state, and the first index in the available state is used for searching documents associated with search content through vectors.

S604, storing the second index into a second type file set, and establishing a mapping relation among the first index, the second index and the document.

In S601, according to the network architecture shown in fig. 3, the first terminal device may send a document to be indexed to the server, so that the server constructs an index according to the document. Correspondingly, the server receives the document from the first terminal device. It should be understood that the form of the document in the embodiment of the present application may refer to the related description of the document in fig. 1, which is not described herein. The first terminal device may send a plurality of documents to the server at the same time, and in this embodiment of the present application, the server is used to describe a processing procedure of one document.

In S602 described above, in the embodiment of the present application, two different types of indexes may be generated according to the document, which are a first index and a second index respectively. The first index characterizes the mapping relationship between the vector and the document, that is, the mapping relationship between the vector corresponding to the document and the document, and can be understood as the vector index. The second index characterizes the mapping relationship between the text and the document, i.e. the mapping relationship between the text in the document and the document, and can be understood as the text index described above. The first index in the embodiment of the present application may be a hierarchical navigable small world map (hierarchcal navigable small world graphs, HNSW) type index, and the second index may be a lucene type index, where the lucene type index is obtained according to a search engine architecture, and the search engine architecture is lucene.

Alternatively, in the embodiment of the present application, the manner of generating the first index according to the document may be: the server extracts the vector from the document, characterizes the document in a vector manner, and further builds a mapping relation between the vector and the document, namely builds a first index. The manner of generating the second index according to the document in the embodiment of the present application may be: the server extracts keywords in the document, records the occurrence frequency and the occurrence position of each keyword in the document, and establishes a mapping relation between the keywords and the document, wherein the second index comprises the mapping relation between the keywords and the document, and the occurrence frequency and the occurrence position of each keyword in the document. It should be noted that, in the embodiment of the present application, the index manner is inverted index.

In S603, in the embodiment of the present application, the first index may be stored in the first type of file set, and the second index may be stored in the second type of file set. The first type file set includes a plurality of files, and the files in the first type file set are all used for storing indexes of vector types, namely first indexes. Similarly, the second type of file set includes a plurality of files, and the files in the second type of file set are all used for storing the text type index, namely, the second index.

It should be noted that, unlike the above example, after the first index is generated, the above example stores the first index to the small file and the index in the small file is in an available state (i.e., the index can be searched) only when the number of indexes in the small file reaches a preset number; or when the occupied content in the small file reaches the preset memory, the index in the small file is in an available state. In either case, the first index cannot be searched in real time. In the embodiment of the present application, after the first index is generated, the first index is in an available state, that is, the first index may be searched. That is, in the embodiment of the present application, the first index is not stored in the same manner as the second index (i.e., the text index described above), and is in a usable state after the first index is generated, rather than being in a usable state after the number of indexes in the small file reaches the preset number. Therefore, in the embodiment of the application, the files where the first index is located do not need to be combined, so that the time for constructing the index can be saved, and the efficiency of constructing the index is improved.

In the same way as in S604, in the embodiment of the present application, the second index may be stored in the second type of file set. Because the first index and the second index are stored in different file sets, in order to ensure consistency of the indexes in the two file sets, a mapping relationship between the first index, the second index and the document can be established, that is, the first index and the second index can be mapped to the document.

Fig. 7 is a schematic diagram of a build index according to an embodiment of the present application. As shown in fig. 7, the server includes a first type of file set and a second type of file set, the first index may be stored in a file 1' in the first type of file set, and the second index may be stored in a small file 3 in the second type of file set, where the small file 1 and the small file 2 in the second type of file set both store an index of a text type. As can be seen from fig. 7, in the embodiment of the present application, the second index is stored by continuously generating small files and merging the small files, but the first index is stored by directly storing the first index into the file 1' in the first type file set without merging the files.

Compared with the method in fig. 4, the server generates the first index and the second index for one document at the same time, the method can complete joint query faster without cross-system call, the first index is in an available state after the server generates the first index, the first index is not stored in the same mode as the second index (namely the text index) in the embodiment of the application, the first index is in the available state after the first index is generated, the first index is not in the available state after the number of the indexes in the small files reaches the preset number, and the files where the first index is located are not combined in the embodiment of the application, so that the consumption of index construction resources can be saved, the time for constructing the indexes is saved, and the efficiency of index construction is improved. In addition, in the embodiment of the application, the mapping relation between the first index, the second index and the document is also established, so that the consistency of the indexes in the first type file set and the second type file set can be ensured.

The following embodiments describe how a server stores a first index to a first type of set of files and a second index to a second type of set of files. Fig. 8 is a flowchart of another embodiment of a method for constructing an index according to an embodiment of the present application. As shown in fig. 8, the method for constructing an index provided in the embodiment of the present application may include:

S801, a document from a first terminal device is received.

S802, generating a first index and a second index according to the document.

S803, the first index is written into a first file.

S804, writing the second index into a second file, and establishing a mapping relation among the first index in the first file, the second index in the second file and the document.

It should be understood that the implementation manners in S801 to S802 in the embodiments of the present application may refer to the relevant descriptions in S601 to S602 in the above embodiments, which are not described herein.

In S803 and S804, the first type file set includes at least one first file, where the first file is used to store an index of a vector type, i.e., a first index. Similarly, the second type file set includes at least one second file, where the second file is used to store an index of the text type, i.e., a second index.

As shown in fig. 7, the first type of file set includes a first file, that is, the file 1', that is, in the embodiment of the present application, the first index may be written into the file 1'. The second type of file set includes three first files, i.e., file 1, file 2, and file 3, and in this embodiment, the second index may be written into file 1, file 2, or file 3.

In this embodiment of the present application, after writing the first index and the second index into corresponding files, the server may establish a mapping relationship between the first index in the first file, the second index in the second file, and the document. Illustratively, assuming that the embodiment of the present application writes the second index into file 2, a mapping relationship between the first index in file 1' and the second index in file 2 may be established. As shown in table three below:

watch III

Index in a first type of file collection	Index in a set of files of a second type	Document and method for producing the same
			First index (File 1')	Second index (File 2)	Document 1

In one possible implementation manner of the embodiment of the present application, the following describes a procedure of writing a first index into a first file and writing a second index into a second file in the embodiment of the present application with reference to fig. 9. Fig. 9 is a schematic diagram ii of a build index provided in the implementation of the present application. As shown in fig. 9, in order to illustrate the process of constructing the index, the embodiment of the present application is divided into 5 time points to describe the processes of generating the index, writing the index into the file, generating the file, merging the file, and the like:

at time 1, the server receives document 1 from the first terminal device and generates a first index 1 and a second index 1' from the document 1. The server builds a first file v1 in the first type file set, builds a first second file f1 in the second type file set, writes a first index 1 into the first file v1, writes a second index 1 'into the second file f1, and builds a mapping relation between the first index 1 in v1, the second index 1' in f1 and the document 1.

It should be noted that, writing the first index and the second index in the file in fig. 9 is indicated by using arrows towards the file, and the first index in the embodiment of the present application is in a usable state, i.e. can be searched, and the arrow away from the first file is indicated by using arrows away from the first file in fig. 9, but the second index 1' corresponding to the document 1 is only in a usable state when the number of indexes to be written in the second file f1 is greater than the preset number, so that the second index 1' in fig. 9 only has the arrow to write in the second file f1, and does not have the arrow indicating that the second index 1' is in the usable state. It should be noted that in fig. 9, the index is written to the file by an arrow pointing to the file, and the index is in the available state by an arrow pointing away from the file.

If, after time 1, the server receives the document 2 and the document 3 … … document n from the first terminal device, the first index and the second index corresponding to each document may be sequentially generated, the first index and the second index corresponding to each document are written into the first file v1 and the second file f1 respectively, and a mapping relationship between each document and the first index in the corresponding v1 and the second index in the corresponding f1 is established. Alternatively, the n documents or a part of the n documents may be sent by the first terminal device to the server at the same time, and the server may sequentially generate the first index and the second index corresponding to each document or generate the first index and the second index corresponding to each document according to the documents received at the same time.

If the number of the written indexes in the j-th second file is smaller than the second threshold, continuing to write the generated second index into the j-th second file, wherein j is an integer greater than or equal to 1. If the number of written indexes in the j-th second file is equal to the second threshold value, the j+1th second file is newly built, and the second indexes are written in the j+1th second file. It should be understood that the indexes written in the second file are all text type indexes, i.e. the second indexes.

For example, if the number of the second indexes written in one second file is n, at time 2, the server receives the document n+1 from the first terminal device, and generates a second index (n+1)', corresponding to the document n+1. If the server determines that the number of written indexes in f1 is n, a second file f2 is newly created, and the second index (n+1)' is written into the second file f 2. Correspondingly, the server receives the document n+1 from the first terminal device, generates a first index (n+1) corresponding to the document n+1, and writes the first index (n+1) into the first file v 1. In addition, the server also establishes a mapping relationship of the first index (n+1) in v1, the second index (n+1)' in f2, and the document n+1.

And if the number of written indexes in the j-th second file is equal to a second threshold value, converting the j-th second file from a writing mode to a read-only mode, wherein the second indexes in the j-th second file converted to the read-only mode are in an available state, and the second indexes in the available state are used for searching documents related to search contents through texts. As shown above, if the number of written indexes in f1 is n, which is equal to the second threshold value, f1 may be converted from the write mode to the read-only mode, so that the second indexes written in f1 are all in an available state.

It should be appreciated that, in the embodiment of the present application, the second threshold of the number of indexes written in one second file may be set in a customized manner, depending on the duration of time that the user needs to search for one second file. If the user needs a second file for a longer period of time, the second threshold is larger, so that more indexes can be written in the second file; conversely, if the user needs a second file for a shorter period of time, the second threshold is smaller, that is, a small number of indexes can be written in one second file, and the second index needs to be written in the next second file.

Optionally, in the embodiment of the present application, the user may set a conversion time length of the second file through the first terminal device in advance, and correspondingly, the first terminal device may send the conversion time length of the second file set by the user to the server, where the conversion time length of the second file is a time length of converting the second file from the writing mode to the read-only mode. After receiving the conversion duration of the second file, the server may determine a second threshold according to the conversion duration of the second file. Specifically, the server determines the number of writable indexes in a second file according to the conversion duration of the second file and the duration required for writing an index, that is, the second threshold.

Optionally, in this embodiment of the present invention, the second threshold corresponding to each second file may be different, and the user may preset a conversion duration of each second file, or send, through the first terminal device, the conversion duration of each second file to the server in the index construction process. The above manner of determining the second threshold is an example of an embodiment of the present application, and other manners of determining the second threshold may also be adopted in embodiments of the present application. It should be understood that the second threshold value corresponding to each second file is illustrated as n in fig. 9.

Between time 2 and time 3, the server also receives the document n+2 and the document n+3 … … from the first terminal device, the server can sequentially generate a first index and a second index corresponding to each document, write the first index and the second index corresponding to each document into the first file v1 and the second file f2 respectively, and establish a mapping relation between each document and the first index in v1 and the second index in f 2. It should be understood that the second index therein is represented in each second file from 0 to n in fig. 9.

At time 3, the server receives the document 2n+1 from the first terminal device and generates a first index (2n+1) and a second index (2n+1)', based on the document 2n+1. At this time, because the number of second indexes written in the second file f2 is equal to the second threshold value, the server may switch f2 from the writing mode to the read-only mode, so that the second indexes in f2 are all in an available state. The server may also create a third second file f3, write the second index (2n+1) 'into f3, write the first index (2n+1) into the first file v1, and establish a mapping relationship between the first index (2n+1) in the document 2n+1 and v1, and the second index (2n+1)' in the f 3.

At this time, the server may combine the second files f1 and f2 to generate the second indexes in the large files f4 and f4 that are both in the available state. In addition, in the embodiment of the present application, after merging the second files, the server needs to establish a mapping relationship between the second index in the merged second file, the first index in the first file, and the document. After combining f1 and f2 as described above, the server also updates the mapping relationship, that is, establishes the mapping relationship between each document and the first index in v1 and the second index in f 4. It should be noted that, in the embodiment of the present application, the time for merging the second files may be as follows:

the first way is: and if the occupied memory of the second file converted into the read-only mode reaches the preset memory, merging the second file converted into the read-only mode. In this embodiment of the present application, the server may obtain the occupied memory of the second file converted into the read-only mode, and further merge the second file converted into the read-only mode when the occupied memory of the second file converted into the read-only mode reaches a preset memory. For example, if the server determines that the occupied memory of the second files f1 and f2 converted into the read-only mode reaches the preset memory, f1 and f2 may be combined.

The second way is: in this embodiment of the present application, the server may further obtain a current available load, and if the current available load is greater than a preset load, merge the second files converted into the read-only mode. For example, if the server detects that the current available load is greater than the preset load, f1 and f2 converted to the read-only mode may be combined.

The above two ways are examples of merging the second files in the embodiment of the present application, and other manners may also be used to determine to merge the second files, which is not limited in the embodiment of the present application.

It should be appreciated that during the second file merge, the server may not delete the doclet first, which still provides the searchable service externally. Illustratively, during the merging of f1 and f2, the second index in f1 and f2 is still available, and after f1 and f2 are merged into f4, the server may delete f1 and f2.

Similar to the second file, the number of the first indexes written in the first file in the embodiment of the present application is limited to be within the first threshold. It should be noted that the first threshold is greater than the second threshold described above. Correspondingly, in the embodiment of the present application, when writing the first index into the first file, the server may determine whether the number of the written first indexes in the first file is smaller than a first threshold, and if the number of the written indexes in the i-th first file in the first type file set is smaller than the first threshold, the first index is written into the i-th first file, where i is an integer greater than or equal to 1. If the number of written indexes in the ith first file is equal to a first threshold value, the (i+1) th first file is newly built, and the first indexes are written in the (i+1) th first file. It should be understood that the indexes written in the first file are all indexes of vector type, i.e., first indexes.

Illustratively, between time 3 and time 4, the server may further receive the document 2n+2 and the document 2n+3 … … document 3n from the first terminal device, and the server may sequentially generate a first index and a second index corresponding to each document, write the first index and the second index corresponding to each document into the first file v1 and the second file f3, respectively, and establish a mapping relationship between each document and the first index in v1 and the second index in f 3.

At time 4, the server also receives a document 3n+1 from the first terminal device, and generates a first index (3n+1) and a second index (3n+1)' from the document 3n+1. At this time, because the number of second indexes written in the third second file f3 is equal to the second threshold value, the server may switch f3 from the write mode to the read-only mode, so that the second indexes in f3 are all in an available state. And the server may also create a fourth second file f4, writing the second index (3n+1)' into f 4. Assuming that the first threshold is 3n, the server determines that the number of first indexes written in the first file v1 is greater than the first threshold, and may create a second first file v2, and write the first index (3n+1) into v2, thereby creating a mapping relationship between the first indexes (3n+1) and f4 in the documents 3n+1 and v2 (3n+1)' respectively.

The above describes the process of writing the first index into a first file and writing the second index into a second file by the server in the embodiment of the present application, and at a time after time 4, the server receives the document from the first terminal device, and may continue to write the index into the file according to fig. 9.

In this embodiment of the present application, after the server generates the first index, the first index is in an available state, and the vector type index does not need to generate a large number of small files synchronously with the text type index, and performs merging operation, so that system resource consumption is greatly reduced, efficiency of constructing the index is improved, time of constructing the index is reduced, and compared with the technical scheme in fig. 5, half of resource consumption can be reduced.

On the basis of the above embodiment, in the process that the user sends the document to the server through the first terminal device, if the user finds that many error documents exist in the document and wants to delete the sent error document, the embodiment of the application can also delete the document. This process is described below in conjunction with fig. 10. Fig. 10 is a flowchart of another embodiment of a method for constructing an index according to an embodiment of the present application. As shown in fig. 10, the method for constructing an index provided in the embodiment of the present application may include:

S1001, a deleting instruction sent by the first terminal equipment is received, and the deleting instruction indicates deleting the document.

S1002, the document is marked as a deleted state.

S1003, marking the document as a deleting state, and deleting the document in the second type file set when the second files in the second type file set are combined.

It should be understood that S1002 and S1003 are steps that are alternatively executed, and may not be executed at the same time.

In S1001 described above, the server may send a deletion instruction through the first terminal device when the user desires to delete the transmitted document in the process of constructing the index from the document from the first terminal device. The deletion instruction instructs to delete the document, and in the embodiment of the present application, the document that the user needs to delete is exemplified by the document that the first terminal device in fig. 6 sends to the server in the above embodiment. Optionally, fig. 11 is a schematic interface diagram of the first terminal device provided in the embodiment of the present application. As shown in fig. 11, the interface of the first terminal device may display an identifier of the document, such as document 1, question 2, etc., and a delete control. The user selects the deletion control, and the first terminal device can be triggered to send a deletion instruction to the server. It should be appreciated that when the user selects a plurality of documents, the deletion instruction may instruct deletion of the plurality of documents selected by the user. It should be noted that, in the embodiment of the present application, there is no limitation on how the user triggers the deletion instruction sent by the first terminal device to the server. For example, document 1 and document 2 are selected by the user.

In S1002 above, it should be understood that the server may also store the document after generating the first index and the second index according to the file, writing the first index into the first file, writing the second index into the second file, and establishing the mapping relationship between the first index in the first file, the second index in the second file, and the document.

Wherein, after receiving the deletion instruction, the server may mark the document indicated by the deletion instruction as a deletion state without deleting the document.

S1003 is different from S1002 in that, in this step, the server may mark the document as a deleted state after receiving the deletion instruction, and delete the document in the second-type file set also when the second files in the second-type file set are merged. It should be noted that since the second file in the second type of file set is in the read-only mode, the service of the out-search may be provided, and the document corresponding to the second index in the second file may not be modified, e.g., deleted. When a second file in the second type of file collection is merged, the second file is in a re-write mode in which the document may be deleted. Accordingly, in the embodiment of the application, when the second files in the second type of file set are combined, the files in the second type of file set can be deleted.

It should be understood that in embodiments of the present application, the document in the first type of file set may also be marked as deleted.

Compared with S1002, the manner of deleting the documents in the second type of document set may release the occupied space of the documents in the server, especially in the scenario of deleting the documents in large amounts, may release the occupied space of the documents in the server in large amounts, and may further indirectly improve the efficiency of constructing the index by the server. Because deleting the file releases the occupied space in the server, correspondingly, the searching efficiency can be indirectly improved, and the probability of hitting the deleted file is reduced.

Optionally, in the embodiment of the present application, for a scenario in which there are a large number of documents to be deleted, a method for synchronizing an index of a vector type with an index of a text type is further provided, that is, in the embodiment of the present application, when a document in a second type of document set is deleted, the document in the first type of document set is also deleted synchronously.

In one possible implementation, the user may control, through the first terminal device, deletion synchronization of documents in the first type of document collection and the second type of document collection. Fig. 12 is a second schematic interface diagram of the first terminal device provided in the embodiment of the present application. As shown in fig. 12, compared with fig. 11, a synchronization control is further displayed on the interface of the first terminal device, where when the user selects the synchronization control, the first terminal device may be triggered to send a synchronization deletion instruction to the server, where the synchronization deletion instruction indicates that the deletion condition of the document in the second type of file set is synchronized to the first type of file set.

In this embodiment of the present application, after a user selects to delete a large number of documents, the synchronization control may be selected, or when the user needs to synchronize the deletion condition of the documents in the second type of document set to the first type of document set, the synchronization control may also be selected. Correspondingly, fig. 13 is a flowchart of another embodiment of a method for constructing an index according to an embodiment of the present application. As shown in fig. 13, after the step S1003, the embodiment of the present application may further include:

s1004, receiving a synchronous deleting instruction from the first terminal equipment, wherein the synchronous deleting instruction indicates that the deleting condition of the document in the file set of the second type is synchronous to the file set of the first type.

S1005, deleting the documents in the first type of file set according to the synchronous deleting instruction.

In the embodiment of the present application, since the same document in the first type of file set is not deleted when the document in the second type of file set is deleted in the above embodiment, the undeleted document in the first type of file set also occupies a larger space. It should be appreciated that the documents described above are included in the first type of file collection.

In the embodiment of the application, under the scene of deleting a large number of documents, the documents in the first type of document set can be deleted, namely the space occupied by the deleted documents in the server can be released, which is equivalent to the corresponding increase of resources allocated for subsequent searching, so that the searching efficiency is improved. That is, after receiving the synchronization deletion instruction from the first terminal device, the server may delete the documents in the first type of document set according to the synchronization deletion instruction.

One possible implementation manner of deleting the document in the first type of document set in the embodiment of the present application is: in the embodiment of the application, the index of the vector type can be reconstructed according to the document which is marked as being out of the deleting state. It should be noted that, because a large number of documents to be deleted occupy a large space (i.e., resources), although it is also necessary to consume a part of resources to reconstruct the index of the vector type, less documents to be deleted occupy a large space, and therefore, in the embodiment of the present application, documents marked as deleted state may be deleted at the cost of reconstructing the index of the vector type.

Another possible implementation manner of deleting a document in the first type of document set in the embodiment of the present application is: in the embodiment of the present application, the first files corresponding to the documents to be deleted may be combined, that is, the indexes of the vector types may be reconstructed according to the documents marked as being out of the deletion state in the first files. For example, if the number of documents to be deleted in the 1 st first file is 500 ten thousand and the number of documents to be deleted in the 2 nd first file is 500 ten thousand, in this embodiment of the present application, the 1 st first file and the 2 nd first file may be combined, that is, the documents out of the deletion state marked in the 1 st first file and the 2 nd first file may be reconstructed to form a new first file. It should be understood that in the embodiment of the present application, the resources occupied by merging the first files are less than the space occupied by the document to be deleted, so that in the embodiment of the present application, the document marked as deleted may be deleted at the cost of merging the first files.

In the embodiment of the application, the user can instruct the server to delete the document with the built index through the first terminal device, and the server can mark the document as a deleting state or delete the document after the document is in the deleting state, so that the occupied space of the document in the server is released. In addition, in a scene of deleting a large number of documents, the user can also instruct the server to synchronize the document deleting condition in the second type of document set to the first type of document set through the first terminal equipment, so that the document deleting states in the two types of document sets are kept consistent, and the space in the first type of document set and the second type of document set is released, so that more resources are allocated for subsequent searching, and the searching efficiency is further improved.

Based on the above embodiment, in combination with the network architecture shown in fig. 1, in the process of constructing the index, the embodiment of the application may further provide an external search service, that is, search documents related to the search content input by the user by using the index of the available state. Fig. 14 is a flowchart of another embodiment of a method for constructing an index according to an embodiment of the present application. As shown in fig. 14, the method for constructing an index provided in the embodiment of the present application may include:

S1401, the search content from the second terminal device is received.

S1402, obtaining a search result according to the search content, the first type of file set and the second type of file set, wherein the search result comprises documents.

S1403, sending the search result to the second terminal device.

In S1401 described above, when the user searches for a document through the second terminal device, search content may be input on the second terminal device. The search content in the embodiment of the application may be text type search content and/or vector type search content. Fig. 15 is an interface change schematic diagram of the second terminal device provided in the embodiment of the present application. As shown in an interface 1501 in fig. 15, a search content input box in which a user can input text-type search content is displayed on the interface 1501. In addition, an adding control of the vector type search content can be displayed on the interface, and the user can input the vector type search content by selecting the adding control. After the user inputs the text type search content "red" and the vector type search content "a picture including a brand a car," the interface 1501 jumps to the interface 1502, and the user input search content may be displayed on the interface 1502. After clicking the search control, the user can trigger the second terminal device to send the search content to the server. It should be appreciated that other recommended information is also displayed in interface 1501, such as "XX poetry program" on XX month XX day.

In S1402, if the server receives the search content from the second terminal device during the process of constructing the index, the server may obtain the search result of the search content according to the search content, the first type of file set, and the second type of file set. In order to combine the search result with the document sent by the first terminal device in fig. 6, the search result may include the document.

When the search result is obtained, the first search result can be obtained according to the search content and the first type of file set, and the second search result can be obtained according to the search content and the second type of file set, so that the first search result and the second search result are integrated, and the final search result is obtained.

It should be appreciated that, in view of the fact that the server is not all in the available state for writing in the second set of files of the second type in the process of building the index, if the number of indexes in the second file does not reach the second threshold, the second index in the second file is in the unavailable state. Therefore, in the embodiment of the application, the second search result can be obtained according to the search content and the second index in the available state in the second type of file set. In view of the fact that all the first indexes written in the first type of file set are in a usable state, in the embodiment of the present application, the first search result may be obtained according to the search content and the first indexes in the first type of file set.

In the embodiment of the application, the server may extract keywords in the search content, obtain the similarity between the keywords in the search content and the keywords in the second index in the available state in the second type of file set, and further use the document corresponding to the keywords with the similarity greater than the similarity threshold as the first search result. Similarly, the server may extract a vector of the search content, obtain a distance between the vector in the search content and a vector in the first index in the available state in the first type of file set, and further use a document corresponding to a vector with a distance smaller than the distance threshold as the second search result.

Optionally, in the embodiment of the present application, the first search result and the second search result are integrated to obtain a final search result, which may be that the same document in the first search result and the second search result is used as a search result, or that documents that both include search content in the first search result and the second search result are used as a search result.

It should be noted that, if the search result in the embodiment of the present application includes the document marked as the deleted state, the document is filtered out in the search result, that is, the document marked as the deleted state is not fed back to the second terminal device.

In S1403 described above, after obtaining the search result, the server may send the search result to the second terminal device. The second terminal device may display the search result after receiving the search result. As shown in fig. 15, the interface 1502 may jump to an interface 1503, where the interface 1503 displays search results including: document 1 and document 2, and picture 1.

In the embodiment of the invention, the server can provide search service in the process of constructing the index according to the document, and can integrate the text type search result and the vector type search result by combining the constructed first type file set and second type file set aiming at the multi-type search content input by the user, so that more accurate search result can be obtained. In addition, in the process of constructing the index according to the document, the server does not need to generate a large number of small files and combine the small files synchronously with the index of the text type, so that the system resource consumption is greatly reduced, the index constructing efficiency is improved, the index constructing time is shortened, the searchable duration of the index is shortened, the searching speed and the searching efficiency are improved, and on the other hand, the system resource consumption is reduced when the index is constructed, so that more resources can be used for searching service, and the searching efficiency is further improved.

Fig. 16 is a schematic structural diagram of an apparatus for constructing an index according to an embodiment of the present application. The means for constructing the index may be a server or a chip or a processor in the server or the like in the above embodiments. As shown in fig. 16, the apparatus for constructing an index includes: a transceiver module 1601 and a processing module 1602.

A transceiver module 1601, configured to receive a document from a first terminal device; the processing module 1602 is configured to generate a first index and a second index according to a document, store the first index into a first type of file set, store the second index into a second type of file set, and establish a mapping relationship between the first index, the second index, and the document, where the first index characterizes the mapping relationship between a vector and the document, the second index characterizes the mapping relationship between a text and the document, the first index is in an available state, and the first index in the available state is used for searching the document associated with the search content through the vector.

In one possible implementation, the first type of file set includes at least one first file, where the first file is used to store a first index. The processing module 1602 is specifically configured to write the first index into a first file.

In one possible implementation, the second type of file set includes at least one second file, where the second file is used to store a second index. The processing module 1602 is specifically configured to write the second index into a second file.

In one possible implementation, the processing module 1602 is specifically configured to establish a mapping relationship between a first index in a first file, a second index in a second file, and a document.

In one possible implementation, the processing module 1602 is specifically configured to, if the number of written indexes in the ith first file in the first type of file set is less than a first threshold, write the first index into the ith first file, where i is an integer greater than or equal to 1; if the number of written indexes in the ith first file is equal to a first threshold value, the (i+1) th first file is newly built, and the first indexes are written in the (i+1) th first file.

In one possible implementation, the processing module 1602 is specifically configured to, if the number of written indexes in a j-th second file in the second type of file set is smaller than a second threshold, write the second index into the j-th second file, where j is an integer greater than or equal to 1; if the number of written indexes in the j-th second file is equal to the second threshold value, the j+1th second file is newly built, and the second indexes are written in the j+1th second file.

In one possible implementation, the processing module 1602 is further configured to, if the number of written indexes in the jth second file is equal to the second threshold, switch the jth second file from the writing mode to the read-only mode, and the second indexes in the jth second file switched to the read-only mode are in an available state, where the second indexes in the available state are used to search documents associated with the search content through text.

In one possible implementation, the transceiver module 1601 is further configured to receive a conversion duration of the second file from the first terminal device, where the conversion duration is a duration of converting the second file from the write mode to the read-only mode.

Correspondingly, the processing module 1602 is further configured to determine a second threshold according to the transition duration.

In one possible implementation, the processing module 1602 is further configured to merge the second files converted to the read-only mode if the occupied memory of the second files converted to the read-only mode reaches a preset memory; or if the current available load is greater than the preset load, merging the second files converted into the read-only mode.

The processing module 1602, the processing module 1602 is further configured to establish a mapping relationship between the second index in the merged second file, the first index in the first file, and the document.

In one possible implementation, the second type of collection of files includes documents therein.

The transceiver module 1601 is further configured to receive a deletion instruction sent by the first terminal device, where the deletion instruction indicates deleting the document. Accordingly, the processing module 1602 is further configured to mark the document as deleted.

The transceiver module 1601 is further configured to receive a deletion instruction sent by the first terminal device, where the deletion instruction indicates deleting the document. Correspondingly, the processing module 1602 is further configured to mark the document as a deleted state, and delete the document in the second type of file set when the second files in the second type of file set are merged.

In a possible implementation manner, the transceiver module 1601 is further configured to receive a synchronization deletion instruction from the first terminal device, where the synchronization deletion instruction indicates that a deletion situation of a document in the second type of file set is synchronized to the first type of file set.

Correspondingly, the processing module 1602 is further configured to delete a document in the first type of file set.

In one possible implementation, the receiving transceiver module 1601 is further configured to search for content from the second terminal device. Correspondingly, the processing module 1602 is further configured to obtain a search result according to the search content, the first type of file set, and the second type of file set, where the search result includes a document.

The transceiver module 1601 is further configured to send a search result to the second terminal device.

In one possible implementation, the processing module 1602 is specifically configured to obtain a first search result according to the search content and the first type of file set; obtaining a second search result according to the search content and the second type of file set; and obtaining the search results according to the first search results and the second search results.

In one possible implementation, the processing module 1602 is specifically configured to obtain a first search result according to the search content and a first index in the first type of file set; and obtaining a second search result according to the search content and a second index in an available state in the second type file set.

In one possible implementation, the processing module 1602 is further configured to filter the documents from the search results if the search results include documents marked as deleted. Accordingly, the transceiver module 1601 is specifically configured to send a search result that does not include a document to the second terminal device.

The device for constructing the index provided in the embodiment of the present application may perform the action of the server in the embodiment of the method, and its implementation principle and technical effect are similar, and are not described herein again.

Optionally, fig. 17 is a second schematic structural diagram of an apparatus for constructing an index according to an embodiment of the present application. In this embodiment, as shown in fig. 17, the processing module 1602 may include a mapping management unit 16021, a first index management unit 16022, and a second index management unit 16023. It should be noted that the mapping management unit 16021 is used to perform the step of establishing a mapping relationship in the above-described embodiment. The first index management unit 16022 performs the steps of generating the first index in S602 and S802, S603, and S803 in the above-described embodiments. The second index management unit 16023 performs the step of generating the second index in the above-described embodiments S602 and S802, the step of creating the mapping relationship in S604, and the step of creating the mapping relationship in S804.

It should be noted that the above transceiver module may be actually implemented as a transceiver, or include a transmitter and a receiver. And the processing module can be realized in the form of software calling through the processing element; or in hardware. For example, the processing module may be a processing element that is set up separately, may be implemented in a chip of the above-mentioned apparatus, or may be stored in a memory of the above-mentioned apparatus in the form of program codes, and the functions of the above-mentioned processing module may be called and executed by a processing element of the above-mentioned apparatus. In addition, all or part of the modules can be integrated together or can be independently implemented. The processing element described herein may be an integrated circuit having signal processing capabilities. In implementation, each step of the above method or each module above may be implemented by an integrated logic circuit of hardware in a processor element or an instruction in a software form.

For example, the modules above may be one or more integrated circuits configured to implement the methods above, such as: one or more application specific integrated circuits (application specific integrated circuit, ASIC), or one or more microprocessors (digital signal processor, DSP), or one or more field programmable gate arrays (field programmable gate array, FPGA), or the like. For another example, when a module above is implemented in the form of a processing element scheduler code, the processing element may be a general purpose processor, such as a central processing unit (central processing unit, CPU) or other processor that may invoke the program code. For another example, the modules may be integrated together and implemented in the form of a system-on-a-chip (SOC).

Fig. 18 is a schematic structural diagram of an electronic device according to an embodiment of the present application. The electronic device is the server in the above embodiment. As shown in fig. 18, the electronic device may include: a processor 1801 (e.g., CPU), memory 1802, transceiver 1803; the transceiver 1803 is coupled to the processor 1801, and the processor 1801 controls the transceiving actions of the transceiver 1803; the memory 1802 may include a random-access memory (RAM) and may also include a non-volatile memory (NVM), such as at least one magnetic disk memory, in which various instructions may be stored in the memory 1802 for performing various processing functions and implementing the method steps of the present application. Optionally, the electronic device related to the present application may further include: a power supply 1804, a communication bus 1805, and a communication port 1806. The transceiver 1803 may be integrated into a transceiver of the electronic device or may be a separate transceiver antenna on the electronic device. The communication bus 1805 is used to enable communication connections between the elements. The communication port 1806 is used to enable connection communication between the electronic device and other peripheral devices.

In the embodiment of the present application, the memory 1802 is configured to store computer executable program codes, where the program codes include instructions; when the processor 1801 executes the instructions, the instructions cause the processor 1801 of the electronic device to execute the processing actions of the terminal device in the above method embodiment, and cause the transceiver 1803 to execute the transceiving actions of the terminal device in the above method embodiment, so that the implementation principle and technical effects are similar, and are not described herein again.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another, for example, by wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.) means from one website, computer, server, or data center. Computer readable storage media can be any available media that can be accessed by a computer or data storage devices, such as servers, data centers, etc., that contain an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The term "plurality" herein refers to two or more. The term "and/or" is herein merely an association relationship describing an associated object, meaning that there may be three relationships, e.g., a and/or B, may represent: a exists alone, A and B exist together, and B exists alone. In addition, the character "/" herein generally indicates that the front and rear associated objects are an "or" relationship; in the formula, the character "/" indicates that the front and rear associated objects are a "division" relationship.

It will be appreciated that the various numerical numbers referred to in the embodiments of the present application are merely for ease of description and are not intended to limit the scope of the embodiments of the present application.

It should be understood that, in the embodiments of the present application, the sequence number of each process described above does not mean that the execution sequence of each process should be determined by the function and the internal logic of each process, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Claims

1. A method of constructing an index, comprising:

receiving a document from a first terminal device;

generating a first index and a second index according to the document, wherein the first index represents the mapping relation between a vector and the document, and the second index represents the mapping relation between a text and the document;

Storing the first index into a first type file set, wherein the first index is in an available state, the first index in the available state is used for searching the document associated with search content through vectors, the first type file set comprises at least one first file, and the number of the first indexes stored in the first file is a first threshold value;

storing the second index into a j-th second file in a second type file set, and establishing a mapping relation among the first index, the second index and the document, wherein j is an integer greater than or equal to 1;

the method further comprises the steps of:

if the number of written indexes in the j-th second file is equal to a second threshold value, converting the j-th second file from a writing mode to a read-only mode, wherein the second indexes in the j-th second file converted to the read-only mode are in an available state, and the second indexes in the available state are used for searching the documents related to search content through texts, and the second threshold value is smaller than the first threshold value;

if the occupied memory of the second file converted into the read-only mode reaches a preset memory, merging the second file converted into the read-only mode; or,

And if the current available load is greater than the preset load, merging the second files converted into the read-only mode.

2. The method of claim 1, wherein the first file is configured to store a first index, and wherein storing the first index into a first type of file collection comprises:

writing the first index into a first file.

3. The method of claim 2, wherein the second set of files includes at least one second file, the second file being used to store a second index, the storing the second index into the second set of files comprising:

and writing the second index into a second file.

4. The method of claim 3, wherein the establishing a mapping relationship of the first index, the second index, and the document comprises:

and establishing a mapping relation between the first index in the first file, the second index in the second file and the document.

5. The method of claim 2, wherein writing the first index into a first file comprises:

If the number of written indexes in the ith first file in the first type file set is smaller than a first threshold value, writing the first index into the ith first file, wherein i is an integer greater than or equal to 1;

if the number of written indexes in the ith first file is equal to the first threshold value, newly building an ith+1th first file, and writing the first indexes into the ith+1th first file.

6. A method according to claim 3, wherein said writing said second index into a second file comprises:

if the number of written indexes in a j second file in the second type file set is smaller than a second threshold value, writing the second index into the j second file;

and if the number of written indexes in the j-th second file is equal to the second threshold value, creating a j+1-th second file, and writing the second indexes into the j+1-th second file.

7. The method of claim 6, wherein the method further comprises:

receiving a conversion time length of a second file from the first terminal equipment, wherein the conversion time length is a time length for converting the second file from a writing mode to a read-only mode;

And determining the second threshold according to the conversion duration.

8. The method of claim 1, wherein after merging the second files having the number of written indices equal to the second threshold, further comprising:

and establishing a mapping relation among the second index in the combined second file, the first index in the first file and the document.

9. The method of any of claims 1-8, wherein the documents are included in the second type of collection of files; the method further comprises the steps of:

receiving a deleting instruction sent by the first terminal equipment, wherein the deleting instruction indicates to delete the document;

the document is marked as deleted.

10. The method of any of claims 1-8, wherein the documents are included in the second type of collection of files; the method further comprises the steps of:

the document is marked as deleted and the document in the second type of file set is deleted when the second files in the second type of file set are merged.

11. The method of claim 10, wherein the document is included in the first type of file collection, the method further comprising:

receiving a synchronous deletion instruction from the first terminal equipment, wherein the synchronous deletion instruction indicates that the deletion condition of the document in the file set of the second type is synchronized to the file set of the first type;

and deleting the documents in the first type of file set according to the synchronous deleting instruction.

12. The method according to any one of claims 1-8, further comprising:

receiving the search content from the second terminal device;

obtaining a search result according to the search content, the first type file set and the second type file set, wherein the search result comprises the document;

and sending the search result to the second terminal equipment.

13. The method of claim 12, wherein the obtaining search results from the search content, the first type of set of files, and the second type of set of files comprises:

obtaining a first search result according to the search content and the first type of file set;

Obtaining a second search result according to the search content and the second type of file set;

and acquiring the search results according to the first search results and the second search results.

14. The method of claim 13, wherein the obtaining a first search result from the search content and the first type of file collection comprises:

obtaining the first search result according to the search content and a first index in the first type of file set;

and obtaining a second search result according to the search content and the second type file set, wherein the second search result comprises the following steps:

and obtaining the second search result according to the search content and a second index in an available state in the second type file set.

15. The method according to claim 13 or 14, characterized in that the method further comprises:

filtering the document in the search result if the search result comprises the document marked as a deleting state;

the sending the search result to the second terminal device includes:

and sending the search result which does not comprise the document to the second terminal equipment.

16. An apparatus for constructing an index, comprising:

the receiving and transmitting module is used for receiving the document from the first terminal equipment;

the processing module is used for generating a first index and a second index according to the document, storing the first index into a first type file set, storing the second index into a j-th second file in the second type file set, and establishing a mapping relation among the first index, the second index and the document, wherein the first index characterizes the mapping relation among a vector and the document, the second index characterizes the mapping relation among a text and the document, the first index is in an available state, the first index in the available state is used for searching the document related to search content through the vector, the first type file set comprises at least one first file, the number of the first indexes stored in the first file is a first threshold value, and j is an integer greater than or equal to 1;

the processing module is further configured to:

17. An electronic device, comprising: memory, processor, and transceiver;

the processor being operative to couple with the memory, read and execute instructions in the memory to implement the method of any one of claims 1-16;

the transceiver is coupled to the processor and is controlled by the processor to transmit and receive messages.

18. A computer readable storage medium, characterized in that the computer storage medium stores computer instructions, which when executed by a computer, cause the computer to perform the method of any of claims 1-16.