CN116049350A

CN116049350A - Data retrieval method, device, computer equipment and storage medium

Info

Publication number: CN116049350A
Application number: CN202211681406.1A
Authority: CN
Inventors: 李慧琼; 张悦
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2022-12-27
Filing date: 2022-12-27
Publication date: 2023-05-02

Abstract

The application relates to a data retrieval method, a data retrieval device, computer equipment and a storage medium. The application relates to the technical field of big data. The method comprises the following steps: acquiring text information input by a user; dividing the text information into a plurality of sentences through a sentence dividing model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence; generating a hash code corresponding to the high-dimensional sentence feature vector through a hash function, inquiring the hash code which is the same as the hash code of the high-dimensional sentence feature vector in a database, and taking the data to be selected corresponding to the same hash code as target retrieval data corresponding to text information. By adopting the method, the retrieval accuracy can be improved.

Description

Data retrieval method, device, computer equipment and storage medium

Technical Field

The present disclosure relates to the field of big data technologies, and in particular, to a data retrieval method, a data retrieval device, a computer device, and a storage medium.

Background

With the development of large data information, data complexity and data diversification make data refinement processing very difficult. Therefore, an efficient data retrieval method is an important point to be studied at the present stage.

The traditional data retrieval method is characterized in that a high-dimensional feature vector is converted into a low-dimensional binary hash code through a hash function, when retrieval is performed, data to be retrieved is converted into the hash code through a trained hash function, then similarity calculation is performed on the hash code and hash codes of all data in a database, and a retrieval result is returned according to the calculated similarity. However, in the process of converting the hash function into the binary hash code, dimension reduction processing is performed on the text information, so that statement information containing the text information in the hash code is less, and the accuracy of data retrieval is lower.

Disclosure of Invention

In view of the foregoing, it is desirable to provide a data retrieval method, apparatus, computer device, computer readable storage medium, and computer program product.

In a first aspect, the present application provides a data retrieval method. The method comprises the following steps:

acquiring text information input by a user;

dividing the text information into a plurality of sentences through a sentence dividing model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence;

generating a hash code corresponding to the high-dimensional sentence feature vector through a hash function, inquiring the hash code which is the same as the hash code of the high-dimensional sentence feature vector in a database, and taking the data to be selected corresponding to the same hash code as target retrieval data corresponding to text information.

Optionally, the sentence division model includes a word embedding layer and a division layer, the dividing the text information into a plurality of sentences by the sentence division model, and performing coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence, including:

determining a plurality of word embedding vectors through the word embedding layer for each word in the text information, and carrying out position coding on the front and back sequence of each word in the text information to obtain a position coding vector of the word;

dividing each sentence in the text information into a plurality of short sentences through the dividing layer, and carrying out conversion processing on each short sentence to obtain a plurality of short sentence vectors corresponding to the sentence;

and determining a high-dimensional sentence characteristic vector of each sentence in the text information according to the word embedding vector of the word segmentation contained in the sentence in the text information, the position coding vector of the word segmentation contained in the sentence and the short sentence vector corresponding to the sentence.

Optionally, the determining the high-dimensional sentence feature vector of each sentence in the text information according to the word embedding vector of the word segmentation included in the sentence in the text information, the position coding vector of the word segmentation included in the sentence, and the short sentence vector corresponding to the sentence includes:

And for each sentence, carrying out vector superposition processing on the word embedding vector of the word included in the sentence, the position coding vector of the word included in the sentence and the short sentence vector corresponding to the sentence to obtain the high-dimensional sentence feature vector of the sentence.

Optionally, generating the hash code corresponding to the high-dimensional sentence feature vector through a hash function, querying, in a database, to-be-selected data corresponding to the hash code identical to the hash code of the high-dimensional sentence feature vector, and taking the to-be-selected data of the identical hash code as target retrieval data corresponding to text information, where the method includes:

generating hash codes corresponding to the high-dimensional sentence feature vectors through a hash function aiming at each high-dimensional sentence feature vector;

selecting any one data to be selected in a database, and judging whether hash bits in hash codes of the data to be selected are consistent with hash bits in hash codes corresponding to the high-dimensional statement feature vectors;

and under the condition that Ha Xiwei in the hash codes of the data to be selected is inconsistent with Ha Xiwei in the hash codes corresponding to the high-dimensional sentence feature vectors, returning to execute the step of selecting any one data to be selected in the database until Ha Xiwei in the hash codes of the selected data to be selected is consistent with the hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors, and taking the data to be selected with the consistent hash bits as target retrieval data corresponding to the text information.

Optionally, the method further comprises:

acquiring sample text information and hash codes of feature vectors of all high-dimensional sample sentences in the sample text information;

dividing the sample text information into a plurality of sentences through an initial sentence dividing model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector of each sentence;

generating hash codes corresponding to the high-dimensional sentence feature vectors through hash functions, and training the initial sentence division model according to the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors to obtain a sentence division model.

Optionally, the training the initial sentence division model according to the hash code of each high-dimensional sample sentence feature vector and the hash code of each high-dimensional sentence feature vector to obtain a sentence division model includes:

judging whether the hash codes of the high-dimensional sentence feature vectors are different from the hash codes of the high-dimensional sample sentence feature vectors or not;

under the condition that the hash codes of the high-dimensional sentence feature vectors are different from the hash codes of the high-dimensional sample sentence feature vectors, adjusting the dividing parameters of the initial sentence dividing model according to the difference values between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors, returning to execute the process of dividing the sample text information into a plurality of sentences through the initial sentence dividing model, and carrying out coding conversion processing on each sentence to obtain the high-dimensional sentence feature vectors of each sentence until the hash codes of the high-dimensional sentence feature vectors which are different from the hash codes of the high-dimensional sample sentence feature vectors are not present;

And taking the initial sentence division model corresponding to the hash code of the high-dimensional sentence feature vector which is not different from the hash code of the high-dimensional sample sentence feature vector as the sentence division model.

Optionally, the method further comprises:

acquiring a high-dimensional sample sentence feature vector and sample data corresponding to the high-dimensional sample sentence feature vector;

calculating the hash code of each high-dimensional sample sentence characteristic vector and the hash code of each sample data through an initial hash function, and calculating the similarity between the hash code of each high-dimensional sample sentence characteristic vector and the hash code of each sample data;

when the hash codes of the different high-dimensional sample sentence feature vectors exist and the similarity between the hash codes of the sample sentence feature vectors and the hash codes of the sample data is larger than a similarity threshold value, adjusting the hash code generation parameters of the initial hash function, and returning to execute the hash code step of calculating the hash codes of each high-dimensional sample sentence feature vector and the hash codes of each sample data through the initial hash function until the similarity between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vectors is larger than the similarity threshold value;

And taking an initial hash function with similarity between hash codes only of the high-dimensional sample sentence feature vector and hash codes of sample data corresponding to the high-dimensional sample sentence feature vector larger than a similarity threshold as the hash function.

In a second aspect, the present application further provides a data retrieval device. The device comprises:

the acquisition module is used for acquiring text information input by a user; the text information is text data which needs to be retrieved by a user;

the division module is used for dividing the text information into a plurality of sentences through a sentence division model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence characteristic vector corresponding to each sentence;

and the screening module is used for generating the hash codes corresponding to the high-dimensional statement feature vectors through the hash functions, inquiring the to-be-selected data corresponding to the hash codes identical to the hash codes of the high-dimensional statement feature vectors in the database, and taking the to-be-selected data of the identical hash codes as target data.

Optionally, the dividing module is specifically configured to:

Optionally, the screening module is specifically configured to:

Optionally, the apparatus further includes:

the sample acquisition module is used for acquiring sample text information and hash codes of feature vectors of all high-dimensional sample sentences in the sample text information;

the coding module is used for dividing the sample text information into a plurality of sentences through an initial sentence division model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence characteristic vector of each sentence;

the training module is used for generating hash codes corresponding to the high-dimensional sentence feature vectors through hash functions, and training the initial sentence division model according to the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors to obtain the sentence division model.

Optionally, the training module is specifically configured to:

Optionally, the apparatus further includes:

the vector acquisition module is used for acquiring a high-dimensional sample sentence characteristic vector and sample data corresponding to the high-dimensional sample sentence characteristic vector;

The similarity calculation module is used for calculating the hash code of each high-dimensional sample sentence characteristic vector and the hash code of each sample data through an initial hash function, and calculating the similarity between the hash code of each high-dimensional sample sentence characteristic vector and the hash code of each sample data;

the judging module is used for adjusting the hash code generation parameters of the initial hash function under the condition that the hash codes of the different high-dimensional sample sentence feature vectors and the hash codes of the sample data are more than a similarity threshold value, and returning to execute the hash code step of calculating the hash codes of each high-dimensional sample sentence feature vector and the hash codes of each sample data through the initial hash function until the similarity between the hash codes of the high-dimensional sample sentence feature vector and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vector is more than the similarity threshold value;

and the determining module is used for taking an initial hash function with similarity between hash codes of the high-dimensional sample sentence feature vector and hash codes of sample data corresponding to the high-dimensional sample sentence feature vector being larger than a similarity threshold value as the hash function.

In a third aspect, the present application provides a computer device. The computer device comprises a memory storing a computer program and a processor implementing the steps of the method of any of the first aspects when the processor executes the computer program.

In a fourth aspect, the present application provides a computer-readable storage medium. On which a computer program is stored which, when being executed by a processor, implements the steps of the method of any of the first aspects.

In a fifth aspect, the present application provides a computer program product. The computer program product comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

The data retrieval method, the data retrieval device, the computer equipment and the storage medium are used for acquiring text information input by a user; dividing the text information into a plurality of sentences through a sentence dividing model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence; generating a hash code corresponding to the high-dimensional sentence feature vector through a hash function, inquiring the hash code which is the same as the hash code of the high-dimensional sentence feature vector in a database, and taking the data to be selected corresponding to the same hash code as target retrieval data corresponding to text information. The text information is divided into a plurality of sentences, each sentence is converted into the high-dimensional sentence feature vector, the dimension of the text information is improved, the dimension of the feature information of each sentence is enhanced, and the hash function generates hash codes of the high-dimensional sentence feature vectors corresponding to the text information based on the high-dimensional sentence feature vectors, so that the dimension of the text information is reduced when the hash function converts the text information, and the retrieval accuracy is improved.

Drawings

FIG. 1 is a flow diagram of a method of data retrieval in one embodiment;

FIG. 2 is a flow chart illustrating the steps for determining feature vectors of a high-dimensional sentence in one embodiment;

FIG. 3 is a flow chart illustrating the training steps of a hash function in one embodiment;

FIG. 4 is a flow diagram of an example of data retrieval in one embodiment;

FIG. 5 is a block diagram of the structure of a data retrieval device in one embodiment;

fig. 6 is an internal structural diagram of a computer device in one embodiment.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.

The data retrieval method provided by the embodiment of the application can be applied to a server, can also be applied to a system comprising a terminal and the server, and is realized through interaction of the terminal and the server. The terminal may be, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and the like. The terminal divides the text information into a plurality of sentences, converts each sentence into a high-dimensional sentence feature vector, improves the dimension of the text information, strengthens the dimension of the feature information of each sentence, and enables the hash function to generate hash codes of the high-dimensional sentence feature vectors corresponding to the text information based on the high-dimensional sentence feature vector, so that the dimension of the text information is reduced when the hash function converts the text information, and the retrieval accuracy is improved.

In one embodiment, as shown in fig. 1, a data retrieval method is provided, and the method is applied to a terminal for illustration, and includes the following steps:

step S101, obtaining text information input by a user.

In this embodiment, the terminal obtains text information input by the user by receiving text data input by the user at the display port.

Step S102, dividing the text information into a plurality of sentences through a sentence division model, and performing coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence.

In this embodiment, the terminal divides the acquired text information into a plurality of sentences through the sentence division model. Wherein the statement is long statement data. And the terminal performs coding conversion processing on each long statement data to obtain a high-dimensional statement feature vector corresponding to each statement. The sentence division model is a Bert model (Bidirectional Encoder Representation from Transformers), and a specific division process will be described in detail later.

Step S103, generating hash codes corresponding to the high-dimensional sentence feature vectors through a hash function, inquiring the hash codes identical to the hash codes of the high-dimensional sentence feature vectors in a database, and taking the to-be-selected data corresponding to the identical hash codes as target retrieval data corresponding to the text information.

In this embodiment, the terminal passes through a hash function. And converting the high-dimensional sentence characteristic vector corresponding to each sentence into a hash code, and inquiring in a database of the terminal, wherein the hash code is the same as the hash code of each high-dimensional sentence characteristic vector. And the terminal takes the data to be selected of the same hash code in the database as target retrieval data corresponding to the text information.

Based on the scheme, the text information is divided into a plurality of sentences, each sentence is converted into the high-dimensional sentence feature vector, the dimension of the text information is improved, the dimension of the feature information of each sentence is enhanced, and the hash function generates hash codes of the high-dimensional sentence feature vectors corresponding to the text information based on the high-dimensional sentence feature vector, so that the dimension of the text information is reduced when the hash function converts the text information, and the retrieval accuracy is improved.

Optionally, as shown in fig. 2, the sentence division model includes a word embedding layer and a division layer, and the sentence division model divides the text information into a plurality of sentences, and performs coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence, including:

Step S201, determining a plurality of word embedding vectors through a word embedding layer for each word in the text information, and performing position coding on the front and back sequence of each word in the text information to obtain a position coding vector of the word.

In this embodiment, the terminal inserts, through the word embedding layer, word segmentation identification information of each word segment and a start point of each word segment in the text information, and performs encoding conversion processing on the words in the two word segmentation identification information to obtain a plurality of word embedding vectors. And the terminal carries out position coding according to the corresponding position of each word in the text information and the front and back centers to obtain a position coding vector of each word.

Step S202, dividing each sentence in the text information into a plurality of short sentences through dividing layers, and converting each short sentence to obtain a plurality of short sentence vectors corresponding to the sentence.

In this embodiment, the terminal divides each sentence in the text information into a plurality of short sentences through dividing layers, is used for improving the dimension of the sentence, performs coding conversion processing on each short sentence to obtain a plurality of short sentence vectors corresponding to each sentence,

Step S203, determining a high-dimensional sentence feature vector of each sentence in the text information according to the word embedding vector of the word segment included in the sentence in the text information, the position encoding vector of the word segment included in the sentence, and the short sentence vector corresponding to the sentence.

In this embodiment, the terminal performs integration and superposition processing on the word embedding vector of the word segment included in each sentence in the text information, the position coding vector of the word segment included in the sentence, and the short sentence vector corresponding to each sentence, to obtain a high-dimensional sentence feature vector of each sentence in the text information.

Based on the scheme, the high-dimensional sentence feature vector of each sentence is obtained by word segmentation and short sentence segmentation of each sentence and encoding vector processing of each word and each short sentence, so that the dimension of each sentence is better improved, and the accuracy of retrieval is improved.

Optionally, determining the high-dimensional sentence feature vector of each sentence in the text information according to the word embedding vector of the word segment included in the sentence in the text information, the position coding vector of the word segment included in the sentence, and the short sentence vector corresponding to the sentence, includes:

for each sentence, carrying out vector superposition processing on word embedding vectors of the included word fragments of the sentence, position coding vectors of the word fragments of the sentence and short sentence vectors corresponding to the sentence, and obtaining a high-dimensional sentence feature vector of the sentence.

In this embodiment, for each sentence, the terminal performs a vector superposition process on a word embedding vector of a word segment included in the sentence, a position coding vector of the word segment included in the sentence, and a short sentence vector corresponding to the sentence, to obtain a high-dimensional sentence feature vector of the sentence. Similarly, the high-dimensional sentence feature vector of each sentence is obtained through the steps.

Based on the scheme, the high-dimensional sentence feature vector is obtained by carrying out superposition processing on the word segmentation vector and the short sentence vector of each sentence, so that the convenience of dimension increasing of each sentence is improved, and excessively complex operation is avoided.

Optionally, generating a hash code corresponding to the high-dimensional sentence feature vector through a hash function, querying, in a database, to-be-selected data corresponding to a hash code identical to the hash code of the high-dimensional sentence feature vector, and using the to-be-selected data of the same hash code as target retrieval data corresponding to text information, including: generating hash codes corresponding to the high-dimensional sentence feature vectors through a hash function aiming at each high-dimensional sentence feature vector; selecting any one data to be selected in the database, and judging whether the hash bits in the hash codes of the data to be selected are consistent with the hash bits in the hash codes corresponding to the high-dimensional statement feature vectors; and under the condition that Ha Xiwei in the hash codes of the data to be selected is inconsistent with Ha Xiwei in the hash codes corresponding to the high-dimensional sentence feature vectors, returning to execute the step of selecting any one data to be selected in the database until Ha Xiwei in the hash codes of the selected data to be selected is consistent with the hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors, and taking the data to be selected with the consistent hash bits as target retrieval data corresponding to the text information.

In this embodiment, when obtaining the high-dimensional sentence feature vector corresponding to each sentence, the terminal generates, for each high-dimensional sentence feature vector, a hash code corresponding to the high-dimensional sentence feature vector through a hash function. The terminal randomly selects one piece of data to be selected from a database of the terminal, and judges whether the hash code of the data to be selected is consistent with all hash bits in the hash code of the high-dimensional fishing gear feature vector. And under the condition that Ha Xiwei in the hash codes of the data to be selected are consistent with hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors, the terminal takes the data to be selected as target retrieval data corresponding to the text information. Under the condition that Ha Xiwei in the hash codes of the data to be selected is consistent with hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors, the terminal returns to execute the step of randomly selecting one data to be selected in the database of the terminal until Ha Xiwei in the hash codes of the selected data to be selected is consistent with hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors, and the terminal takes the data to be selected with the consistent hash bits as target retrieval data corresponding to the text information.

Based on the scheme, the target retrieval data corresponding to the text information is screened through the hash codes, and whether hash bits corresponding to 2 sentences are consistent or not is only required to be compared, which is equivalent to obtaining the similarity of 2 sentences by exclusive OR operation, so that the response time in retrieval is reduced.

Optionally, the method further comprises: acquiring sample text information and hash codes of feature vectors of all high-dimensional sample sentences in the sample text information; dividing the sample text information into a plurality of sentences through an initial sentence dividing model, and carrying out coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector of each sentence; generating hash codes corresponding to the high-dimensional sentence feature vectors through the hash function, and training the initial sentence division model according to the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors to obtain the sentence division model.

In this embodiment, the terminal obtains, in the history retrieval record, text information input by the user before as sample text information, and queries, in hash codes generated by a previous hash function, hash codes of feature vectors of high-order sample sentences in the sample text information. The terminal divides the sample text information into a plurality of sentences through an initial sentence division model, and performs coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector of each sentence. The terminal generates a hash code corresponding to the high-dimensional sentence feature vector based on the feature vector of each high-dimensional sentence through a hash function, and trains an initial sentence division model according to the hash code of each high-dimensional sample sentence feature vector and the hash code of each high-dimensional sentence feature vector to obtain a sentence division model.

Based on the scheme, the sentence division model is trained, so that the division accuracy of the sentence division model is improved.

Optionally, training the initial sentence division model according to the hash code of each high-dimensional sample sentence feature vector and the hash code of each high-dimensional sentence feature vector to obtain a sentence division model, including: judging whether the hash codes of the high-dimensional sentence feature vectors are different from those of the high-dimensional sample sentence feature vectors or not; under the condition that the hash codes of the high-dimensional sentence feature vectors are different from the hash codes of the high-dimensional sample sentence feature vectors, according to the difference values between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors, adjusting the dividing parameters of an initial sentence dividing model, returning to execute the process of dividing sample text information into a plurality of sentences through the initial sentence dividing model, and carrying out coding conversion processing on each sentence to obtain the high-dimensional sentence feature vectors of each sentence until the hash codes of the high-dimensional sentence feature vectors which are different from the hash codes of the high-dimensional sample sentence feature vectors are not present; and taking the initial sentence division model corresponding to the hash code of the high-dimensional sentence feature vector which is not different from the hash code of the high-dimensional sample sentence feature vector as the sentence division model.

In this embodiment, when the terminal trains the initial sentence division model, it is determined whether there is a hash code of the high-dimensional sentence feature vector different from the hash code of the high-dimensional sample sentence feature vector. Under the condition that the hash codes of the high-dimensional sentence feature vectors are different from the hash codes of the high-dimensional sample sentence feature vectors, the terminal adjusts the dividing parameters of the initial sentence dividing model according to the difference values between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the high-dimensional sentence feature vectors, and returns to execute the steps of dividing sample text information into a plurality of sentences through the initial sentence dividing model, and carrying out coding conversion processing on each sentence to obtain the high-dimensional sentence feature vectors of each sentence until the hash codes of the high-dimensional sentence feature vectors which are different from the hash codes of the high-dimensional sample sentence feature vectors are not present. The terminal takes the initial sentence division model corresponding to the hash code of the high-dimensional sentence feature vector which is not identical to the hash code of the high-dimensional sample sentence feature vector as the sentence division model.

Based on the scheme, the division parameters of the initial sentence division model are adjusted through the difference value between the hash code of the high-dimensional sample sentence feature vector and the hash code of the high-dimensional sentence feature vector, so that the accuracy of the adjustment of the initial sentence division model is improved.

Optionally, as shown in fig. 3, the method further includes:

step S301, obtaining a high-dimensional sample sentence feature vector and sample data corresponding to the high-dimensional sample sentence feature vector.

In this embodiment, the terminal obtains, based on the high-dimensional sample sentence feature vectors divided by the sentence division model in the history search record, and sample data corresponding to each high-dimensional sample sentence feature vector.

In step S302, the hash code of each high-dimensional sample sentence feature vector and the hash code of each sample data are calculated by the initial hash function, and the similarity between the hash code of each high-dimensional sample sentence feature vector and the hash code of each sample data is calculated.

In this embodiment, the terminal calculates the hash code of each high-dimensional sample sentence feature vector and the hash code of each sample data through the initial hash function, and calculates the similarity between the hash code of each high-dimensional sample sentence feature vector and the hash code of each sample data.

Specifically, when the terminal trains the hash function, the labeled high-dimensional sample sentence feature vector is used to learn the hash code, for example, the high-dimensional sample sentence feature vector a is a "car", the high-dimensional sample sentence feature vector b is a "large truck", the high-dimensional sample sentence feature vector c is a "large road", and the terminal determines that the similarity between the high-dimensional sample sentence feature vector a and the high-dimensional sample sentence feature vector b is higher than that between the high-dimensional sample sentence feature vector a and the high-dimensional sample sentence feature vector c, that is, the terminal determines that the high-dimensional sample sentence feature vector a and the high-dimensional sample sentence feature vector b belong to the same category (car), the label lable_ab=1, the high-dimensional sample sentence feature vector a and the high-dimensional sample sentence feature vector c belong to different categories, and the label lable_ac=0. Therefore, when the terminal trains the hash function, the hash codes of the feature vectors of the high-dimensional sample sentences are made to be infinitely close to the Label information marked in advance. For example, if the hash code of the high-dimensional sample sentence feature vector a is h_a= [1, 0] and the hash code of the high-dimensional sample sentence feature vector b is h_b= [1, 0], and the hash code of the high-dimensional sample sentence feature vector c is h_c= [0, 1], then the similarity between h_a and h_b is equal to that of similarity_ab= (h_a_bit1 (xor) h_b_bit1+h_a_bit2 (xor) h_b_bit2+h_a_bit3 (xor) h_b_bit3)/h_length algorithm, and similarity_ab=1 can be obtained, and similarity_ac is equal to that of Lable_ac=0. That is, the hash code similarity_xy of any 2 high-dimensional sample sentence feature vector data x, y is infinitely close to Lable_xy, the hash function is trained through the steps, the feature of the hash code of the hash function for storing data semantic similarity information is improved, and the trained hash function can convert the high-dimensional feature vector of any one high-dimensional sample sentence feature vector data into the hash code.

Step S303, in the case where there are different hash codes of the high-dimensional sample sentence feature vectors and the similarity between the hash codes of the sample data is greater than the similarity threshold, adjusting the hash code generation parameters of the initial hash function, and returning to execute the hash code step of calculating the hash code of each high-dimensional sample sentence feature vector and the hash code of each sample data by the initial hash function until the similarity between the hash codes of only the high-dimensional sample sentence feature vector and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vector is greater than the similarity threshold.

In this embodiment, the terminal determines whether the similarity between the hash codes of the feature vectors of the non-identical high-dimensional sample sentences and the hash codes of the sample data is greater than a similarity threshold. And (3) under the condition that the hash codes of the different high-dimensional sample sentence feature vectors exist and the similarity between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the sample data is larger than a similarity threshold value, the terminal adjusts the hash code generation parameters of the initial hash function and returns to execute the step S302 until only the hash codes of the high-dimensional sample sentence feature vectors of the same hash codes and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vectors exist or under the condition that the similarity between the hash codes of the high-dimensional sample sentence feature vectors and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vectors is larger than the similarity threshold value, the terminal stops iterative operation.

In step S304, an initial hash function, in which the similarity between the hash codes of the high-dimensional sample sentence feature vector and the hash codes of the sample data corresponding to the high-dimensional sample sentence feature vector is greater than the similarity threshold, is used as the hash function.

In this embodiment, the terminal uses, as the hash function, an initial hash function in which only hash codes of the high-dimensional sample sentence feature vector exist and the similarity between hash codes of sample data corresponding to the high-dimensional sample sentence feature vector is greater than a similarity threshold.

Based on the scheme, by training the hash function, the hash function is ensured to accurately generate the hash code, so that the retrieval accuracy is improved.

The application also provides a data retrieval example, as shown in fig. 4, and the specific processing procedure comprises the following steps:

step S401, acquiring text information input by a user.

Step S402, determining a plurality of word embedding vectors through a word embedding layer for each word in the text information, and performing position coding on the front and back sequence of each word in the text information to obtain a position coding vector of the word.

Step S403, dividing each sentence in the text information into a plurality of short sentences through dividing layers, and performing conversion processing on each short sentence to obtain a plurality of short sentence vectors corresponding to the sentence.

Step S404, determining the high-dimensional sentence characteristic vector of each sentence in the text information according to the word embedding vector of the word segmentation contained in the sentence in the text information, the position coding vector of the word segmentation contained in the sentence and the short sentence vector corresponding to the sentence.

Step S405, for each high-dimensional sentence feature vector, generating a hash code corresponding to the high-dimensional sentence feature vector through a hash function.

Step S406, selecting any one data to be selected from the database, and judging whether the hash bits in the hash codes of the data to be selected are consistent with the hash bits in the hash codes corresponding to the high-dimensional sentence feature vectors.

Step S407, when Ha Xiwei in the hash codes of the data to be selected is inconsistent with Ha Xiwei in the hash codes corresponding to the high-dimensional sentence feature vector, returning to execute the step of selecting any one data to be selected in the database until Ha Xiwei in the hash codes of the selected data to be selected is consistent with the hash bits in the hash codes corresponding to the high-dimensional sentence feature vector, and using the data to be selected with the consistent hash bits as target retrieval data corresponding to the text information.

It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.

Based on the same inventive concept, the embodiment of the application also provides a data retrieval device for realizing the above related data retrieval method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation of one or more embodiments of the data retrieval device provided below may refer to the limitation of the data retrieval method described above, and will not be repeated here.

In one embodiment, as shown in fig. 5, there is provided a data retrieval apparatus comprising: an acquisition module 510, a partitioning module 520, and a screening module 530, wherein:

an obtaining module 510, configured to obtain text information input by a user; the text information is text data which needs to be retrieved by a user;

the division module 520 is configured to divide the text information into a plurality of sentences through a sentence division model, and perform coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence;

and the screening module 530 is configured to generate a hash code corresponding to the high-dimensional statement feature vector through a hash function, query a database for to-be-selected data corresponding to a hash code identical to the hash code of the high-dimensional statement feature vector, and take the to-be-selected data of the identical hash code as target data.

Optionally, the dividing module 520 is specifically configured to:

Optionally, the screening module 530 is specifically configured to:

Optionally, the apparatus further includes:

Optionally, the training module is specifically configured to:

Optionally, the apparatus further includes:

The respective modules in the above-described data retrieval apparatus may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.

In one embodiment, a computer device is provided, which may be a terminal, and the internal structure of which may be as shown in fig. 6. The computer device includes a processor, a memory, a communication interface, a display screen, and an input device connected by a system bus. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system and a computer program. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The communication interface of the computer device is used for carrying out wired or wireless communication with an external terminal, and the wireless mode can be realized through WIFI, a mobile cellular network, NFC (near field communication) or other technologies. The computer program is executed by a processor to implement a data retrieval method. The display screen of the computer equipment can be a liquid crystal display screen or an electronic ink display screen, and the input device of the computer equipment can be a touch layer covered on the display screen, can also be keys, a track ball or a touch pad arranged on the shell of the computer equipment, and can also be an external keyboard, a touch pad or a mouse and the like.

It will be appreciated by those skilled in the art that the structure shown in fig. 6 is merely a block diagram of some of the structures associated with the present application and is not limiting of the computer device to which the present application may be applied, and that a particular computer device may include more or fewer components than shown, or may combine certain components, or have a different arrangement of components.

In an embodiment, a computer device is provided comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method of any of the first aspects when the computer program is executed.

In an embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, implements the steps of the method of any of the first aspects.

In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method of any of the first aspects.

It should be noted that, user information (including but not limited to user equipment information, user personal information, etc.) and data (including but not limited to data for analysis, stored data, presented data, etc.) referred to in the present application are information and data authorized by the user or sufficiently authorized by each party.

Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in the various embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile Memory may include Read-Only Memory (ROM), magnetic tape, floppy disk, flash Memory, optical Memory, high density embedded nonvolatile Memory, resistive random access Memory (ReRAM), magnetic random access Memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric Memory (Ferroelectric Random Access Memory, FRAM), phase change Memory (Phase Change Memory, PCM), graphene Memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in the form of a variety of forms, such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), and the like. The databases referred to in the various embodiments provided herein may include at least one of relational databases and non-relational databases. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processors referred to in the embodiments provided herein may be general purpose processors, central processing units, graphics processors, digital signal processors, programmable logic units, quantum computing-based data processing logic units, etc., without being limited thereto.

The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.

The above examples only represent a few embodiments of the present application, which are described in more detail and are not to be construed as limiting the scope of the present application. It should be noted that it would be apparent to those skilled in the art that various modifications and improvements could be made without departing from the spirit of the present application, which would be within the scope of the present application. Accordingly, the scope of protection of the present application shall be subject to the appended claims.

Claims

1. A method of data retrieval, the method comprising:

acquiring text information input by a user;

2. The method according to claim 1, wherein the sentence division model includes a word embedding layer and a division layer, the dividing the text information into a plurality of sentences by the sentence division model, and performing coding conversion processing on each sentence to obtain a high-dimensional sentence feature vector corresponding to each sentence, includes:

3. The method according to claim 2, wherein the determining the high-dimensional sentence feature vector of each sentence in the text information according to the word embedding vector of the word segment included in the sentence in the text information, the position encoding vector of the word segment included in the sentence, and the short sentence vector corresponding to the sentence, comprises:

4. The method according to claim 1, wherein the generating, by a hash function, the hash code corresponding to the high-dimensional sentence feature vector, querying, in a database, data to be selected corresponding to the hash code identical to the hash code of the high-dimensional sentence feature vector, and taking the data to be selected of the identical hash code as target retrieval data corresponding to text information, includes:

5. The method according to claim 1, wherein the method further comprises:

6. The method according to claim 5, wherein training the initial sentence division model according to the hash code of each of the high-dimensional sample sentence feature vectors and the hash code of each of the high-dimensional sentence feature vectors to obtain the sentence division model includes:

7. The method according to claim 1, wherein the method further comprises:

8. A data retrieval apparatus, the apparatus comprising:

9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.

10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.

11. A computer program product comprising a computer program, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.