CN117910022A - Data searching method, device, computer equipment, storage medium and product - Google Patents

Data searching method, device, computer equipment, storage medium and product Download PDF

Info

Publication number
CN117910022A
CN117910022A CN202410311119.4A CN202410311119A CN117910022A CN 117910022 A CN117910022 A CN 117910022A CN 202410311119 A CN202410311119 A CN 202410311119A CN 117910022 A CN117910022 A CN 117910022A
Authority
CN
China
Prior art keywords
text
segment
sequence
text segment
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410311119.4A
Other languages
Chinese (zh)
Inventor
张民遐
许金明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Gaodeng Computer Technology Co ltd
Original Assignee
Shenzhen Gaodeng Computer Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Gaodeng Computer Technology Co ltd filed Critical Shenzhen Gaodeng Computer Technology Co ltd
Priority to CN202410311119.4A priority Critical patent/CN117910022A/en
Publication of CN117910022A publication Critical patent/CN117910022A/en
Pending legal-status Critical Current

Links

Abstract

The application relates to a data searching method, a data searching device, computer equipment, a storage medium and a product. The method comprises the following steps: acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments; mapping the text segment to be mapped into a corresponding characteristic value for each text segment in the text segment sequence; determining a data identifier matched with the text fragment from a preconfigured index table corresponding to the text fragment; screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence; obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text; and performing text matching on the input text and the plaintext candidate text to obtain a data search result aiming at the input text. By adopting the method, the safety of the encrypted data can be improved, and the searching efficiency of the encrypted data can be improved.

Description

Data searching method, device, computer equipment, storage medium and product
Technical Field
The present application relates to the field of search technology, and in particular, to a data search method, apparatus, computer device, storage medium, and computer program product.
Background
With the rapid development of the internet and digital technology, data is one of the core assets of modern society. Since some data relates to various aspects such as personal privacy, enterprise confidentiality, and national security, in order to secure data, these data are generally encrypted to form encrypted data, and the encrypted data are stored. When using the data, the data matching the content to be searched is further searched out from a large amount of encrypted data by inputting the content to be searched. In the conventional technology, generally, the plaintext data is obtained by obtaining the full amount of encrypted data and decrypting the full amount of encrypted data, the plaintext data is precisely matched with the content to be searched one by one, and finally the plaintext data matched with the content to be searched is obtained.
However, the manner of precisely matching the content to be searched after decrypting the whole amount of encrypted data leads to a large risk of leakage of the encrypted data and low data security.
Disclosure of Invention
In view of the foregoing, it is desirable to provide a data search method, apparatus, computer device, computer-readable storage medium, and computer program product that can improve the security of encrypted data.
In a first aspect, the present application provides a data searching method, including:
acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
Mapping, for each text segment in the sequence of text segments, the text segment in question to a corresponding feature value;
Determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identification matched with the aimed text segment from the index record;
screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence;
obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text;
And carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
In a second aspect, the present application also provides a data searching apparatus, including:
The analysis module is used for acquiring an input text and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
the characteristic value mapping module is used for mapping each text segment in the text segment sequence into a corresponding characteristic value;
The data searching module is used for determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identifier matched with the aimed text segment from the index record; screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence; obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text; and carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
In a third aspect, the present application also provides a computer device comprising a memory and a processor, the memory storing a computer program, the processor implementing the following steps when executing the computer program:
acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
Mapping, for each text segment in the sequence of text segments, the text segment in question to a corresponding feature value;
Determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identification matched with the aimed text segment from the index record;
screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence;
obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text;
And carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
In a fourth aspect, the present application also provides a computer readable storage medium having stored thereon a computer program which when executed by a processor performs the steps of:
acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
Mapping, for each text segment in the sequence of text segments, the text segment in question to a corresponding feature value;
Determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identification matched with the aimed text segment from the index record;
screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence;
obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text;
And carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
In a fifth aspect, the application also provides a computer program product comprising a computer program which, when executed by a processor, performs the steps of:
acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
Mapping, for each text segment in the sequence of text segments, the text segment in question to a corresponding feature value;
Determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identification matched with the aimed text segment from the index record;
screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence;
obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text;
And carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
According to the data searching method, the device, the computer equipment, the storage medium and the computer program product, the input text is analyzed into the text fragment sequence formed by the text fragments, each text fragment is mapped into the corresponding characteristic value, and the matched data identification is determined through the pre-configured index table corresponding to the text fragment; and the target data identification matched with each text segment in the text segment sequence is screened out from the data identifications matched with each text segment in the text segment sequence, the pre-encrypted text corresponding to the target data identification is acquired and decrypted to obtain a plaintext candidate text, and then the input text and the plaintext candidate text are subjected to text matching to obtain a data search result.
Drawings
In order to more clearly illustrate the embodiments of the present application or the technical solutions in the related art, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.
FIG. 1 is an application environment diagram of a data search method in one embodiment;
FIG. 2 is a flow chart of a method of data searching in one embodiment;
FIG. 3 is a schematic diagram of an example index table in one embodiment;
FIG. 4 is a flow diagram of an index generation step in one embodiment;
FIG. 5 is a schematic diagram of a data search architecture in one embodiment;
FIG. 6 is a flow chart of a simplified step of data searching in one embodiment;
FIG. 7 is a flow chart illustrating detailed steps of data searching in one embodiment;
FIG. 8 is a block diagram of a data search device in one embodiment;
fig. 9 is an internal structural diagram of a computer device in one embodiment.
Detailed Description
The present application will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present application more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application.
The data searching method provided by the embodiment of the application can be applied to an application environment shown in figure 1. Wherein the terminal 102 communicates with the server 104 via a network. The data storage system may store data that the server 104 needs to process. The data storage system may be integrated on the server 104 or may be located on a cloud or other network server. The server 104 may obtain an input text input through the terminal 102, parse the input text into a text segment sequence composed of a plurality of text segments, map the text segment to a corresponding feature value for each text segment in the text segment sequence, determine a data identifier matched with the text segment to be searched, further screen out a target data identifier matched with each text segment, decrypt a pre-encrypted text corresponding to the target data identifier, and perform text matching with the input text to obtain a data search result. The terminal 102 may be a personal computer, a notebook computer, a smart phone, a tablet computer, an internet of things device or a portable wearable device, and the internet of things device may be an intelligent television, an intelligent vehicle-mounted device or the like. The portable wearable device may be a smart watch, smart bracelet, headset, or the like. The server 104 may be implemented as a stand-alone server or as a server cluster of multiple servers.
In one embodiment, as shown in fig. 2, a data searching method is provided, and the method is applied to the server 104 in fig. 1 for illustration, and includes the following steps 202 to 212. Wherein:
step 202, acquiring an input text, and analyzing the input text into a text segment sequence formed by a plurality of text segments.
Wherein the input text is text entered by the user for searching. A text segment is a segment in the input text. A text segment sequence is a sequence in which all text segments of an input text are formed in the order of each in the input text. A text segment is a segment formed by characters of one character type. The text segment may include a text segment or a symbol segment. The character segment is a segment formed by characters of any one of a Chinese character type, an English letter type, and a numeral type. A symbol fragment is a fragment formed by characters of a symbol type. A text segment may include one character or multiple characters of the same character type.
In one embodiment, the server may obtain an input text input through the terminal, parse the character types of the characters in the input text, and parse the input text into a text segment sequence composed of a plurality of text segments according to a preset rule corresponding to the character types of the characters in the input text. The preset rule corresponding to the character type is used for analyzing the characters of the character type in the input text into text fragments. For example, the preset rule corresponding to the character type may be that a single character of the character type in the input text is determined as a text segment, or that a plurality of continuous characters of the character type in the input text is determined as a text segment.
Step 204, for each text segment in the sequence of text segments, mapping the text segment in question to a corresponding feature value.
The feature value is obtained by mapping the text segment according to a mapping rule. The feature value may characterize the text segment. When the mapping rule is known, what the text segment represented by the feature value is can be determined, and when the mapping rule is not known, the feature value can be regarded as ciphertext data, and the specific meaning represented by the feature value cannot be known. The text segment sequence may include only text segments, or may include text segments and symbol segments.
In one embodiment, when the sequence of text segments includes text segments, the server may determine, for each text segment in the sequence of text segments, a letter sequence corresponding to the text segment for which the corresponding text segment is mapped to a corresponding feature value based on the letter sequence.
In one embodiment, when the sequence of text segments includes a symbol segment and the symbol segment includes at least two consecutive symbols in the input text, the server may determine, for each symbol segment in the sequence of text segments, a number of symbols for the symbol segment, determine a first symbol and a last symbol in the symbol segment, and determine a salt value pre-configured for the symbol type, map the symbol segment for the corresponding feature value based on the number of symbols, the first symbol, the last symbol, and the salt value pre-configured for the symbol type. Wherein the salt value is used to increase the complexity of the feature value, the salt value may be a specified string, which may have no actual meaning.
In one embodiment, the step of mapping the symbol segment to the corresponding feature value according to the number of symbols, the first symbol, the last symbol and the salt value preconfigured for the symbol type includes: the server can splice the number of the symbols, the first symbol, the last symbol and the salt value preconfigured for the symbol type to obtain a splicing result, and encrypt the splicing result to obtain the characteristic value corresponding to the symbol fragment.
It is to be understood that, in the splicing, the order of the four symbols, i.e., the number of symbols, the first symbol, the last symbol, and the salt value preconfigured for the symbol type, may be configured, for example, the four symbols may be spliced in order of the number of symbols, the first symbol, the last symbol, and the salt value preconfigured for the symbol type, or the four symbols may be spliced in order of the first symbol, the number of symbols, the last symbol, and the salt value preconfigured for the symbol type. For example, the symbol fragment may be "@ #" where the number of symbols is 6, the first symbol is "@", and the last symbol is ")", and the salt value preconfigured for the symbol type may be "% ], and the concatenation result may be" 6 @) ", or" @ 6)% ".
In step 206, an index record containing the feature value is determined from the preconfigured index table corresponding to the text segment, and a data identifier matching the text segment is determined from the index record.
Wherein the index table is a spreadsheet for storing a plurality of index records. The index record may include a characteristic value and a data identification corresponding to the characteristic value. The data identifier may be an identifier corresponding to a pre-encrypted text obtained by pre-encrypting a preset text, and is used for identifying different pre-encrypted texts. The data identification may be referred to as data ID (Identity Document).
In one embodiment, the server may determine a preconfigured index table corresponding to the text segment for which it is intended, determine an index record containing the feature value, and determine the data identification in the determined index record as the data identification matching the text segment for which it is intended from among the preconfigured plurality of index tables.
And step 208, screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence.
In one embodiment, the server may filter out a plurality of candidate data identifiers matching each text segment in the sequence of text segments from the data identifiers matching each text segment in the sequence of text segments, and determine the plurality of candidate data identifiers as the target data identifier. For example, the text segment sequence may include text segment 1, text segment 2, and text segment 3, the data matched by text segment 1 is identified as identifier 1, identifier 2, and identifier 3, the data matched by text segment 2 is identified as identifier 2, identifier 3, and identifier 4, the data matched by text segment 3 is identified as identifier 2, identifier 3, and identifier 5, then the plurality of candidate data matched with each text segment is identified as identifier 2 and identifier 3, and then the target data identifier may be identifier 2 and identifier 3.
In one embodiment, the server may screen out a plurality of candidate data identifiers matching each text segment in the sequence of text segments from the data identifiers matching each text segment in the sequence of text segments; for each of the plurality of candidate data identifiers, determining a text segment sequence number of each text segment in the text segment sequence when the corresponding candidate data identifier is located in an index record where the corresponding candidate data identifier is located, and screening out the target data identifier from the plurality of candidate data identifiers according to the text segment sequence numbers of each text segment in the text segment sequence when the corresponding candidate data identifier is located.
Step 210, obtaining and decrypting the pre-encrypted text corresponding to the target data identifier, and obtaining the plaintext candidate text.
The pre-encrypted text may be a result obtained by pre-encrypting the plaintext. The plaintext candidate text is text as a candidate obtained by decrypting the pre-encrypted text. It will be appreciated that the pre-encrypted text may be stored in a different data table than the index table, stored separately, and data security may be improved.
In one embodiment, the server may obtain the pre-encrypted text corresponding to the target data identifier, and decrypt the pre-encrypted text according to a reversible cryptographic algorithm when the pre-encrypted text is generated, to obtain the plaintext candidate text. Wherein, the reversible cryptographic algorithm is an algorithm that can restore original plaintext data by decryption. The reversible cryptographic algorithm may be the national secret SM4 algorithm (collectively, "SM 4 block cryptographic algorithm"), the AES (Advanced Encryption Standar, advanced encryption standard) algorithm, or others.
And 212, performing text matching on the input text and the plaintext candidate text to obtain a data search result for the input text.
In one embodiment, the server may match the input text with plaintext candidates, determine a target text that is the same as the input text from among the plaintext candidates, and determine the target text as a data search result for the input text.
In one embodiment, the server may match the input text with plaintext candidates, determine a target text that includes the input text from the plaintext candidates, and determine the target text as a data search result for the input text.
In the data searching method, the input text is analyzed into a text fragment sequence formed by a plurality of text fragments, each text fragment is mapped into a corresponding characteristic value, and then the matched data identification is determined through a pre-configured index table corresponding to the text fragment; and the target data identification matched with each text segment in the text segment sequence is screened out from the data identifications matched with each text segment in the text segment sequence, the pre-encrypted text corresponding to the target data identification is acquired and decrypted to obtain a plaintext candidate text, and then the input text and the plaintext candidate text are subjected to text matching to obtain a data search result.
In one embodiment, the sequence of text fragments comprises text fragments, and step 204 comprises: for each text segment in the text segment sequence, determining a letter sequence corresponding to the text segment; determining the number of letters in the letter sequence, determining target letters arranged at preset positions in the letter sequence, and acquiring preset salt values corresponding to the first letters in the letter sequence; determining a jump degree characteristic value corresponding to the letter sequence according to the difference between every two adjacent letters in the letter sequence; and mapping the text segment to be mapped into a corresponding characteristic value based on the number of letters, the target letters, the preset salt value and the jump degree characteristic value.
The character segment is formed by characters of any one of Chinese character type, english letter type and number type. The characters of the kanji type may be kanji. The english alphabet type characters may be english alphabets. The character of the numeric type may be a number. A letter sequence is a sequence formed by a plurality of letters. The letter may specifically be one of 26 lower case english letters a to z.
When the text segment is formed by Chinese characters, the letter sequence can be a sequence formed by lower-case English letters corresponding to each letter in the Chinese pinyin of the text segment. The Chinese pinyin comprises 26 pinyin letters, wherein the lower-case English letters corresponding to the pinyin letters u are v, and other pinyin letters are the same as other lower-case English letters in a one-to-one correspondence. When the text segment is formed by english letters, the letter sequence may be a sequence formed by lower-case english letters corresponding to each letter in the pinyin of the text segment, or may be a sequence formed by lower-case english spelling letters of the text segment. For example, when the text segment is 2, the letter sequence may be "er" (a sequence formed by lower case english letters corresponding to each letter in the chinese pinyin of 2) or "two" (a sequence formed by lower case english spelling letters of 2).
The number of letters is the number of letters in the letter sequence. The preset position is a preset position. The preset position may be the first position or the last position. The preset salt value is a preset salt value. The skip feature value characterizes skip among letters in the letter sequence.
In this embodiment, a letter sequence corresponding to a text segment is determined for the text segment, and the text segment is mapped to a corresponding feature value according to the number of letters of the letter sequence, the target letters, the preset salt value and the jump degree feature value, so that the feature value is complex and not easy to crack, and the data security is improved.
In one embodiment, the letters in the letter sequence take one of 26 lower case english letters. In this embodiment, the server may determine the arrangement position of each letter in the letter sequence in 26 lower case english letters, and determine the difference between the arrangement positions of each two adjacent letters as the difference between each two adjacent letters in the letter sequence. For example, the letter sequence includes "aeg", and among 26 lower case english letters, a is arranged at position 1, e is arranged at position 5, g is arranged at position 7, each pair of adjacent letters includes "ae" and "eg", the difference in arrangement position between the "ae" is 4, the difference in arrangement position between the "eg" is 2, the difference between the "ae" is 4, and the difference between the "eg" is 2.
In one embodiment, the step of determining the jump degree feature value corresponding to the letter sequence according to the difference between every two adjacent letters in the letter sequence includes: acquiring an ASCII code value of each letter in the letter sequence; determining the difference value between ASCII code values of every two adjacent letters in the letter sequence; and determining the jump degree characteristic value corresponding to the letter sequence according to the difference value between the ASCII code values of every two adjacent letters in the letter sequence.
Wherein ASCII (AMERICAN STANDARD Code for Information Interchange) is american standard code for information exchange. In ASCII, 128 characters each correspond to an ASCII code value, which may specifically be an eight-bit binary number that can be converted into a decimal number, that is, the ASCII code value corresponding to 128 characters may be from 0 to 127. In ASCII, 128 characters may include 26 lower case english letters, 26 upper case english letters, arabic numerals, english punctuation, and a controller. Each letter in the letter sequence may be a lowercase english letter.
The difference between the ASCII code values of each of the two adjacent letters may be an absolute value of the difference between the ASCII code values of each of the two adjacent letters. The difference value between the ASCII code values of each of the adjacent letters in the letter sequence may be used as the difference between the adjacent letters in the letter sequence, specifically, the difference value between the ASCII code values of each of every two adjacent letters in the letter sequence may be used as the difference between the corresponding two adjacent letters in the letter sequence.
In this embodiment, the differences between every two adjacent letters in the letter sequence can be represented by the difference values between the ASCII code values of every two adjacent letters, so that the jump characteristic values corresponding to the letter sequence can be easily determined, and conditions are created for the subsequent generation of characteristic values corresponding to the text segments.
In one embodiment, when the letter sequence includes at least three letters, the server may determine a minimum difference value and a maximum difference value from difference values between ASCII code values of adjacent letters in the letter sequence, and splice the minimum difference value and the maximum difference value to obtain the jump degree feature value corresponding to the letter sequence. For example, when the letter sequence is "xiao", x, i, a, o has ASCII values of 120, 105, 97, and 111, the difference between ASCII values of "xi" is 15, the difference between ASCII values of "ia" is 8, and the difference between ASCII values of "ao" is 14, and the jump feature value "815" is obtained by concatenating 8 and 15.
In one embodiment, the step of mapping the text segment to the corresponding feature value based on the number of letters, the target letter, the preset salt value, and the jump feature value includes: the server can splice the number of letters, the target letters, the jump characteristic value and the preset salt value to obtain a splicing result, and encrypt the splicing result to obtain the characteristic value corresponding to the text segment. For example, when the text segment is "m", the corresponding letter sequence may be "mi", the number of letters is 2, the target letter may be the last letter, i.e., i, the jump feature value may be "44", the preset salt value may be "$", and the concatenation result may be "2i44$". For another example, when the text segment is "what", the corresponding letter sequence may be "what", the number of letters is 4, the target letter may be the last letter, i.e., t, the jump feature value may be "719", the preset salt value may be "x", and the splicing result may be "4t 719".
In one embodiment, the text segment sequence includes a symbol segment and a text segment, and the data searching method further includes the steps of: when the text segment is a symbol segment, determining a preset index table in a plurality of preset index tables as a preset index table corresponding to the symbol segment; when the text segment is a text segment, determining the pronunciation category corresponding to the text segment, and determining a preconfigured index table corresponding to the text segment from a plurality of preconfigured index tables according to the first letter in the letter sequence and the pronunciation category.
Wherein a symbol fragment is a fragment formed by characters of a symbol type. The preconfigured plurality of index tables is a preconfigured plurality of index tables for storing the pre-generated index records.
The preconfigured plurality of index tables may include a preset index table and index tables in a plurality of index groups. The preset index table may be used to store index records corresponding to the symbol fragments. The index tables in the index groups can be used for storing index records corresponding to the text fragments. The plurality of index groups may specifically include index groups corresponding to 26 lower case english alphabets, respectively. The index group corresponding to each lower case english letter may include a plurality of index tables corresponding to different pronunciation categories.
In this embodiment, when the text segment is a text segment, the index table corresponding to the text segment can be determined from the plurality of index tables according to the first letter and the pronunciation category in the letter sequence corresponding to the text segment, and therefore, part of the index tables in the plurality of index tables can be classified into different index tables according to the letters and the pronunciation category, so that the index table corresponding to the text segment can be determined conveniently.
In one embodiment, the text segment includes a Chinese segment formed of a single Chinese character, an English segment formed of consecutive English letters in the input text, or a numeric segment formed of a single number; the letter sequences corresponding to the Chinese character segments are formed based on the Chinese pinyin corresponding to the Chinese character segments; the pronunciation category corresponding to the Chinese character segment represents the tone of the Chinese phonetic alphabet corresponding to the Chinese character segment; the letter sequence corresponding to the English fragment is a sequence formed by lower case letters corresponding to each letter in the English fragment; the pronunciation category corresponding to the English fragment is a preset pronunciation category; the letter sequences corresponding to the number segments are formed based on the Chinese pinyin corresponding to the number segments; the pronunciation category corresponding to the digital segment represents the tone of the Chinese phonetic alphabet corresponding to the digital segment.
The English fragments formed by continuous English letters in the input text are English letter sequences which are not separated by other characters except English letters in the input text. For example, "andy liu", "andy" and "liu" are included in the input text and are separated by space characters, so that "andy" and "liu" can be used as different english segments, respectively.
The tones of the chinese pinyin may include a first sound (yin Ping), a second sound (yang Ping), a third sound (go-round), and a fourth sound (go-round). The pronunciation category corresponding to the Chinese character segment may include category 1, category 2, category 3 or category 4, each representing one of the tones of the Chinese pinyin. The preset pronunciation category may be denoted as category 0. The Chinese phonetic alphabet corresponding to the number segment is the Chinese phonetic alphabet corresponding to the number segment read by Chinese.
In this embodiment, based on the chinese pinyin corresponding to the chinese segment, the respective corresponding letter sequences may be conveniently formed based on the chinese pinyin corresponding to the number segment, and the voice class corresponding to the chinese segment characterizes the tone of the corresponding chinese pinyin, so that the chinese segments with the same letter sequences but different tones may be distinguished to a certain extent, so that the index records matched in the subsequent steps are more accurate, and the data searching efficiency may be improved.
In one embodiment, the server converts pinyin letters of the pinyin corresponding to the chinese character segments into lower-case english letters, respectively, to obtain a letter sequence corresponding to the chinese character segments.
In one embodiment, the step of parsing the input text into a text segment sequence composed of a plurality of text segments according to the preset rule corresponding to the character type of each character in the input text includes: when the input text includes Chinese characters (characters of a kanji type), english letters (characters of an English type), numerals (characters of a numeric type) and symbols (characters of a symbol type), the server may parse each Chinese character in the input text into a Chinese character segment, parse each English letter in the input text into an English segment, parse each number in the input text into a numeric segment, and parse each symbol in the input text into a symbol segment according to the arrangement order of the characters in the input text.
The maximum length of the english segment and the symbol segment may be set to a preset length, for example, 20 characters. The numbers may specifically refer to Arabic numerals. In addition to Chinese characters, english letters, numbers, and separators, other characters may be considered symbols. The separator may include a space and tab. For example, the input text may be "meter-in-small-circle writing: woashiyiegedashuaiwgeni, but I have a very faint # @ # -! # of the cylinder. 222", which may be parsed into" meters "," small "," circles "," writes "," tracks ",": "," woashiyiegedashuaiwg "," eni "," but "," I "," very "," halo "," # @ # -, and-! # of the cylinder. A sequence of 17 text fragments of "," 2 ".
In one embodiment, step 208 includes: determining a plurality of candidate data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with the text segments in the text segment sequence; for each candidate data identifier in the plurality of candidate data identifiers, determining a text segment sequence number of each text segment in the text segment sequence when the corresponding candidate data identifier is located in an index record where the corresponding candidate data identifier is located; and screening out target data identifiers from the plurality of candidate data identifiers, so that the sequence represented by the text segment serial numbers of the plurality of text segments in the text segment sequence when the plurality of text segments correspond to the target data identifiers is consistent with the arrangement sequence of the plurality of text segments in the text segment sequence.
Each index record in the index table may include a one-to-one data identifier, a text fragment sequence number, and a feature value. The sequence number of the text segment in each index record can represent the sequence of the text segment corresponding to the characteristic value in the index record in the preset text corresponding to the data identifier in the index record.
Since the plurality of candidate data identifiers are matched with each text segment in the text segment sequence, it is known that in the plaintext candidate texts of the pre-encrypted text corresponding to each of the plurality of candidate data identifiers, there may be plaintext candidate texts which are matched with each text segment but not matched with the input text, specifically, the arrangement mode of each text segment in the non-matched plaintext candidate text is different from the arrangement mode of each text segment in the input text. For example, when the input text is a "meter circle", the input text may be parsed into a sequence of text segments of three text segments of "meters", "small", "circles", and the plaintext candidate texts of the respective corresponding pre-encrypted text may include "meter circle", wherein while the "meter circle" includes three text segments of "small", "meters", "circles", the order of the three text segments in the "meter circle" is different from the order in the "meter circle", the "meter circle" does not match the input text. Therefore, the target data identification is further screened out by the text segment sequence number on the candidate data identification, the number of the pre-encrypted text to be decrypted subsequently can be reduced, the data security is improved, the efficiency of text matching between the input text and the plaintext candidate text subsequently can be improved, and the data searching efficiency is improved.
In this embodiment, the target data identifier is selected from the plurality of candidate data identifiers, so that the sequence represented by the text segment numbers of the plurality of text segments in the text segment sequence when the plurality of text segments correspond to the target data identifier is consistent with the arrangement sequence of the plurality of text segments in the text segment sequence, decryption of the pre-encrypted text which is not matched with the input text can be reduced, the efficiency of text matching between the input text and the plaintext candidate text can be improved, and the data searching efficiency can be improved.
In one embodiment, before performing step 202, the data searching method further includes an index generating step, where the index generating step includes: encrypting the preset text, obtaining and storing the encrypted text corresponding to the preset text, and generating a data identifier corresponding to the encrypted text; analyzing the preset text into a preset text fragment sequence formed by a plurality of preset text fragments; mapping the preset text segment to a corresponding characteristic value for each of the sequences of the preset text segments; generating a text segment serial number corresponding to the preset text segment based on the data identifier corresponding to the encrypted text and the arrangement position of the preset text segment in the preset text segment sequence; and recording the data identification corresponding to the encrypted text, the text fragment serial number corresponding to the preset text fragment and the characteristic value corresponding to the preset text fragment in a preset index table corresponding to the preset text fragment in a one-to-one correspondence manner.
The preset text is a preset text, and the step is used for generating index records corresponding to a plurality of preset text fragments in the preset text. When the preset text is encrypted, a reversible cryptographic algorithm can be adopted for encryption.
In this embodiment, the preset text is encrypted to obtain the encrypted text and stored, the data identifier corresponding to the encrypted text is generated, and then the preset text is parsed into the sequence of preset text fragments to generate the index record, and since the data identifier corresponding to the encrypted text in the preset text, the characteristic value corresponding to the preset text fragment and the text fragment sequence number corresponding to the preset text fragment are recorded in the index record, the plaintext data of the preset text is not recorded, so that the data security can be ensured, and conditions are created for executing steps 202 to 208 subsequently, so that the data searching efficiency is improved while the data security is ensured.
In one embodiment, the server may generate the text segment sequence number corresponding to the preset text segment according to the preset numerical value, the data identifier corresponding to the encrypted text, and the arrangement position of the preset text segment in the preset text segment sequence.
Wherein the preset value is a preset value. The preset value may be a preset large number, for example 10000. The text segment serial numbers can be the product of the preset numerical value and the data identifier, and the arrangement positions are added, when the preset numerical value is set to be a preset large number, the text segment serial numbers corresponding to different preset texts can be non-repeated in large probability, so that the searching efficiency is higher. For example, the preset value may be 10000, the data identifier 1 corresponding to the preset text 1 may be 100, the data identifier 2 corresponding to the preset text 2 may be 101, and then the product of the preset value and the data identifier 1 is 1000000, and the product of the preset value and the data identifier 2 is 1010000. The preset text 1 may be parsed into 3 preset text segments, and the 3 text segment numbers corresponding to the preset text 1 may be 1000001, 1000002, 1000003, respectively. The preset text 2 can be parsed into 4 preset text segments, and the number of the 4 text segments corresponding to the preset text 2 can be 1010001, 1010002, 1010003, 1010004 respectively.
In one embodiment, the server may screen the target data identifier from the plurality of candidate data identifiers, so that a sequence number of a text segment in the text segment sequence, when the first text segment corresponds to the target data identifier, is greater than a product of the target data identifier and a preset numerical value, and an order represented by the sequence numbers of the text segments in the text segment sequence when the plurality of text segments correspond to the target data identifier is consistent with an order of the plurality of text segments in the text segment sequence.
In one embodiment, referring to fig. 3, the index represents an exemplary view, the plurality of index tables may include one preset index table (special character index table) and 26 index groups corresponding to lower case english alphabets a to z, each index group may include 5 index tables, and taking an a index group as an example, the a index group may include an a0 index table, an a1 index table, an a2 index table, an a3 index table, and an a4 index table, wherein the a0 index table may be used to store an index record corresponding to an english segment, and a in an alphabetical sequence corresponding to the english segment is a first letter. The a1 index table can be used for index recording of Chinese character segments or digital segments, and the Chinese character segments or digital segments correspond to the letter sequences with a as the initial letter and the corresponding tone of Chinese phonetic alphabet as the first tone. The a2 index table, the a3 index table, and the a4 index table are similar to the a1 index table, except that the a2 index table, the a3 index table, and the a4 index table correspond to the tones of the second sound, the third sound, and the fourth sound, respectively. Any one of the b index group to the z index group is similar to the a index group, except that the index records stored in each of the b index group to the z index group are index records corresponding to text fragments with the first letters of b to z in the corresponding letter sequence. In each index table, a data identification field (data ID field), a text fragment number field (index number field), and a feature value field are provided, each index table including a plurality of index records each including a data identification (data ID), a text fragment number (index number), and a feature value. An encrypted data table for storing encrypted text (encrypted data) and a data identification corresponding to the encrypted text exists in addition to the index table.
In one embodiment, referring to a flowchart of the index generating step shown in fig. 4, the above index generating step specifically includes the following steps.
The server may encrypt a preset text (original text), obtain and store an encrypted text corresponding to the preset text, generate a data identifier corresponding to the encrypted text, and store the encrypted text and the corresponding data identifier in a preset encrypted data table.
The server may parse the preset text into a sequence of preset text segments (the text segments may be referred to as tokens or tokens) of a plurality of preset text segments. Specifically, when the preset text includes Chinese characters, english letters, numbers and symbols, each Chinese character in the preset text is parsed into a Chinese character segment, continuous English letters in the preset text are parsed into an English segment, each number in the preset text is parsed into a number segment, and continuous symbols in the preset text are parsed into a symbol segment.
The server may map, for each of the sequence of preset text segments, the targeted preset text segment to a corresponding feature value. Specifically, when the aimed preset text segment is a Chinese character segment (Chinese character token), the Chinese pinyin and tone corresponding to the Chinese character segment are obtained; converting each pinyin letter in the Chinese pinyin into a lower-case English letter to obtain a letter sequence corresponding to the preset text segment; determining the number of letters of the letter sequence, determining the last target letter in the letter sequence, and obtaining a preset salt value corresponding to the first letter in the letter sequence; determining ASCII code values of each letter in the letter sequence, determining difference values between ASCII code values of adjacent letters in the letter sequence, determining minimum difference values and maximum difference values from the difference values between ASCII code values of adjacent letters in the letter sequence, and splicing the minimum difference values and the maximum difference values to obtain jump degree characteristic values corresponding to the letter sequence; splicing the number of letters, the target letters, the jump characteristic value and the preset salt value to obtain a splicing result, and encrypting the splicing result to obtain the characteristic value corresponding to the preset text segment.
When the aimed preset text segment is an English segment (English token), acquiring each English letter in the English segment, converting each English letter into a lower case English letter, and acquiring a letter sequence corresponding to the English segment; the subsequent steps are the same as the steps after the step of obtaining the letter sequence corresponding to the Chinese character segment when the preset text segment is the Chinese character segment, until the characteristic value corresponding to the preset text segment is obtained.
When the aimed preset text segment is a digital segment (digital token), determining the Chinese pinyin corresponding to the digital segment, and determining the tone of the Chinese pinyin corresponding to the digital segment; the subsequent steps are the same as the steps after the step of acquiring the pinyin and the tone corresponding to the Chinese character segment when the preset text segment is the Chinese character segment, until the characteristic value corresponding to the preset text segment is obtained.
When the preset text segment is a symbol segment (special character token), the number of symbols of the symbol segment can be determined, the first symbol and the last symbol in the symbol segment are determined, the salt value preconfigured for the symbol type is determined, the number of symbols, the first symbol, the last symbol and the salt value preconfigured for the symbol type are spliced to obtain a splicing result, and the splicing result is encrypted to obtain the characteristic value corresponding to the preset text segment.
Therefore, when the preset text fragments are different, namely the preset text fragments are Chinese character fragments, english fragments, digital fragments or symbol fragments respectively, the characteristic value generation modes when the characteristic values corresponding to the preset text fragments are generated are different, and the fact that the characteristic values in the index table are not generated according to the same characteristic value generation mode can be understood that the cracking difficulty of the characteristic values is improved, so that the data safety is improved; in addition, because the length of the text fragment is usually smaller, the application does not adopt the traditional MD5 algorithm (message-digest Algorithm 5, fifth edition of information abstract encryption algorithm) or SHA algorithm (Secure Hash Algorithm ) to generate the characteristic value, thus being capable of avoiding being cracked by a rainbow table and improving the data security; moreover, since the relation from the text segment to the letter sequence is many-to-one, that is to say, the same letter sequence has the possibility of various text segments, and the relation between the letter sequence and the characteristic value is many-to-many, even if the characteristic value is cracked, only the letter sequence can be obtained, the original text segment can not be obtained, and the cracking difficulty is high.
The server may determine a product of the data identifier corresponding to the encrypted text and the preset value, and determine a sum of the product and an arrangement position of the preset text segment in the preset text segment sequence as a text segment sequence number corresponding to the preset text segment.
The server may determine a preconfigured index table corresponding to the preset text segment from a plurality of index tables as shown in fig. 3, and record the data identifier corresponding to the encrypted text, the text segment number corresponding to the preset text segment, and the feature value corresponding to the preset text segment in the preconfigured index table corresponding to the preset text segment in a one-to-one correspondence.
In one embodiment, the data searching method described above may be implemented based on a data searching architecture diagram as shown in fig. 5. The terminal can run a client, the client submits an input text, and the search server can acquire the input text submitted by the client and execute the data search method. The search server can be provided with a word segmentation device, a pinyin library and a search engine. Wherein the word segmenter is operable to parse the input text into a sequence of text segments of a plurality of text segments. The pinyin library may be used to store pinyin letters, correspondence between pinyin letters and lowercase english letters, correspondence between numbers and chinese pinyin, and the like. The search engine may be used to perform fuzzy search and exact match, steps 204 through 208 may be understood as the fuzzy search process, and steps 210 through 212 may be understood as the exact match process. The search server may be connected to a database in which encrypted data and a plurality of index tables (encrypted data indexes) may be stored, and it is understood that the encrypted data and the plurality of index tables may be stored in the same database or in different databases. The database may be a relational database, a Redis (Remote Dictionary Server, remote dictionary service) database, a MongoDB database (a database written in the C++ computer program language, based on distributed file storage), or other database.
In one embodiment, after the index generating step, referring to a schematic flow chart of a data searching simplified step shown in fig. 6 and a schematic flow chart of a data searching detailed step shown in fig. 7, the above data searching method specifically includes the following steps.
The server may obtain the input text and parse the input text into a sequence of text segments comprised of a plurality of text segments. The text segment sequence comprises a symbol segment and a text segment, and the text segment comprises a Chinese character segment formed by single Chinese characters, an English segment formed by continuous English letters in an input text, or a digital segment formed by single digits.
The server may map, for each text segment in the sequence of text segments, the text segment for which it is intended to correspond to a feature value. Specifically, when the text segment is a symbol segment, determining the number of symbols of the symbol segment, determining the first symbol and the last symbol in the symbol segment, determining a salt value preconfigured for the symbol type, and mapping the symbol segment to a corresponding characteristic value according to the number of symbols, the first symbol, the last symbol and the salt value preconfigured for the symbol type.
When the text segment is a text segment, the server can determine a letter sequence corresponding to the text segment; determining the number of letters in the letter sequence, determining target letters arranged at preset positions in the letter sequence, and acquiring preset salt values corresponding to the first letters in the letter sequence; acquiring an ASCII code value of each letter in the letter sequence; determining the difference value between ASCII code values of every two adjacent letters in the letter sequence; determining a jump degree characteristic value corresponding to the letter sequence according to the difference value between ASCII code values of every two adjacent letters in the letter sequence; and mapping the text segment to be mapped into a corresponding characteristic value based on the number of letters, the target letters, the preset salt value and the jump degree characteristic value.
The character sequences corresponding to the Chinese character segments are formed based on the Chinese pinyin corresponding to the Chinese character segments; the letter sequence corresponding to the English fragment is a sequence formed by lower case letters corresponding to each letter in the English fragment; the letter sequences corresponding to the number segments are formed based on the pinyin corresponding to the number segments.
When the text segment is a symbol segment, the server may determine a preset index table of the preset index tables (such as the index tables shown in fig. 3) as a preset index table corresponding to the symbol segment. When the text segment is a text segment, the server can determine the pronunciation category corresponding to the text segment, and determine a preconfigured index table corresponding to the text segment from a plurality of preconfigured index tables according to the first letter in the letter sequence and the pronunciation category.
Wherein, the pronunciation category corresponding to the Chinese character segment represents the tone of the Chinese phonetic alphabet corresponding to the Chinese character segment; the pronunciation category corresponding to the English fragment is a preset pronunciation category; the pronunciation category corresponding to the digital segment represents the tone of the Chinese phonetic alphabet corresponding to the digital segment.
The server may determine an index record containing the feature value from a pre-configured index table corresponding to the text segment for which it is intended, and determine a data identification from the index record that matches the text segment for which it is intended.
The server can determine a plurality of candidate data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with the text segments in the text segment sequence; for each candidate data identifier in the plurality of candidate data identifiers, determining a text segment sequence number of each text segment in the text segment sequence when the corresponding candidate data identifier is located in an index record where the corresponding candidate data identifier is located; and screening out target data identifiers from the plurality of candidate data identifiers, so that the sequence number of the text segment, which is arranged at the head position in the text segment sequence and corresponds to the target data identifier, is larger than the product of the target data identifier and a preset numerical value, and the sequence represented by the sequence numbers of the text segments, which corresponds to the target data identifier, of the plurality of text segments in the text segment sequence is consistent with the sequence of the plurality of text segments in the text segment sequence.
After determining the feature value and the corresponding index table corresponding to each text segment in the text segment sequence, the feature value and the name of the corresponding index table corresponding to each text segment in the text segment sequence may be sequentially added to the query condition list according to the arrangement position of each text segment in the text segment sequence, so that a query statement for fuzzy search may be generated based on the query condition list, where the query statement may be specifically used to implement the above-mentioned step of determining, from the preconfigured index table corresponding to the text segment, that the server may determine the index record containing the feature value.
For example, taking a database as a relational database as an example, if the preset value is set to 10000, if the input text submitted by the user is "mih circle", the query condition list corresponding to the input text may be:
"Rice": index table: m3, eigenvalues: { X };
"Small": index table: x3, eigenvalues: { Y };
"circle": index table: q1, eigenvalues: { Z };
Further, the query statement for the fuzzy search may be:
SELECT m3.id
FROM m3
INNER JOIN x3 ON x3.id = m3.id
INNER JOIN q1 ON q1.id = m3.id
where m3.trait_value={X} and x3.trait_value={Y} AND q1.trait_value={Z}
AND m3.id*10000<m3.index AND m3.index<x3.index AND x3.index<q1.index
The query statement is specifically an SQL (Structured Query Language ) statement, id may represent a data identification field, track_value may represent a feature value field, and index may represent a text fragment sequence number field.
For another example, if the input text submitted by the user is "micellar circle Abc,", the input text may be parsed into "m", "small", "circle", "Abc", and "5 text pieces altogether, the query condition list corresponding to the input text may be:
"Rice": index table: m3, eigenvalues: { X };
"Small": index table: x3, eigenvalues: { Y };
"circle": index table: q1, eigenvalues: { Z };
"Abc": index table: a0, eigenvalues: { O };
",": index table: special character index table (noted: special), feature value: { P };
Further, the query statement for the fuzzy search may be:
SELECT m3.id
FROM m3
INNER JOIN x3 ON x3.id = m3.id
INNER JOIN q1 ON q1.id = m3.id
INNER JOIN a0 ON a0.id = m3.id
INNER JOIN special ON special.id = m3.id
where m3.trait_value={X} and x3.trait_value={Y} AND q1.trait_value={Z} AND a0.trait_value={O} AND special.trait_value={P}
AND m3.id*10000<m3.index AND m3.index<x3.index AND x3.index<q1.index AND q1.index<a0.index AND a0.index<special.index
It should be understood that, although the steps in the flowcharts related to the embodiments described above are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least some of the steps in the flowcharts described in the above embodiments may include a plurality of steps or a plurality of stages, which are not necessarily performed at the same time, but may be performed at different times, and the order of the steps or stages is not necessarily performed sequentially, but may be performed alternately or alternately with at least some of the other steps or stages.
Based on the same inventive concept, the embodiment of the application also provides a data searching device for realizing the above related data searching method. The implementation of the solution provided by the device is similar to the implementation described in the above method, so the specific limitation in one or more embodiments of the data searching device provided below may refer to the limitation of the data searching method hereinabove, and will not be repeated herein.
In one embodiment, as shown in fig. 8, there is provided a data search apparatus 800 comprising: a parsing module 810, a eigenvalue mapping module 820, and a data searching module 830, wherein:
the parsing module 810 is configured to obtain an input text, parse the input text into a text segment sequence composed of a plurality of text segments;
A feature value mapping module 820, configured to map, for each text segment in the sequence of text segments, the text segment to be mapped to a corresponding feature value;
The data searching module 830 is configured to determine an index record containing a feature value from a preconfigured index table corresponding to the text segment, and determine a data identifier matched with the text segment from the index record; screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence; obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text; and performing text matching on the input text and the plaintext candidate text to obtain a data search result aiming at the input text.
In one embodiment, the text segment sequence includes text segments, and the feature value mapping module 820 is further configured to, for each text segment in the text segment sequence, determine a letter sequence corresponding to the text segment; determining the number of letters in the letter sequence, determining target letters arranged at preset positions in the letter sequence, and acquiring preset salt values corresponding to the first letters in the letter sequence; determining a jump degree characteristic value corresponding to the letter sequence according to the difference between every two adjacent letters in the letter sequence; and mapping the text segment to be mapped into a corresponding characteristic value based on the number of letters, the target letters, the preset salt value and the jump degree characteristic value.
In one embodiment, the eigenvalue mapping module 820 is further configured to obtain an ASCII code value for each letter in the letter sequence; determining the difference value between ASCII code values of every two adjacent letters in the letter sequence; and determining the jump degree characteristic value corresponding to the letter sequence according to the difference value between the ASCII code values of every two adjacent letters in the letter sequence.
In one embodiment, the text segment sequence includes a symbol segment and a text segment, and the data search module 830 is further configured to determine a preset index table of the preset index tables as a preset index table corresponding to the symbol segment when the text segment is the symbol segment; when the text segment is a text segment, determining the pronunciation category corresponding to the text segment, and determining a preconfigured index table corresponding to the text segment from a plurality of preconfigured index tables according to the first letter in the letter sequence and the pronunciation category.
In one embodiment, the text segment includes a Chinese segment formed of a single Chinese character, an English segment formed of consecutive English letters in the input text, or a numeric segment formed of a single number; the letter sequences corresponding to the Chinese character segments are formed based on the Chinese pinyin corresponding to the Chinese character segments; the pronunciation category corresponding to the Chinese character segment represents the tone of the Chinese phonetic alphabet corresponding to the Chinese character segment; the letter sequence corresponding to the English fragment is a sequence formed by lower case letters corresponding to each letter in the English fragment; the pronunciation category corresponding to the English fragment is a preset pronunciation category; the letter sequences corresponding to the number segments are formed based on the Chinese pinyin corresponding to the number segments; the pronunciation category corresponding to the digital segment represents the tone of the Chinese phonetic alphabet corresponding to the digital segment.
In one embodiment, the data search module 830 is further configured to determine a plurality of candidate data identifiers that match each text segment in the sequence of text segments from the data identifiers that match each text segment in the sequence of text segments; for each candidate data identifier in the plurality of candidate data identifiers, determining a text segment sequence number of each text segment in the text segment sequence when the corresponding candidate data identifier is located in an index record where the corresponding candidate data identifier is located; and screening out target data identifiers from the plurality of candidate data identifiers, so that the sequence represented by the text segment serial numbers of the plurality of text segments in the text segment sequence when the plurality of text segments correspond to the target data identifiers is consistent with the arrangement sequence of the plurality of text segments in the text segment sequence.
In one embodiment, the data searching apparatus 800 further includes an index generating module, where the index generating module is configured to encrypt the preset text, obtain and store an encrypted text corresponding to the preset text, and generate a data identifier corresponding to the encrypted text; analyzing the preset text into a preset text fragment sequence formed by a plurality of preset text fragments; mapping the preset text segment to a corresponding characteristic value for each of the sequences of the preset text segments; generating a text segment serial number corresponding to the preset text segment based on the data identifier corresponding to the encrypted text and the arrangement position of the preset text segment in the preset text segment sequence; and recording the data identification corresponding to the encrypted text, the text fragment serial number corresponding to the preset text fragment and the characteristic value corresponding to the preset text fragment in a preset index table corresponding to the preset text fragment in a one-to-one correspondence manner.
The respective modules in the above-described data search device may be implemented in whole or in part by software, hardware, and combinations thereof. The above modules may be embedded in hardware or may be independent of a processor in the computer device, or may be stored in software in a memory in the computer device, so that the processor may call and execute operations corresponding to the above modules.
In one embodiment, a computer device is provided, which may be a server, and the internal structure of which may be as shown in fig. 9. The computer device includes a processor, a memory, an Input/Output interface (I/O) and a communication interface. The processor, the memory and the input/output interface are connected through a system bus, and the communication interface is connected to the system bus through the input/output interface. Wherein the processor of the computer device is configured to provide computing and control capabilities. The memory of the computer device includes a non-volatile storage medium and an internal memory. The non-volatile storage medium stores an operating system, computer programs, and a database. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The database of the computer device is used for storing data to be stored when the data searching method is executed. The input/output interface of the computer device is used to exchange information between the processor and the external device. The communication interface of the computer device is used for communicating with an external terminal through a network connection. The computer program is executed by a processor to implement a data search method.
It will be appreciated by persons skilled in the art that the architecture shown in fig. 9 is merely a block diagram of some of the architecture relevant to the present inventive arrangements and is not limiting as to the computer device to which the present inventive arrangements are applicable, and that a particular computer device may include more or fewer components than shown, or may combine some of the components, or have a different arrangement of components.
In an embodiment, there is also provided a computer device comprising a memory and a processor, the memory having stored therein a computer program, the processor implementing the steps of the method embodiments described above when the computer program is executed.
In one embodiment, a computer-readable storage medium is provided, on which a computer program is stored which, when executed by a processor, carries out the steps of the method embodiments described above.
In an embodiment, a computer program product is provided, comprising a computer program which, when executed by a processor, implements the steps of the method embodiments described above.
It should be noted that, the user information (including but not limited to user equipment information, user personal information, etc.) and the data (including but not limited to data for analysis, stored data, presented data, etc.) related to the present application are both information and data authorized by the user or sufficiently authorized by each party, and the collection, use and processing of the related data are required to meet the related regulations.
Those skilled in the art will appreciate that implementing all or part of the above described methods may be accomplished by way of a computer program stored on a non-transitory computer readable storage medium, which when executed, may comprise the steps of the embodiments of the methods described above. Any reference to memory, database, or other medium used in embodiments provided herein may include at least one of non-volatile and volatile memory. The nonvolatile memory may include read-only memory (ROM), magnetic tape, floppy disk, flash memory, optical memory, high density embedded nonvolatile memory, resistive random access memory (ReRAM), magneto-resistive random access memory (Magnetoresistive Random Access Memory, MRAM), ferroelectric memory (Ferroelectric Random Access Memory, FRAM), phase change memory (PHASE CHANGE memory, PCM), graphene memory, and the like. Volatile memory can include random access memory (Random Access Memory, RAM) or external cache memory, and the like. By way of illustration, and not limitation, RAM can be in various forms such as static random access memory (Static Random Access Memory, SRAM) or dynamic random access memory (Dynamic Random Access Memory, DRAM), etc. The databases referred to in the embodiments provided herein may include at least one of a relational database and a non-relational database. The non-relational database may include, but is not limited to, a blockchain-based distributed database, and the like. The processor referred to in the embodiments provided in the present application may be a general-purpose processor, a central processing unit, a graphics processor, a digital signal processor, a programmable logic unit, a data processing logic unit based on quantum computing, or the like, but is not limited thereto.
The technical features of the above embodiments may be arbitrarily combined, and all possible combinations of the technical features in the above embodiments are not described for brevity of description, however, as long as there is no contradiction between the combinations of the technical features, they should be considered as the scope of the description.
The foregoing examples illustrate only a few embodiments of the application and are described in detail herein without thereby limiting the scope of the application. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the application, which are all within the scope of the application. Accordingly, the scope of the application should be assessed as that of the appended claims.

Claims (10)

1. A data searching method, the method comprising:
acquiring an input text, and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
Mapping, for each text segment in the sequence of text segments, the text segment in question to a corresponding feature value;
Determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identification matched with the aimed text segment from the index record;
screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence;
obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text;
And carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
2. The method of claim 1, wherein the sequence of text segments comprises text segments, and wherein for each text segment in the sequence of text segments, mapping the text segment to a corresponding feature value comprises:
Determining a letter sequence corresponding to each text segment in the text segment sequence;
determining the number of letters of the letter sequence, determining target letters arranged at preset positions in the letter sequence, and acquiring preset salt values corresponding to first letters in the letter sequence;
determining a jump degree characteristic value corresponding to the letter sequence according to the difference between every two adjacent letters in the letter sequence;
And mapping the text segment to be mapped into a corresponding characteristic value based on the number of letters, the target letters, the preset salt value and the jump degree characteristic value.
3. The method of claim 2, wherein determining the jump characteristic value corresponding to the letter sequence according to the difference between every two adjacent letters in the letter sequence comprises:
acquiring an ASCII code value of each letter in the letter sequence;
Determining the difference value between ASCII code values of every two adjacent letters in the letter sequence;
And determining the jump degree characteristic value corresponding to the letter sequence according to the difference value between the ASCII code values of every two adjacent letters in the letter sequence.
4. The method of claim 2, wherein the sequence of text segments includes a symbol segment and a text segment, the method further comprising:
When the text segment is the symbol segment, determining a preset index table in a plurality of preset index tables as a preset index table corresponding to the symbol segment;
And when the text segment is the text segment, determining a pronunciation category corresponding to the text segment, and determining a preconfigured index table corresponding to the text segment from the preconfigured index tables according to the first letter in the letter sequence and the pronunciation category.
5. The method of claim 4, wherein the text segment comprises a kanji segment formed from a single kanji, an english segment formed from consecutive english letters in the input text, or a numeric segment formed from a single number;
The letter sequences corresponding to the Chinese character segments are formed based on the Chinese pinyin corresponding to the Chinese character segments; the pronunciation category corresponding to the Chinese character segment represents the tone of the Chinese phonetic alphabet corresponding to the Chinese character segment;
The letter sequence corresponding to the English fragment is a sequence formed by lower case letters corresponding to all letters in the English fragment; the pronunciation category corresponding to the English fragment is a preset pronunciation category;
the letter sequences corresponding to the number segments are formed based on the Chinese pinyin corresponding to the number segments; and representing the pronunciation category corresponding to the digital segment and the tone of the Chinese phonetic alphabet corresponding to the digital segment.
6. The method according to claim 1, wherein screening out target data identifiers matched with each text segment in the sequence of text segments from the data identifiers matched with each text segment in the sequence of text segments comprises:
Determining a plurality of candidate data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with the text segments in the text segment sequence;
For each candidate data identifier in the plurality of candidate data identifiers, determining a text segment sequence number of each text segment in the text segment sequence when the corresponding candidate data identifier is located in an index record where the corresponding candidate data identifier is located;
And screening out target data identifiers from the plurality of candidate data identifiers, so that the sequence represented by the text segment serial numbers of the plurality of text segments in the text segment sequence when the text segments correspond to the target data identifiers is consistent with the arrangement sequence of the plurality of text segments in the text segment sequence.
7. The method according to any one of claims 1-6, further comprising:
encrypting a preset text, obtaining and storing an encrypted text corresponding to the preset text, and generating a data identifier corresponding to the encrypted text;
Analyzing the preset text into a preset text fragment sequence formed by a plurality of preset text fragments;
Mapping the preset text segment to a corresponding characteristic value for each of the sequences of preset text segments;
Generating a text segment serial number corresponding to the aimed preset text segment based on the data identification corresponding to the encrypted text and the arrangement position of the aimed preset text segment in the preset text segment sequence;
And recording the data identification corresponding to the encrypted text, the text segment serial number corresponding to the preset text segment and the characteristic value corresponding to the preset text segment in a one-to-one correspondence manner in a preset index table corresponding to the preset text segment.
8. A data search device, the device comprising:
The analysis module is used for acquiring an input text and analyzing the input text into a text fragment sequence formed by a plurality of text fragments;
the characteristic value mapping module is used for mapping each text segment in the text segment sequence into a corresponding characteristic value;
The data searching module is used for determining an index record containing the characteristic value from a preconfigured index table corresponding to the aimed text segment, and determining a data identifier matched with the aimed text segment from the index record; screening out target data identifiers matched with each text segment in the text segment sequence from the data identifiers matched with each text segment in the text segment sequence; obtaining a pre-encrypted text corresponding to the target data identifier and decrypting the pre-encrypted text to obtain a plaintext candidate text; and carrying out text matching on the input text and the text candidate text to obtain a data search result aiming at the input text.
9. A computer device comprising a memory and a processor, the memory storing a computer program, characterized in that the processor implements the steps of the method of any of claims 1 to 7 when the computer program is executed.
10. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the method of any of claims 1 to 7.
CN202410311119.4A 2024-03-19 2024-03-19 Data searching method, device, computer equipment, storage medium and product Pending CN117910022A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410311119.4A CN117910022A (en) 2024-03-19 2024-03-19 Data searching method, device, computer equipment, storage medium and product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410311119.4A CN117910022A (en) 2024-03-19 2024-03-19 Data searching method, device, computer equipment, storage medium and product

Publications (1)

Publication Number Publication Date
CN117910022A true CN117910022A (en) 2024-04-19

Family

ID=90692525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410311119.4A Pending CN117910022A (en) 2024-03-19 2024-03-19 Data searching method, device, computer equipment, storage medium and product

Country Status (1)

Country Link
CN (1) CN117910022A (en)

Similar Documents

Publication Publication Date Title
RU2686590C1 (en) Method and device for comparing similar elements of high-dimensional image features
CN110059455A (en) Code encryption method, apparatus, electronic equipment and computer readable storage medium
CN112651236B (en) Method and device for extracting text information, computer equipment and storage medium
CN108734024A (en) A kind of efficient database encryption method based on dictionary mapping
EP2779520A1 (en) A process for obtaining candidate data from a remote storage server for comparison to a data to be identified
CN111506608A (en) Method and device for comparing structured texts
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
JP6343081B1 (en) Recording medium recording code code classification search software
JP2018060370A (en) Search program, search method and search device
CN117910022A (en) Data searching method, device, computer equipment, storage medium and product
US9286349B2 (en) Dynamic search system
US20220277098A1 (en) Method and system for securely storing and programmatically searching data
Wang et al. A privacy-preserving cross-media retrieval on encrypted data in cloud computing
US20170169079A1 (en) Method and apparatus for secured information storage
CN114338058A (en) Information processing method, device and storage medium
JP2008059136A (en) Leaking personal information retrieval system, leaking personal information retrieval method, leaking personal information retrieval device and program
CN116599666B (en) Method, device, computer equipment and storage medium for generating password dictionary
WO2023286340A1 (en) Information processing device and information processing method
CN117278343B (en) Data multi-level output processing method based on big data platform data
JP6044422B2 (en) Abbreviation generation method and abbreviation generation apparatus
KR102658134B1 (en) Electronic document management server that performs database processing for electronic document based on identification tag and operating method thereof
JP2018200546A (en) Recording medium having categorizing code generation software recorded therein
US11093478B2 (en) Computer architecture for mapping correlithm objects to sub-string correlithm objects of a string correlithm object in a correlithm object processing system
US10996965B2 (en) Computer architecture for emulating a string correlithm object generator in a correlithm object processing system
CN116827630A (en) Searchable encryption method, device, equipment and storage medium for card service information

Legal Events

Date Code Title Description
PB01 Publication