WO2021258848A1

WO2021258848A1 - Data dictionary generation method and apparatus, data query method and apparatus, and device and medium

Info

Publication number: WO2021258848A1
Application number: PCT/CN2021/090528
Authority: WO
Inventors: 刘东煜; 陈乐清; 曾增烽; 李炫�
Original assignee: 平安科技（深圳）有限公司
Priority date: 2020-06-24
Filing date: 2021-04-28
Publication date: 2021-12-30
Also published as: CN111737977A; CN111737977B

Abstract

Disclosed are a data dictionary generation method and apparatus, and a computer device and a storage medium. The method comprises: acquiring first data to be stored, said data comprising a first pinyin node and a second pinyin node; performing a query in a preset first data dictionary on the basis of the first pinyin node and the second pinyin node, so as to determine a first index sequence and a second index sequence; processing the first index sequence and the second index sequence by using a CSR method, so as to obtain candidate index groups; querying a candidate frequency value of each candidate index group in a preset second data dictionary, and screening out, from the candidate index groups, a target index group, the candidate frequency value of which meets a preset requirement; and mapping and storing said data and the target index group to generate a third data dictionary. Recovery is performed by means of combining a first data dictionary and a second data dictionary to obtain a third data dictionary, thereby solving the problem of information redundancy during data storage.

Description

Data dictionary generation method, data query method, device, equipment and medium

This application is based on the Chinese patent application filed on June 24, 2020, with the application number 202010589195.3, titled "Data dictionary generation method, data query method, device, equipment and medium", and claims its priority.

Technical field

This application relates to the field of cloud storage, and in particular to a data dictionary generation method, data query method, device, equipment, and medium.

Background technique

With the rapid development of the Internet and the improvement of informatization in various fields of society, the amount of data is exploding at an unprecedented rate, and mankind is entering the era of big data. In an information management system, a data dictionary is usually used to store data. The inventor realized that the current dictionary library based on word segmentation generally requires four types of bottom dictionaries: 1gram word frequency dictionary, 1gram pinyin-homonym mapping dictionary, 2gram word frequency, and 2gram pinyin-homonym mapping dictionary, and these four types of underlying dictionaries need to be stored separately. When the algorithm loads the dictionary, not only must it be loaded as four HashMaps, but also the one-to-one mapping relationship in the dictionary must be saved separately. Therefore, the use of this traditional data dictionary storage method often results in greater information redundancy and space waste.

Application content

The embodiments of the present application provide a data dictionary generation method, device, computer equipment, and storage medium to solve the problem of information redundancy during data storage.

The embodiments of the present application provide a data query method, device, computer equipment, and storage medium to solve the problem of low efficiency of data query.

A method for generating a data dictionary, including:

Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;

Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;

Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;

The data to be stored and the target index group are mapped and stored to generate a third data dictionary.

A data query method, including:

Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;

Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.

A data dictionary generating device includes:

The first obtaining module is configured to obtain first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

The first query module is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine a first index sequence and a second index sequence, wherein the first An index sequence is an index sequence of the first pinyin node, and the second index sequence is an index sequence of the second pinyin node;

The first processing module is configured to process the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;

The first screening module is configured to query the candidate frequency value of each candidate index group in a preset second data dictionary, and filter out the target index whose candidate frequency value meets the preset requirements from the candidate index group Group;

The first mapping storage module is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.

A data query device includes:

The second query module is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third The data dictionary is obtained by using the data dictionary generating method of claim 1;

The third query module is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, wherein the fourth data dictionary refers to A word frequency dictionary storing the sixth index value and the corresponding sample frequency value.

A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:

One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

The above-mentioned data dictionary generation method, device, computer equipment and storage medium are used to obtain the first data to be stored. The first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, in advance It is assumed that the first data dictionary is queried to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node; The CSR method processes the first index sequence and the second index sequence to obtain a candidate index group; query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter the candidate frequency value from the candidate index group The target index group that meets the preset requirements; the data to be stored and the target index group are mapped and stored to generate a third data dictionary; the third data dictionary is restored by combining the first data dictionary and the second data dictionary, thereby saving data storage space. In addition, when data storage is performed to generate a dictionary, the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage. The inconvenience caused by the storage of remaining information and character type data.

The above-mentioned data query method, device, computer equipment and storage medium acquire the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, where the third The data dictionary is obtained by using the data dictionary generating method of claim 1; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained, wherein the fourth data dictionary refers to A word frequency dictionary for storing the sixth index value and the corresponding sample frequency value; thereby ensuring the accuracy of data query.

The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.

Description of the drawings

In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.

FIG. 1 is a schematic diagram of an application environment of a data dictionary generation method and a data query method in an embodiment of the present application;

FIG. 2 is an example diagram of a method for generating a data dictionary in an embodiment of the present application;

FIG. 3 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;

FIG. 4 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;

FIG. 5 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;

Fig. 6 is a functional block diagram of a data dictionary generating device in an embodiment of the present application;

FIG. 7 is an example diagram of a data query method in an embodiment of the present application;

FIG. 8 is another example diagram of a data query method in an embodiment of the present application;

Fig. 9 is a functional block diagram of a data query device in an embodiment of the present application;

Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.

detailed description

The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.

The data dictionary generation method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1. Specifically, the data dictionary generation method is applied in a data dictionary generation system. The data dictionary generation system includes a client and a server as shown in FIG. The problem of information redundancy. Among them, the client is also called the client, which refers to the program that corresponds to the server and provides local services to the client. The client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented with a standalone server or a server cluster composed of multiple servers.

In one embodiment, as shown in FIG. 2, a method for generating a data dictionary is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:

S11: Acquire first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node.

Among them, the first data to be stored refers to the 2gram pinyin data to be stored. For example: the first data to be stored may be GaoKong CaoZuo, YanJing She or KaiKai XinXin. The first data to be stored includes pinyin data of two nodes, namely the first pinyin node and the second pinyin node. The first pinyin node refers to the pinyin data of the first 1gram in the first data to be stored. The second pinyin node refers to the pinyin corresponding to the second 1gram in the first data to be stored. The first pinyin node and the second pinyin node may be the same or different. For example: if the first data to be stored is GaoKong CaoZuo, the first pinyin node is GaoKong; the second pinyin node is CaoZuo. Specifically, the first data to be stored can be obtained by collecting 2gram pinyin data in real time as the first data to be stored; or directly obtaining 2gram pinyin data from the pinyin dictionary database as the first data to be stored.

S12: Based on the first pinyin node and the second pinyin node, query in the preset first data dictionary to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node , The second index sequence is the index sequence of the second pinyin node.

Among them, the first data dictionary refers to a 1gram homophone dictionary generated in advance for storing 1gram pinyin-homonym data. Specifically, a number of 1gram pinyin nodes and an index sequence corresponding to each 1gram pinyin node are stored in the first data dictionary. For example: the first data dictionary stores 1gram pinyin-homonym data with a key value of GaiXing and a value value of [index1,index2,index3,index4...]. GaiXing is the 1gram pinyin node; [index1,index2,index3,index4...] is the index of the string corresponding to the 1gram pinyin node GaiXing. For example, the character string whose pinyin is GaiXing can include [modified, modified, changed surname, this new...], by using the double array dictionary tree algorithm to perform [modified, modified, changed surname, this new...] Processing, you can get the index sequence [index1, index2, index3, index4...] corresponding to GaiXing. It should be noted that the index is based on character strings, and the index value corresponding to each character string is uniquely determined.

Specifically, after the first pinyin node and the second pinyin node are determined, the first pinyin node and the second pinyin node are respectively matched with all 1gram pinyin nodes (key values) in the first data dictionary, and will be matched with The index sequence corresponding to the 1gram pinyin node that matches the first pinyin node is determined as the first index sequence, and the index sequence corresponding to the 1gram pinyin node that matches the second pinyin node is determined as the second index sequence. Optionally, the first index sequence may be expressed as preIndex, which is expressed as the index sequence of the first pinyin node, and the second index sequence may be expressed as sufIndex, which is expressed as the index sequence of the second pinyin node.

S13: Use the CSR method to process the first index sequence and the second index sequence to obtain a candidate index group.

Among them, the CSR method is a sparse matrix storage method. The average number of bytes (Bytes per Nonzero Entry) used by non-zero elements in the CSR format is the most stable when storing a sparse matrix. Specifically, CSR mainly includes three types of data: row vector, column vector, and value vector. Among them, the row vector (row offsets) represents the number of rows; its element value represents the offset of the first non-zero value in the row; the column indicators represent the column values of the elements; the value vectors (values) represent the corresponding elements value.

Wherein, the candidate index group refers to an index group obtained by randomly combining any index value in the first index sequence and any index value in the second index sequence. A candidate index group is composed of two index values. For example, the candidate index group may be Index1-index3, Index2-index3, or Index3-index5. Specifically, after the first index sequence and the second index sequence are determined, the first index sequence is taken as the row of the matrix, and the second index sequence is taken as the column of the matrix; then, the row vector and the column vector in the CSR method are used to determine The column index array of the row corresponding to the first index sequence in the matrix, and then the column index array of the row corresponding to the first index sequence and the second index sequence are intersected to obtain the candidate index group.

S14: Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group.

Among them, the second data dictionary refers to a pre-generated word frequency dictionary library used to store the index value of 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word). A 2gram string refers to a phrase composed of two 1gram strings. For example: 2gram string can be used for high-altitude operations/patent analysis/happy. Specifically, the index group corresponding to several 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word) are stored in the second data dictionary. Among them, the frequency value refers to the number of times a given 2gram character string (word) appears in the text. The frequency value is one of the most important reference indicators for ranking candidate words. The word with the larger frequency value indicates its The greater the probability of being the correct word. For example, the first data dictionary stores index values Index1-index3 whose key is "high-altitude operation", and word frequency data whose value is 45. Among them, Index1 is the high-altitude index value, index3 is the operation index value, and 45 is the high-altitude operation frequency value.

Among them, the target index group refers to an index group whose frequency value meets a preset requirement. Specifically, after the candidate index groups are determined, each candidate index group is queried in the preset second data dictionary to determine the candidate frequency of each candidate index group. After the candidate frequency of each candidate index group is determined, an index group whose candidate frequency value meets the preset requirements is screened out from the candidate index group and used as the target index group. In a specific embodiment, a frequency threshold may be preset, and then the candidate frequency value of each candidate index group is compared with the frequency threshold, and then the candidate index group corresponding to the candidate frequency value greater than the frequency threshold is determined to meet Preset the required target index group. Preferably, in order to ensure the diversity and universality of the stored data, in this embodiment, the frequency threshold is set to 0, that is, as long as the candidate index group with a candidate frequency value greater than 0 is determined as the target index group, the candidate frequency value is 0 It means that the 2gram character string (word) corresponding to the candidate index group does not exist. In another specific embodiment, when each candidate index group is queried in the preset second data dictionary, if the corresponding candidate frequency value is not queried in the second data dictionary, the candidate index group is directly determined The candidate frequency value of does not meet the preset requirements, and the candidate index group is eliminated.

Exemplarily, if the candidate index group includes Index1-index3 (high-altitude operation), Index2-index4 (high-altitude slot), Index1-index4 (high-altitude slot), and Index2-index3 (high-altitude operation); After querying in the data dictionary, the frequency value of Index1-index3 (high-altitude operation) is 40; the frequency value of Index2-index4 (high-altitude slot) is 20; the frequency value of Index1-index4 (high-altitude slot) is 0 ( Does not exist); the frequency value of Index2-index3 (high control operation) is 0 (not exist); then Index1-index3 and Index2-index4 are determined as the target index group.

S15: Map and store the data to be stored and the target index group to generate a third data dictionary.

Among them, the third data dictionary refers to a 2gram homophone dictionary for storing 2gram pinyin-homonym data. Specifically, the third data dictionary includes several 2gram pinyin nodes and an index group sequence corresponding to each 2gram pinyin node. Specifically, after the target index group is determined, the data to be stored (2gram pinyin node) and the corresponding target index group are mapped and stored to generate the third data dictionary. For example: if the data to be stored is GaoKong CaoZuo, and its corresponding target index groups are Index1-index3 and Index2-index4, then GaoKong CaoZuo is used as the key value, and Index1-index3 and Index2-index4 are mapped and stored as the value value to generate the first Three data dictionary.

In this embodiment, the first data to be stored is acquired, and the first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, it is performed in the preset first data dictionary Query to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node; the CSR method is used to compare the first index sequence and The second index sequence is processed to obtain the candidate index group; the candidate frequency value of each candidate index group is queried in the preset second data dictionary, and the target index group whose candidate frequency value meets the preset requirements is selected from the candidate index group ; Map and store the data to be stored and the target index group to generate a third data dictionary; restore the third data dictionary by combining the first data dictionary and the second data dictionary, thereby saving data storage space. In addition, when data storage is performed to generate a dictionary, the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage. The inconvenience caused by the storage of remaining information and character type data.

In one embodiment, as shown in FIG. 3, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method specifically includes the following steps:

S21: Obtain second data to be stored. The second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node.

Among them, the second data to be stored refers to the 1gram pinyin-homonym data to be stored. For example: the second data to be stored can be the key value is GaiXing, the value value is [modified, changed surname, modified...], or the key value is GaoKong, and the value value is [高空,高控,高孔... ] 1gram pinyin-homophone data. The second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node. The third pinyin node refers to the key value in the second data to be stored. For example: the third pinyin node can be GaiXing/GaoKong/CaoZuo. Understandably, the value corresponding to each key value in the second to-be-stored data is the character string corresponding to each third pinyin node. Each third pinyin node correspondingly includes at least one character string. For example: the string corresponding to the third pinyin node GaiXing includes [modified, changed surname, modified...]. Specifically, the second data to be stored can be obtained by real-time collection of 1 gram pinyin-homonymous word data as the second data to be stored; or the 1 gram pinyin-homonymous word data can be directly obtained from the pinyin-homonymous dictionary database as the second data to be stored.

S22: Use the double-array dictionary tree algorithm to process each character string of each third pinyin node, and determine the index value set corresponding to each third pinyin node.

Among them, the double-array dictionary tree is an efficient indexing method. In the tree structure, each node corresponds to a DFA state, and each edge from the parent node to the child node (directed) corresponds to a DFA conversion. . The traversal starts from the root node, and then from head to tail. Each character of the keyword determines the next state. The edge marked with the same character is selected for movement; each such movement consumes one character from the keyword And go to the next level of the tree. If the key string is empty and it reaches the leaf node, it means that the key word's exit has been reached. If you are trapped at a point, such as because there is no branch and are marked as the current character, or because the key string is empty at the middle node, it means that the key string is not recognized by the trie.

Specifically, the double-array dictionary tree algorithm is used to process each string corresponding to each third pinyin node, that is, each string corresponding to each third pinyin node is stored in the form of a double-array dictionary tree, thereby obtaining The index value set corresponding to each third pinyin node; thus, the indexes of all homophones of the pinyin node can be obtained through the pinyin node during data acquisition. It should be noted that each index value in the index value set corresponding to each third pinyin node is uniquely determined. Each string corresponds to a unique index value.

Exemplarily, if the character string corresponding to the third pinyin node GaiXing includes [modified, changed surname, modified...], the double-array dictionary tree algorithm is used for processing to obtain the index value set corresponding to the third pinyin node GaiXing It is [index1,index2,index3...]. Among them, index1 is the index value corresponding to "modified"; index2 is the index value corresponding to "modified surname"; index3 is the index value corresponding to "modified".

S23: Write the index value set corresponding to each third pinyin node into the preset first index array to obtain the first target index array.

Wherein, the first index array refers to a pre-established one-dimensional array used to record the index value set corresponding to each third pinyin node. Specifically, the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target index array. Exemplarily, if the index value set corresponding to the third pinyin node GaiXing is [index1, index2, index3]; the index value set corresponding to the third pinyin node GaoKong is [index4, index5, index6], then the third pinyin node After the index value sets corresponding to GaiXing and GaiXing are written into the preset first index array, the first target index array obtained is [index1, index2, index3, index4, index5, index6].

S24: Determine the starting index position of each third Pinyin node from the first target index array.

Specifically, since each index value in the first target index array is uniquely determined, the first index value and the last index value in the index value set corresponding to each third pinyin node are set in the first The array number in the target index array is determined as the starting index position of the corresponding third pinyin node. Exemplarily, if the first target index array is [index1, index2, index3, index4, index5, index6], index1 and index3 are the first index value and the last index value of the third pinyin node GaiXing, and index1 is in the first index. The array number of a target index array is 0, and the array number of index3 in the first target index array is 2. Therefore, the starting index position of the third pinyin node GaiXing is (0, 2); index4 and index6 are respectively The first index value and the last index value of the three pinyin node GaoKong, the array number of index4 in the first target index array is 3, and the array number of index6 in the first target index array is 5. Therefore, the third pinyin node The starting index position of GaoKong is (3,5).

S25: Use the double-array dictionary tree algorithm to process each third pinyin node to obtain the node identifier of each third pinyin node.

Specifically, each third pinyin node is processed using the double-array dictionary tree algorithm, that is, each third pinyin node is stored in the form of a double-array dictionary tree, so as to obtain the node identifier corresponding to each third pinyin node. Understandably, the node identifier corresponding to each third pinyin node is uniquely determined. It should be noted that the specific method and process of processing each third pinyin node using the double-array dictionary tree algorithm in this step is the same as that in step S22 for each string of each third pinyin node using the double-array dictionary tree algorithm. The specific method and process of processing are similar, so I won’t repeat them here.

S26: Map and store the node identifier of each third Pinyin node and the corresponding start index position to generate an offset array set.

Among them, the offset array set refers to a set composed of several offset arrays. Each offset array includes a node identifier and a corresponding starting index position. Specifically, after the node identifier of each third pinyin node is determined, each node identifier and the corresponding starting index position are associated and stored to generate an offset array set. For example: if the node ID of the third pinyin node GaiXing is 0, its corresponding starting index position is (0,2); the third pinyin node GaoKong is the node ID being 1, and its corresponding starting index position is ( 3, 5); Therefore, the node ID 0 and the starting index position (0, 2) are mapped and stored, the first offset array is generated, and the node ID 1 is mapped and stored with the starting index position (3, 5), A second offset array is generated, and the first offset array and the second offset array form an offset array set.

S27: Combine the first target index array and the offset array set to generate a first data dictionary.

Among them, the first data dictionary is a dictionary for storing 1 gram homophones. Specifically, after the first target index array and the offset array set are determined, the first target index array and the offset array set are combined to generate the first data dictionary. Understandably, in the first data dictionary, each 1gram pinyin node is stored in the form of node identification, and the string corresponding to each 1gram pinyin node is stored in the form of index; thereby reducing data storage Redundant information at the time.

In this embodiment, the second data to be stored is acquired, and the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node; the double-array dictionary tree algorithm is used for each third pinyin node. Each character string of the pinyin node is processed to determine the index value set corresponding to each third pinyin node; the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target Index array; determine the starting index position of each third pinyin node from the first target index array; use the double-array dictionary tree algorithm to process each pair of third pinyin nodes to obtain the node identification of each third pinyin node; The node identifier of each third pinyin node is mapped and stored with the corresponding starting index position to generate an offset array set; the first target index array and the offset array set are combined to generate the first data dictionary; Second, the data to be stored is stored in the form of a double-array dictionary tree, that is, the third pinyin node is converted into the form of node identification for storage, and the string corresponding to each third pinyin node is converted into the form of index for storage, thereby reducing the data Redundant information during storage.

In one embodiment, as shown in FIG. 4, before querying the candidate frequency value of each candidate index group in the preset second data dictionary, the data dictionary generating method further specifically includes the following steps:

S41: Obtain the third data to be stored. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value.

Among them, the third data to be stored refers to the 2gram word frequency data to be stored. For example: the third data to be stored is 2gram word frequency data with a key value of GaoKong CaoZuo and a value value of 30, or the key value of YanJing Sheg and a value of 25. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value. Among them, the fourth pinyin byte refers to the first 1gram pinyin in the third data to be stored. The fifth pinyin node refers to the second 1gram pinyin in the third data to be stored. The fourth pinyin node and the fifth pinyin node may be the same or different. The fourth pinyin node and the fifth pinyin node are combined as the key value of the third data to be stored. The target frequency value refers to the frequency value corresponding to the combined pinyin node of the fourth pinyin node and the fifth pinyin node. The target frequency value is the value value in the third data to be stored. For example: if the key value in the third data to be stored is GaoKong CaoZuo and the value value is 25; the fourth pinyin node is GaoKong; the second pinyin node is CaoZuo; and the target frequency value is 25. Among them, 25 is the frequency value of GaoKong CaoZuo. Specifically, the third data to be stored can be acquired by real-time acquisition of 2gram word frequency data as the third data to be stored; or the 2gram word frequency data can be directly acquired from the Pinyin dictionary database as the third data to be stored.

S42: Use the double-array dictionary tree algorithm to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte , The fifth index value is the index value of the fifth pinyin byte.

Specifically, a double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value. Wherein, the fourth index value is the index value of the fourth pinyin byte, and the fifth index value is the index value of the fifth pinyin byte. It should be noted that the specific method and process of processing the fourth pinyin byte and the fifth pinyin byte using the double-array dictionary tree algorithm in this step is the same as that in step S22 using the double-array dictionary tree algorithm for each third pinyin node The specific methods and procedures for processing each character string are similar, so I won’t repeat them here.

S43: Use the CSR method to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.

Among them, the second data dictionary refers to a word frequency dictionary library used to store the index value of the 2gram character string (word) and the corresponding frequency value. Since the 2gram string is composed of two 1gram strings, each 2gram string (word) includes two index values, which are the fourth index value and the fifth index value. Specifically, a two-dimensional matrix can be preset, the fourth index value is used as the row of the two-dimensional matrix, and the fifth index value is used as the column of the two-dimensional matrix; the target frequency value is used as the element value in the two-dimensional matrix. Map storage. Furthermore, since many 2gramm string combinations do not exist in practice, the two-dimensional matrix is a sparse matrix. Therefore, the CSR method is used to process the two-dimensional matrix to compress the space and generate a second data dictionary.

In this embodiment, the third data to be stored is obtained. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value; the double-array dictionary tree algorithm is used to compare the fourth pinyin byte and the fifth pinyin byte. The pinyin bytes are processed to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte, and the fifth index value is the index value of the fifth pinyin byte; adopt The CSR method maps and stores the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary; by storing the third data to be stored in the form of a double array dictionary tree, that is, the third data to be stored is The fourth pinyin byte and the fifth pinyin byte are represented by indexes, thereby reducing redundant information during data storage and saving storage space.

In an embodiment, as shown in FIG. 5, the data dictionary generation method further specifically includes the following steps:

S16: Obtain fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string.

Among them, the fourth data to be stored refers to 1gram word frequency data to be stored. The fourth data to be stored includes L sample character strings and a frequency value corresponding to each sample character string. Among them, the sample string is the key value in the fourth data to be stored, and the frequency value is the value value in the fourth data to be stored. For example: if the fourth data to be stored includes 1gram word frequency data with key value of high altitude, value value of 40, sum, key value of operation, and value value of 45; then "high altitude" is the sample string, and "40" is The frequency value corresponding to high altitude; "operation" is the sample string, and "45" is the frequency value corresponding to the operation. Understandably, the fourth data to be stored includes L key-value pairs, and each key corresponds to a frequency value, that is, each sample string corresponds to a frequency value. Specifically, the fourth to-be-stored data can be obtained by real-time collection of 1 gram word frequency data as the fourth to-be-stored data; or directly obtained from the Pinyin dictionary database as the fourth to-be-stored data.

S17: Use the double-array dictionary tree algorithm to process each sample string to obtain the sixth index value of each sample string.

Specifically, the double-array dictionary tree algorithm is used to process each sample character string, so as to obtain the sixth index value of each sample character string. Understandably, each sample string corresponds to a unique sixth index value. It should be noted that the specific method and process of using the double-array dictionary tree algorithm to process each sample string in this step is the same as that in step S22 using the double-array dictionary tree algorithm for each string of each third pinyin node. The specific method and process of processing are similar, so I won’t repeat them here.

S18: Write each sample character string and the corresponding sixth index value into the preset array to obtain a storage array.

In addition, since the double-array dictionary tree cannot back-check the 1gram segment (sample character string) through the index (the sixth index value), in this embodiment, a sixth index value for storing each sample character string is established The storage array. Specifically, the array number of the sixth index value in the storage array corresponds to the sixth index value. That is, according to the order of the sixth index value from small to large, the sixth index value of each sample string is written into the storage array, so that it is convenient to check the corresponding 1gram segment (sample character) through the index value (sixth index value). string).

S19: Map and store each sixth index value with the corresponding sample frequency value to generate a fourth data dictionary.

Specifically, after the sixth index value is obtained, each sixth index value and the corresponding sample frequency value are mapped and stored to generate the fourth data dictionary. Among them, the fourth data dictionary refers to a 1gram word frequency dictionary for storing 1gram word frequency data. In the fourth data dictionary, the index value and corresponding frequency value of several 1gram strings are included. For example, the fourth data dictionary includes data whose key value is index1, value value is 30, and key value is index2, and value value is 40. Among them, index1 is the sixth index value of the sample string "高空", 30 is the frequency value of the sample string "高空"; index2 is the sixth index value of the sample string "Operation", and 40 is the sample string "Operation" The frequency value.

In this embodiment, the fourth data to be stored is obtained, and the fourth data to be stored includes L sample character strings and the sample frequency value corresponding to each sample character string; each sample character string is processed by the double-array dictionary tree algorithm , Get the sixth index value of each sample character string; write each sample character string and the corresponding sixth index value into the preset array to obtain the storage array; combine each sixth index value with the corresponding sample frequency value Perform mapping storage to generate a fourth data dictionary; by storing the fourth data to be stored in the form of a double array dictionary tree, that is, each sample string is converted into a sixth index value, and stored with the corresponding sample frequency value, thereby reducing The redundant information during data storage and the inconvenience caused by character type data storage are eliminated.

It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.

In one embodiment, a data dictionary generating device is provided, and the data dictionary generating device corresponds to the data dictionary generating method in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 6, the data dictionary generating device includes a first acquisition module 11, a first query module 12, a processing module 13, a first screening module 14 and a first mapping storage module 15. The detailed description of each functional module is as follows:

The first obtaining module 11 is configured to obtain first data to be stored, and the first data to be stored includes a first pinyin node and a second pinyin node;

The first query module 12 is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine the first index sequence and the second index sequence, where the first index sequence is the first An index sequence of a pinyin node, and the second index sequence is an index sequence of a second pinyin node;

The first processing module 13 is configured to use the CSR method to process the first index sequence and the second index sequence to obtain a candidate index group;

The first screening module 14 is configured to query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirements from the candidate index group;

The first mapping storage module 15 is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.

Preferably, the data dictionary generating device further includes:

The second acquisition module is configured to acquire second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node;

The second processing module is used to process each character string of each third pinyin node by using a double-array dictionary tree algorithm to determine the index value set corresponding to each third pinyin node;

The first writing module is used to write the index value set corresponding to each third pinyin node into the preset first index array to obtain the first target index array;

The first determining module is used to determine the starting index position of each third pinyin node from the first target index array;

The third processing module is used to process each third pinyin node by using the double-array dictionary tree algorithm to obtain the node identifier of each third pinyin node;

The second mapping storage module is used to map and store the node identifier of each third pinyin node and the corresponding starting index position to generate an offset array set;

The combination module is used to combine the first target index array and the offset array set to generate a first data dictionary.

Preferably, the data dictionary generating device further includes:

The third acquisition module is used to acquire the third data to be stored, the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value;

The fourth processing module is used to process the fourth pinyin byte and the fifth pinyin byte using the double-array dictionary tree algorithm to obtain the fourth index value and the fifth index value, where the fourth index value is the fourth pinyin character The index value of the section, the fifth index value is the index value of the fifth pinyin byte;

The third mapping storage module is used to map and store the fourth index value, the fifth index value, and the target frequency value using the CSR method to generate a second data dictionary.

Preferably, the data dictionary generating device further includes:

The fourth acquiring module is configured to acquire fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string;

The fifth processing module is used to process each sample string using the double-array dictionary tree algorithm to obtain the sixth index value of each sample string;

The second writing module is used to write each sample string and the corresponding sixth index value into the preset array to obtain the storage array;

The fourth mapping storage module is used for mapping and storing each sixth index value and the corresponding sample frequency value to generate a fourth data dictionary.

For the specific definition of the data dictionary generating device, please refer to the above definition of the data dictionary generating method, which will not be repeated here. Each module in the above-mentioned data dictionary generating device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, as shown in FIG. 7, a data query method is provided. The method is applied to the server in FIG. 1 as an example for description, and includes the following steps:

S100: Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary is generated using the data dictionary of claim 1 Method.

Among them, the first data to be queried refers to the 2gram pinyin node data to be queried. The first data to be queried is composed of a first pinyin node to be queried and a second pinyin node to be queried. For example: the first data to be queried is GaoKong CaoZuo. GaoKong is the first pinyin node to be queried, and CaoZuo is the second pinyin node to be queried. Specifically, the first data to be queried is matched with all 2gram pinyin nodes stored in the third data dictionary, and the target index group corresponding to the 2gram pinyin node that matches the first data to be queried is determined as the first The index group to be queried for the data to be queried. Among them, the third data dictionary is obtained by using the above-mentioned data dictionary generation method.

S101: Based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding sample frequency value Word frequency dictionary.

Specifically, in order to retrieve the corresponding character string through the index value, in step S18, each sample character string and the corresponding sixth index value have been written into the preset array to obtain the storage array, that is, the storage array of the fourth data dictionary The storage array includes each sample character string and the corresponding sixth index value. Therefore, in this step, the index group to be queried is queried in the storage array of the fourth data dictionary, and the sample string corresponding to the sixth index value that matches the index group to be queried is determined as the first data to be queried The target string. Among them, the fourth data dictionary is obtained by using the above-mentioned data dictionary generating method.

In this embodiment, the first data to be queried is acquired, the first data to be queried is queried in a third data dictionary, and the index group to be queried for the first data to be queried is determined, wherein the third data dictionary adopts claim 1. The data dictionary generation method is obtained; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained. The fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of the sample frequency value of the sample frequency value; thereby ensuring the accuracy of data query.

In an embodiment, as shown in FIG. 8, the data query method further specifically includes the following steps:

S110: Obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary adopts claim 2. Obtained by the data dictionary generating method.

Among them, the second data to be queried refers to the 1gram pinyin node data to be queried. For example, the second data to be queried may be 1gram pinyin node data of GaoKong, CaoZuo or GaiXing. Specifically, the offset array set of the first data dictionary includes several offset arrays of the third pinyin node. Therefore, the second query data is matched with the third pinyin node of each offset data group in the offset array set of the first data dictionary, and the offset corresponding to the third pinyin node that matches the second query data is matched Array, determined as the target offset array of the second query data. Among them, the first data dictionary is obtained by using the above-mentioned data dictionary generating method.

S111: Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the target index data of the second data to be queried.

From step S26, it can be seen that the node identifier of each third pinyin node and the corresponding start index position are recorded in the offset array set. Therefore, the start index position in the target offset array is determined as the target start index position. Specifically, after the target start index position is determined, the query is performed in the first target index array, the start index position of the data to be queried in the first target index array is determined, and the target start index position is The index value corresponding to the start position to the end position is determined as the target index data of the data to be queried.

S112: Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.

Specifically, the storage array is queried based on the target index data to obtain the target character string of the second data to be queried. It should be noted that, in this step, the specific method and process for obtaining the target string of the second data to be queried based on the target index data in the storage array is the same as that in step S101 based on the query in the storage array based on the index group to be queried to obtain the first The specific method and process of the target character string of the data to be queried is similar, and will not be redundantly described here.

In this embodiment, the second data to be queried is obtained, and the second query data is queried in the offset array set of the first data dictionary to determine the target offset array of the second data to be queried, where the first data dictionary is Obtained by using the data dictionary generating method of claim 2; obtaining the target starting index position in the target offset array, and querying in the first target index array of the first data dictionary based on the target starting index position, and determining The target index data of the second data to be queried; query in the storage array based on the target index data to obtain the target character string of the second data to be queried; thus while ensuring the query efficiency, the accuracy of the data query is also improved.

In one embodiment, a data query device is provided, and the data query device corresponds to the data query method in the foregoing embodiment one-to-one. As shown in FIG. 9, the data query device includes a second query module 100 and a third query module 101. The detailed description of each functional module is as follows:

The second query module 100 is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts the above Obtained by the data dictionary generation method;

The third query module 101 is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, where the fourth data dictionary is used to store the sixth index value The word frequency dictionary with the corresponding sample frequency value.

Preferably, the data query device further includes:

The second determining module is used to obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, where the first data dictionary It is obtained by the above-mentioned data dictionary generation method;

The fourth query module is used to obtain the target starting index position in the target offset array, and based on the target starting index position, query in the first target index array of the first data dictionary to determine the target of the second data to be queried Index data

The fifth query module is used to query the storage array based on the target index data to obtain the target character string of the second data to be queried.

For the specific limitation of the data query device, please refer to the above limitation on the data query method, which will not be repeated here. Each module in the above-mentioned data query device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.

In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 10. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium. The database of the computer equipment is used to store the data used in the data dictionary generating method and the data query method in the foregoing embodiments. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instruction is executed by the processor to implement a data dictionary generation method, or the computer-readable instruction is executed by the processor to implement a data query method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, The following steps: acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, The following steps:

In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:

A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.

The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims

A method for generating a data dictionary, including:

Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;

Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;

Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;

The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
The data dictionary generating method according to claim 1, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further include:

Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;

Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;

Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;

Determine the starting index position of each of the third pinyin nodes from the first target index array;

Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;

Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;

Combining the first target index array and the offset array set to generate a first data dictionary.
3. The data dictionary generating method according to claim 1, wherein before the query of the candidate frequency value of each candidate index group in the preset second data dictionary, the data dictionary generating method further comprises:

Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;

The double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;

A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
The data dictionary generating method according to claim 1, wherein the data dictionary generating method further comprises:

Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;

Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;

Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;

Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
A data query method, which includes:

Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;

Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
8. The data query method according to claim 5, wherein the data query method further comprises:

Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;

Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;

Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
A data dictionary generating device, which includes:

The first obtaining module is configured to obtain first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

The first query module is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine a first index sequence and a second index sequence, wherein the first An index sequence is an index sequence of the first pinyin node, and the second index sequence is an index sequence of the second pinyin node;

The first processing module is configured to process the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;

The first screening module is configured to query the candidate frequency value of each candidate index group in a preset second data dictionary, and filter out the target index whose candidate frequency value meets the preset requirements from the candidate index group Group;

The first mapping storage module is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
A data query device, which includes:

The second query module is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third The data dictionary is obtained by using the data dictionary generating method of claim 1;

The third query module is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, wherein the fourth data dictionary refers to A word frequency dictionary storing the sixth index value and the corresponding sample frequency value.
A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:

Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;

Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;

Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;

The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
9. The computer device of claim 9, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further comprises:

Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;

Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;

Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;

Determine the starting index position of each of the third pinyin nodes from the first target index array;

Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;

Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;

Combining the first target index array and the offset array set to generate a first data dictionary.
The computer device according to claim 9, wherein, before the candidate frequency value of each candidate index group is queried in the preset second data dictionary, the processor further executes the computer-readable instruction To achieve the following steps:

Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;

The double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;

A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
9. The computer device of claim 9, wherein the processor further implements the following steps when executing the computer-readable instructions:

Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;

Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;

Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;

Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:

Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;

Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
The computer device according to claim 13, wherein the processor further implements the following steps when executing the computer readable instruction:

Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;

Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;

Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;

Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;

Processing the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;

Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;

The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
The readable storage medium of claim 15, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further include:

Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;

Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;

Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;

Determine the starting index position of each of the third pinyin nodes from the first target index array;

Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;

Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;

Combining the first target index array and the offset array set to generate a first data dictionary.
The readable storage medium according to claim 15, wherein, before the candidate frequency value of each candidate index group is queried in the preset second data dictionary, the computer readable instruction is executed by one or more When the processor executes, the one or more processors further execute the following steps:

Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;

A double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;

A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
15. The readable storage medium of claim 15, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;

Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;

Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;

Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:

Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;

Based on the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, wherein the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
The readable storage medium according to claim 19, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:

Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;

Obtain the target start index position in the target offset array, and based on the target start index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;

Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.