WO2021258848A1 - Data dictionary generation method and apparatus, data query method and apparatus, and device and medium - Google Patents

Data dictionary generation method and apparatus, data query method and apparatus, and device and medium Download PDF

Info

Publication number
WO2021258848A1
WO2021258848A1 PCT/CN2021/090528 CN2021090528W WO2021258848A1 WO 2021258848 A1 WO2021258848 A1 WO 2021258848A1 CN 2021090528 W CN2021090528 W CN 2021090528W WO 2021258848 A1 WO2021258848 A1 WO 2021258848A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
index
pinyin
dictionary
array
Prior art date
Application number
PCT/CN2021/090528
Other languages
French (fr)
Chinese (zh)
Inventor
刘东煜
陈乐清
曾增烽
李炫�
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021258848A1 publication Critical patent/WO2021258848A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries

Definitions

  • This application relates to the field of cloud storage, and in particular to a data dictionary generation method, data query method, device, equipment, and medium.
  • a data dictionary is usually used to store data.
  • the algorithm loads the dictionary not only must it be loaded as four HashMaps, but also the one-to-one mapping relationship in the dictionary must be saved separately. Therefore, the use of this traditional data dictionary storage method often results in greater information redundancy and space waste.
  • the embodiments of the present application provide a data dictionary generation method, device, computer equipment, and storage medium to solve the problem of information redundancy during data storage.
  • the embodiments of the present application provide a data query method, device, computer equipment, and storage medium to solve the problem of low efficiency of data query.
  • a method for generating a data dictionary including:
  • first data to be stored includes a first pinyin node and a second pinyin node
  • the first pinyin node and the second pinyin node Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
  • the data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  • a data query method including:
  • the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  • a data dictionary generating device includes:
  • the first obtaining module is configured to obtain first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
  • the first query module is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine a first index sequence and a second index sequence, wherein the first An index sequence is an index sequence of the first pinyin node, and the second index sequence is an index sequence of the second pinyin node;
  • the first processing module is configured to process the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;
  • the first screening module is configured to query the candidate frequency value of each candidate index group in a preset second data dictionary, and filter out the target index whose candidate frequency value meets the preset requirements from the candidate index group Group;
  • the first mapping storage module is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
  • a data query device includes:
  • the second query module is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third The data dictionary is obtained by using the data dictionary generating method of claim 1;
  • the third query module is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, wherein the fourth data dictionary refers to A word frequency dictionary storing the sixth index value and the corresponding sample frequency value.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
  • first data to be stored includes a first pinyin node and a second pinyin node
  • the first pinyin node and the second pinyin node Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
  • the data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  • a computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
  • the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  • One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • first data to be stored includes a first pinyin node and a second pinyin node
  • the first pinyin node and the second pinyin node Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
  • the data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  • One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
  • the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  • the above-mentioned data dictionary generation method, device, computer equipment and storage medium are used to obtain the first data to be stored.
  • the first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, in advance It is assumed that the first data dictionary is queried to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node;
  • the CSR method processes the first index sequence and the second index sequence to obtain a candidate index group; query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter the candidate frequency value from the candidate index group
  • the target index group that meets the preset requirements; the data to be stored and the target index group are mapped and stored to generate a third data dictionary; the third data dictionary is restored by combining the first data dictionary and the second data dictionary, thereby saving data storage space.
  • the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage.
  • the above-mentioned data query method, device, computer equipment and storage medium acquire the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, where the third
  • the data dictionary is obtained by using the data dictionary generating method of claim 1; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained, wherein the fourth data dictionary refers to A word frequency dictionary for storing the sixth index value and the corresponding sample frequency value; thereby ensuring the accuracy of data query.
  • FIG. 1 is a schematic diagram of an application environment of a data dictionary generation method and a data query method in an embodiment of the present application;
  • FIG. 2 is an example diagram of a method for generating a data dictionary in an embodiment of the present application
  • FIG. 3 is another example diagram of a method for generating a data dictionary in an embodiment of the present application.
  • FIG. 4 is another example diagram of a method for generating a data dictionary in an embodiment of the present application.
  • FIG. 5 is another example diagram of a method for generating a data dictionary in an embodiment of the present application.
  • Fig. 6 is a functional block diagram of a data dictionary generating device in an embodiment of the present application.
  • FIG. 7 is an example diagram of a data query method in an embodiment of the present application.
  • FIG. 8 is another example diagram of a data query method in an embodiment of the present application.
  • Fig. 9 is a functional block diagram of a data query device in an embodiment of the present application.
  • Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
  • the data dictionary generation method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1.
  • the data dictionary generation method is applied in a data dictionary generation system.
  • the data dictionary generation system includes a client and a server as shown in FIG.
  • the problem of information redundancy is also called the client, which refers to the program that corresponds to the server and provides local services to the client.
  • the client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices.
  • the server can be implemented with a standalone server or a server cluster composed of multiple servers.
  • a method for generating a data dictionary is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
  • S11 Acquire first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node.
  • the first data to be stored refers to the 2gram pinyin data to be stored.
  • the first data to be stored may be GaoKong CaoZuo, YanJing She or KaiKai XinXin.
  • the first data to be stored includes pinyin data of two nodes, namely the first pinyin node and the second pinyin node.
  • the first pinyin node refers to the pinyin data of the first 1gram in the first data to be stored.
  • the second pinyin node refers to the pinyin corresponding to the second 1gram in the first data to be stored.
  • the first pinyin node and the second pinyin node may be the same or different.
  • the first data to be stored is GaoKong CaoZuo
  • the first pinyin node is GaoKong
  • the second pinyin node is CaoZuo.
  • the first data to be stored can be obtained by collecting 2gram pinyin data in real time as the first data to be stored; or directly obtaining 2gram pinyin data from the pinyin dictionary database as the first data to be stored.
  • S12 Based on the first pinyin node and the second pinyin node, query in the preset first data dictionary to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node , The second index sequence is the index sequence of the second pinyin node.
  • the first data dictionary refers to a 1gram homophone dictionary generated in advance for storing 1gram pinyin-homonym data. Specifically, a number of 1gram pinyin nodes and an index sequence corresponding to each 1gram pinyin node are stored in the first data dictionary.
  • the first data dictionary stores 1gram pinyin-homonym data with a key value of GaiXing and a value value of [index1,index2,index3,index4...].
  • GaiXing is the 1gram pinyin node; [index1,index2,index3,index4...] is the index of the string corresponding to the 1gram pinyin node GaiXing.
  • the character string whose pinyin is GaiXing can include [modified, modified, changed surname, this new...], by using the double array dictionary tree algorithm to perform [modified, modified, changed surname, this new...] Processing, you can get the index sequence [index1, index2, index3, index4...] corresponding to GaiXing. It should be noted that the index is based on character strings, and the index value corresponding to each character string is uniquely determined.
  • the first pinyin node and the second pinyin node are respectively matched with all 1gram pinyin nodes (key values) in the first data dictionary, and will be matched with
  • the index sequence corresponding to the 1gram pinyin node that matches the first pinyin node is determined as the first index sequence
  • the index sequence corresponding to the 1gram pinyin node that matches the second pinyin node is determined as the second index sequence.
  • the first index sequence may be expressed as preIndex, which is expressed as the index sequence of the first pinyin node
  • the second index sequence may be expressed as sufIndex, which is expressed as the index sequence of the second pinyin node.
  • the CSR method is a sparse matrix storage method.
  • the average number of bytes (Bytes per Nonzero Entry) used by non-zero elements in the CSR format is the most stable when storing a sparse matrix.
  • CSR mainly includes three types of data: row vector, column vector, and value vector.
  • the row vector represents the number of rows; its element value represents the offset of the first non-zero value in the row; the column indicators represent the column values of the elements; the value vectors (values) represent the corresponding elements value.
  • the candidate index group refers to an index group obtained by randomly combining any index value in the first index sequence and any index value in the second index sequence.
  • a candidate index group is composed of two index values.
  • the candidate index group may be Index1-index3, Index2-index3, or Index3-index5.
  • the first index sequence is taken as the row of the matrix
  • the second index sequence is taken as the column of the matrix; then, the row vector and the column vector in the CSR method are used to determine The column index array of the row corresponding to the first index sequence in the matrix, and then the column index array of the row corresponding to the first index sequence and the second index sequence are intersected to obtain the candidate index group.
  • S14 Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group.
  • the second data dictionary refers to a pre-generated word frequency dictionary library used to store the index value of 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word).
  • a 2gram string refers to a phrase composed of two 1gram strings.
  • 2gram string can be used for high-altitude operations/patent analysis/happy.
  • the index group corresponding to several 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word) are stored in the second data dictionary.
  • the frequency value refers to the number of times a given 2gram character string (word) appears in the text. The frequency value is one of the most important reference indicators for ranking candidate words.
  • the first data dictionary stores index values Index1-index3 whose key is "high-altitude operation", and word frequency data whose value is 45.
  • Index1 is the high-altitude index value
  • index3 is the operation index value
  • 45 is the high-altitude operation frequency value.
  • the target index group refers to an index group whose frequency value meets a preset requirement.
  • each candidate index group is queried in the preset second data dictionary to determine the candidate frequency of each candidate index group.
  • an index group whose candidate frequency value meets the preset requirements is screened out from the candidate index group and used as the target index group.
  • a frequency threshold may be preset, and then the candidate frequency value of each candidate index group is compared with the frequency threshold, and then the candidate index group corresponding to the candidate frequency value greater than the frequency threshold is determined to meet Preset the required target index group.
  • the frequency threshold is set to 0, that is, as long as the candidate index group with a candidate frequency value greater than 0 is determined as the target index group, the candidate frequency value is 0 It means that the 2gram character string (word) corresponding to the candidate index group does not exist.
  • the candidate index group is directly determined The candidate frequency value of does not meet the preset requirements, and the candidate index group is eliminated.
  • the candidate index group includes Index1-index3 (high-altitude operation), Index2-index4 (high-altitude slot), Index1-index4 (high-altitude slot), and Index2-index3 (high-altitude operation);
  • the frequency value of Index1-index3 (high-altitude operation) is 40; the frequency value of Index2-index4 (high-altitude slot) is 20; the frequency value of Index1-index4 (high-altitude slot) is 0 ( Does not exist); the frequency value of Index2-index3 (high control operation) is 0 (not exist); then Index1-index3 and Index2-index4 are determined as the target index group.
  • S15 Map and store the data to be stored and the target index group to generate a third data dictionary.
  • the third data dictionary refers to a 2gram homophone dictionary for storing 2gram pinyin-homonym data.
  • the third data dictionary includes several 2gram pinyin nodes and an index group sequence corresponding to each 2gram pinyin node.
  • the data to be stored is GaoKong CaoZuo
  • its corresponding target index groups are Index1-index3 and Index2-index4
  • GaoKong CaoZuo is used as the key value
  • Index1-index3 and Index2-index4 are mapped and stored as the value value to generate the first Three data dictionary.
  • the first data to be stored is acquired, and the first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, it is performed in the preset first data dictionary Query to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node; the CSR method is used to compare the first index sequence and The second index sequence is processed to obtain the candidate index group; the candidate frequency value of each candidate index group is queried in the preset second data dictionary, and the target index group whose candidate frequency value meets the preset requirements is selected from the candidate index group ; Map and store the data to be stored and the target index group to generate a third data dictionary; restore the third data dictionary by combining the first data dictionary and the second data dictionary, thereby saving data storage space.
  • the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage.
  • the data dictionary generating method before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, specifically includes the following steps:
  • the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node.
  • the second data to be stored refers to the 1gram pinyin-homonym data to be stored.
  • the second data to be stored can be the key value is GaiXing, the value value is [modified, changed surname, modified...], or the key value is GaoKong, and the value value is [ ⁇ , ⁇ , ⁇ ... ] 1gram pinyin-homophone data.
  • the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node.
  • the third pinyin node refers to the key value in the second data to be stored.
  • the third pinyin node can be GaiXing/GaoKong/CaoZuo.
  • the value corresponding to each key value in the second to-be-stored data is the character string corresponding to each third pinyin node.
  • Each third pinyin node correspondingly includes at least one character string.
  • the string corresponding to the third pinyin node GaiXing includes [modified, changed surname, modified...].
  • the second data to be stored can be obtained by real-time collection of 1 gram pinyin-homonymous word data as the second data to be stored; or the 1 gram pinyin-homonymous word data can be directly obtained from the pinyin-homonymous dictionary database as the second data to be stored.
  • S22 Use the double-array dictionary tree algorithm to process each character string of each third pinyin node, and determine the index value set corresponding to each third pinyin node.
  • the double-array dictionary tree is an efficient indexing method.
  • each node corresponds to a DFA state
  • each edge from the parent node to the child node corresponds to a DFA conversion.
  • the traversal starts from the root node, and then from head to tail.
  • Each character of the keyword determines the next state.
  • the edge marked with the same character is selected for movement; each such movement consumes one character from the keyword And go to the next level of the tree. If the key string is empty and it reaches the leaf node, it means that the key word's exit has been reached. If you are trapped at a point, such as because there is no branch and are marked as the current character, or because the key string is empty at the middle node, it means that the key string is not recognized by the trie.
  • the double-array dictionary tree algorithm is used to process each string corresponding to each third pinyin node, that is, each string corresponding to each third pinyin node is stored in the form of a double-array dictionary tree, thereby obtaining The index value set corresponding to each third pinyin node; thus, the indexes of all homophones of the pinyin node can be obtained through the pinyin node during data acquisition. It should be noted that each index value in the index value set corresponding to each third pinyin node is uniquely determined. Each string corresponds to a unique index value.
  • the double-array dictionary tree algorithm is used for processing to obtain the index value set corresponding to the third pinyin node GaiXing It is [index1,index2,index3...].
  • index1 is the index value corresponding to "modified”
  • index2 is the index value corresponding to "modified surname”
  • index3 is the index value corresponding to "modified”.
  • the first index array refers to a pre-established one-dimensional array used to record the index value set corresponding to each third pinyin node. Specifically, the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target index array. Exemplarily, if the index value set corresponding to the third pinyin node GaiXing is [index1, index2, index3]; the index value set corresponding to the third pinyin node GaoKong is [index4, index5, index6], then the third pinyin node After the index value sets corresponding to GaiXing and GaiXing are written into the preset first index array, the first target index array obtained is [index1, index2, index3, index4, index5, index6].
  • S24 Determine the starting index position of each third Pinyin node from the first target index array.
  • the first index value and the last index value in the index value set corresponding to each third pinyin node are set in the first
  • the array number in the target index array is determined as the starting index position of the corresponding third pinyin node.
  • the first target index array is [index1, index2, index3, index4, index5, index6]
  • index1 and index3 are the first index value and the last index value of the third pinyin node GaiXing, and index1 is in the first index.
  • the array number of a target index array is 0, and the array number of index3 in the first target index array is 2.
  • the starting index position of the third pinyin node GaiXing is (0, 2); index4 and index6 are respectively The first index value and the last index value of the three pinyin node GaoKong, the array number of index4 in the first target index array is 3, and the array number of index6 in the first target index array is 5. Therefore, the third pinyin node The starting index position of GaoKong is (3,5).
  • S25 Use the double-array dictionary tree algorithm to process each third pinyin node to obtain the node identifier of each third pinyin node.
  • each third pinyin node is processed using the double-array dictionary tree algorithm, that is, each third pinyin node is stored in the form of a double-array dictionary tree, so as to obtain the node identifier corresponding to each third pinyin node. Understandably, the node identifier corresponding to each third pinyin node is uniquely determined.
  • the specific method and process of processing each third pinyin node using the double-array dictionary tree algorithm in this step is the same as that in step S22 for each string of each third pinyin node using the double-array dictionary tree algorithm. The specific method and process of processing are similar, so I won’t repeat them here.
  • S26 Map and store the node identifier of each third Pinyin node and the corresponding start index position to generate an offset array set.
  • the offset array set refers to a set composed of several offset arrays.
  • Each offset array includes a node identifier and a corresponding starting index position. Specifically, after the node identifier of each third pinyin node is determined, each node identifier and the corresponding starting index position are associated and stored to generate an offset array set.
  • the node ID of the third pinyin node GaiXing is 0, its corresponding starting index position is (0,2); the third pinyin node GaoKong is the node ID being 1, and its corresponding starting index position is ( 3, 5); Therefore, the node ID 0 and the starting index position (0, 2) are mapped and stored, the first offset array is generated, and the node ID 1 is mapped and stored with the starting index position (3, 5), A second offset array is generated, and the first offset array and the second offset array form an offset array set.
  • the first data dictionary is a dictionary for storing 1 gram homophones. Specifically, after the first target index array and the offset array set are determined, the first target index array and the offset array set are combined to generate the first data dictionary. Understandably, in the first data dictionary, each 1gram pinyin node is stored in the form of node identification, and the string corresponding to each 1gram pinyin node is stored in the form of index; thereby reducing data storage Redundant information at the time.
  • the second data to be stored is acquired, and the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node; the double-array dictionary tree algorithm is used for each third pinyin node.
  • Each character string of the pinyin node is processed to determine the index value set corresponding to each third pinyin node; the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target Index array; determine the starting index position of each third pinyin node from the first target index array; use the double-array dictionary tree algorithm to process each pair of third pinyin nodes to obtain the node identification of each third pinyin node;
  • the node identifier of each third pinyin node is mapped and stored with the corresponding starting index position to generate an offset array set; the first target index array and the offset array set are combined to generate the first data dictionary;
  • Second, the data to be stored is stored in the form of a double-array dictionary tree, that
  • the data dictionary generating method before querying the candidate frequency value of each candidate index group in the preset second data dictionary, the data dictionary generating method further specifically includes the following steps:
  • the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value.
  • the third data to be stored refers to the 2gram word frequency data to be stored.
  • the third data to be stored is 2gram word frequency data with a key value of GaoKong CaoZuo and a value value of 30, or the key value of YanJing Sheg and a value of 25.
  • the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value.
  • the fourth pinyin byte refers to the first 1gram pinyin in the third data to be stored.
  • the fifth pinyin node refers to the second 1gram pinyin in the third data to be stored.
  • the fourth pinyin node and the fifth pinyin node may be the same or different.
  • the fourth pinyin node and the fifth pinyin node are combined as the key value of the third data to be stored.
  • the target frequency value refers to the frequency value corresponding to the combined pinyin node of the fourth pinyin node and the fifth pinyin node.
  • the target frequency value is the value value in the third data to be stored. For example: if the key value in the third data to be stored is GaoKong CaoZuo and the value value is 25; the fourth pinyin node is GaoKong; the second pinyin node is CaoZuo; and the target frequency value is 25. Among them, 25 is the frequency value of GaoKong CaoZuo.
  • the third data to be stored can be acquired by real-time acquisition of 2gram word frequency data as the third data to be stored; or the 2gram word frequency data can be directly acquired from the Pinyin dictionary database as the third data to be stored.
  • S42 Use the double-array dictionary tree algorithm to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte , The fifth index value is the index value of the fifth pinyin byte.
  • a double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value.
  • the fourth index value is the index value of the fourth pinyin byte
  • the fifth index value is the index value of the fifth pinyin byte.
  • S43 Use the CSR method to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
  • the second data dictionary refers to a word frequency dictionary library used to store the index value of the 2gram character string (word) and the corresponding frequency value. Since the 2gram string is composed of two 1gram strings, each 2gram string (word) includes two index values, which are the fourth index value and the fifth index value. Specifically, a two-dimensional matrix can be preset, the fourth index value is used as the row of the two-dimensional matrix, and the fifth index value is used as the column of the two-dimensional matrix; the target frequency value is used as the element value in the two-dimensional matrix. Map storage. Furthermore, since many 2gramm string combinations do not exist in practice, the two-dimensional matrix is a sparse matrix. Therefore, the CSR method is used to process the two-dimensional matrix to compress the space and generate a second data dictionary.
  • the third data to be stored is obtained.
  • the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value; the double-array dictionary tree algorithm is used to compare the fourth pinyin byte and the fifth pinyin byte.
  • the pinyin bytes are processed to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte, and the fifth index value is the index value of the fifth pinyin byte; adopt The CSR method maps and stores the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary; by storing the third data to be stored in the form of a double array dictionary tree, that is, the third data to be stored is The fourth pinyin byte and the fifth pinyin byte are represented by indexes, thereby reducing redundant information during data storage and saving storage space.
  • the data dictionary generation method further specifically includes the following steps:
  • S16 Obtain fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string.
  • the fourth data to be stored refers to 1gram word frequency data to be stored.
  • the fourth data to be stored includes L sample character strings and a frequency value corresponding to each sample character string.
  • the sample string is the key value in the fourth data to be stored
  • the frequency value is the value value in the fourth data to be stored.
  • the fourth data to be stored includes 1gram word frequency data with key value of high altitude, value value of 40, sum, key value of operation, and value value of 45; then "high altitude” is the sample string, and "40” is The frequency value corresponding to high altitude; "operation” is the sample string, and "45” is the frequency value corresponding to the operation.
  • the fourth data to be stored includes L key-value pairs, and each key corresponds to a frequency value, that is, each sample string corresponds to a frequency value.
  • the fourth to-be-stored data can be obtained by real-time collection of 1 gram word frequency data as the fourth to-be-stored data; or directly obtained from the Pinyin dictionary database as the fourth to-be-stored data.
  • S17 Use the double-array dictionary tree algorithm to process each sample string to obtain the sixth index value of each sample string.
  • the double-array dictionary tree algorithm is used to process each sample character string, so as to obtain the sixth index value of each sample character string. Understandably, each sample string corresponds to a unique sixth index value. It should be noted that the specific method and process of using the double-array dictionary tree algorithm to process each sample string in this step is the same as that in step S22 using the double-array dictionary tree algorithm for each string of each third pinyin node. The specific method and process of processing are similar, so I won’t repeat them here.
  • a sixth index value for storing each sample character string is established The storage array.
  • the array number of the sixth index value in the storage array corresponds to the sixth index value. That is, according to the order of the sixth index value from small to large, the sixth index value of each sample string is written into the storage array, so that it is convenient to check the corresponding 1gram segment (sample character) through the index value (sixth index value). string).
  • the fourth data dictionary refers to a 1gram word frequency dictionary for storing 1gram word frequency data.
  • the index value and corresponding frequency value of several 1gram strings are included.
  • the fourth data dictionary includes data whose key value is index1, value value is 30, and key value is index2, and value value is 40.
  • index1 is the sixth index value of the sample string " ⁇ ”
  • 30 is the frequency value of the sample string " ⁇ ”
  • index2 is the sixth index value of the sample string "Operation”
  • 40 is the sample string "Operation” The frequency value.
  • the fourth data to be stored is obtained, and the fourth data to be stored includes L sample character strings and the sample frequency value corresponding to each sample character string; each sample character string is processed by the double-array dictionary tree algorithm , Get the sixth index value of each sample character string; write each sample character string and the corresponding sixth index value into the preset array to obtain the storage array; combine each sixth index value with the corresponding sample frequency value Perform mapping storage to generate a fourth data dictionary; by storing the fourth data to be stored in the form of a double array dictionary tree, that is, each sample string is converted into a sixth index value, and stored with the corresponding sample frequency value, thereby reducing The redundant information during data storage and the inconvenience caused by character type data storage are eliminated.
  • a data dictionary generating device is provided, and the data dictionary generating device corresponds to the data dictionary generating method in the above-mentioned embodiment in a one-to-one correspondence.
  • the data dictionary generating device includes a first acquisition module 11, a first query module 12, a processing module 13, a first screening module 14 and a first mapping storage module 15. The detailed description of each functional module is as follows:
  • the first obtaining module 11 is configured to obtain first data to be stored, and the first data to be stored includes a first pinyin node and a second pinyin node;
  • the first query module 12 is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine the first index sequence and the second index sequence, where the first index sequence is the first An index sequence of a pinyin node, and the second index sequence is an index sequence of a second pinyin node;
  • the first processing module 13 is configured to use the CSR method to process the first index sequence and the second index sequence to obtain a candidate index group;
  • the first screening module 14 is configured to query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirements from the candidate index group;
  • the first mapping storage module 15 is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
  • the data dictionary generating device further includes:
  • the second acquisition module is configured to acquire second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node;
  • the second processing module is used to process each character string of each third pinyin node by using a double-array dictionary tree algorithm to determine the index value set corresponding to each third pinyin node;
  • the first writing module is used to write the index value set corresponding to each third pinyin node into the preset first index array to obtain the first target index array;
  • the first determining module is used to determine the starting index position of each third pinyin node from the first target index array
  • the third processing module is used to process each third pinyin node by using the double-array dictionary tree algorithm to obtain the node identifier of each third pinyin node;
  • the second mapping storage module is used to map and store the node identifier of each third pinyin node and the corresponding starting index position to generate an offset array set;
  • the combination module is used to combine the first target index array and the offset array set to generate a first data dictionary.
  • the data dictionary generating device further includes:
  • the third acquisition module is used to acquire the third data to be stored, the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value;
  • the fourth processing module is used to process the fourth pinyin byte and the fifth pinyin byte using the double-array dictionary tree algorithm to obtain the fourth index value and the fifth index value, where the fourth index value is the fourth pinyin character
  • the index value of the section, the fifth index value is the index value of the fifth pinyin byte;
  • the third mapping storage module is used to map and store the fourth index value, the fifth index value, and the target frequency value using the CSR method to generate a second data dictionary.
  • the data dictionary generating device further includes:
  • the fourth acquiring module is configured to acquire fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string;
  • the fifth processing module is used to process each sample string using the double-array dictionary tree algorithm to obtain the sixth index value of each sample string;
  • the second writing module is used to write each sample string and the corresponding sixth index value into the preset array to obtain the storage array;
  • the fourth mapping storage module is used for mapping and storing each sixth index value and the corresponding sample frequency value to generate a fourth data dictionary.
  • Each module in the above-mentioned data dictionary generating device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a data query method is provided.
  • the method is applied to the server in FIG. 1 as an example for description, and includes the following steps:
  • S100 Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary is generated using the data dictionary of claim 1 Method.
  • the first data to be queried refers to the 2gram pinyin node data to be queried.
  • the first data to be queried is composed of a first pinyin node to be queried and a second pinyin node to be queried.
  • the first data to be queried is GaoKong CaoZuo. GaoKong is the first pinyin node to be queried, and CaoZuo is the second pinyin node to be queried.
  • the first data to be queried is matched with all 2gram pinyin nodes stored in the third data dictionary, and the target index group corresponding to the 2gram pinyin node that matches the first data to be queried is determined as the first The index group to be queried for the data to be queried.
  • the third data dictionary is obtained by using the above-mentioned data dictionary generation method.
  • each sample character string and the corresponding sixth index value have been written into the preset array to obtain the storage array, that is, the storage array of the fourth data dictionary
  • the storage array includes each sample character string and the corresponding sixth index value. Therefore, in this step, the index group to be queried is queried in the storage array of the fourth data dictionary, and the sample string corresponding to the sixth index value that matches the index group to be queried is determined as the first data to be queried The target string.
  • the fourth data dictionary is obtained by using the above-mentioned data dictionary generating method.
  • the first data to be queried is acquired, the first data to be queried is queried in a third data dictionary, and the index group to be queried for the first data to be queried is determined, wherein the third data dictionary adopts claim 1.
  • the data dictionary generation method is obtained; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained.
  • the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of the sample frequency value of the sample frequency value; thereby ensuring the accuracy of data query.
  • the data query method further specifically includes the following steps:
  • S110 Obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary adopts claim 2. Obtained by the data dictionary generating method.
  • the second data to be queried refers to the 1gram pinyin node data to be queried.
  • the second data to be queried may be 1gram pinyin node data of GaoKong, CaoZuo or GaiXing.
  • the offset array set of the first data dictionary includes several offset arrays of the third pinyin node. Therefore, the second query data is matched with the third pinyin node of each offset data group in the offset array set of the first data dictionary, and the offset corresponding to the third pinyin node that matches the second query data is matched Array, determined as the target offset array of the second query data.
  • the first data dictionary is obtained by using the above-mentioned data dictionary generating method.
  • S111 Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the target index data of the second data to be queried.
  • step S26 it can be seen that the node identifier of each third pinyin node and the corresponding start index position are recorded in the offset array set. Therefore, the start index position in the target offset array is determined as the target start index position. Specifically, after the target start index position is determined, the query is performed in the first target index array, the start index position of the data to be queried in the first target index array is determined, and the target start index position is The index value corresponding to the start position to the end position is determined as the target index data of the data to be queried.
  • S112 Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
  • the storage array is queried based on the target index data to obtain the target character string of the second data to be queried.
  • the specific method and process for obtaining the target string of the second data to be queried based on the target index data in the storage array is the same as that in step S101 based on the query in the storage array based on the index group to be queried to obtain the first
  • the specific method and process of the target character string of the data to be queried is similar, and will not be redundantly described here.
  • the second data to be queried is obtained, and the second query data is queried in the offset array set of the first data dictionary to determine the target offset array of the second data to be queried, where the first data dictionary is Obtained by using the data dictionary generating method of claim 2; obtaining the target starting index position in the target offset array, and querying in the first target index array of the first data dictionary based on the target starting index position, and determining The target index data of the second data to be queried; query in the storage array based on the target index data to obtain the target character string of the second data to be queried; thus while ensuring the query efficiency, the accuracy of the data query is also improved.
  • a data query device is provided, and the data query device corresponds to the data query method in the foregoing embodiment one-to-one.
  • the data query device includes a second query module 100 and a third query module 101.
  • the detailed description of each functional module is as follows:
  • the second query module 100 is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts the above Obtained by the data dictionary generation method;
  • the third query module 101 is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, where the fourth data dictionary is used to store the sixth index value The word frequency dictionary with the corresponding sample frequency value.
  • the data query device further includes:
  • the second determining module is used to obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, where the first data dictionary It is obtained by the above-mentioned data dictionary generation method;
  • the fourth query module is used to obtain the target starting index position in the target offset array, and based on the target starting index position, query in the first target index array of the first data dictionary to determine the target of the second data to be queried Index data
  • the fifth query module is used to query the storage array based on the target index data to obtain the target character string of the second data to be queried.
  • Each module in the above-mentioned data query device can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
  • a computer device is provided.
  • the computer device may be a server, and its internal structure diagram may be as shown in FIG. 10.
  • the computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities.
  • the memory of the computer device includes a readable storage medium and an internal memory.
  • the readable storage medium stores an operating system, computer readable instructions, and a database.
  • the internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium.
  • the database of the computer equipment is used to store the data used in the data dictionary generating method and the data query method in the foregoing embodiments.
  • the network interface of the computer device is used to communicate with an external terminal through a network connection.
  • the computer-readable instruction is executed by the processor to implement a data dictionary generation method, or the computer-readable instruction is executed by the processor to implement a data query method.
  • the readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions, The following steps: acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
  • the first pinyin node and the second pinyin node Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
  • the data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  • a computer device including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor.
  • the processor executes the computer-readable instructions, The following steps:
  • the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:
  • first data to be stored includes a first pinyin node and a second pinyin node
  • the first pinyin node and the second pinyin node Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
  • the data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  • one or more readable storage media storing computer readable instructions are provided.
  • the readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:
  • the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  • a person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions.
  • the computer-readable instructions can be stored in a non-volatile computer.
  • a readable storage medium or a volatile readable storage medium when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments.
  • any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory.
  • Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory.
  • Volatile memory may include random access memory (RAM) or external cache memory.
  • RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

Disclosed are a data dictionary generation method and apparatus, and a computer device and a storage medium. The method comprises: acquiring first data to be stored, said data comprising a first pinyin node and a second pinyin node; performing a query in a preset first data dictionary on the basis of the first pinyin node and the second pinyin node, so as to determine a first index sequence and a second index sequence; processing the first index sequence and the second index sequence by using a CSR method, so as to obtain candidate index groups; querying a candidate frequency value of each candidate index group in a preset second data dictionary, and screening out, from the candidate index groups, a target index group, the candidate frequency value of which meets a preset requirement; and mapping and storing said data and the target index group to generate a third data dictionary. Recovery is performed by means of combining a first data dictionary and a second data dictionary to obtain a third data dictionary, thereby solving the problem of information redundancy during data storage.

Description

数据字典生成方法、数据查询方法、装置、设备及介质Data dictionary generation method, data query method, device, equipment and medium
本申请以2020年6月24日提交的申请号为202010589195.3,名称为“数据字典生成方法、数据查询方法、装置、设备及介质”的中国申请专利申请为基础,并要求其优先权。This application is based on the Chinese patent application filed on June 24, 2020, with the application number 202010589195.3, titled "Data dictionary generation method, data query method, device, equipment and medium", and claims its priority.
技术领域Technical field
本申请涉及云存储领域,尤其涉及一种数据字典生成方法、数据查询方法、装置、设备及介质。This application relates to the field of cloud storage, and in particular to a data dictionary generation method, data query method, device, equipment, and medium.
背景技术Background technique
随着互联网的快速发展和社会各领域信息化水平的提高,数据量正以史无前例的速度井喷,人类正在进入大数据时代。在信息管理系统中,通常会使用数据字典来存储数据。发明人意识到,目前基于分词的字典库一般需要1gram词频字典、1gram拼音-同音词映射字典、2gram词频以及2gram拼音-同音词映射字典这4类底层字典,且这四类底层字典需分别独立存储,在算法加载字典时不但要分别加载为四个HashMap,且需分别保存字典中的一一映射关系。因此,采用这种传统的数据字典的存储方式往往会造成较大的信息冗余和空间浪费。With the rapid development of the Internet and the improvement of informatization in various fields of society, the amount of data is exploding at an unprecedented rate, and mankind is entering the era of big data. In an information management system, a data dictionary is usually used to store data. The inventor realized that the current dictionary library based on word segmentation generally requires four types of bottom dictionaries: 1gram word frequency dictionary, 1gram pinyin-homonym mapping dictionary, 2gram word frequency, and 2gram pinyin-homonym mapping dictionary, and these four types of underlying dictionaries need to be stored separately. When the algorithm loads the dictionary, not only must it be loaded as four HashMaps, but also the one-to-one mapping relationship in the dictionary must be saved separately. Therefore, the use of this traditional data dictionary storage method often results in greater information redundancy and space waste.
申请内容Application content
本申请实施例提供一种数据字典生成方法、装置、计算机设备及存储介质,以解决数据存储时的信息冗余问题。The embodiments of the present application provide a data dictionary generation method, device, computer equipment, and storage medium to solve the problem of information redundancy during data storage.
本申请实施例提供一种数据查询方法、装置、计算机设备及存储介质,以解决数据查询的效率不高问题。The embodiments of the present application provide a data query method, device, computer equipment, and storage medium to solve the problem of low efficiency of data query.
一种数据字典生成方法,包括:A method for generating a data dictionary, including:
获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
一种数据查询方法,包括:A data query method, including:
获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
一种数据字典生成装置,包括:A data dictionary generating device includes:
第一获取模块,用于获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;The first obtaining module is configured to obtain first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
第一查询模块,用于基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;The first query module is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine a first index sequence and a second index sequence, wherein the first An index sequence is an index sequence of the first pinyin node, and the second index sequence is an index sequence of the second pinyin node;
第一处理模块,用于采用CSR方法对所述第一索引序列和所述第二索引序列进行处理, 得到候选索引组;The first processing module is configured to process the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;
第一筛选模块,用于在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;The first screening module is configured to query the candidate frequency value of each candidate index group in a preset second data dictionary, and filter out the target index whose candidate frequency value meets the preset requirements from the candidate index group Group;
第一映射存储模块,用于将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The first mapping storage module is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
一种数据查询装置,包括:A data query device includes:
第二查询模块,用于获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;The second query module is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third The data dictionary is obtained by using the data dictionary generating method of claim 1;
第三查询模块,用于基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。The third query module is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, wherein the fourth data dictionary refers to A word frequency dictionary storing the sixth index value and the corresponding sample frequency value.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
上述数据字典生成方法、装置、计算机设备及存储介质,获取第一待存储数据,第一待存储数据包括第一拼音节点和第二拼音节点;基于第一拼音节点和第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,第一索引序列为第一拼音节点的索引序列,第二索引序列为第二拼音节点的索引序列;采用CSR方法对第一索引序列和第二索引序列进行处理,得到候选索引组;在预设的第二数据字典中查询每一候选索引组的候选频率值,从候选索引组中筛选出候选频率值符合预设要求的目标索引组;将待存储数据和目标索引组进行映射存储,生成第三数据字典;通过结合第一数据字典和第二数据字典恢复得到第三数据字典,从而节省了数据存储空间。另外地,在进行数据存储生成字典时,通过将第一待存储数据存储为双数组字典树形式,即将第一拼音节点和第二拼音节点转化成索引进行存储,从而降低了数据存储时的冗余信息和字符类型数据存储时所带来的不便。The above-mentioned data dictionary generation method, device, computer equipment and storage medium are used to obtain the first data to be stored. The first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, in advance It is assumed that the first data dictionary is queried to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node; The CSR method processes the first index sequence and the second index sequence to obtain a candidate index group; query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter the candidate frequency value from the candidate index group The target index group that meets the preset requirements; the data to be stored and the target index group are mapped and stored to generate a third data dictionary; the third data dictionary is restored by combining the first data dictionary and the second data dictionary, thereby saving data storage space. In addition, when data storage is performed to generate a dictionary, the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage. The inconvenience caused by the storage of remaining information and character type data.
上述数据查询方法、装置、计算机设备及存储介质,获取第一待查询数据,将第一待查询数据在第三数据字典中查询,确定第一待查询数据的待查询索引组,其中,第三数据字典是采用权利要求1的数据字典生成方法得到的;基于待查询索引组在第四数据字典的存储数组中查询,得到第一待查询数据的目标字符串,其中,第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典;从而保证了数据查询的准确性。The above-mentioned data query method, device, computer equipment and storage medium acquire the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, where the third The data dictionary is obtained by using the data dictionary generating method of claim 1; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained, wherein the fourth data dictionary refers to A word frequency dictionary for storing the sixth index value and the corresponding sample frequency value; thereby ensuring the accuracy of data query.
本申请的一个或多个实施例的细节在下面的附图和描述中提出,本申请的其他特征和优点将从说明书、附图以及权利要求变得明显。The details of one or more embodiments of the present application are presented in the following drawings and description, and other features and advantages of the present application will become apparent from the description, drawings and claims.
附图说明Description of the drawings
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to explain the technical solutions of the embodiments of the present application more clearly, the following will briefly introduce the drawings that need to be used in the description of the embodiments of the present application. Obviously, the drawings in the following description are only some embodiments of the present application. For those of ordinary skill in the art, other drawings can be obtained based on these drawings without creative labor.
图1是本申请一实施例中数据字典生成方法和数据查询方法的一应用环境示意图;FIG. 1 is a schematic diagram of an application environment of a data dictionary generation method and a data query method in an embodiment of the present application;
图2是本申请一实施例中数据字典生成方法的一示例图;FIG. 2 is an example diagram of a method for generating a data dictionary in an embodiment of the present application;
图3是本申请一实施例中数据字典生成方法的另一示例图;FIG. 3 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;
图4是本申请一实施例中数据字典生成方法的另一示例图;FIG. 4 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;
图5是本申请一实施例中数据字典生成方法的另一示例图;FIG. 5 is another example diagram of a method for generating a data dictionary in an embodiment of the present application;
图6是本申请一实施例中数据字典生成装置的一原理框图;Fig. 6 is a functional block diagram of a data dictionary generating device in an embodiment of the present application;
图7是本申请一实施例中数据查询方法的一示例图;FIG. 7 is an example diagram of a data query method in an embodiment of the present application;
图8是本申请一实施例中数据查询方法的另一示例图;FIG. 8 is another example diagram of a data query method in an embodiment of the present application;
图9是本申请一实施例中数据查询装置的一原理框图;Fig. 9 is a functional block diagram of a data query device in an embodiment of the present application;
图10是本申请一实施例中计算机设备的一示意图。Fig. 10 is a schematic diagram of a computer device in an embodiment of the present application.
具体实施方式detailed description
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。The technical solutions in the embodiments of the present application will be described clearly and completely in conjunction with the accompanying drawings in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of them. Based on the embodiments in this application, all other embodiments obtained by those of ordinary skill in the art without creative work shall fall within the protection scope of this application.
本申请实施例提供的数据字典生成方法,该数据字典生成方法可应用如图1所示的应用环境中。具体地,该数据字典生成方法应用在数据字典生成系统中,该数据字典生成系 统包括如图1所示的客户端和服务端,客户端与服务端通过网络进行通信,用于解决数据存储时的信息冗余问题。其中,客户端又称为用户端,是指与服务端相对应,为客户提供本地服务的程序。客户端可安装在但不限于各种个人计算机、笔记本电脑、智能手机、平板电脑和便携式可穿戴设备上。服务端可以用独立的服务器或者是多个服务器组成的服务器集群来实现。The data dictionary generation method provided by the embodiment of the present application can be applied to the application environment as shown in FIG. 1. Specifically, the data dictionary generation method is applied in a data dictionary generation system. The data dictionary generation system includes a client and a server as shown in FIG. The problem of information redundancy. Among them, the client is also called the client, which refers to the program that corresponds to the server and provides local services to the client. The client can be installed on, but not limited to, various personal computers, notebook computers, smart phones, tablet computers, and portable wearable devices. The server can be implemented with a standalone server or a server cluster composed of multiple servers.
在一实施例中,如图2所示,提供一种数据字典生成方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 2, a method for generating a data dictionary is provided. Taking the method applied to the server in FIG. 1 as an example, the method includes the following steps:
S11:获取第一待存储数据,第一待存储数据包括第一拼音节点和第二拼音节点。S11: Acquire first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node.
其中,第一待存储数据是指待进行存储的2gram拼音数据。例如:第一待存储数据可以为GaoKong CaoZuo、YanJing She或者KaiKai XinXin。第一待存储数据包括两个节点的拼音数据,分别为第一拼音节点和第二拼音节点。第一拼音节点指第一待存储数据中第一个1gram的拼音数据。第二拼音节点指第一待存储数据中第二个1gram所对应的拼音。第一拼音节点和第二拼音节点可以相同或者不同。例如:若第一待存储数据为GaoKong CaoZuo,则第一拼音节点为GaoKong;第二拼音节点为CaoZuo。具体地,获取第一待存储数据可以通过实时采集2gram拼音数据作为第一待存储数据;或者直接从拼音字典库中获取2gram拼音数据作为第一待存储数据。Among them, the first data to be stored refers to the 2gram pinyin data to be stored. For example: the first data to be stored may be GaoKong CaoZuo, YanJing She or KaiKai XinXin. The first data to be stored includes pinyin data of two nodes, namely the first pinyin node and the second pinyin node. The first pinyin node refers to the pinyin data of the first 1gram in the first data to be stored. The second pinyin node refers to the pinyin corresponding to the second 1gram in the first data to be stored. The first pinyin node and the second pinyin node may be the same or different. For example: if the first data to be stored is GaoKong CaoZuo, the first pinyin node is GaoKong; the second pinyin node is CaoZuo. Specifically, the first data to be stored can be obtained by collecting 2gram pinyin data in real time as the first data to be stored; or directly obtaining 2gram pinyin data from the pinyin dictionary database as the first data to be stored.
S12:基于第一拼音节点和第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,第一索引序列为第一拼音节点的索引序列,第二索引序列为第二拼音节点的索引序列。S12: Based on the first pinyin node and the second pinyin node, query in the preset first data dictionary to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node , The second index sequence is the index sequence of the second pinyin node.
其中,第一数据字典是指预先生成的用于存储1gram拼音-同音数据的1gram同音词词典。具体地,在第一数据字典中存储有若干1gram拼音节点和每一个1gram拼音节点所对应的索引序列。例如:第一数据字典中存储有key值为GaiXing,value值为[index1,index2,index3,index4...]的1gram拼音-同音数据。GaiXing为1gram拼音节点;[index1,index2,index3,index4...]为1gram拼音节点GaiXing所对应的字符串的索引。比如:拼音为GaiXing的字符串可以包括[改型,改性,改姓,该新...],通过采用双数组字典树算法对[改型,改性,改姓,该新...]进行处理,即可得到GaiXing所对应的索引序列[index1,index2,index3,index4...]。需要说明的是,索引是以字符串为单元的,每一字符串所对应的索引值都是唯一确定的。Among them, the first data dictionary refers to a 1gram homophone dictionary generated in advance for storing 1gram pinyin-homonym data. Specifically, a number of 1gram pinyin nodes and an index sequence corresponding to each 1gram pinyin node are stored in the first data dictionary. For example: the first data dictionary stores 1gram pinyin-homonym data with a key value of GaiXing and a value value of [index1,index2,index3,index4...]. GaiXing is the 1gram pinyin node; [index1,index2,index3,index4...] is the index of the string corresponding to the 1gram pinyin node GaiXing. For example, the character string whose pinyin is GaiXing can include [modified, modified, changed surname, this new...], by using the double array dictionary tree algorithm to perform [modified, modified, changed surname, this new...] Processing, you can get the index sequence [index1, index2, index3, index4...] corresponding to GaiXing. It should be noted that the index is based on character strings, and the index value corresponding to each character string is uniquely determined.
具体地,在确定了第一拼音节点和第二拼音节点之后,将第一拼音节点和第二拼音节点分别与第一数据字典中的所有1gram拼音节点(key值)进行一一匹配,将与第一拼音节点相匹配的1gram拼音节点所对应的索引序列,确定为第一索引序列,将与第二拼音节点相匹配的1gram拼音节点所对应的索引序列,确定为第二索引序列。可选地,可以将第一索引序列表示为preIndex,表示为第一拼音节点的索引序列,将第二索引序列表示为表示为sufIndex,表示为第二拼音节点的索引序列。Specifically, after the first pinyin node and the second pinyin node are determined, the first pinyin node and the second pinyin node are respectively matched with all 1gram pinyin nodes (key values) in the first data dictionary, and will be matched with The index sequence corresponding to the 1gram pinyin node that matches the first pinyin node is determined as the first index sequence, and the index sequence corresponding to the 1gram pinyin node that matches the second pinyin node is determined as the second index sequence. Optionally, the first index sequence may be expressed as preIndex, which is expressed as the index sequence of the first pinyin node, and the second index sequence may be expressed as sufIndex, which is expressed as the index sequence of the second pinyin node.
S13:采用CSR方法对第一索引序列和第二索引序列进行处理,得到候选索引组。S13: Use the CSR method to process the first index sequence and the second index sequence to obtain a candidate index group.
其中,CSR方法是一种稀疏矩阵存储方法。CSR格式在存储稀疏矩阵时非零元素平均使用的字节数(Bytes per Nonzero Entry)最为稳定。具体地,CSR主要包括行向量、列向量和值向量三类数据。其中,行向量(row offsets)代表行数;其元素值代表所在行第一个非0值的偏移量;列向量(column indices)代表元素的列值;值向量(values)代表对应元素的值。Among them, the CSR method is a sparse matrix storage method. The average number of bytes (Bytes per Nonzero Entry) used by non-zero elements in the CSR format is the most stable when storing a sparse matrix. Specifically, CSR mainly includes three types of data: row vector, column vector, and value vector. Among them, the row vector (row offsets) represents the number of rows; its element value represents the offset of the first non-zero value in the row; the column indicators represent the column values of the elements; the value vectors (values) represent the corresponding elements value.
其中,候选索引组指对第一索引序列中的任意一个索引值和第二索引序列中的任意一个索引值进行随机组合后得到的索引组。一个候选索引组由两个索引值组成。例如:候选索引组可以为Index1-index3、Index2-index3或Index3-index5等。具体地,在确定了第一索引序列和第二索引序列之后,将第一索引序列作为矩阵的行,将第二索引序列作为矩阵的列;然后通过CSR方法中的行向量与列向量,确定矩阵中第一索引序列所对应行的 列索引数组,然后再将第一索引序列所对应行的列索引数组与第二索引序列做交集处理,即可得到候选索引组。Wherein, the candidate index group refers to an index group obtained by randomly combining any index value in the first index sequence and any index value in the second index sequence. A candidate index group is composed of two index values. For example, the candidate index group may be Index1-index3, Index2-index3, or Index3-index5. Specifically, after the first index sequence and the second index sequence are determined, the first index sequence is taken as the row of the matrix, and the second index sequence is taken as the column of the matrix; then, the row vector and the column vector in the CSR method are used to determine The column index array of the row corresponding to the first index sequence in the matrix, and then the column index array of the row corresponding to the first index sequence and the second index sequence are intersected to obtain the candidate index group.
S14:在预设的第二数据字典中查询每一候选索引组的候选频率值,从候选索引组中筛选出候选频率值符合预设要求的目标索引组。S14: Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group.
其中,第二数据字典是指预先生成的用于存储2gram字符串(词语)的索引值,以及每一2gram字符串(词语)对应的频率值的词频词典库。2gram字符串是指由两个1gram字符串组成的词组。例如:2gram字符串可以为高空操作/专利分析/开开心心。具体地,在第二数据字典中存储有若干2gram字符串(词语)所对应的索引组和每一个2gram字符串(词语)所对应的频率值。其中,频率值是指用于评估某一个给定的2gram字符串(词语)在文本中出现的次数,频率值是对候选词排序最重要的参考指标之一,频率值越大的词语表示其为正确词的概率越大。例如:第一数据字典中存储有key为“高空操作”的索引值Index1-index3,以及value为45的词频数据。其中,Index1为高空的索引值,index3为操作的索引值,45为高空操作的频率值。Among them, the second data dictionary refers to a pre-generated word frequency dictionary library used to store the index value of 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word). A 2gram string refers to a phrase composed of two 1gram strings. For example: 2gram string can be used for high-altitude operations/patent analysis/happy. Specifically, the index group corresponding to several 2gram character strings (words) and the frequency value corresponding to each 2gram character string (word) are stored in the second data dictionary. Among them, the frequency value refers to the number of times a given 2gram character string (word) appears in the text. The frequency value is one of the most important reference indicators for ranking candidate words. The word with the larger frequency value indicates its The greater the probability of being the correct word. For example, the first data dictionary stores index values Index1-index3 whose key is "high-altitude operation", and word frequency data whose value is 45. Among them, Index1 is the high-altitude index value, index3 is the operation index value, and 45 is the high-altitude operation frequency value.
其中,目标索引组指频率值满足预设要求的索引组。具体地,在确定了候选索引组之后,将每一候选索引组在预设的第二数据字典中查询,确定每一候选索引组的候选频率。在确定了每一候选索引组的候选频率之后,从候选索引组中筛选出候选频率值符合预设要求的索引组,作为目标索引组。在一具体实施例中,可以预先设定一频率阈值,然后将每一候选索引组的候选频率值与频率阈值进行比较,然后将大于频率阈值的候选频率值所对应的候选索引组确定为符合预设要求的目标索引组。优选地,为了保证存储的数据的多样性和普遍性,在本实施例中,频率阈值设为0,即只要候选频率值大于0的候选索引组都确定为目标索引组,候选频率值为0代表该候选索引组对应的2gram字符串(词语)不存在。在另一具体实施例中,在将每一候选索引组在预设的第二数据字典中查询时,若在第二数据字典中没有查询到对应的候选频率值,则直接判断该候选索引组的候选频率值不符合预设要求,剔除该候选索引组。Among them, the target index group refers to an index group whose frequency value meets a preset requirement. Specifically, after the candidate index groups are determined, each candidate index group is queried in the preset second data dictionary to determine the candidate frequency of each candidate index group. After the candidate frequency of each candidate index group is determined, an index group whose candidate frequency value meets the preset requirements is screened out from the candidate index group and used as the target index group. In a specific embodiment, a frequency threshold may be preset, and then the candidate frequency value of each candidate index group is compared with the frequency threshold, and then the candidate index group corresponding to the candidate frequency value greater than the frequency threshold is determined to meet Preset the required target index group. Preferably, in order to ensure the diversity and universality of the stored data, in this embodiment, the frequency threshold is set to 0, that is, as long as the candidate index group with a candidate frequency value greater than 0 is determined as the target index group, the candidate frequency value is 0 It means that the 2gram character string (word) corresponding to the candidate index group does not exist. In another specific embodiment, when each candidate index group is queried in the preset second data dictionary, if the corresponding candidate frequency value is not queried in the second data dictionary, the candidate index group is directly determined The candidate frequency value of does not meet the preset requirements, and the candidate index group is eliminated.
示例性的,若候选索引组包括Index1-index3(高空操作)、Index2-index4(高控槽座)、Index1-index4(高空槽座)和Index2-index3(高控操作);在预设的第二数据字典中查询后,得到Index1-index3(高空操作)的频率值为40;Index2-index4(高控槽座)的频率值为20;Index1-index4(高空槽座)的频率值为0(不存在);Index2-index3(高控操作)的频率值为0(不存在);则将Index1-index3和Index2-index4确定为目标索引组。Exemplarily, if the candidate index group includes Index1-index3 (high-altitude operation), Index2-index4 (high-altitude slot), Index1-index4 (high-altitude slot), and Index2-index3 (high-altitude operation); After querying in the data dictionary, the frequency value of Index1-index3 (high-altitude operation) is 40; the frequency value of Index2-index4 (high-altitude slot) is 20; the frequency value of Index1-index4 (high-altitude slot) is 0 ( Does not exist); the frequency value of Index2-index3 (high control operation) is 0 (not exist); then Index1-index3 and Index2-index4 are determined as the target index group.
S15:将待存储数据和目标索引组进行映射存储,生成第三数据字典。S15: Map and store the data to be stored and the target index group to generate a third data dictionary.
其中,第三数据字典是指用于存储2gram拼音-同音数据的2gram同音词词典。具体地,在第三数据字典中包括若干2gram拼音节点和每一2gram拼音节点所对应的索引组序列。具体地,在确定了目标索引组之后,将待存储数据(2gram拼音节点)和对应的目标索引组进行映射存储,生成第三数据字典。例如:若待存储数据为GaoKong CaoZuo,其对应的目标索引组为Index1-index3和Index2-index4,则将GaoKong CaoZuo作为key值,将Index1-index3和Index2-index4作为value值进行映射存储,生成第三数据字典。Among them, the third data dictionary refers to a 2gram homophone dictionary for storing 2gram pinyin-homonym data. Specifically, the third data dictionary includes several 2gram pinyin nodes and an index group sequence corresponding to each 2gram pinyin node. Specifically, after the target index group is determined, the data to be stored (2gram pinyin node) and the corresponding target index group are mapped and stored to generate the third data dictionary. For example: if the data to be stored is GaoKong CaoZuo, and its corresponding target index groups are Index1-index3 and Index2-index4, then GaoKong CaoZuo is used as the key value, and Index1-index3 and Index2-index4 are mapped and stored as the value value to generate the first Three data dictionary.
在本实施例中,获取第一待存储数据,第一待存储数据包括第一拼音节点和第二拼音节点;基于第一拼音节点和第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,第一索引序列为第一拼音节点的索引序列,第二索引序列为第二拼音节点的索引序列;采用CSR方法对第一索引序列和第二索引序列进行处理,得到候选索引组;在预设的第二数据字典中查询每一候选索引组的候选频率值,从候选索引组中筛选出候选频率值符合预设要求的目标索引组;将待存储数据和目标索引组进行映射存储,生成第三数据字典;通过结合第一数据字典和第二数据字典恢复得到第三数据字典,从而节省了数据存储空间。另外地,在进行数据存储生成字典时,通过将第一待存储 数据存储为双数组字典树形式,即将第一拼音节点和第二拼音节点转化成索引进行存储,从而降低了数据存储时的冗余信息和字符类型数据存储时所带来的不便。In this embodiment, the first data to be stored is acquired, and the first data to be stored includes the first pinyin node and the second pinyin node; based on the first pinyin node and the second pinyin node, it is performed in the preset first data dictionary Query to determine the first index sequence and the second index sequence, where the first index sequence is the index sequence of the first pinyin node, and the second index sequence is the index sequence of the second pinyin node; the CSR method is used to compare the first index sequence and The second index sequence is processed to obtain the candidate index group; the candidate frequency value of each candidate index group is queried in the preset second data dictionary, and the target index group whose candidate frequency value meets the preset requirements is selected from the candidate index group ; Map and store the data to be stored and the target index group to generate a third data dictionary; restore the third data dictionary by combining the first data dictionary and the second data dictionary, thereby saving data storage space. In addition, when data storage is performed to generate a dictionary, the first data to be stored is stored in the form of a double-array dictionary tree, that is, the first pinyin node and the second pinyin node are converted into indexes for storage, thereby reducing the redundancy of data storage. The inconvenience caused by the storage of remaining information and character type data.
在一实施例中,如图3所示,在基于第一拼音节点和第二拼音节点,在预设的第一数据字典中进行查询之前,该数据字典生成方法,具体包括如下步骤:In one embodiment, as shown in FIG. 3, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method specifically includes the following steps:
S21:获取第二待存储数据,第二待存储数据包括N个第三拼音节点和每一第三拼音节点对应的M个字符串。S21: Obtain second data to be stored. The second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node.
其中,第二待存储数据是指待进行存储的1gram拼音-同音词数据。例如:第二待存储数据可以为key值为GaiXing,value值为[改性,改姓,改型...],或者key值为GaoKong,value值为[高空,高控,高孔...]的1gram拼音-同音词数据。第二待存储数据包括N个第三拼音节点和每一个第三拼音节点对应的M个字符串。第三拼音节点是指第二待存储数据中的key值。例如:第三拼音节点可以为GaiXing/GaoKong/CaoZuo。可以理解地,在第二待存储数据中每一个key值所对应的value值即为每一个第三拼音节点对应的字符串。每一个第三拼音节点至少对应包括一个字符串。例如:第三拼音节点GaiXing对应的字符串包括[改性,改姓,改型...]。具体地,获取第二待存储数据可以通过实时采集1gram拼音-同音词数据作为第二待存储数据;或者直接从拼音-同音字典库中获取1gram拼音-同音词数据作为第二待存储数据。Among them, the second data to be stored refers to the 1gram pinyin-homonym data to be stored. For example: the second data to be stored can be the key value is GaiXing, the value value is [modified, changed surname, modified...], or the key value is GaoKong, and the value value is [高空,高控,高孔... ] 1gram pinyin-homophone data. The second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node. The third pinyin node refers to the key value in the second data to be stored. For example: the third pinyin node can be GaiXing/GaoKong/CaoZuo. Understandably, the value corresponding to each key value in the second to-be-stored data is the character string corresponding to each third pinyin node. Each third pinyin node correspondingly includes at least one character string. For example: the string corresponding to the third pinyin node GaiXing includes [modified, changed surname, modified...]. Specifically, the second data to be stored can be obtained by real-time collection of 1 gram pinyin-homonymous word data as the second data to be stored; or the 1 gram pinyin-homonymous word data can be directly obtained from the pinyin-homonymous dictionary database as the second data to be stored.
S22:采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理,确定每一第三拼音节点对应的索引值集。S22: Use the double-array dictionary tree algorithm to process each character string of each third pinyin node, and determine the index value set corresponding to each third pinyin node.
其中,双数组字典树是一种高效的索引方法,在树的结构中,每一个结点对应一个DFA状态,每一个从父结点指向子结点(有向)标记的边对应一个DFA转换。遍历从根结点开始,然后从head到tail,由关键词的每个字符来决定下一个状态,标记有相同字符的边被选中做移动;每次这种移动会从关键词中消耗一个字符并走向树的下一层,如果这个关键字符串空了,并且走到了叶子结点,那么表示达到了这个关键词的出口。如果被困在了一点结点,比如因为没有分枝被标记为当前有的字符,或是因为关键字符串在中间结点就空了,这表示关键字符串没有被trie认出来。Among them, the double-array dictionary tree is an efficient indexing method. In the tree structure, each node corresponds to a DFA state, and each edge from the parent node to the child node (directed) corresponds to a DFA conversion. . The traversal starts from the root node, and then from head to tail. Each character of the keyword determines the next state. The edge marked with the same character is selected for movement; each such movement consumes one character from the keyword And go to the next level of the tree. If the key string is empty and it reaches the leaf node, it means that the key word's exit has been reached. If you are trapped at a point, such as because there is no branch and are marked as the current character, or because the key string is empty at the middle node, it means that the key string is not recognized by the trie.
具体地,采用双数组字典树算法对每一个第三拼音节点所对应的每一个字符串进行处理,即将每一个第三拼音节点所对应的每一个字符串存储为双数组字典树形式,从而得到每一个第三拼音节点对应的索引值集;从而实现在数据获取时通过拼音节点即可得到该拼音节点的所有同音词的索引。需要说明的是,每一个第三拼音节点对应的索引值集中的每一索引值都是唯一确定的。每一个字符串对应一个唯一的索引值。Specifically, the double-array dictionary tree algorithm is used to process each string corresponding to each third pinyin node, that is, each string corresponding to each third pinyin node is stored in the form of a double-array dictionary tree, thereby obtaining The index value set corresponding to each third pinyin node; thus, the indexes of all homophones of the pinyin node can be obtained through the pinyin node during data acquisition. It should be noted that each index value in the index value set corresponding to each third pinyin node is uniquely determined. Each string corresponds to a unique index value.
示例性地,若第三拼音节点GaiXing对应的字符串包括[改性,改姓,改型...],则采用双数组字典树算法进行处理后,得到第三拼音节点GaiXing对应的索引值集为[index1,index2,index3...]。其中,index1为“改性”对应的索引值;index2为“改姓”对应的索引值;index3为“改型”对应的索引值。Exemplarily, if the character string corresponding to the third pinyin node GaiXing includes [modified, changed surname, modified...], the double-array dictionary tree algorithm is used for processing to obtain the index value set corresponding to the third pinyin node GaiXing It is [index1,index2,index3...]. Among them, index1 is the index value corresponding to "modified"; index2 is the index value corresponding to "modified surname"; index3 is the index value corresponding to "modified".
S23:将每一第三拼音节点对应的索引值集写入预设的第一索引数组中,得到第一目标索引数组。S23: Write the index value set corresponding to each third pinyin node into the preset first index array to obtain the first target index array.
其中,第一索引数组是指预先建立的用于记录每一个第三拼音节点对应的索引值集的一维数组。具体地,将每一个第三拼音节点对应的索引值集写入预设的第一索引数组中,得到第一目标索引数组。示例性地,若第三拼音节点GaiXing对应的索引值集为[index1,index2,index3];第三拼音节点GaoKong对应的索引值集为[index4,index5,index6],则将个第三拼音节点GaiXing和GaiXing对应的索引值集都写入预设的第一索引数组中后,得到的第一目标索引数组为[index1,index2,index3,index4,index5,index6]。Wherein, the first index array refers to a pre-established one-dimensional array used to record the index value set corresponding to each third pinyin node. Specifically, the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target index array. Exemplarily, if the index value set corresponding to the third pinyin node GaiXing is [index1, index2, index3]; the index value set corresponding to the third pinyin node GaoKong is [index4, index5, index6], then the third pinyin node After the index value sets corresponding to GaiXing and GaiXing are written into the preset first index array, the first target index array obtained is [index1, index2, index3, index4, index5, index6].
S24:从第一目标索引数组中确定每一第三拼音节点的起始索引位置。S24: Determine the starting index position of each third Pinyin node from the first target index array.
具体地,由于第一目标索引数组中的每一索引值都是唯一确定的,因此,将每一个第 三拼音节点所对应的索引值集中的第一个索引值和最后一个索引值在第一目标索引数组中的数组序号确定为对应的第三拼音节点的起始索引位置。示例性地,若第一目标索引数组为[index1,index2,index3,index4,index5,index6],index1和index3分别为第三拼音节点GaiXing的第一个索引值和最后一个索引值,index1在第一目标索引数组中的数组序号为0,index3在第一目标索引数组中的数组序号为2,因此,第三拼音节点GaiXing的起始索引位置为(0,2);index4和index6分别为第三拼音节点GaoKong的第一个索引值和最后一个索引值,index4在第一目标索引数组中的数组序号为3,index6在第一目标索引数组中的数组序号为5,因此,第三拼音节点GaoKong的起始索引位置为(3,5)。Specifically, since each index value in the first target index array is uniquely determined, the first index value and the last index value in the index value set corresponding to each third pinyin node are set in the first The array number in the target index array is determined as the starting index position of the corresponding third pinyin node. Exemplarily, if the first target index array is [index1, index2, index3, index4, index5, index6], index1 and index3 are the first index value and the last index value of the third pinyin node GaiXing, and index1 is in the first index. The array number of a target index array is 0, and the array number of index3 in the first target index array is 2. Therefore, the starting index position of the third pinyin node GaiXing is (0, 2); index4 and index6 are respectively The first index value and the last index value of the three pinyin node GaoKong, the array number of index4 in the first target index array is 3, and the array number of index6 in the first target index array is 5. Therefore, the third pinyin node The starting index position of GaoKong is (3,5).
S25:采用双数组字典树算法每一第三拼音节点进行处理,得到每一第三拼音节点的节点标识。S25: Use the double-array dictionary tree algorithm to process each third pinyin node to obtain the node identifier of each third pinyin node.
具体地,采用双数组字典树算法每一个第三拼音节点进行处理,即将每一第三拼音节点存储为双数组字典树形式,从而得到每一个第三拼音节点对应的节点标识。可以理解地,每一第三拼音节点对应的节点标识都是唯一确定的。需要说明的是,本步骤中采用双数组字典树算法每一第三拼音节点进行处理的具体方法和过程,与步骤S22采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理的的具体方法和过程相似,在此不做冗余赘述。Specifically, each third pinyin node is processed using the double-array dictionary tree algorithm, that is, each third pinyin node is stored in the form of a double-array dictionary tree, so as to obtain the node identifier corresponding to each third pinyin node. Understandably, the node identifier corresponding to each third pinyin node is uniquely determined. It should be noted that the specific method and process of processing each third pinyin node using the double-array dictionary tree algorithm in this step is the same as that in step S22 for each string of each third pinyin node using the double-array dictionary tree algorithm. The specific method and process of processing are similar, so I won’t repeat them here.
S26:将每一第三拼音节点的节点标识与对应的起始索引位置进行映射存储,生成偏移数组集。S26: Map and store the node identifier of each third Pinyin node and the corresponding start index position to generate an offset array set.
其中,偏移数组集是指由若干偏移数组组成的集合。每一偏移数组包括一个节点标识和对应的起点索引位置。具体地,在确定了每一第三拼音节点的节点标识之后,将每一节点标识与对应的起始索引位置进行关联存储,生成偏移数组集。例如:若第三拼音节点GaiXing的节点标识为0,其所对应的起始索引位置为(0,2);第三拼音节点GaoKong为节点标识为1,其所对应的起始索引位置为(3,5);因此,将节点标识0与起始索引位置(0,2)进行映射存储,生成第一偏移数组,将节点标识1与起始索引位置(3,5)进行映射存储,生成第二偏移数组,第一偏移数组和第二偏移数组组成偏移数组集。Among them, the offset array set refers to a set composed of several offset arrays. Each offset array includes a node identifier and a corresponding starting index position. Specifically, after the node identifier of each third pinyin node is determined, each node identifier and the corresponding starting index position are associated and stored to generate an offset array set. For example: if the node ID of the third pinyin node GaiXing is 0, its corresponding starting index position is (0,2); the third pinyin node GaoKong is the node ID being 1, and its corresponding starting index position is ( 3, 5); Therefore, the node ID 0 and the starting index position (0, 2) are mapped and stored, the first offset array is generated, and the node ID 1 is mapped and stored with the starting index position (3, 5), A second offset array is generated, and the first offset array and the second offset array form an offset array set.
S27:将第一目标索引数组和偏移数组集进行组合,生成第一数据字典。S27: Combine the first target index array and the offset array set to generate a first data dictionary.
其中,第一数据字典为用于存储1gram同音词的词典。具体地,在确定了第一目标索引数组和偏移数组集之后,将第一目标索引数组和偏移数组集进行组合,生成第一数据字典。可以理解地,在第一数据字典中,每一个1gram拼音节点是以节点标识的形式进行存储的,每一个1gram拼音节点所对应的字符串是以索引的形式进行存储的;从而降低了数据存储时的冗余信息。Among them, the first data dictionary is a dictionary for storing 1 gram homophones. Specifically, after the first target index array and the offset array set are determined, the first target index array and the offset array set are combined to generate the first data dictionary. Understandably, in the first data dictionary, each 1gram pinyin node is stored in the form of node identification, and the string corresponding to each 1gram pinyin node is stored in the form of index; thereby reducing data storage Redundant information at the time.
在本实施例中,获取第二待存储数据,第二待存储数据包括N个第三拼音节点和每一第三拼音节点对应的M个字符串;采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理,确定每一第三拼音节点对应的索引值集;将每一第三拼音节点对应的索引值集写入预设的第一索引数组中,得到第一目标索引数组;从第一目标索引数组中确定每一第三拼音节点的起始索引位置;采用双数组字典树算法每一对第三拼音节点进行处理,得到每一第三拼音节点的节点标识;将每一第三拼音节点的节点标识与对应的起始索引位置进行映射存储,生成偏移数组集;将第一目标索引数组和偏移数组集进行组合,生成第一数据字典;通过将第二待存储数据存储为双数组字典树形式,即将第三拼音节点转化成节点标识的形式进行存储,和将每一第三拼音节点对应的字符串转化成索引的形式进行存储,从而降低了数据存储时的冗余信息。In this embodiment, the second data to be stored is acquired, and the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node; the double-array dictionary tree algorithm is used for each third pinyin node. Each character string of the pinyin node is processed to determine the index value set corresponding to each third pinyin node; the index value set corresponding to each third pinyin node is written into the preset first index array to obtain the first target Index array; determine the starting index position of each third pinyin node from the first target index array; use the double-array dictionary tree algorithm to process each pair of third pinyin nodes to obtain the node identification of each third pinyin node; The node identifier of each third pinyin node is mapped and stored with the corresponding starting index position to generate an offset array set; the first target index array and the offset array set are combined to generate the first data dictionary; Second, the data to be stored is stored in the form of a double-array dictionary tree, that is, the third pinyin node is converted into the form of node identification for storage, and the string corresponding to each third pinyin node is converted into the form of index for storage, thereby reducing the data Redundant information during storage.
在一实施例中,如图4所示,在预设的第二数据字典中查询每一候选索引组的候选频率值之前,该数据字典生成方法,还具体包括如下步骤:In one embodiment, as shown in FIG. 4, before querying the candidate frequency value of each candidate index group in the preset second data dictionary, the data dictionary generating method further specifically includes the following steps:
S41:获取第三待存储数据,第三待存储数据包括第四拼音字节、第五拼音字节和目 标频率值。S41: Obtain the third data to be stored. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value.
其中,第三待存储数据是指待进行存储的2gram词频数据。例如:第三待存储数据为key值为GaoKong CaoZuo,value值为30,或者,key值为YanJing Sheg,value值为25的2gram词频数据。第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值。其中,第四拼音字节是指第三待存储数据中的第一个1gram拼音。第五拼音节点指第三待存储数据中的第二个1gram拼音。第四拼音节点和第五拼音节点可以相同或者不同。第四拼音节点和第五拼音节点组合作为第三待存储数据的key值。目标频率值是指由第四拼音节点和第五拼音节点的组合拼音节点所对应的频率值。目标频率值为第三待存储数据中的value值。例如:若第三待存储数据中key值为GaoKong CaoZuo和value值为25;则第四拼音节点为GaoKong;第二拼音节点为CaoZuo;目标频率值为25。其中,25为GaoKong CaoZuo的频率值。具体地,获取第三待存储数据可以通过实时采集2gram词频数据作为第三待存储数据;或者直接从拼音字典库中获取2gram词频数据作为第三待存储数据。Among them, the third data to be stored refers to the 2gram word frequency data to be stored. For example: the third data to be stored is 2gram word frequency data with a key value of GaoKong CaoZuo and a value value of 30, or the key value of YanJing Sheg and a value of 25. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value. Among them, the fourth pinyin byte refers to the first 1gram pinyin in the third data to be stored. The fifth pinyin node refers to the second 1gram pinyin in the third data to be stored. The fourth pinyin node and the fifth pinyin node may be the same or different. The fourth pinyin node and the fifth pinyin node are combined as the key value of the third data to be stored. The target frequency value refers to the frequency value corresponding to the combined pinyin node of the fourth pinyin node and the fifth pinyin node. The target frequency value is the value value in the third data to be stored. For example: if the key value in the third data to be stored is GaoKong CaoZuo and the value value is 25; the fourth pinyin node is GaoKong; the second pinyin node is CaoZuo; and the target frequency value is 25. Among them, 25 is the frequency value of GaoKong CaoZuo. Specifically, the third data to be stored can be acquired by real-time acquisition of 2gram word frequency data as the third data to be stored; or the 2gram word frequency data can be directly acquired from the Pinyin dictionary database as the third data to be stored.
S42:采用双数组字典树算法对第四拼音字节和第五拼音字节进行处理,得到第四索引值和第五索引值,其中,第四索引值为第四拼音字节的的索引值,第五索引值为第五拼音字节的的索引值。S42: Use the double-array dictionary tree algorithm to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte , The fifth index value is the index value of the fifth pinyin byte.
具体地,采用双数组字典树算法对第四拼音字节和第五拼音字节进行处理,得到第四索引值和第五索引值。其中,第四索引值为第四拼音字节的索引值,第五索引值为第五拼音字节的的索引值。需要说明的是,本步骤中采用双数组字典树算法对第四拼音字节和第五拼音字节进行处理的具体方法和过程,与步骤S22采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理的的具体方法和过程相似,在此不做冗余赘述。Specifically, a double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain the fourth index value and the fifth index value. Wherein, the fourth index value is the index value of the fourth pinyin byte, and the fifth index value is the index value of the fifth pinyin byte. It should be noted that the specific method and process of processing the fourth pinyin byte and the fifth pinyin byte using the double-array dictionary tree algorithm in this step is the same as that in step S22 using the double-array dictionary tree algorithm for each third pinyin node The specific methods and procedures for processing each character string are similar, so I won’t repeat them here.
S43:采用CSR方法将第四索引值、第五索引值和目标频率值进行映射存储,生成所述第二数据字典。S43: Use the CSR method to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
其中,第二数据字典是指用于存储2gram字符串(词语)的索引值以及对应的频率值的词频词典库。由于2gram字符串是由两个1gram字符串两两组合成的,因此每一个2gram字符串(词语)都包括两个索引值,分别为第四索引值和第五索引值。具体地,可预设一个二维矩阵,将第四索引值作为该二维矩阵的行,将第五索引值作为该二维矩阵的列;目标频率值作为该二维矩阵中的元素值进行映射存储。进一步地,由于很多2gramm字符串组合在实际中不存在,所以该二维矩阵为稀疏矩阵,因此,再采用CSR方法对该二维矩阵进行处理,从而压缩空间,生成第二数据字典。Among them, the second data dictionary refers to a word frequency dictionary library used to store the index value of the 2gram character string (word) and the corresponding frequency value. Since the 2gram string is composed of two 1gram strings, each 2gram string (word) includes two index values, which are the fourth index value and the fifth index value. Specifically, a two-dimensional matrix can be preset, the fourth index value is used as the row of the two-dimensional matrix, and the fifth index value is used as the column of the two-dimensional matrix; the target frequency value is used as the element value in the two-dimensional matrix. Map storage. Furthermore, since many 2gramm string combinations do not exist in practice, the two-dimensional matrix is a sparse matrix. Therefore, the CSR method is used to process the two-dimensional matrix to compress the space and generate a second data dictionary.
在本实施例中,获取第三待存储数据,第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值;采用双数组字典树算法对第四拼音字节和第五拼音字节进行处理,得到第四索引值和第五索引值,其中,第四索引值为第四拼音字节的的索引值,第五索引值为第五拼音字节的的索引值;采用CSR方法将第四索引值、第五索引值和目标频率值进行映射存储,生成所述第二数据字典;通过将第三待存储数据存储为双数组字典树形式,即将第三待存储数据的第四拼音字节和第五拼音字节用索引表示,从而降低了数据存储时的冗余信息,节省了存储空间。In this embodiment, the third data to be stored is obtained. The third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value; the double-array dictionary tree algorithm is used to compare the fourth pinyin byte and the fifth pinyin byte. The pinyin bytes are processed to obtain the fourth index value and the fifth index value, where the fourth index value is the index value of the fourth pinyin byte, and the fifth index value is the index value of the fifth pinyin byte; adopt The CSR method maps and stores the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary; by storing the third data to be stored in the form of a double array dictionary tree, that is, the third data to be stored is The fourth pinyin byte and the fifth pinyin byte are represented by indexes, thereby reducing redundant information during data storage and saving storage space.
在一实施例中,如图5所示,该数据字典生成方法,还具体包括如下步骤:In an embodiment, as shown in FIG. 5, the data dictionary generation method further specifically includes the following steps:
S16:获取第四待存储数据,第四待存储数据包括L个样本字符串和每一样本字符串对应的样本频率值。S16: Obtain fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string.
其中,第四待存储数据是指待进行存储的1gram词频数据。第四待存储数据包括L个样本字符串和每一样本字符串对应的频率值。其中,样本字符串为第四待存储数据中的key值,频率值为第四待存储数据中的value值。例如:若第四待存储数据中包括有key值为高空,value值为40,和,key值为操作,value值为45的1gram词频数据;则“高空”为样本字符串,“40”为高空对应的频率值;“操作”为样本字符串,“45”为操作对应的频率值。可以理解地,第四待存储数据包括有L个键值对key-value,每一个key对应一个频率 值,即每一个样本字符串对应一个频率值。具体地,获取第四待存储数据可以通过实时采集1gram词频数据作为第四待存储数据;或者直接从拼音字典库中获取1gram词频数据作为第四待存储数据。Among them, the fourth data to be stored refers to 1gram word frequency data to be stored. The fourth data to be stored includes L sample character strings and a frequency value corresponding to each sample character string. Among them, the sample string is the key value in the fourth data to be stored, and the frequency value is the value value in the fourth data to be stored. For example: if the fourth data to be stored includes 1gram word frequency data with key value of high altitude, value value of 40, sum, key value of operation, and value value of 45; then "high altitude" is the sample string, and "40" is The frequency value corresponding to high altitude; "operation" is the sample string, and "45" is the frequency value corresponding to the operation. Understandably, the fourth data to be stored includes L key-value pairs, and each key corresponds to a frequency value, that is, each sample string corresponds to a frequency value. Specifically, the fourth to-be-stored data can be obtained by real-time collection of 1 gram word frequency data as the fourth to-be-stored data; or directly obtained from the Pinyin dictionary database as the fourth to-be-stored data.
S17:采用双数组字典树算法对每一样本字符串进行处理,得到每一样本字符串的第六索引值。S17: Use the double-array dictionary tree algorithm to process each sample string to obtain the sixth index value of each sample string.
具体地,采用双数组字典树算法对每一样本字符串进行处理,从而得到每一样本字符串的第六索引值。可以理解地,每一样本字符串对应一个唯一的第六索引值。需要说明的是,本步骤中采用双数组字典树算法对每一样本字符串进行处理的具体方法和过程,与步骤S22采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理的的具体方法和过程相似,在此不做冗余赘述。Specifically, the double-array dictionary tree algorithm is used to process each sample character string, so as to obtain the sixth index value of each sample character string. Understandably, each sample string corresponds to a unique sixth index value. It should be noted that the specific method and process of using the double-array dictionary tree algorithm to process each sample string in this step is the same as that in step S22 using the double-array dictionary tree algorithm for each string of each third pinyin node. The specific method and process of processing are similar, so I won’t repeat them here.
S18:将每一样本字符串和对应的第六索引值写入预设数组中,得到存储数组。S18: Write each sample character string and the corresponding sixth index value into the preset array to obtain a storage array.
另外地,由于双数组字典树无法通过索引(第六索引值)反查1gram片段(样本字符串),因此,在本实施例中,建立一个用于存储每一样本字符串的第六索引值的存储数组。具体地,第六索引值在存储数组的数组序号与第六索引值相对应。即按照第六索引值从小到大的顺序,将每一样本字符串的第六索引值写入存储数组中,从而便于后续通过索引值(第六索引值)反查对应的1gram片段(样本字符串)。In addition, since the double-array dictionary tree cannot back-check the 1gram segment (sample character string) through the index (the sixth index value), in this embodiment, a sixth index value for storing each sample character string is established The storage array. Specifically, the array number of the sixth index value in the storage array corresponds to the sixth index value. That is, according to the order of the sixth index value from small to large, the sixth index value of each sample string is written into the storage array, so that it is convenient to check the corresponding 1gram segment (sample character) through the index value (sixth index value). string).
S19:将每一第六索引值与对应的样本频率值进行映射存储,生成第四数据字典。S19: Map and store each sixth index value with the corresponding sample frequency value to generate a fourth data dictionary.
具体地,在得到第六索引值之后,将每一第六索引值和对应的样本频率值进行映射存储,生成第四数据字典。其中,第四数据字典是指用于存储1gram词频数据的1gram词频字典。在第四数据字典中,包括若干1gram字符串的索引值和对应的频率值。例如:第四数据字典中包括key值为index1,value值为30,和,key值为index2,value值为40的数据。其中,index1为样本字符串“高空”的第六索引值,30为样本字符串“高空”的频率值;index2为样本字符串“操作”的第六索引值,40为样本字符串“操作”的频率值。Specifically, after the sixth index value is obtained, each sixth index value and the corresponding sample frequency value are mapped and stored to generate the fourth data dictionary. Among them, the fourth data dictionary refers to a 1gram word frequency dictionary for storing 1gram word frequency data. In the fourth data dictionary, the index value and corresponding frequency value of several 1gram strings are included. For example, the fourth data dictionary includes data whose key value is index1, value value is 30, and key value is index2, and value value is 40. Among them, index1 is the sixth index value of the sample string "高空", 30 is the frequency value of the sample string "高空"; index2 is the sixth index value of the sample string "Operation", and 40 is the sample string "Operation" The frequency value.
在本实施例中,获取第四待存储数据,第四待存储数据包括L个样本字符串和每一样本字符串对应的样本频率值;采用双数组字典树算法对每一样本字符串进行处理,得到每一样本字符串的第六索引值;将每一样本字符串和对应的第六索引值写入预设数组中,得到存储数组;将每一第六索引值与对应的样本频率值进行映射存储,生成第四数据字典;通过将第四待存储数据存储为双数组字典树形式,即将每一样本字符串转化成第六索引值,并与对应的样本频率值进行存储,从而降低了数据存储时的冗余信息和字符类型数据存储时所带来的不便。In this embodiment, the fourth data to be stored is obtained, and the fourth data to be stored includes L sample character strings and the sample frequency value corresponding to each sample character string; each sample character string is processed by the double-array dictionary tree algorithm , Get the sixth index value of each sample character string; write each sample character string and the corresponding sixth index value into the preset array to obtain the storage array; combine each sixth index value with the corresponding sample frequency value Perform mapping storage to generate a fourth data dictionary; by storing the fourth data to be stored in the form of a double array dictionary tree, that is, each sample string is converted into a sixth index value, and stored with the corresponding sample frequency value, thereby reducing The redundant information during data storage and the inconvenience caused by character type data storage are eliminated.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
在一实施例中,提供一种数据字典生成装置,该数据字典生成装置与上述实施例中数据字典生成方法一一对应。如图6所示,该数据字典生成装置包括第一获取模块11、第一查询模块12、处理模块13、第一筛选模块14和第一映射存储模块15。各功能模块详细说明如下:In one embodiment, a data dictionary generating device is provided, and the data dictionary generating device corresponds to the data dictionary generating method in the above-mentioned embodiment in a one-to-one correspondence. As shown in FIG. 6, the data dictionary generating device includes a first acquisition module 11, a first query module 12, a processing module 13, a first screening module 14 and a first mapping storage module 15. The detailed description of each functional module is as follows:
第一获取模块11,用于获取第一待存储数据,第一待存储数据包括第一拼音节点和第二拼音节点;The first obtaining module 11 is configured to obtain first data to be stored, and the first data to be stored includes a first pinyin node and a second pinyin node;
第一查询模块12,用于基于第一拼音节点和第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,第一索引序列为第一拼音节点的索引序列,第二索引序列为第二拼音节点的索引序列;The first query module 12 is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine the first index sequence and the second index sequence, where the first index sequence is the first An index sequence of a pinyin node, and the second index sequence is an index sequence of a second pinyin node;
第一处理模块13,用于采用CSR方法对第一索引序列和第二索引序列进行处理,得到候选索引组;The first processing module 13 is configured to use the CSR method to process the first index sequence and the second index sequence to obtain a candidate index group;
第一筛选模块14,用于在预设的第二数据字典中查询每一候选索引组的候选频率值,从候选索引组中筛选出候选频率值符合预设要求的目标索引组;The first screening module 14 is configured to query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirements from the candidate index group;
第一映射存储模块15,用于将待存储数据和目标索引组进行映射存储,生成第三数据字典。The first mapping storage module 15 is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
优选地,该数据字典生成装置还包括:Preferably, the data dictionary generating device further includes:
第二获取模块,用于获取第二待存储数据,第二待存储数据包括N个第三拼音节点和每一第三拼音节点对应的M个字符串;The second acquisition module is configured to acquire second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each third pinyin node;
第二处理模块,用于采用双数组字典树算法对每一第三拼音节点的每一字符串进行处理,确定每一第三拼音节点对应的索引值集;The second processing module is used to process each character string of each third pinyin node by using a double-array dictionary tree algorithm to determine the index value set corresponding to each third pinyin node;
第一写入模块,用于将每一第三拼音节点对应的索引值集写入预设的第一索引数组中,得到第一目标索引数组;The first writing module is used to write the index value set corresponding to each third pinyin node into the preset first index array to obtain the first target index array;
第一确定模块,用于从第一目标索引数组中确定每一第三拼音节点的起始索引位置;The first determining module is used to determine the starting index position of each third pinyin node from the first target index array;
第三处理模块,用于采用双数组字典树算法每一第三拼音节点进行处理,得到每一第三拼音节点的节点标识;The third processing module is used to process each third pinyin node by using the double-array dictionary tree algorithm to obtain the node identifier of each third pinyin node;
第二映射存储模块,用于将每一第三拼音节点的节点标识与对应的起始索引位置进行映射存储,生成偏移数组集;The second mapping storage module is used to map and store the node identifier of each third pinyin node and the corresponding starting index position to generate an offset array set;
组合模块,用于将第一目标索引数组和偏移数组集进行组合,生成第一数据字典。The combination module is used to combine the first target index array and the offset array set to generate a first data dictionary.
优选地,该数据字典生成装置还包括:Preferably, the data dictionary generating device further includes:
第三获取模块,用于获取第三待存储数据,第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值;The third acquisition module is used to acquire the third data to be stored, the third data to be stored includes the fourth pinyin byte, the fifth pinyin byte and the target frequency value;
第四处理模块,用于采用双数组字典树算法对第四拼音字节和第五拼音字节进行处理,得到第四索引值和第五索引值,其中,第四索引值为第四拼音字节的的索引值,第五索引值为第五拼音字节的的索引值;The fourth processing module is used to process the fourth pinyin byte and the fifth pinyin byte using the double-array dictionary tree algorithm to obtain the fourth index value and the fifth index value, where the fourth index value is the fourth pinyin character The index value of the section, the fifth index value is the index value of the fifth pinyin byte;
第三映射存储模块,用于采用CSR方法将第四索引值、第五索引值和目标频率值进行映射存储,生成第二数据字典。The third mapping storage module is used to map and store the fourth index value, the fifth index value, and the target frequency value using the CSR method to generate a second data dictionary.
优选地,该数据字典生成装置还包括:Preferably, the data dictionary generating device further includes:
第四获取模块,用于获取第四待存储数据,第四待存储数据包括L个样本字符串和每一样本字符串对应的样本频率值;The fourth acquiring module is configured to acquire fourth data to be stored, where the fourth data to be stored includes L sample character strings and a sample frequency value corresponding to each sample character string;
第五处理模块,用于采用双数组字典树算法对每一样本字符串进行处理,得到每一样本字符串的第六索引值;The fifth processing module is used to process each sample string using the double-array dictionary tree algorithm to obtain the sixth index value of each sample string;
第二写入模块,用于将每一样本字符串和对应的第六索引值写入预设数组中,得到存储数组;The second writing module is used to write each sample string and the corresponding sixth index value into the preset array to obtain the storage array;
第四映射存储模块,用于将每一第六索引值与对应的样本频率值进行映射存储,生成第四数据字典。The fourth mapping storage module is used for mapping and storing each sixth index value and the corresponding sample frequency value to generate a fourth data dictionary.
关于数据字典生成装置的具体限定可以参见上文中对于数据字典生成方法的限定,在此不再赘述。上述数据字典生成装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific definition of the data dictionary generating device, please refer to the above definition of the data dictionary generating method, which will not be repeated here. Each module in the above-mentioned data dictionary generating device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一实施例中,如图7所示,提供一种数据查询方法,以该方法应用在图1中的服务端为例进行说明,包括如下步骤:In one embodiment, as shown in FIG. 7, a data query method is provided. The method is applied to the server in FIG. 1 as an example for description, and includes the following steps:
S100:获取第一待查询数据,将第一待查询数据在第三数据字典中查询,确定第一待查询数据的待查询索引组,其中,第三数据字典是采用权利要求1的数据字典生成方法得到的。S100: Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary is generated using the data dictionary of claim 1 Method.
其中,第一待查询数据指待进行查询的2gram拼音节点数据。第一待查询数据由第一待查询拼音节点和第二待查询拼音节点组成。例如:第一待查询数据为GaoKong CaoZuo。GaoKong为第一待查询拼音节点,CaoZuo为第二待查询拼音节点。具体地,将第一待查询 数据与第三数据字典中存储的所有2gram拼音节点进行匹配,将与第一待查询数据相匹配的2gram拼音节点的所对应的目标索引组,确定为该第一待查询数据的待查询索引组。其中,第三数据字典是采用上述数据字典生成方法得到的。Among them, the first data to be queried refers to the 2gram pinyin node data to be queried. The first data to be queried is composed of a first pinyin node to be queried and a second pinyin node to be queried. For example: the first data to be queried is GaoKong CaoZuo. GaoKong is the first pinyin node to be queried, and CaoZuo is the second pinyin node to be queried. Specifically, the first data to be queried is matched with all 2gram pinyin nodes stored in the third data dictionary, and the target index group corresponding to the 2gram pinyin node that matches the first data to be queried is determined as the first The index group to be queried for the data to be queried. Among them, the third data dictionary is obtained by using the above-mentioned data dictionary generation method.
S101:基于待查询索引组在第四数据字典的存储数组中查询,得到第一待查询数据的目标字符串,其中,第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。S101: Based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding sample frequency value Word frequency dictionary.
具体地,为了通过索引值可反查到对应的字符串,在步骤S18中已将每一样本字符串和对应的第六索引值写入预设数组中得到存储数组,即第四数据字典的存储数组中包括有每一样本字符串和对应的第六索引值。因此,在本步骤中,将待查询索引组在第四数据字典的存储数组中查询,将与待查询索引组相匹配的第六索引值所对应的样本字符串,确定为第一待查询数据的目标字符串。其中,第四数据字典是采用上述数据字典生成方法得到的。Specifically, in order to retrieve the corresponding character string through the index value, in step S18, each sample character string and the corresponding sixth index value have been written into the preset array to obtain the storage array, that is, the storage array of the fourth data dictionary The storage array includes each sample character string and the corresponding sixth index value. Therefore, in this step, the index group to be queried is queried in the storage array of the fourth data dictionary, and the sample string corresponding to the sixth index value that matches the index group to be queried is determined as the first data to be queried The target string. Among them, the fourth data dictionary is obtained by using the above-mentioned data dictionary generating method.
在本实施例中,获取第一待查询数据,将第一待查询数据在第三数据字典中查询,确定第一待查询数据的待查询索引组,其中,第三数据字典是采用权利要求1的数据字典生成方法得到的;基于待查询索引组在第四数据字典的存储数组中查询,得到第一待查询数据的目标字符串,第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典;从而保证了数据查询的准确性。In this embodiment, the first data to be queried is acquired, the first data to be queried is queried in a third data dictionary, and the index group to be queried for the first data to be queried is determined, wherein the third data dictionary adopts claim 1. The data dictionary generation method is obtained; based on the index group to be queried in the storage array of the fourth data dictionary, the target string of the first data to be queried is obtained. The fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of the sample frequency value of the sample frequency value; thereby ensuring the accuracy of data query.
在一实施例中,如图8所示,该数据查询方法,还具体包括如下步骤:In an embodiment, as shown in FIG. 8, the data query method further specifically includes the following steps:
S110:获取第二待查询数据,将第二查询数据在第一数据字典的偏移数组集中进行查询,确定第二待查询数据的目标偏移数组,其中,第一数据字典是采用权利要求2所述的数据字典生成方法得到的。S110: Obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary adopts claim 2. Obtained by the data dictionary generating method.
其中,第二待查询数据是指待进行查询的1gram拼音节点数据。例如:第二待查询数据可以为GaoKong、CaoZuo或GaiXing的1gram拼音节点数据。具体地,由于第一数据字典的偏移数组集中包括有若干第三拼音节点的偏移数组。因此,将第二查询数据与第一数据字典的偏移数组集中的每一偏移数据组的第三拼音节点进行匹配,将与第二查询数据相匹配的第三拼音节点所对应的偏移数组,确定为第二查询数据的目标偏移数组。其中,第一数据字典是采用上述数据字典生成方法得到的。Among them, the second data to be queried refers to the 1gram pinyin node data to be queried. For example, the second data to be queried may be 1gram pinyin node data of GaoKong, CaoZuo or GaiXing. Specifically, the offset array set of the first data dictionary includes several offset arrays of the third pinyin node. Therefore, the second query data is matched with the third pinyin node of each offset data group in the offset array set of the first data dictionary, and the offset corresponding to the third pinyin node that matches the second query data is matched Array, determined as the target offset array of the second query data. Among them, the first data dictionary is obtained by using the above-mentioned data dictionary generating method.
S111:获取目标偏移数组中的目标起始索引位置,基于目标起始索引位置,在第一数据字典的第一目标索引数组中进行查询,确定第二待查询数据的目标索引数据。S111: Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the target index data of the second data to be queried.
由步骤S26可知偏移数组集中记录有每一第三拼音节点的节点标识和对应的起始索引位置,因此,将目标偏移数组中的起始索引位置确定为目标起始索引位置。具体地,在确定了目标起始索引位置之后,在第一目标索引数组中进行查询,确定该待查询数据在第一目标索引数组中的起始索引位置,将目标起始索引位置中的起始位置至终止位置所对应的索引值,确定为待查询数据的目标索引数据。From step S26, it can be seen that the node identifier of each third pinyin node and the corresponding start index position are recorded in the offset array set. Therefore, the start index position in the target offset array is determined as the target start index position. Specifically, after the target start index position is determined, the query is performed in the first target index array, the start index position of the data to be queried in the first target index array is determined, and the target start index position is The index value corresponding to the start position to the end position is determined as the target index data of the data to be queried.
S112:基于目标索引数据在存储数组中查询,得到第二待查询数据的目标字符串。S112: Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
具体地,基于目标索引数据在存储数组中查询,得到第二待查询数据的目标字符串。需要说明的是,本步骤中基于目标索引数据在存储数组中查询,得到第二待查询数据的目标字符串的具体方法和过程,与步骤S101基于待查询索引组在存储数组中查询,得到第一待查询数据的目标字符串的具体方法和过程相似,此处不做冗余赘述。Specifically, the storage array is queried based on the target index data to obtain the target character string of the second data to be queried. It should be noted that, in this step, the specific method and process for obtaining the target string of the second data to be queried based on the target index data in the storage array is the same as that in step S101 based on the query in the storage array based on the index group to be queried to obtain the first The specific method and process of the target character string of the data to be queried is similar, and will not be redundantly described here.
在本实施例中,获取第二待查询数据,将第二查询数据在第一数据字典的偏移数组集中进行查询,确定第二待查询数据的目标偏移数组,其中,第一数据字典是采用权利要求2所述的数据字典生成方法得到的;获取目标偏移数组中的目标起始索引位置,基于目标起始索引位置,在第一数据字典的第一目标索引数组中进行查询,确定第二待查询数据的目标索引数据;基于目标索引数据在存储数组中查询,得到第二待查询数据的目标字符串;从而在保证查询效率的同时,还提高了数据查询的准确性。In this embodiment, the second data to be queried is obtained, and the second query data is queried in the offset array set of the first data dictionary to determine the target offset array of the second data to be queried, where the first data dictionary is Obtained by using the data dictionary generating method of claim 2; obtaining the target starting index position in the target offset array, and querying in the first target index array of the first data dictionary based on the target starting index position, and determining The target index data of the second data to be queried; query in the storage array based on the target index data to obtain the target character string of the second data to be queried; thus while ensuring the query efficiency, the accuracy of the data query is also improved.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the size of the sequence number of each step in the foregoing embodiment does not mean the order of execution. The execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiment of the present application.
在一实施例中,提供一种数据查询装置,该数据查询装置与上述实施例中数据查询方法一一对应。如图9所示,该数据查询装置包括第二查询模块100和第三查询模块101。各功能模块详细说明如下:In one embodiment, a data query device is provided, and the data query device corresponds to the data query method in the foregoing embodiment one-to-one. As shown in FIG. 9, the data query device includes a second query module 100 and a third query module 101. The detailed description of each functional module is as follows:
第二查询模块100,用于获取第一待查询数据,将第一待查询数据在第三数据字典中查询,确定第一待查询数据的待查询索引组,其中,第三数据字典是采用上述数据字典生成方法得到的;The second query module 100 is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts the above Obtained by the data dictionary generation method;
第三查询模块101,用于基于待查询索引组在第四数据字典的存储数组中查询,得到第一待查询数据的目标字符串,其中,第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。The third query module 101 is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, where the fourth data dictionary is used to store the sixth index value The word frequency dictionary with the corresponding sample frequency value.
优选地,该数据查询装置还包括:Preferably, the data query device further includes:
第二确定模块,用于获取第二待查询数据,将第二查询数据在第一数据字典的偏移数组集中进行查询,确定第二待查询数据的目标偏移数组,其中,第一数据字典是采用上述数据字典生成方法得到的;The second determining module is used to obtain the second data to be queried, query the second query data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, where the first data dictionary It is obtained by the above-mentioned data dictionary generation method;
第四查询模块,用于获取目标偏移数组中的目标起始索引位置,基于目标起始索引位置,在第一数据字典的第一目标索引数组中进行查询,确定第二待查询数据的目标索引数据;The fourth query module is used to obtain the target starting index position in the target offset array, and based on the target starting index position, query in the first target index array of the first data dictionary to determine the target of the second data to be queried Index data
第五查询模块,用于基于目标索引数据在存储数组中查询,得到第二待查询数据的目标字符串。The fifth query module is used to query the storage array based on the target index data to obtain the target character string of the second data to be queried.
关于数据查询装置的具体限定可以参见上文中对于数据查询方法的限定,在此不再赘述。上述数据查询装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。For the specific limitation of the data query device, please refer to the above limitation on the data query method, which will not be repeated here. Each module in the above-mentioned data query device can be implemented in whole or in part by software, hardware, and a combination thereof. The foregoing modules may be embedded in the form of hardware or independent of the processor in the computer device, or may be stored in the memory of the computer device in the form of software, so that the processor can call and execute the operations corresponding to the foregoing modules.
在一个实施例中,提供了一种计算机设备,该计算机设备可以是服务器,其内部结构图可以如图10所示。该计算机设备包括通过系统总线连接的处理器、存储器、网络接口和数据库。其中,该计算机设备的处理器用于提供计算和控制能力。该计算机设备的存储器包括可读存储介质、内存储器。该可读存储介质存储有操作系统、计算机可读指令和数据库。该内存储器为可读存储介质中的操作系统和计算机可读指令的运行提供环境。该计算机设备的数据库用于存储上述实施例中的数据字典生成方法和数据查询方法所使用到的数据。该计算机设备的网络接口用于与外部的终端通过网络连接通信。该计算机可读指令被处理器执行时以实现一种数据字典生成方法,或者,该计算机可读指令被处理器执行时以实现一种数据查询方法。本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质。In one embodiment, a computer device is provided. The computer device may be a server, and its internal structure diagram may be as shown in FIG. 10. The computer equipment includes a processor, a memory, a network interface, and a database connected through a system bus. Among them, the processor of the computer device is used to provide calculation and control capabilities. The memory of the computer device includes a readable storage medium and an internal memory. The readable storage medium stores an operating system, computer readable instructions, and a database. The internal memory provides an environment for the operation of the operating system and computer readable instructions in the readable storage medium. The database of the computer equipment is used to store the data used in the data dictionary generating method and the data query method in the foregoing embodiments. The network interface of the computer device is used to communicate with an external terminal through a network connection. The computer-readable instruction is executed by the processor to implement a data dictionary generation method, or the computer-readable instruction is executed by the processor to implement a data query method. The readable storage medium provided in this embodiment includes a non-volatile readable storage medium and a volatile readable storage medium.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中的以下步骤:获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, The following steps: acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
在一个实施例中,提供了一种计算机设备,包括存储器、处理器及存储在存储器上并 可在处理器上运行的计算机可读指令,处理器执行计算机可读指令时实现上述实施例中的以下步骤:In one embodiment, a computer device is provided, including a memory, a processor, and computer-readable instructions stored in the memory and capable of running on the processor. When the processor executes the computer-readable instructions, The following steps:
获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现以下步骤:In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:
获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
在一个实施例中,提供了一个或多个存储有计算机可读指令的可读存储介质,本实施例所提供的可读存储介质包括非易失性可读存储介质和易失性可读存储介质;该可读存储介质上存储有计算机可读指令,该计算机可读指令被一个或多个处理器执行时,使得一个或多个处理器实现以下步骤:In one embodiment, one or more readable storage media storing computer readable instructions are provided. The readable storage media provided in this embodiment include non-volatile readable storage media and volatile readable storage. Medium; the readable storage medium stores computer readable instructions, and when the computer readable instructions are executed by one or more processors, the one or more processors implement the following steps:
获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
本领域普通技术人员可以理解实现上述实施例方法中的全部或部分流程,是可以通过计算机可读指令来指令相关的硬件来完成,所述的计算机可读指令可存储于一非易失性计算机可读取存储介质或易失性可读存储介质中,该计算机可读指令在执行时,可包括如上述各方法的实施例的流程。其中,本申请所提供的各实施例中所使用的对存储器、存储、数据库或其它介质的任何引用,均可包括非易失性和/或易失性存储器。非易失性存储器可包括只读存储器(ROM)、可编程ROM(PROM)、电可编程ROM(EPROM)、电可擦除可编程ROM(EEPROM)或闪存。易失性存储器可包括随机存取存储器(RAM)或者外部高速缓冲存储器。作为说明而非局限,RAM以多种形式可得,诸如静态RAM(SRAM)、动态RAM(DRAM)、同步DRAM(SDRAM)、双数据率SDRAM(DDRSDRAM)、增强型SDRAM(ESDRAM)、同步链路(Synchlink)DRAM(SLDRAM)、存储器总线(Rambus)直接RAM(RDRAM)、直接存储器总线动态RAM(DRDRAM)、以及存储器总线动态RAM(RDRAM)等。A person of ordinary skill in the art can understand that all or part of the processes in the methods of the above-mentioned embodiments can be implemented by instructing relevant hardware through computer-readable instructions. The computer-readable instructions can be stored in a non-volatile computer. In a readable storage medium or a volatile readable storage medium, when the computer readable instruction is executed, it may include the processes of the above-mentioned method embodiments. Wherein, any reference to memory, storage, database or other media used in the embodiments provided in this application may include non-volatile and/or volatile memory. Non-volatile memory may include read only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory may include random access memory (RAM) or external cache memory. As an illustration and not a limitation, RAM is available in many forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous chain Channel (Synchlink) DRAM (SLDRAM), memory bus (Rambus) direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上 描述的全部或者部分功能。Those skilled in the art can clearly understand that, for the convenience and conciseness of description, only the division of the above-mentioned functional units and modules is used as an example. In practical applications, the above-mentioned functions can be allocated to different functional units and modules as required. Module completion, that is, the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, a person of ordinary skill in the art should understand that it can still implement the foregoing The technical solutions recorded in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the application, and should be included in Within the scope of protection of this application.

Claims (20)

  1. 一种数据字典生成方法,其中,包括:A method for generating a data dictionary, including:
    获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
    基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
    采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
    在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
    将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  2. 如权利要求1所述的数据字典生成方法,其中,在基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询之前,所述数据字典生成方法还包括:The data dictionary generating method according to claim 1, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further include:
    获取第二待存储数据,所述第二待存储数据包括N个第三拼音节点和每一所述第三拼音节点对应的M个字符串;Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;
    采用双数组字典树算法对每一所述第三拼音节点的每一所述字符串进行处理,确定每一所述第三拼音节点对应的索引值集;Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;
    将每一所述第三拼音节点对应的所述索引值集写入预设的第一索引数组中,得到第一目标索引数组;Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;
    从所述第一目标索引数组中确定每一所述第三拼音节点的起始索引位置;Determine the starting index position of each of the third pinyin nodes from the first target index array;
    采用双数组字典树算法每一所述第三拼音节点进行处理,得到每一所述第三拼音节点的节点标识;Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;
    将每一所述第三拼音节点的所述节点标识与对应的所述起始索引位置进行映射存储,生成偏移数组集;Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;
    将所述第一目标索引数组和所述偏移数组集进行组合,生成第一数据字典。Combining the first target index array and the offset array set to generate a first data dictionary.
  3. 如权利要求1所述的数据字典生成方法,其中,所述在预设的第二数据字典中查询每一所述候选索引组的候选频率值之前,所述数据字典生成方法还包括:3. The data dictionary generating method according to claim 1, wherein before the query of the candidate frequency value of each candidate index group in the preset second data dictionary, the data dictionary generating method further comprises:
    获取第三待存储数据,所述第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值;Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;
    采用双数组字典树算法对所述第四拼音字节和所述第五拼音字节进行处理,得到第四索引值和第五索引值,其中,所述第四索引值为所述第四拼音字节的的索引值,所述第五索引值为所述第五拼音字节的的索引值;The double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;
    采用CSR方法将所述第四索引值、所述第五索引值和所述目标频率值进行映射存储,生成所述第二数据字典。A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
  4. 如权利要求1所述的数据字典生成方法,其中,所述数据字典生成方法还包括:The data dictionary generating method according to claim 1, wherein the data dictionary generating method further comprises:
    获取第四待存储数据,所述第四待存储数据包括L个样本字符串和每一所述样本字符串对应的样本频率值;Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;
    采用双数组字典树算法对每一所述样本字符串进行处理,得到每一所述样本字符串的第六索引值;Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;
    将每一所述样本字符串和对应的所述第六索引值写入预设数组中,得到存储数组;Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;
    将每一所述第六索引值与对应的所述样本频率值进行映射存储,生成第四数据字典。Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
  5. 一种数据查询方法,其中,包括:A data query method, which includes:
    获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
    基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  6. 如权利要求5所述的数据查询方法,其中,所述数据查询方法还包括:8. The data query method according to claim 5, wherein the data query method further comprises:
    获取第二待查询数据,将所述第二查询数据在第一数据字典的偏移数组集中进行查询,确定所述第二待查询数据的目标偏移数组,其中,所述第一数据字典是采用权利要求2所述的数据字典生成方法得到的;Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;
    获取所述目标偏移数组中的目标起始索引位置,基于所述目标起始索引位置,在所述第一数据字典的第一目标索引数组中进行查询,确定所述第二待查询数据的目标索引数据;Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;
    基于所述目标索引数据在所述存储数组中查询,得到所述第二待查询数据的目标字符串。Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
  7. 一种数据字典生成装置,其中,包括:A data dictionary generating device, which includes:
    第一获取模块,用于获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;The first obtaining module is configured to obtain first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
    第一查询模块,用于基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;The first query module is configured to query in a preset first data dictionary based on the first pinyin node and the second pinyin node to determine a first index sequence and a second index sequence, wherein the first An index sequence is an index sequence of the first pinyin node, and the second index sequence is an index sequence of the second pinyin node;
    第一处理模块,用于采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;The first processing module is configured to process the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;
    第一筛选模块,用于在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;The first screening module is configured to query the candidate frequency value of each candidate index group in a preset second data dictionary, and filter out the target index whose candidate frequency value meets the preset requirements from the candidate index group Group;
    第一映射存储模块,用于将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The first mapping storage module is used for mapping and storing the data to be stored and the target index group to generate a third data dictionary.
  8. 一种数据查询装置,其中,包括:A data query device, which includes:
    第二查询模块,用于获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;The second query module is used to obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third The data dictionary is obtained by using the data dictionary generating method of claim 1;
    第三查询模块,用于基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。The third query module is configured to query in the storage array of the fourth data dictionary based on the index group to be queried to obtain the target character string of the first data to be queried, wherein the fourth data dictionary refers to A word frequency dictionary storing the sixth index value and the corresponding sample frequency value.
  9. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
    基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
    采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence using a CSR method to obtain a candidate index group;
    在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
    将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  10. 如权利要求9所述的计算机设备,其中,在基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询之前,所述数据字典生成方法还包括:9. The computer device of claim 9, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further comprises:
    获取第二待存储数据,所述第二待存储数据包括N个第三拼音节点和每一所述第三拼音节点对应的M个字符串;Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;
    采用双数组字典树算法对每一所述第三拼音节点的每一所述字符串进行处理,确定每一所述第三拼音节点对应的索引值集;Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;
    将每一所述第三拼音节点对应的所述索引值集写入预设的第一索引数组中,得到第一目标索引数组;Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;
    从所述第一目标索引数组中确定每一所述第三拼音节点的起始索引位置;Determine the starting index position of each of the third pinyin nodes from the first target index array;
    采用双数组字典树算法每一所述第三拼音节点进行处理,得到每一所述第三拼音节点的节点标识;Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;
    将每一所述第三拼音节点的所述节点标识与对应的所述起始索引位置进行映射存储,生成偏移数组集;Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;
    将所述第一目标索引数组和所述偏移数组集进行组合,生成第一数据字典。Combining the first target index array and the offset array set to generate a first data dictionary.
  11. 如权利要求9所述的计算机设备,其中,所述在预设的第二数据字典中查询每一所述候选索引组的候选频率值之前,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 9, wherein, before the candidate frequency value of each candidate index group is queried in the preset second data dictionary, the processor further executes the computer-readable instruction To achieve the following steps:
    获取第三待存储数据,所述第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值;Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;
    采用双数组字典树算法对所述第四拼音字节和所述第五拼音字节进行处理,得到第四索引值和第五索引值,其中,所述第四索引值为所述第四拼音字节的的索引值,所述第五索引值为所述第五拼音字节的的索引值;The double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;
    采用CSR方法将所述第四索引值、所述第五索引值和所述目标频率值进行映射存储,生成所述第二数据字典。A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
  12. 如权利要求9所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还实现如下步骤:9. The computer device of claim 9, wherein the processor further implements the following steps when executing the computer-readable instructions:
    获取第四待存储数据,所述第四待存储数据包括L个样本字符串和每一所述样本字符串对应的样本频率值;Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;
    采用双数组字典树算法对每一所述样本字符串进行处理,得到每一所述样本字符串的第六索引值;Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;
    将每一所述样本字符串和对应的所述第六索引值写入预设数组中,得到存储数组;Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;
    将每一所述第六索引值与对应的所述样本频率值进行映射存储,生成第四数据字典。Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
  13. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:A computer device includes a memory, a processor, and computer-readable instructions that are stored in the memory and can run on the processor, wherein the processor implements the following steps when the processor executes the computer-readable instructions:
    获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
    基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the query of the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, where the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  14. 如权利要求13所述的计算机设备,其中,所述处理器执行所述计算机可读指令时还实现如下步骤:The computer device according to claim 13, wherein the processor further implements the following steps when executing the computer readable instruction:
    获取第二待查询数据,将所述第二查询数据在第一数据字典的偏移数组集中进行查询,确定所述第二待查询数据的目标偏移数组,其中,所述第一数据字典是采用权利要求2所述的数据字典生成方法得到的;Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;
    获取所述目标偏移数组中的目标起始索引位置,基于所述目标起始索引位置,在所述第一数据字典的第一目标索引数组中进行查询,确定所述第二待查询数据的目标索引数据;Obtain the target starting index position in the target offset array, and based on the target starting index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;
    基于所述目标索引数据在所述存储数组中查询,得到所述第二待查询数据的目标字符 串。Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
  15. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取第一待存储数据,所述第一待存储数据包括第一拼音节点和第二拼音节点;Acquiring first data to be stored, where the first data to be stored includes a first pinyin node and a second pinyin node;
    基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询,确定第一索引序列和第二索引序列,其中,所述第一索引序列为所述第一拼音节点的索引序列,所述第二索引序列为所述第二拼音节点的索引序列;Based on the first pinyin node and the second pinyin node, query in a preset first data dictionary to determine a first index sequence and a second index sequence, wherein the first index sequence is the first index sequence An index sequence of a pinyin node, where the second index sequence is an index sequence of the second pinyin node;
    采用CSR方法对所述第一索引序列和所述第二索引序列进行处理,得到候选索引组;Processing the first index sequence and the second index sequence by using a CSR method to obtain a candidate index group;
    在预设的第二数据字典中查询每一所述候选索引组的候选频率值,从所述候选索引组中筛选出所述候选频率值符合预设要求的目标索引组;Query the candidate frequency value of each candidate index group in the preset second data dictionary, and filter out the target index group whose candidate frequency value meets the preset requirement from the candidate index group;
    将所述待存储数据和所述目标索引组进行映射存储,生成第三数据字典。The data to be stored and the target index group are mapped and stored to generate a third data dictionary.
  16. 如权利要求15所述的可读存储介质,其中,在基于所述第一拼音节点和所述第二拼音节点,在预设的第一数据字典中进行查询之前,所述数据字典生成方法还包括:The readable storage medium of claim 15, wherein, before querying in a preset first data dictionary based on the first pinyin node and the second pinyin node, the data dictionary generating method further include:
    获取第二待存储数据,所述第二待存储数据包括N个第三拼音节点和每一所述第三拼音节点对应的M个字符串;Acquiring second data to be stored, where the second data to be stored includes N third pinyin nodes and M character strings corresponding to each of the third pinyin nodes;
    采用双数组字典树算法对每一所述第三拼音节点的每一所述字符串进行处理,确定每一所述第三拼音节点对应的索引值集;Using a double-array dictionary tree algorithm to process each of the character strings of each of the third pinyin nodes, and determine the index value set corresponding to each of the third pinyin nodes;
    将每一所述第三拼音节点对应的所述索引值集写入预设的第一索引数组中,得到第一目标索引数组;Writing the index value set corresponding to each of the third pinyin nodes into a preset first index array to obtain a first target index array;
    从所述第一目标索引数组中确定每一所述第三拼音节点的起始索引位置;Determine the starting index position of each of the third pinyin nodes from the first target index array;
    采用双数组字典树算法每一所述第三拼音节点进行处理,得到每一所述第三拼音节点的节点标识;Processing each of the third pinyin nodes by using a double-array dictionary tree algorithm to obtain the node identifier of each of the third pinyin nodes;
    将每一所述第三拼音节点的所述节点标识与对应的所述起始索引位置进行映射存储,生成偏移数组集;Mapping and storing the node identifier of each of the third pinyin nodes and the corresponding starting index position to generate an offset array set;
    将所述第一目标索引数组和所述偏移数组集进行组合,生成第一数据字典。Combining the first target index array and the offset array set to generate a first data dictionary.
  17. 如权利要求15所述的可读存储介质,其中,所述在预设的第二数据字典中查询每一所述候选索引组的候选频率值之前,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 15, wherein, before the candidate frequency value of each candidate index group is queried in the preset second data dictionary, the computer readable instruction is executed by one or more When the processor executes, the one or more processors further execute the following steps:
    获取第三待存储数据,所述第三待存储数据包括第四拼音字节、第五拼音字节和目标频率值;Acquiring third data to be stored, where the third data to be stored includes a fourth pinyin byte, a fifth pinyin byte, and a target frequency value;
    采用双数组字典树算法对所述第四拼音字节和所述第五拼音字节进行处理,得到第四索引值和第五索引值,其中,所述第四索引值为所述第四拼音字节的的索引值,所述第五索引值为所述第五拼音字节的的索引值;A double-array dictionary tree algorithm is used to process the fourth pinyin byte and the fifth pinyin byte to obtain a fourth index value and a fifth index value, where the fourth index value is the fourth pinyin Byte index value, the fifth index value is the index value of the fifth pinyin byte;
    采用CSR方法将所述第四索引值、所述第五索引值和所述目标频率值进行映射存储,生成所述第二数据字典。A CSR method is used to map and store the fourth index value, the fifth index value, and the target frequency value to generate the second data dictionary.
  18. 如权利要求15所述的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:15. The readable storage medium of claim 15, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:
    获取第四待存储数据,所述第四待存储数据包括L个样本字符串和每一所述样本字符串对应的样本频率值;Acquiring fourth to-be-stored data, where the fourth to-be-stored data includes L sample character strings and sample frequency values corresponding to each of the sample character strings;
    采用双数组字典树算法对每一所述样本字符串进行处理,得到每一所述样本字符串的第六索引值;Processing each of the sample character strings using a double-array dictionary tree algorithm to obtain the sixth index value of each of the sample character strings;
    将每一所述样本字符串和对应的所述第六索引值写入预设数组中,得到存储数组;Write each of the sample character strings and the corresponding sixth index value into a preset array to obtain a storage array;
    将每一所述第六索引值与对应的所述样本频率值进行映射存储,生成第四数据字典。Each of the sixth index values and the corresponding sample frequency values are mapped and stored to generate a fourth data dictionary.
  19. 一个或多个存储有计算机可读指令的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器执行如下步骤:One or more readable storage media storing computer readable instructions, where when the computer readable instructions are executed by one or more processors, the one or more processors execute the following steps:
    获取第一待查询数据,将所述第一待查询数据在第三数据字典中查询,确定所述第一待查询数据的待查询索引组,其中,所述第三数据字典是采用权利要求1所述的数据字典生成方法得到的;Obtain the first data to be queried, query the first data to be queried in a third data dictionary, and determine the index group to be queried for the first data to be queried, wherein the third data dictionary adopts claim 1. Obtained by the data dictionary generating method;
    基于所述待查询索引组在第四数据字典的存储数组中查询,得到所述第一待查询数据的目标字符串,其中,所述第四数据字典是指用于存储第六索引值与对应的样本频率值的词频字典。Based on the index group to be queried in the storage array of the fourth data dictionary, the target character string of the first data to be queried is obtained, wherein the fourth data dictionary is used to store the sixth index value and the corresponding The word frequency dictionary of sample frequency values.
  20. 如权利要求19所述的可读存储介质,其中,所述计算机可读指令被一个或多个处理器执行时,使得所述一个或多个处理器还执行如下步骤:The readable storage medium according to claim 19, wherein when the computer-readable instructions are executed by one or more processors, the one or more processors further execute the following steps:
    获取第二待查询数据,将所述第二查询数据在第一数据字典的偏移数组集中进行查询,确定所述第二待查询数据的目标偏移数组,其中,所述第一数据字典是采用权利要求2所述的数据字典生成方法得到的;Obtain the second data to be queried, query the second data in the offset array set of the first data dictionary, and determine the target offset array of the second data to be queried, wherein the first data dictionary is Obtained by using the data dictionary generating method of claim 2;
    获取所述目标偏移数组中的目标起始索引位置,基于所述目标起始索引位置,在所述第一数据字典的第一目标索引数组中进行查询,确定所述第二待查询数据的目标索引数据;Obtain the target start index position in the target offset array, and based on the target start index position, perform a query in the first target index array of the first data dictionary to determine the value of the second data to be queried Target index data;
    基于所述目标索引数据在所述存储数组中查询,得到所述第二待查询数据的目标字符串。Query in the storage array based on the target index data to obtain the target character string of the second data to be queried.
PCT/CN2021/090528 2020-06-24 2021-04-28 Data dictionary generation method and apparatus, data query method and apparatus, and device and medium WO2021258848A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010589195.3A CN111737977B (en) 2020-06-24 2020-06-24 Data dictionary generation method, data query method, device, equipment and medium
CN202010589195.3 2020-06-24

Publications (1)

Publication Number Publication Date
WO2021258848A1 true WO2021258848A1 (en) 2021-12-30

Family

ID=72650969

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/090528 WO2021258848A1 (en) 2020-06-24 2021-04-28 Data dictionary generation method and apparatus, data query method and apparatus, and device and medium

Country Status (2)

Country Link
CN (1) CN111737977B (en)
WO (1) WO2021258848A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114756591A (en) * 2022-04-15 2022-07-15 成都卓讯智安科技有限公司 Data screening method and system based on bidirectional linked list
CN116048478A (en) * 2023-03-07 2023-05-02 智慧眼科技股份有限公司 Dictionary escape method, device, equipment and computer readable storage medium
CN117112718A (en) * 2023-10-16 2023-11-24 达文恒业科技(深圳)有限公司 Method for rapidly storing data of vehicle-mounted computer system

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111737977B (en) * 2020-06-24 2022-05-17 平安科技(深圳)有限公司 Data dictionary generation method, data query method, device, equipment and medium
CN112307183B (en) * 2020-10-30 2024-04-19 北京金堤征信服务有限公司 Search data identification method, apparatus, electronic device and computer storage medium
CN115329032B (en) * 2022-10-14 2023-03-24 杭州海康威视数字技术股份有限公司 Learning data transmission method, device, equipment and storage medium based on federated dictionary
CN116013488B (en) * 2023-03-27 2023-06-02 中国人民解放军总医院第六医学中心 Intelligent security management system for medical records with self-adaptive data rearrangement function

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293298A1 (en) * 2017-04-07 2018-10-11 Sap Se Reordering of enriched inverted indices
CN109840254A (en) * 2018-12-14 2019-06-04 湖南亚信软件有限公司 A kind of data virtualization and querying method, device
CN110147413A (en) * 2019-04-26 2019-08-20 平安科技(深圳)有限公司 Date storage method, data query method, apparatus, equipment and storage medium
CN111143461A (en) * 2019-12-31 2020-05-12 中国银行股份有限公司 Mapping relation processing system and method and electronic equipment
CN111737977A (en) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 Data dictionary generation method, data query method, device, equipment and medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10067909B2 (en) * 2014-06-25 2018-09-04 Sap Se Sparse linear algebra in column-oriented in-memory database
CN106528647B (en) * 2016-10-15 2019-07-23 传神语联网网络科技股份有限公司 One kind carrying out the matched method of term based on cedar even numbers group dictionary tree algorithm
CN110197271B (en) * 2018-02-27 2020-10-27 上海寒武纪信息科技有限公司 Integrated circuit chip device and related product
CN109740023B (en) * 2019-01-03 2020-09-29 中国人民解放军国防科技大学 Sparse matrix compression storage method based on bidirectional bitmap

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180293298A1 (en) * 2017-04-07 2018-10-11 Sap Se Reordering of enriched inverted indices
CN109840254A (en) * 2018-12-14 2019-06-04 湖南亚信软件有限公司 A kind of data virtualization and querying method, device
CN110147413A (en) * 2019-04-26 2019-08-20 平安科技(深圳)有限公司 Date storage method, data query method, apparatus, equipment and storage medium
CN111143461A (en) * 2019-12-31 2020-05-12 中国银行股份有限公司 Mapping relation processing system and method and electronic equipment
CN111737977A (en) * 2020-06-24 2020-10-02 平安科技(深圳)有限公司 Data dictionary generation method, data query method, device, equipment and medium

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114756591A (en) * 2022-04-15 2022-07-15 成都卓讯智安科技有限公司 Data screening method and system based on bidirectional linked list
CN114756591B (en) * 2022-04-15 2022-10-14 成都卓讯智安科技有限公司 Data screening method and system based on bidirectional linked list
CN116048478A (en) * 2023-03-07 2023-05-02 智慧眼科技股份有限公司 Dictionary escape method, device, equipment and computer readable storage medium
CN116048478B (en) * 2023-03-07 2023-05-30 智慧眼科技股份有限公司 Dictionary escape method, device, equipment and computer readable storage medium
CN117112718A (en) * 2023-10-16 2023-11-24 达文恒业科技(深圳)有限公司 Method for rapidly storing data of vehicle-mounted computer system
CN117112718B (en) * 2023-10-16 2024-01-26 达文恒业科技(深圳)有限公司 Method for rapidly storing data of vehicle-mounted computer system

Also Published As

Publication number Publication date
CN111737977A (en) 2020-10-02
CN111737977B (en) 2022-05-17

Similar Documents

Publication Publication Date Title
WO2021258848A1 (en) Data dictionary generation method and apparatus, data query method and apparatus, and device and medium
JP6662990B2 (en) System and method for modeling an object network
WO2022142613A1 (en) Training corpus expansion method and apparatus, and intent recognition model training method and apparatus
US10521441B2 (en) System and method for approximate searching very large data
US10671586B2 (en) Optimal sort key compression and index rebuilding
CN110532347B (en) Log data processing method, device, equipment and storage medium
US11526465B2 (en) Generating hash trees for database schemas
WO2021258853A1 (en) Vocabulary error correction method and apparatus, computer device, and storage medium
EP3926484B1 (en) Improved fuzzy search using field-level deletion neighborhoods
CN116383238B (en) Data virtualization system, method, device, equipment and medium based on graph structure
US11989185B2 (en) In-memory efficient multistep search
CN112307169B (en) Address data matching method and device, computer equipment and storage medium
CN108595437B (en) Text query error correction method and device, computer equipment and storage medium
US8321429B2 (en) Accelerating queries using secondary semantic column enumeration
US7672925B2 (en) Accelerating queries using temporary enumeration representation
CN114238334A (en) Heterogeneous data encoding method and device, heterogeneous data decoding method and device, computer equipment and storage medium
CN114461606A (en) Data storage method and device, computer equipment and storage medium
CN116860564B (en) Cloud server data management method and data management device thereof
CN117540056B (en) Method, device, computer equipment and storage medium for data query
CN110471901B (en) Data importing method and terminal equipment
JP4160627B2 (en) Structured document management system and program
Zhou et al. Multi-agent Based Container Microservices for Remote File Storage System
CN111626585A (en) Script data extraction method and device, computer equipment and storage medium
CN113377711A (en) Data processing method, device, equipment and computer readable storage medium
CN117540056A (en) Method, device, computer equipment and storage medium for data query

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21829107

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21829107

Country of ref document: EP

Kind code of ref document: A1